**DADA2** is an R package for extracting the amplicon sequence variants form FASTQs.
To use it you should be just familiar with R and Bash.
The main difference with creating OTUs is that the process is clustering free, without established threshold, defining the amplicon sequence variants
at the smallest level possible, with even differences of only one nucleotide.
The web [Tutorial](https://benjjneb.github.io/dada2/tutorial.html) presents nearly everything needed for using it. Some things that for me didn't work and
took to long for realizing it:
- Cut your primers. `cutadapt` does the job really easy!
- If around 25-30 % of the reads are lost in the process of ASV generation, possibly some of the parameters have to be changed.
* Are you sure that the primers from the FASTQ are removed?
* What `maxee` did you specify? If this is making many reads to be lost, you can specify a bigger maxee, and even different values for the F and R reads (for example `c(2,4)`).
The algorithm will take into account the errors in the modelling phase, so this will not make your ASVs erroneus.
* Does the pair of reads overlapp? By how many bases? It should be >= 20 nt.
> If you follow the tutorial, at the end of the procedure a **track analysis** is generated specifing how many reads are lost along the whole procedure. It is the best way to know where it failed.
- In the trimming procedure, the `truncLen` cuts all the reads to an specific length and *removes* all reads being smaller. It is important then to know the average read length, since if you go too low with the trimming you will lose too much reads.
* In the pipeline of DADA2 there is a quality profile, you should be aware of it in deciding where to cut.
*
* For each run the trimming point is different, so if you are working on multiple runs each of them have to be processed separatedly and then joined together with `mergeSequenceTables`.
* You should have an analysis of the FASTQs. The av. length, the avg quality for each sample, and so on. Many of the problems with recovering most of the reads
stem from having a low quality sample, or the reads not being properly amplified. `seqkit` is a good tool for this kind of information.
- The taxonomy assignation is realized at the Species level only if only a 100%, exact matching. This can make that some bacteria/eukarya present some differences
in the identification at that level when comparing with OTU results. See a link explaining this in more detail [here](https://benjjneb.github.io/dada2/assign.html#species-assignment).
----
#### How to use it in our biocluster
You just need to use the `qsub` system as always, with the command `Rscript`. This only executes the Script, as the name say.
First, the following modules should be called!
`module load module load Rstats/R-3.4.1`
`module load gcc/4.9.0`
My approach is:
- Cut the primers.
- Do the quality profile, look at which length i will trim.
- Perform the dada2 procedure.
- Check the results, modify the parameters if the resulting reads are too low and run it again.