plotQualityProfile(fwd)
plotQualityProfile(rev)
Setup & input
Start with paired-end FASTQ files (e.g.
sample1_R1.fastq
,sample1_R2.fastq
)Files must be:
- Demultiplexed (split by sample)
- Primer-trimmed
DADA2 matches forward (
R1
) and reverse (R2
) files automatically by sample name.
R1
) and reverse (R2
) reads?
Paired-end sequencing (e.g. Illumina MiSeq) reads both ends of a DNA fragment, generating two FASTQ files per sample:
R1.fastq
→ Forward readR2.fastq
→ Reverse readThe forward read comes from the 5′ end of the top strand
The reverse read comes from the 5′ end of the bottom strand (reverse complement)
The two reads may overlap, allowing better reconstruction of the full DNA fragment
DADA2 uses paired-end reads to:
- Independently filter and model errors in each direction
- Merge the reads using their overlap to:
- Correct sequencing errors
- Reconstruct the full barcode (amplicon)
- Improve accuracy and resolution
- Correct sequencing errors
File | Description |
---|---|
R1.fastq |
Forward reads (5′→3′ of one strand) |
R2.fastq |
Reverse reads (5′→3′ of the other strand) |
→ Both are needed to build full-length, high-quality sequences.
Inspecting quality
- Forward reads: generally good quality
- Reverse reads: quality often drops toward the end
- Use this to decide truncation lengths (e.g. 240 bp forward, 160 bp reverse)
Filter & trim
filterAndTrim(fwd, filtF,
rev, filtR,truncLen = c(240,160),
maxN = 0,
maxEE = c(2,2),
truncQ = 2,
rm.phix = TRUE,
compress = TRUE,
multithread = TRUE)
- Removes poor-quality and ambiguous reads
- Drops PhiX contaminants
- Outputs retained read counts per sample
PhiX (or PhiX174) is a small viral genome often added as a control during Illumina sequencing runs.
- It helps calibrate the sequencer and improve performance in low-diversity samples (like amplicon-based eDNA).
- It’s not part of your actual biological sample, so it needs to be filtered out.
In DADA2, this is done automatically:
= TRUE rm.phix
Learn error rates
<- learnErrors(filtFs, multithread = TRUE)
errF <- learnErrors(filtRs, multithread = TRUE) errR
- Builds a parametric error model from observed data
- Separately for forward and reverse reads
Denoise / Sample inference
<- dada(filtFs, err = errF, multithread = TRUE)
dadaFs <- dada(filtRs, err = errR, multithread = TRUE) dadaRs
- Infers ASVs (Amplicon Sequence Variants)
- Corrects sequencing errors to identify true biological sequences
Merge paired reads
<- mergePairs(dadaFs, filtFs, dadaRs, filtRs) mergers
- Merges forward and reverse reads
- Requires overlap (e.g., ≥12 bp) and agreement in the overlap region
Build sequence table
<- makeSequenceTable(mergers) seqtab
- Rows = samples
- Columns = ASVs
- Cells = read counts
Chimera removal
<- removeBimeraDenovo(seqtab, method = "consensus", multithread = TRUE) seqtab.nochim
- Removes PCR artifacts (chimeras)
- Yields a cleaned-up ASV table
Chimeras are artificial DNA sequences created during PCR amplification.
- They form when a partially extended DNA strand from one template accidentally binds to a different, but similar, template in the next PCR cycle.
- This creates a hybrid sequence that does not exist in any real organism.
These artefacts:
- Are common in amplicon-based workflows.
- Can inflate biodiversity estimates or lead to false species detections.
Track read loss
<- function(x) sum(getUniques(x))
getN <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
track colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
- Monitors how many reads pass each step
- Helps catch major data loss or pipeline issues
Assign taxonomy
<- assignTaxonomy(seqtab.nochim, "silva_nr_v138_train_set.fa.gz", multithread = TRUE)
taxa <- addSpecies(taxa, "silva_species_assignment_v138.fa.gz") taxa
- Classifies ASVs by comparing to a reference database (e.g., SILVA)
- Adds species-level IDs for exact matches
Optional: Species-level matching
Use curated references like MiFish for fish-specific identification
Alternatives:
- BLAST for manual species checks
- DECIPHER’s
IdTaxa()
function
Validate with mock community
- Run pipeline on a known “mock” sample
- Should recover expected species with high accuracy
Export to phyloseq
library(phyloseq)
<- phyloseq(otu_table(seqtab.nochim, taxa_are_rows = FALSE),
ps tax_table(taxa),
sample_data(metadata))
- Enables diversity analysis, ordination, visualisation
Summary flow
Step | Action |
---|---|
1 | Import FASTQ files |
2 | Visualise quality |
3 | Filter & trim reads |
4 | Learn error rates |
5 | Infer ASVs (denoise) |
6 | Merge paired reads |
7 | Build sequence table |
8 | Remove chimeras |
9 | Track read retention |
10 | Assign taxonomy |
11 | Optional: species matching |
12 | Validate with known samples |
13 | Export to phyloseq |
Example code
# Load necessary libraries
library(dada2)
library(phyloseq)
library(Biostrings)
library(ggplot2)
# File set up
<- "MiSeq_SOP"
path list.files(path)
# Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq
<- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE))
fnFs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))
fnRs
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
<- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
sample.names
# Inspect quality
plotQualityProfile(fnFs[1:2])
plotQualityProfile(fnRs[1:2])
# Filter & trim
<- file.path(path, "filtered", basename(fnFs))
filtFs <- file.path(path, "filtered", basename(fnRs))
filtRs <- filterAndTrim(fnFs, filtFs, fnRs, filtRs,
out truncLen=c(240,200), maxN=0, maxEE=c(2,2),
truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=FALSE)
head(out)
# Learn error rates
<- learnErrors(filtFs, multithread=FALSE)
errF <- learnErrors(filtRs, multithread=FALSE)
errR plotErrors(errF, nominalQ=TRUE)
# Dereplicate
<- derepFastq(filtFs, verbose=TRUE)
derepFs <- derepFastq(filtRs, verbose=TRUE)
derepRs
# DADA denoising
<- dada(derepFs, err=errF, multithread=FALSE)
dadaFs <- dada(derepRs, err=errR, multithread=FALSE)
dadaRs
# Merge paired reads
<- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
mergers
# Create ASV table
<- makeSequenceTable(mergers)
seqtab dim(seqtab)
table(nchar(getSequences(seqtab)))
# Remove chimeras
<- removeBimeraDenovo(seqtab, method="consensus",
seqtab.nochim multithread=FALSE, verbose=TRUE)
sum(seqtab.nochim)/sum(seqtab)
# Track reads through pipeline
<- function(x) sum(getUniques(x))
getN <- data.frame(
track input=out[,1],
filtered=out[,2],
denoisedF=sapply(dadaFs, getN),
denoisedR=sapply(dadaRs, getN),
merged=sapply(mergers, getN),
nonchim=rowSums(seqtab.nochim)
)rownames(track) <- sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
print(track)
# Taxonomic assignment (using Silva v138 trained classifier)
<- assignTaxonomy(seqtab.nochim, "silva_nr_v138_train_set.fa.gz", multithread=FALSE)
taxa print(head(taxa))
# Removing sequence rownames for display only
<- taxa
taxa.print rownames(taxa.print) <- NULL
head(taxa.print)
# Evaluate accuracy
rownames(seqtab.nochim)
<- sapply(strsplit(rownames(seqtab.nochim), "_"), `[`, 1)
sample.names rownames(seqtab.nochim) <- sample.names
<- seqtab.nochim["Mock",]
unqs.mock <- sort(unqs.mock[unqs.mock>0], decreasing=TRUE) # Drop ASVs absent in the Mock
unqs.mock cat("DADA2 inferred", length(unqs.mock), "sample sequences present in the Mock community.\n")
# Load expected reference sequences for the Mock community
<- getSequences(file.path(path, "HMP_MOCK.v35.fasta"))
mock.ref <- sum(sapply(names(unqs.mock), function(x) any(grepl(x, mock.ref))))
match.ref cat("Of those,", sum(match.ref), "were exact matches to the expected reference sequences.\n")
######
theme_set(theme_bw())
<- rownames(seqtab.nochim)
samples.out <- sapply(strsplit(samples.out, "D"), `[`, 1)
subject <- substr(subject,1,1)
gender <- substr(subject,2,999)
subject <- as.integer(sapply(strsplit(samples.out, "D"), `[`, 2))
day <- data.frame(Subject=subject, Gender=gender, Day=day)
samdf $When <- "Early"
samdf$When[samdf$Day>100] <- "Late"
samdfrownames(samdf) <- samples.out
<- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE),
ps sample_data(samdf),
tax_table(taxa))
<- prune_samples(sample_names(ps) != "Mock", ps) # Remove mock sample
ps
<- Biostrings::DNAStringSet(taxa_names(ps))
dna names(dna) <- taxa_names(ps)
<- merge_phyloseq(ps, dna)
ps taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))
ps
# Alpha diversity
plot_richness(ps, x="Day", measures=c("Shannon", "Simpson"), color="When")
# Beta diversity
<- transform_sample_counts(ps, function(otu) otu/sum(otu))
ps.prop <- ordinate(ps.prop, method="NMDS", distance="bray")
ord.nmds.bray
plot_ordination(ps.prop, ord.nmds.bray, color="When", title="Bray NMDS")
<- names(sort(taxa_sums(ps), decreasing=TRUE))[1:20]
top20 <- transform_sample_counts(ps, function(OTU) OTU/sum(OTU))
ps.top20 <- prune_taxa(top20, ps.top20)
ps.top20 plot_bar(ps.top20, x="Day", fill="Family") + facet_wrap(~When, scales="free_x")
Understanding relative abundance in eDNA studies
When using eDNA, only a portion of the total sample is being used, and then PCR introduces bias by creating millions of DNA copies. This raises the question:
How can we trust that the number of DNA reads (or “read counts”) reflects the true abundance of species in the environment?
PCR amplifies DNA exponentially — ideally doubling the number of DNA copies with each cycle. However, this process introduces amplification bias, such as:
- Primer efficiency bias: some sequences bind primers better and amplify more easily.
- Template preference: certain DNA sequences are preferentially copied.
- Plateau effect: as resources deplete, amplification slows and stops.
- Subsampling: only a small portion of the total DNA is used for PCR and sequencing.
This means PCR doesn’t preserve the original proportions of DNA exactly — it confirms presence, but not perfect abundance. Read counts still be useful but cautiously. While PCR distorts absolute quantities, read counts still carry ecological signals, especially for dominant species. Researchers typically use:
Relative Read Abundance (RRA):
- 40% of reads = European seabass
- 10% of reads = Threespine stickleback
RRA allows us to compare patterns across sites or conditions, not exact organism numbers.
After sequencing, a read count table shows how many reads were detected for each species per sample. To make this data comparable, the following techniques are applied:
Normalisation
Goal: Adjust for different sequencing depths (i.e. different total read numbers per sample).
- Proportional scaling: convert raw counts to relative proportions (% of total reads).
- Log transformation: dampens the influence of very abundant species.
- Model-based approaches: e.g. DESeq2, edgeR — account for count variability statistically.
Rarefaction
Goal: Standardise sequencing effort across samples.
- Subsample each dataset to a common number of reads (e.g. 10,000 per sample).
- Ensures fair comparisons of diversity.
- Prevents overestimating richness in deeper sequenced samples.
Trade-off: some data (e.g. rare reads) is lost, but results are less biased.
Statistical tools
Once data is normalised or rarefied, we can use various ecological and statistical analyses:
- Diversity indices: Shannon, Simpson – to measure species richness and evenness
- Ordination: PCA, NMDS, PCoA – to visualise patterns between sites
- Hypothesis testing: PERMANOVA, ANOSIM – to compare groups or treatments
- Ecological modelling: Explore links between environmental drivers and communities