Setup & input

  • Start with paired-end FASTQ files (e.g. sample1_R1.fastq, sample1_R2.fastq)

  • Files must be:

    • Demultiplexed (split by sample)
    • Primer-trimmed
  • DADA2 matches forward (R1) and reverse (R2) files automatically by sample name.

Why forward (R1) and reverse (R2) reads?

Paired-end sequencing (e.g. Illumina MiSeq) reads both ends of a DNA fragment, generating two FASTQ files per sample:

  • R1.fastqForward read

  • R2.fastqReverse read

  • The forward read comes from the 5′ end of the top strand

  • The reverse read comes from the 5′ end of the bottom strand (reverse complement)

  • The two reads may overlap, allowing better reconstruction of the full DNA fragment

DADA2 uses paired-end reads to:

  1. Independently filter and model errors in each direction
  2. Merge the reads using their overlap to:
    • Correct sequencing errors
    • Reconstruct the full barcode (amplicon)
    • Improve accuracy and resolution
File Description
R1.fastq Forward reads (5′→3′ of one strand)
R2.fastq Reverse reads (5′→3′ of the other strand)

→ Both are needed to build full-length, high-quality sequences.

Inspecting quality

plotQualityProfile(fwd)
plotQualityProfile(rev)
  • Forward reads: generally good quality
  • Reverse reads: quality often drops toward the end
  • Use this to decide truncation lengths (e.g. 240 bp forward, 160 bp reverse)

Filter & trim

filterAndTrim(fwd, filtF,
              rev, filtR,
              truncLen = c(240,160),
              maxN = 0,
              maxEE = c(2,2),
              truncQ = 2,
              rm.phix = TRUE,
              compress = TRUE,
              multithread = TRUE)
  • Removes poor-quality and ambiguous reads
  • Drops PhiX contaminants
  • Outputs retained read counts per sample
What is PhiX and why is it removed?

PhiX (or PhiX174) is a small viral genome often added as a control during Illumina sequencing runs.

  • It helps calibrate the sequencer and improve performance in low-diversity samples (like amplicon-based eDNA).
  • It’s not part of your actual biological sample, so it needs to be filtered out.

In DADA2, this is done automatically:

rm.phix = TRUE

Learn error rates

errF <- learnErrors(filtFs, multithread = TRUE)
errR <- learnErrors(filtRs, multithread = TRUE)
  • Builds a parametric error model from observed data
  • Separately for forward and reverse reads

Denoise / Sample inference

dadaFs <- dada(filtFs, err = errF, multithread = TRUE)
dadaRs <- dada(filtRs, err = errR, multithread = TRUE)
  • Infers ASVs (Amplicon Sequence Variants)
  • Corrects sequencing errors to identify true biological sequences

Merge paired reads

mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs)
  • Merges forward and reverse reads
  • Requires overlap (e.g., ≥12 bp) and agreement in the overlap region

Build sequence table

seqtab <- makeSequenceTable(mergers)
  • Rows = samples
  • Columns = ASVs
  • Cells = read counts

Chimera removal

seqtab.nochim <- removeBimeraDenovo(seqtab, method = "consensus", multithread = TRUE)
  • Removes PCR artifacts (chimeras)
  • Yields a cleaned-up ASV table
What are chimeras and why remove them?

Chimeras are artificial DNA sequences created during PCR amplification.

  • They form when a partially extended DNA strand from one template accidentally binds to a different, but similar, template in the next PCR cycle.
  • This creates a hybrid sequence that does not exist in any real organism.

These artefacts:

  • Are common in amplicon-based workflows.
  • Can inflate biodiversity estimates or lead to false species detections.

Track read loss

getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
  • Monitors how many reads pass each step
  • Helps catch major data loss or pipeline issues

Assign taxonomy

taxa <- assignTaxonomy(seqtab.nochim, "silva_nr_v138_train_set.fa.gz", multithread = TRUE)
taxa <- addSpecies(taxa, "silva_species_assignment_v138.fa.gz")
  • Classifies ASVs by comparing to a reference database (e.g., SILVA)
  • Adds species-level IDs for exact matches

Optional: Species-level matching

  • Use curated references like MiFish for fish-specific identification

  • Alternatives:

    • BLAST for manual species checks
    • DECIPHER’s IdTaxa() function

Validate with mock community

  • Run pipeline on a known “mock” sample
  • Should recover expected species with high accuracy

Export to phyloseq

library(phyloseq)
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows = FALSE),
               tax_table(taxa),
               sample_data(metadata))
  • Enables diversity analysis, ordination, visualisation

Summary flow

Step Action
1 Import FASTQ files
2 Visualise quality
3 Filter & trim reads
4 Learn error rates
5 Infer ASVs (denoise)
6 Merge paired reads
7 Build sequence table
8 Remove chimeras
9 Track read retention
10 Assign taxonomy
11 Optional: species matching
12 Validate with known samples
13 Export to phyloseq

Example code

# Load necessary libraries
library(dada2)
library(phyloseq)
library(Biostrings)
library(ggplot2)

# File set up
path <- "MiSeq_SOP"
list.files(path)

# Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq
fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))

# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)

# Inspect quality
plotQualityProfile(fnFs[1:2])
plotQualityProfile(fnRs[1:2])

# Filter & trim
filtFs <- file.path(path, "filtered", basename(fnFs))
filtRs <- file.path(path, "filtered", basename(fnRs))
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs,
                     truncLen=c(240,200), maxN=0, maxEE=c(2,2),
                     truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=FALSE)
head(out)

# Learn error rates
errF <- learnErrors(filtFs, multithread=FALSE)
errR <- learnErrors(filtRs, multithread=FALSE)
plotErrors(errF, nominalQ=TRUE)

# Dereplicate
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)

# DADA denoising
dadaFs <- dada(derepFs, err=errF, multithread=FALSE)
dadaRs <- dada(derepRs, err=errR, multithread=FALSE)

# Merge paired reads
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)

# Create ASV table
seqtab <- makeSequenceTable(mergers)
dim(seqtab)
table(nchar(getSequences(seqtab)))

# Remove chimeras
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus",
                                    multithread=FALSE, verbose=TRUE)
sum(seqtab.nochim)/sum(seqtab)

# Track reads through pipeline
getN <- function(x) sum(getUniques(x))
track <- data.frame(
  input=out[,1],
  filtered=out[,2],
  denoisedF=sapply(dadaFs, getN),
  denoisedR=sapply(dadaRs, getN),
  merged=sapply(mergers, getN),
  nonchim=rowSums(seqtab.nochim)
)
rownames(track) <- sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
print(track)

# Taxonomic assignment (using Silva v138 trained classifier)
taxa <- assignTaxonomy(seqtab.nochim, "silva_nr_v138_train_set.fa.gz", multithread=FALSE)
print(head(taxa))

# Removing sequence rownames for display only
taxa.print <- taxa
rownames(taxa.print) <- NULL
head(taxa.print)

# Evaluate accuracy
rownames(seqtab.nochim)
sample.names <- sapply(strsplit(rownames(seqtab.nochim), "_"), `[`, 1)
rownames(seqtab.nochim) <- sample.names

unqs.mock <- seqtab.nochim["Mock",]
unqs.mock <- sort(unqs.mock[unqs.mock>0], decreasing=TRUE) # Drop ASVs absent in the Mock
cat("DADA2 inferred", length(unqs.mock), "sample sequences present in the Mock community.\n")

# Load expected reference sequences for the Mock community
mock.ref <- getSequences(file.path(path, "HMP_MOCK.v35.fasta"))
match.ref <- sum(sapply(names(unqs.mock), function(x) any(grepl(x, mock.ref))))
cat("Of those,", sum(match.ref), "were exact matches to the expected reference sequences.\n")

######

theme_set(theme_bw())

samples.out <- rownames(seqtab.nochim)
subject <- sapply(strsplit(samples.out, "D"), `[`, 1)
gender <- substr(subject,1,1)
subject <- substr(subject,2,999)
day <- as.integer(sapply(strsplit(samples.out, "D"), `[`, 2))
samdf <- data.frame(Subject=subject, Gender=gender, Day=day)
samdf$When <- "Early"
samdf$When[samdf$Day>100] <- "Late"
rownames(samdf) <- samples.out

ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE), 
               sample_data(samdf), 
               tax_table(taxa))
ps <- prune_samples(sample_names(ps) != "Mock", ps) # Remove mock sample

dna <- Biostrings::DNAStringSet(taxa_names(ps))
names(dna) <- taxa_names(ps)
ps <- merge_phyloseq(ps, dna)
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))
ps

# Alpha diversity
plot_richness(ps, x="Day", measures=c("Shannon", "Simpson"), color="When")

# Beta diversity
ps.prop <- transform_sample_counts(ps, function(otu) otu/sum(otu))
ord.nmds.bray <- ordinate(ps.prop, method="NMDS", distance="bray")

plot_ordination(ps.prop, ord.nmds.bray, color="When", title="Bray NMDS")

top20 <- names(sort(taxa_sums(ps), decreasing=TRUE))[1:20]
ps.top20 <- transform_sample_counts(ps, function(OTU) OTU/sum(OTU))
ps.top20 <- prune_taxa(top20, ps.top20)
plot_bar(ps.top20, x="Day", fill="Family") + facet_wrap(~When, scales="free_x")

Understanding relative abundance in eDNA studies

RRA

When using eDNA, only a portion of the total sample is being used, and then PCR introduces bias by creating millions of DNA copies. This raises the question:

How can we trust that the number of DNA reads (or “read counts”) reflects the true abundance of species in the environment?

PCR amplifies DNA exponentially — ideally doubling the number of DNA copies with each cycle. However, this process introduces amplification bias, such as:

  • Primer efficiency bias: some sequences bind primers better and amplify more easily.
  • Template preference: certain DNA sequences are preferentially copied.
  • Plateau effect: as resources deplete, amplification slows and stops.
  • Subsampling: only a small portion of the total DNA is used for PCR and sequencing.

This means PCR doesn’t preserve the original proportions of DNA exactly — it confirms presence, but not perfect abundance. Read counts still be useful but cautiously. While PCR distorts absolute quantities, read counts still carry ecological signals, especially for dominant species. Researchers typically use:

  • Relative Read Abundance (RRA):

    • 40% of reads = European seabass
    • 10% of reads = Threespine stickleback

RRA allows us to compare patterns across sites or conditions, not exact organism numbers.

After sequencing, a read count table shows how many reads were detected for each species per sample. To make this data comparable, the following techniques are applied:

Normalisation

Goal: Adjust for different sequencing depths (i.e. different total read numbers per sample).

  • Proportional scaling: convert raw counts to relative proportions (% of total reads).
  • Log transformation: dampens the influence of very abundant species.
  • Model-based approaches: e.g. DESeq2, edgeR — account for count variability statistically.

Rarefaction

Goal: Standardise sequencing effort across samples.

  • Subsample each dataset to a common number of reads (e.g. 10,000 per sample).
  • Ensures fair comparisons of diversity.
  • Prevents overestimating richness in deeper sequenced samples.

Trade-off: some data (e.g. rare reads) is lost, but results are less biased.

Statistical tools

Once data is normalised or rarefied, we can use various ecological and statistical analyses:

  • Diversity indices: Shannon, Simpson – to measure species richness and evenness
  • Ordination: PCA, NMDS, PCoA – to visualise patterns between sites
  • Hypothesis testing: PERMANOVA, ANOSIM – to compare groups or treatments
  • Ecological modelling: Explore links between environmental drivers and communities