Monday, January 7, 2013

Dramatically reducing sequencing error via Duplex Tag sequencing

Note: This is a cross-post from my recent blogging over at The Molecular Ecologist - check out the blog if you haven't heard of it; its a great resource for biologists grappling with high-throughput sequencing data.

An exciting new study was published in PNAS last month, an open access paper entitled "Detection of ultra-rare mutations by next-generation sequencing". This new method has the potential to open up a new frontier in Next-gen sequencing bioinformatics, since it allows tracking of virtually all PCR and sequencing-generated errors.

In this approach, authors Schmitt et al. used Duplex Tag sequencing - they tacked on a sequence of 12 randomized nucleotides onto Illumina adaptors prior to conducting PCR (where forward and reverse adaptors are labelled with different tags, denoted here by A and B tag sequences). After library prep and sequencing, these primer tags can be tracked in two ways. Firstly, sequences containing the same unique 12 bp in the same orientation (AB or BA) can be informatically grouped together and used to generate a Single-Strand Consensus Sequences (SSCS). Even this simple approach reduces error rates from standard quality processing.

However, the real power comes from combining information from reads containing the 12 bp tags in both orientations (AB and BA). The authors showed dramatic (e.g. near-elimination) or sequencing errors by using information from Duplex Consensus Sequence (DCS)--information from complementary pools of PCR products representing both sense and antisense DNA strands from the original reference DNA molecule. In DCS, any given mutation must be present across ALL reads (AB and BA oriented tags), or else it is likely to represent spurious sequencing error or first-round PCR error (e.g. in the case of a mutation being present in all sequences used to generate the SSCS but not present on the complementary reads used to generate the DCS)

Overview of Duplex Tag sequencing approach (Schmitt et al. 2012) 

Schmitt et al. estimated the error frequency of DCS at 3.8 x 10^-10 (but likely even lower because this estimate assumes all mutations are equally probable, but the data shows there is actually quite a strong mutational bias observed in SSCS data). This error rate is unprecedented and astoundingly low, considering the method for standard data processing methods is typically 3.8 × 10^−3.

The study also contains a number of super cool empirical validations proving the accuracy of the Duplex Tag approach. Error rates from DCS were dramatically reduced to the point that DCS-derived estimates were shown to be even more accurate than in vivo genetic measurements of mutation rates (in this study, the authors used an example from the M13mp2 LacZ assay used to determine mutation rates for reference DNA). In addition, Schmitt et al.'s estimation of mutation rates via DCS was consistently equivalent to previous research in M13mp2 substrates (where mutation has been extensively characterized) and in human mitochondrial DNA. DCS can also be used pinpoint "hotspot mutational regions" and genomic mutation patterns, by removing artifacts that have previously precluded such analysis. For example, the authors used their Duplex Tag method to identify one such hotspot in the region of replication initiation (D loop) in human mtDNA.

As if that weren't enough, the study also tracked what kind of errors happen in your typical PCR reactions. Typically, there are bursts of DNA mutations that pop up during the first round of PCR, and these strand-specific errors are carried through all subsequent reactions. The authors used a mutagenic protocol to show that oxidative products produced by PCR polymerases cause an excess of C-->T and G-->T mutations, and a characteristic mutation profile in SSCS analyses. These PCR-specific errors can't be identified through SSCS alone (since first-round PCR errors are propagated across all daughter copies), but they are easily pinpointed and eliminated using DCS.

Tracking mutational profiles of first-round PCR errors (Schmitt et al. 2012)

So what do these results mean for future studies? Well first of all, Duplex tagging is an EASY method to apply--there are no extra steps in library prep, just modifications when you order your sequencing adaptors. However, the approach is limited by the need for overlapping fragments (e.g. fragment size limitations imposed by 2 x 250 PE runs on an Illumina MiSeq). The full workflow (deep error correction via DCS) is currently only applicable for shotgun metagenomic sequencing, where you are originally starting with double-stranded fragments of DNA. That being said, amplicon sequencing approaches (rRNA or other marker genes) could still harness the SSCP approach to at least gain some reduction in error rates by pooling reads with the same unique 12bp tag.

Even applied to metagenomic approaches, Duplex Tag sequencing has the potential to give us an unprecedented view into the rare biosphere - we've always been able to look at abundant taxa, but ~1% sequencing error rates have persistently clouded our view of low-copy taxa (or rare gene variants) in environmental or microbiome samples. By removing virtually all error, future studies can ensure that observed variants are biologically real.

Reference:

Schmitt MW, Kennedy SR, Salk JJ, Fox EJ, Hiatt JB, Loeb LA. (2012) Detection of ultra-rare mutations by next-generation sequencing. Proc Natl Acad Sci USA, 109(36):14508–13.

No comments:

Post a Comment