UMI management for somatic analyses

Get the most out of UMIs by selecting one of SomaVar’s 3 new analysis modes, and browsing through a more comprehensive QC report.
Newsletter February 2021
Introduction
An essential step in most NGS protocols, both for the preparation of libraries and their enrichment in targets of interest, PCR generates for each unique molecule a variable number of clones, or duplicates.
This step is known not only to generate certain biases, preferentially amplifying certain sequences and thus artificially increasing the coverage of a given position in the genome, but also errors which can prove problematic when searching for variants with low allele frequency.
The use of a unique molecular index (UMI, also known as a unique molecular identifier) in the sequencing preparation workflow, upstream of PCR, offers a solution to these problems. These UMIs allow the identification of all sequences originating from the same initial molecule, and thus enhance the precision of the sequencing by eliminating errors [1].
Benefits of using UMIs
UMIs offer two major advantages:
- A more precise estimate of the allele frequency, improving the deduplication process.
During the bioinformatic analysis pipeline, PCR duplicates are identified during the deduplication step: reads aligning exactly to the same position in the reference genome being identified as as many clones of one and the same initial molecule. Only one of these sequences is then retained as representative of the starting molecule for the remainder of the bioinformatic workflow.
However, two sequences with identical genomic coordinates could just as well come from two distinct sequences, originating from different cells. With a classic deduplication approach, these would be reduced to a single molecule during the deduplication process.This is therefore as much signal lost for the detection of variants, in addition to being only a partial representation of the signal at this locus.

When each molecule is indexed prior to PCR amplification, so that each clone can be associated with the initial molecule, identical sequences at the end of the alignment but coming from different molecules will be associated with different UMIs.
- Increased specificity to identify low frequency variants.
Multiple PCR clones can be used to increase the quality of the representative sequence of the original DNA fragment. Since the fragment was duplicated before sequencing and then sequenced multiple times, multiple copies can be used to correct sequencing errors. By generating a consensus sequence from these duplicates, which relies on a majority vote for each position, we can then largely eliminate background noise.
This application then becomes particularly useful when searching for variants of very low allelic frequency, when sequencing circulating tumor DNA (ctDNA) for example, for which errors generated during PCR or sequencing can quickly become problematic.
How to use SeqOne’s bioinformatics with UMIs
In practice, several methods of UMI processing are proposed on the platform when launching a compatible analysis:
- Standard mode (recommended mode): consensus sequences are generated from PCR duplicates with the same UMI, when their number is greater than or equal to 2. UMIs represented by a single sequence (singletons) are also preserved.
- High quality: Consensus sequences are generated from PCR duplicates with the same UMI when their number is greater than 3, and UMIs supported by 1 (singletons) or 2 reads are eliminated from the analysis. This method of analysis is also more stringent on the quality of the bases after consensus generation, and allows the detection of variants with an allelic frequency of less than 1%. It is recommended when the sequencing depth is greater than 5000X, and for applications such as ctDNA sequencing.
- UMI disabled: UMIs are not used for deduplication, and they are cut off from the end of sequences before analysis.
The quality control report of the SomaVar analysis in particular now provides a more detailed view of the composition of the sample, in particular the distribution of UMIs according to the number of sequences carrying them. A better understanding of the sample’s profile can then guide the choice of the most appropriate analysis mode.
CNV analysis
Do you want to detect CNVs from your capture data with UMI? The SomaCNVCapture pipeline will now be available in your UMI projects as well.
Regardless of the configuration selected when launching your SomaVar analyzes in this project, the analysis of CNVs will be based on a standard approach: consensus sequences will be generated from PCR duplicates with the same index, and the singletons will be preserved
Current limitations and backward compatibility
- Only the following kits are currently supported on the SeqOne platform:
– QIAGEN QIAseq
– Agilent XTHS / Low input
– Agilent XTHS V2
– IDT xGen UDI-UMI
– Illumina TruSight Oncology 500
If you use another protocol, contact us!
- Only the SomaVar, SomaCNVCapture and SomaRNA worksets are compatible with UMI data.
- Each of the two new UMI (standard, high quality) configurations differ from the previous implementation for SomaVar v1.4, summarized in the following table:
SomaVar v1.4 UMI | SomaVar v1.5 UMI standard | SomaVar v1.5 UMI high quality | |
Number of reads per consensus | 2 | 2 | 3 |
Minimal base quality (phred score) | 30 | 30 | 40 |
Reads outside consensus sequences filtered out | yes | no | yes |