Jump to Section:
Phylogenetic trees depend on the quality of sequence data to display hypotheses regarding the evolutionary history of organisms. This is important to keep in mind because Nextstrain trees do not show a branch support metric that can give us confidence in the probability of a given branch. If we notice something new or unexpected on our trees, we have to make sure, to the best of our ability, that our data supports the observed phylogeny. Here we discuss some quality checks that might help you gain confidence in your trees using SARS-CoV-2 data as an example.
After reading this guide, you will:
- Become familiar with tree quality checks
- Become familiar with data quality checks that may be affecting your tree
- Be able to find data quality information
- Learn about natural processes that may affect your tree
Tree quality checks
Given the lack of branch support metrics on Nextstrain trees, the first thing we can do regarding tree quality is to look at our tree and evaluate if it is plausible based on what is already known. You can also look into clock plots to see if there are any outliers in your data.
Quality check 1: Does my tree look OK?
For example, let’s say that you decide to look at your tree based on Pango lineages because there is already support from the scientific community for how those SARS-CoV-2 lineages relate to each other.
Nextstrain SARS-Cov-2 tree colored by Pango lineage
The tree above highlights relationships based on Pango lineages. Notice that something is off. We know that Pango lineage BA.1.1 evolved from lineage BA.1. Why do we see BA.1.1 samples in two clades that have distinct evolutionary paths? In the tree above, BA.1.1 does not appear to be monophyletic because not all BA.1.1 samples descend from the same most recent common ancestor (MRCA) and land on the same branch as a descendant from BA.1 samples.
Tree with potential problems highlighted
This tree does not follow what we know about SARS-CoV-2 lineage evolution. Therefore, you need to be cautious with interpretation and step back to check if something potentially went wrong. See the section below for common data quality checks you can do.
Quality check 2: Do samples follow the known evolutionary rate?
The molecular evolution of SARS-CoV-2 can be visualized using Nextstrain, including analysis of its evolutionary rate. The evolutionary rate gives us an idea of how many mutations we should expect to see relative to a reference sequence over time. We can look at the "clock" view within Nextstrain to estimate how many mutations we should see in our samples. Go to SARS-CoV-2 global phylogeny to see the latest clock plot. Please note that the clock view is not available for Mpox trees.
Nextstrain clock plot view for SARS-CoV-2 focusing on subsampling of samples collected globally between June and August 2022.
Based on the clock plot information above, we should expect to see between 60 - 90 accumulated mutations (relative to the 2019 reference sequence) for samples collected between June and August 2022. If samples significantly deviate from these numbers, then we should take a closer look at our sequence data to make sure our workflow and data quality supports the observed data. See common data quality checks you can do below.
What could be wrong with my tree? Data quality checks
If you notice that something does not look right with your tree or you observe something unprecedented, you should review the quality of your data. Although quality control checks on consensus genome assembly should always be done prior to uploading data to CZ GEN EPI, it is possible to miss errors that may affect your tree. Note that CZ GEN EPI automatically performs quality checks through Nextclade and will flag sequences with bad quality when you upload your data. This is a useful feature to help you avoid bad quality sequences when you select specific samples (force-include) for your tree builds. CZ GEN EPI also runs quality checks on contextual data, including the removal of sequences with < 27,000 known bases (ATGC count) and those containing too many mutations based on the SARS-CoV-2 evolutionary rate and the time of collection. However, these quality checks are not performed on samples that are force-included in your tree builds.
You can view overall sequence QC scores on your Sample page and some QC parameters directly on Nextstrain trees (see Where can I find data quality information?). However, if you need to check on the specifics of various sequence quality parameters, you can use Nextclade. Here we explain how to use and interpret QC parameters using Nextclade (see Nextclade Quality Control for more details).
To check on the quality of your sequences using Nextclade:
- Go to Nextclade and upload a fasta file with sequences of interest.
Nextclade web interface for uploading sequences. Once you upload the sequence file(s), click "Run" to view QC report.
- View the QC report which provides an overall QC score for each sequence. The higher the score, the worse the quality of the sequence. Each of the QC parameters is colored based on a quality score threshold and, thus, sequences with low quality scores are easy to spot.
The Nextclade QC report is color coded. Quality control parameters are summarized under the QC column. To view QC parameter details for a given sample, simply hover over the parameters and a popup will display QC information. In the provided example, we can easily see that sequence #5 has bad quality (highlighted in red on the table). For details regarding other metrics in the report visit Nextclade.
Flagged QC parameters will alert you about potential sequence problems. However, not all flagged sequences will result from sequencing or assembly errors as there are natural processes that may lead to unexpected results, such as a high number of mutations (see below). You can explore the following QC parameters to identify potential errors in consensus genome sequences:
- Number of unknown nucleotides (N): Unknown nucleotides in a given position are denoted with "Ns" within sequences. These N sites within genome sequences are effectively missing data because they refer to nucleotides that could not be defined through sequencing or genome assembly. If your sequence contains > 3000 Ns, this QC parameter will be flagged as "bad" by Nextstrain. You can see the specific number of Ns for a given sequence in the Nextclade table by looking at the column labeled "Ns". If you have a lot of missing data, Nextstrain may artificially place your sample closer to the root because it doesn’t have enough information to compare it with other sequences on your tree.
- Number of mixed or ambiguous nucleotides (M): Mixed sites within sequences refer to non-ATGC or N sites. Mixed sites happen when more than one nucleotide is likely to occupy a given site. These sites are labeled with the IUPAC ambiguity code, which defines specific nucleotides that are likely to occupy a given genome site. You can see the specific number of mixed sites for a given sequence in the Nextclade table by looking at the column labeled "non-ACGTN":
The number of mixed sites in your sequence will depend on the assembly pipeline you used to assemble your genome given that different assemblers may use different thresholds. For example, CZ ID uses a 75% frequency threshold to call a nucleotide during genome assembly. If there are multiple nucleotides mapping to a given site and none of them reach 75% frequency, CZ ID will call that a mixed site. Although mixed sites could indicate co-infection with multiple virus genotypes, mixed sites are often indicators of contamination. If your sequence contains more than 10 mixed sites, this QC parameter will be flagged as "bad" by Nextstrain.
- Number of “private” mutations (P): Private mutations refer to mutations that are unique to a given sequence after it has been placed on a SARS-CoV-2 phylogeny and map to a terminal branch (i.e., mutations are not found in sublineages or descendants). These mutations could indicate errors associated with sequencing and/or consensus genome assembly. Alternatively, private mutations may be present in an unusual variant without close relatives in the tree. Sequences with many private mutations are likely to have errors and this parameter will be flagged by Nextstrain.
There are three types of private mutations:
- Reversions: Reversions occur when a given sequence shows a mutation back to the original reference sequence. You can see this in trees when closely related sequences within a clade have a mutation but the sequence mapped to the external branch has the same nucleotide(s) as the reference sequence. Reversions can occur naturally at a very low frequency, while a large number of reversions may be seen as a result of genome assembly pipelines that fill in sites with no data with the reference sequence.
- Labeled mutations: Labeled mutations are mutations that are unique to a sequence in a given clade. However, the mutation has been observed in other clades of the tree. Labeled mutations can occur naturally as homoplasies (i.e., mutations that arise independently during virus evolution due to a process other than inheritance), and can also be due to sequence errors.
- Unlabelled mutations: Unlabelled mutations are mutations that are unique to a given sequence and have not been observed in sequences placed elsewhere in the tree.
- Mutation clusters (C): Genomic regions with more than six private mutations within a 100 nucleotide stretch will be flagged as single nucleotide polymorphism (SNP) clusters. There are regions on the genome that are known to be hypervariable. If there are SNP clusters outside of these regions, this QC parameter will be flagged as "mediocre" by Nextstrain. These clusters can be caused by poor data quality which will lead to artificially long branches in your tree. This will falsely indicate that a given sample has gained more mutations than expected. On the other hand, mutation clusters can also be the result of recombination or immune responses and reflect real biology.
- Premature stop codons (S): Unknown premature stop codons will lead to truncated genes and could indicate sequence errors especially when considering essential genes. If there is an unknown premature stop codon detected in your sequence, this parameter will be flagged as "mediocre" by Nextstrain. If two or more are detected, this parameter will be flagged as "bad".
- Frameshits (F): Premature stop codons may result from frameshifts caused by inserted or deleted nucleotides relative to the reference sequence. Frameshifts that do not cause premature stop codons may alter predicted translation products in other ways. Some viruses are known to use frameshifting to increase protein coding capacity. However, if your sequence has unknown or uncommon frameshifts, this parameter will be flagged (QC score of 75 per frameshift detected). See Types of frameshifts and when to fix them for more information about real vs artifactual frameshifts.
- Number of gaps: Gaps are defined by Nextstrain as deletions relative to reference sequence. Although these may represent real sequence changes that lead to variants, gaps may also be introduced through sequence errors.
- Number of insertions: Insertions are defined by Nextstrain as the number of inserted nucleotides relative to the reference sequence. Although these may represent real sequence changes that lead to variants, insertions may also be introduced through sequence errors.
Get more details using the alignment viewer
On the right-hand side of the Nextclade QC table you can see an alignment viewer. This is a useful feature that can help you pinpoint problematic regions in your sample. See how you can use information to check data quality using BAM files here.
Nextclade results highlighting the alignment viewer on the right-hand side of the table.
Where can I see sample quality information?
As described in the section above, Nextclade is a very useful tool that provides detailed QC parameters. You can also view sample quality information using other tools, including:
- Nextstrain trees
You can view QC information directly on your Nextstrain tree by enabling tree coloring by QC parameters. This will allow you to spot potentially problematic samples. To easily view QC metrics, you can color the tree by missing data, mixed sites, reversion mutations, and/or potential contaminants. For more details about how to view trees in Nextstrain and enable coloring click here.
Targeted tree colored by missing data (refers to the number of unknown nucleotides or Ns). The tree can also be colored by other QC metrics found under the "Color By" dropdown menu on the left-hand side of the Nextstrain tree page.
The output table summarizing phylogenetic placement results from UShER provides QC metrics, including number of Ns, mixed sites, closely related samples found in the same cluster as your sample, and number of maximally parsimonious placements.
UShER output table highlighting useful QC parameters and phylogenetic placement results.You can use results from UShER to evaluate sequence quality (number of Ns and mixed sites) and the placement of your sample. How does the placement compare with your Nextstrain tree based on the reported neighboring sample in the pre-calculated tree? Is there confidence in that placement? Look at the number of maximally parsimonious placements, which indicate the number of potentially good placements on the tree (the higher the number of placements, the less confidence in a given placement). See Phylogenetic analysis using UShER for details regarding how to run UShER and the output table.
- Consensus genome assembly reports
If you notice sequence quality issues in your sample (e.g., high number of mixed sites), you can go back to the assembly results and perform quality checks, such as reviewing the genome coverage plot and assembly report and evaluating how sequencing reads align to the reference genome using BAM files. Does read coverage support the consensus genome call for a given site of interest? For details see Building and analyzing a SARS-CoV-2 consensus genome and Data quality checks using BAM files.
Consensus genome report from CZ ID
What else could be happening if my sample quality looks good, but my phylogenetic tree looks off?
There are biological processes that might explain unexpected phylogenetic tree results. Always do your best to control for sample quality errors to help you distinguish between natural processes and data artifacts. Below we briefly discuss natural processes and scenarios that may impact your phylogenetic tree results, including recombination, co-infection, poor sampling, and host immune responses.
Recombination is the process of genetic exchange between organisms by swapping fragments of their genomes and recombining them to produce a new combination of genetic material. This is a natural and widespread process that generates diversity in viral populations. When a virus variant is the result of a recombination event it is called a recombinant.
Schematic representation of a recombination event between two virus variants whose genomes contain 3 genes. The resulting recombinant virus has genetic sequences from parental A (gene 2) and B (genes 1 and 3) viral genomes.
If you are working with recombinant samples, tree interpretation gets tricky because the assumption of common ancestry no longer applies (i.e., recombinants are not only slowly accumulating mutations relative to an ancestor, they are also exchanging pieces of genetic material).
Schematics illustrating the challenges of interpreting trees when recombinants are present. Note that tree interpretation will depend on which part of the genome you are investigating.
You need to be aware of the possibility of recombinant SARS-CoV-2 samples because this molecular process may result in mutation clusters, reversion mutations, and/or homoplasy. Mutation clusters and reversions are reported as part of Nextclade QC parameters discussed above. Homoplasy refers to mutations shared by separate lineages because these mutations result from a process other than inheritance from a common ancestor (see schematic above). If a given sample contains mutations representing homoplasies, you can see this information in your Nextstrain tree. Although homoplasies can occur naturally due to recombination and other evolutionary processes, these mutations can be indicative of sequence errors and it is important to check if the quality of the data supports your tree.
You can view information about homoplasies by hovering over a sample of interest in Nextstrain trees.
If an individual is infected with two virus variants, this co-infection may result in a high number of mixed sites when sequencing due to the presence of viruses with distinct genomes. Although co-infection of SARS-CoV-2 with other respiratory pathogens may be common, co-infection with multiple SARS-CoV-2 variants is rare. Therefore, a high number of mixed sites is often indicative of contamination. If your samples have a high number of mixed sites, check lab controls and your sequencing framework (e.g., plate map, sample concentrations, barcodes used) to rule out contamination. See “Troubleshooting mixed sites” within Troubleshooting QC issues for details.
If you encounter samples with a high number of unique mutations, it is possible that the mutations are the result of undetected virus circulation. This undetected circulation may allow for mutations to accumulate without being accounted for by surveillance efforts. However, samples with unique mutations should not deviate significantly from the known evolutionary rate of SARS-CoV-2. If you have samples with a high number of mutations, evaluate how they fit within a global SARS-CoV-2 clock plot.
A SARS-CoV-2 sample showing many unique mutations could be real if it represents a lineage that circulated in a population that was not well sampled. It could also represent a sample from an immunocompromised patient with a chronic viral infection. If the sequence is real, it should not be an outlier on the clock plot. In this example, the sample was real and the excess terminal branch length was likely due to poor sampling to detect the BA.5 variant.
RNA editing enzymes involved in immune responses to viral attack can create mutations within viral genomes that may lead to unexpected mutation clusters or patterns. For example, adenosine deaminases that act on RNA (ADARs) mediate adenosine to inosine (A-to-I) changes that are interpreted as A-to-G substitutions by viral replication enzymes. Cytidine deaminases of the apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like (APOBEC) protein family lead to cytosine to uracil (C-to-U) conversions. Due to their genome editing capabilities, ADARs and APOBECs are thought to contribute to the evolution of various viral groups, including SARS-CoV-2 (see Potential APOBEC-mediated RNA editing of SARS-CoV-2 genomes and ADAR editing in viruses).