Jump to Section:
Overview
One of the main functionalities of CZ GEN EPI is to facilitate phylogenetic tree building through Nextstrain. By easily accessing and building several types of phylogenetic trees, users are able to better understand how their samples fit within the context of pathogen dynamics inside and outside of a location of interest. Below is a description of phylogenetic tree types generated by CZ GEN EPI.
After reading this user guide, you will be able to:
- Understand ways of generating phylogenetic trees within CZ GEN EPI
- Understand the purpose of automatic tree builds
- Understand the purpose of on-demand tree builds
Generating Phylogenetic Trees
CZ GEN EPI currently generates phylogenetic trees in two ways:
- Automatically: CZ GEN EPI automatically generates an "Overview tree" that provides a broad overview of SARS-CoV-2 diversity. Automatic trees are generated on a weekly basis.
- On-demand: You, or someone within your group, can include samples of interest in "Overview", "Targeted", or "Non-contextualized" trees. On-demand trees are currently available for SARS-CoV-2 and Mpox and can be generated whenever users need them. Please note that on-demand tree building takes time (up to 12 hours).
All Nextstrain-generated trees can be accessed through the Phylogenetic Tree page. Note that phylogenetic placements made through UShER are not saved on the Phylogenetic Tree page.
Automatically generated Overview trees
SARS-CoV-2 automatic trees are built by CZ GEN EPI on a weekly basis. These routinely updated phylogenies are designed to provide a broad overview of the viral diversity in your samples and how it compares to viral diversity outside of your default location. Note that publicly available samples chosen for SARS-CoV-2 automatic overview trees are now sourced from GenBank instead of GISAID (click here for details).
Tree image highlighting different types of samples included in automatically generated phylogenetic trees.
Note the following:
-
The tree will display samples automatically selected by Nextstrain, including:
- Samples from your default location: The number of samples from your location for the automatic build is limited to 2000. Therefore, subsampling will occur if there are more than 2000 samples available from your location.
- Samples outside of your location: Contextual samples that are closely related to samples in your location, including samples collected nationally and internationally, and randomly-selected samples. These automatically selected samples provide important context for understanding and interpreting the viral diversity and its evolution within your area. National samples will include samples collected within your state (maximum of 500 sequences) and the broader USA (maximum of 400 sequences). International samples will include a maximum of 100 sequences.
- The tree will include samples from the past 3 months (12 weeks) from your default location to provide better support and visualization of recent cases.
- All automatic tree builds will be named "{Group name} Contextual Recency-Focused Build" by default. However, you can edit tree names as needed through the Phylogenetic Tree page.
- Automatic trees fall within the "Overview" tree category, which is designed to better understand the overall picture of viral diversity within your location.
On-demand phylogenetic trees
There are three types of phylogenetic trees that you can build with SARS-CoV-2 or Mpox samples of interest, including: Overview, Targeted and Non-contextualized trees. See Build on-demand trees to learn details about how to select samples and build each of the trees described below. Note that publicly available samples chosen for SARS-CoV-2 phylogenetic trees are now sourced from GenBank instead of GISAID (click here for details).
Overview trees
Overview trees: Purpose
On-demand Overview trees are designed to better understand the overall picture of viral diversity for samples of interest defined by location, lineages, and/or time period. You can also specify additional samples of interest that fall outside of defined location, lineage(s), and/or date of collection range. Note that this contrasts with automatically generated Overview trees, where all the samples are selected by Nextstrain.
Overview trees: Included samples
The on-demand Overview trees are build around samples of interest, including your samples and public data from CZ GEN EPI and GenBank. This type of overview tree includes samples that you can select and/or define by location, lineage, and/or collection date range, genetically similar contextual samples, and randomly-selected samples. For SARS-CoV-2, this on-demand tree is similar to the CZ GEN EPI automatically generated Overview trees, but you can customize it. This on-demand tree version allows you to:
- Define a SARS-CoV-2 or Mpox sample set based on location and/or collection date range of interest. SARS-CoV-2 samples can also be defined by lineage. Note that if you don't specify location, lineage, and/or collection date range, Nextstrain will automatically select samples from your default location. If there are over 2000 samples from your location, Nextstrain will then automatically select samples through subsampling based on temporal representation.
- If you have over 2000 samples, you can specify samples to bypass subsampling and force the build to include samples of interest on the tree.
- Learn how to select samples for Overview trees here.
The randomly-selected samples on the on-demand Overview tree represent samples collected over time and space. These randomly-selected samples help represent the evolutionary history of the virus.
Overview trees: Limitations
Note that tree customization will hinder an unbiased overview of the viral diversity within an area of interest. Customized Overview trees, where you select or define samples by location, lineage, and/or collection data range, are biased because the samples on the tree are not evenly subsampled across time given that you are choosing specific samples.
Targeted trees
Targeted trees: Purpose
Targeted trees facilitate outbreak investigations because they allow you to identify and examine samples most closely related to samples of interest, such as those from a potential outbreak. Compared to the Overview tree, the Targeted tree allows for a higher resolution of contextual samples by keeping as many closely related samples as possible. In contrast, contextual samples that have identical sequences are usually removed from Overview trees to allow more viral diversity.
Typical questions addressed with Targeted trees include:
- Are all of my samples part of the same outbreak?
- Did a given outbreak originate from a single introduction event?
- Has an outbreak been contained to a localized setting or did it spread and is circulating within the wider community?
Tree image highlighting different types of samples included in Targeted phylogenetic trees.
Targeted trees: Included samples
First, you select samples of interest which become the focal samples for the tree (learn how to add samples to Targeted tree here). Nextstrain then adds twice the amount of contextual samples based on close genetic similarities to focal samples. You can opt to preferentially include contextual samples from a specific location. Roughly, half of the contextual samples will be chosen regardless of collection location based on genetic similarity alone. The other half will represent samples collected over time from the preferred location and other regions within the same state or country.
Targeted trees also include samples that are randomly selected to represent samples collected over time and space. These randomly-selected samples help represent the evolutionary history of the virus.
Non-contextualized trees
Non-contextualized trees: Purpose
Non-contextualized trees are designed to provide an overview of samples of interest defined by location, lineages, and/or time period. Unlike other trees, samples of interest for Non-contextualized trees will not inform contextual samples because there are none in this tree type. Looking at specific samples of interest may help you evaluate the following:
-
Do populations sampled by other groups within my location show different patterns of viral diversity than what my group has captured?
-
If there is a different pattern in viral diversity sampled between groups in my area, is my group preferentially sampling certain viral lineages?
Example tree highlighting the only sample type included in Non-contextualized trees.
Non-contextualized trees: Included samples
Non-contextualized SARS-CoV-2 and Mpox trees show samples of interest based on location and/or collection date (up to 2000), including your samples and public data from CZ GEN EPI and GenBank. SARS-CoV-2 samples can also be defined by lineage. Note the following:
- If you don't specify location, lineage and/or collection date range, Nextstrain will automatically select samples from your default location. If there are over 2000 samples from your location, Nextstrain will automatically select samples through subsampling based on temporal representation.
- You can select additional samples of interest for the Non-contextualized tree build. These additional samples will be force included in the tree regardless of defined location, lineage and/or collection date range (see how to add samples of interest to Non-contextualized trees here). If you select a large number of additional samples for the tree build, it may bias the proportion of lineages on the tree. This is because these additional user-selected samples will not be subject to the same temporal subsampling as the Nextstrain-selected samples. Therefore, minimizing user-selection leads to a more unbiased tree.
Non-contextualized trees: Limitations
Note that the lack of contextual data makes it impossible to make sound epidemiological inferences. Therefore, we caution against the use of non-contextualized trees for any purpose other than simply looking at viral diversity across samples of interest.
Comments
0 comments
Please sign in to leave a comment.