Overview
You can build three types of on-demand phylogenetic trees on CZ GEN EPI, including Overview, Targeted, and Non-contextualized trees. The main difference between these types of on-demand trees is sample composition, which in turn affects the type of inferences you can make from a given tree. Below is an explanation of sample composition for each tree type.
After reading this user guide, you will be able to:
- Understand the difference in sample composition between on-demand tree types.
Sample composition for on-demand trees
A readable tree can only fit up to ~4000 samples. Therefore, CZ GEN EPI offers three phylogenetic tree types through Nextstrain that allow you to focus on different samples. Contextualized trees (Overview and Targeted types) show the relationships among focal samples that users are interested in (for example, a given location or outbreak) and contextual samples that show how focal samples fit within sample diversity at large. Contextual samples are automatically selected by Nextstrain based on genetic similarities to focal samples. Consequently, focal sample composition affects contextual sample selection. In contrast, Non-contextualized only show diversity within samples of interest.
Contextualized trees (Overview and Targeted types) also include randomly-selected samples. These samples are automatically and randomly selected by Nextstrain to represent samples collected over time and space. These randomly-selected samples help represent the evolutionary history of the virus.
Graph showing the distribution of focal and contextual samples across types of on-demand trees. Note that for Overview trees, focal samples include user-selected samples (gray) and user-defined samples (light blue). These samples of interest differ on how they are selected by the user. User-defined samples are defined by location, lineage, and/or collection date range of interest, whereas user-selected samples are added from the sample table and/or using sample IDs. User-selected samples are force-included regardless of defined parameters. Focal samples in Targeted trees only include user-selected samples.
The sample composition for each tree type is as follows:
Overview trees include:
- Focal samples: User-defined samples by location, lineage, and/or date range (up to 2000) and user-selected samples from the sample table and/or specified with sample IDs. Note that if you don't specify location, lineage, and/or collection date range, Nextstrain will automatically select samples from your default location. When you define a location, lineage(s), and/or date range of interest, Nextstrain will select all samples within these defined parameters. If there are over 2000 user-defined samples or samples from your default location, Nextstrain will automatically select samples through subsampling based on temporal representation. All user-selected samples will be on the tree build.
- Contextual samples: Nextstrain-selected samples based on genetic similarities to the focal samples. Among the contextual samples, genetically identical sequences are removed to keep only 1 representative sequence to allow for more genetic diversity on the tree.
- Randomly-selected samples: Samples automatically selected by Nextstrain to represent virus diversity over space and time.
Targeted trees include:
- Focal samples: User-selected samples from the sample table and/or specified with sample IDs.
- Contextual samples: Nextstrain-selected samples that are twice the amount of focal samples. Contextual samples are automatically selected based on close similarities to focal samples. Users can opt to preferentially include contextual samples from a specific location. Roughly, half of the contextual samples will be chosen regardless of collection location based on genetic similarity alone. The other half will represent samples collected over time from the preferred location and other regions within the same state or country. All genetically identical sequences are kept in this tree.
- Randomly-selected samples: Samples automatically selected by Nextstrain to represent virus diversity over space and time.
Note (Mpox): Targeted tree builds for mpox do not include randomly-selected samples over time. Mpox targeted trees only include contextual samples that are closely related to user-selected focal samples at a 2 to 1 ratio. This allows users to build trees focused on specific samples or samples from specific clades.
Non-contextualized trees include:
- User-defined samples: Defined by location, lineage, and/or date range of interest.
- User-selected samples: Selected from the sample table and/or specified with sample IDs. All user-selected samples will be on tree builds.
-
Nextstrain-selected samples: If you don't specify a location, Nextstrain will automatically select samples from your default location. When you define a location, lineage(s), and/or date range of interest, Nextstrain will select all samples within these defined parameters. If there are over 2000 user-defined samples or samples from your default location, Nextstrain will automatically select samples through subsampling based on temporal representation.
Note that Non-contextualized trees are the only trees that do not contain contextual samples. Since this tree type only shows diversity within samples of interest without any contextual data, Non-contextualized trees are not suitable for making epidemiological inferences.
To minimize redundancy, CZ GEN EPI removes duplicate samples from all tree builds by matching the Public IDs to GISAID IDs (SARS-CoV-2), GenBank Isolate Names (SARS-CoV-2), or GenBank Accession IDs (Mpox). To ensure that the platform recognizes duplicated samples, users should update their sample Public IDs with GISAID or GenBank IDs whenever possible (see how to edit Public IDs).
Comments
0 comments
Please sign in to leave a comment.