Jump to Section:
Quality control checks on consensus genome assemblies should always be performed before uploading data to CZ GEN EPI. However, it is possible to miss errors that may affect phylogenetic trees. If you observe unexpected results in phylogenetic trees it is important to perform quality checks on your data. Here we explain how to use BAM files to investigate genome quality metrics that are flagged by Nextclade.
After reading this guide, you will:
- Become familiar with BAM files
- Understand how to use Nextclade to pinpoint potentially problematic regions in consensus genomes
- Become familiar with the Integrative Genomics Viewer (IGV) interactive tool to explore read mapping data stored in BAM files
What are BAM files?
Sequence Alignment/Map (SAM) and Binary Alignment/Map (BAM) files contain information regarding where sequence reads align and how well do they match to a reference sequence. Both file types contain the same information. However, BAM files (binary format) are the compressed version of SAM files (tab-delimited text format). Here we focus on BAM files given that these files are smaller than SAM files and are commonly used to view genome alignment and mapping information.
You can view how sequencing reads map to a reference sequence using BAM files. This allows you to evaluate genome coverage and pinpoint poorly sequenced regions. This image was taken after uploading files to the Integrative Genomics Viewer (IGV) web app.
BAM files are useful to verify your genome assembly. For example, you can confirm potential SARS-CoV-2 mutations by evaluating how much sequence support there is for a given mutation. The provided example highlights a C-to-A mutation in position 10,449 (also written as “A10449C”). This mutation is well supported by sequence reads given that 100% of the reads had the same mutation.
BAM files store information for all the reads in a given dataset, such as read name, length, and quality and their mapping and alignment information against a reference sequence. BAM files are accompanied by BAM index files (BAI), which indicate where in the BAM file a specific read or set of reads can be found. BAI files do not contain any sequence data and can be thought of as a “table of contents” for the BAM file. When you are ready to view read mapping information for your consensus genome you should have two files per sample, namely “.bam” and “.bai” files.
Where can I get BAM files?
You can obtain BAM and accompanying BAI files from your genome assembly pipeline. For example, you can obtain “.bam” and “.bai” intermediate output files while generating viral consensus genomes in CZ ID. If your workflow does not provide BAM and BAI files, you can use other platforms that will generate these files, such as Samtools and Galaxy.
When should I look at read mapping information?
You should look at read mapping information if there are unexpected or questionable sites in your genome (e.g, high number of undetermined nucleotides, mixed sites, mutations, or frameshifts). You can use Nextclade to evaluate the quality of your genome. See Tree quality checks for a general description of genome quality metrics provided by Nextclade.
SARS-CoV-2 genomes are close to 30,000 bases (or 30 kb) and it may be overwhelming, and not practical, to look at how reads map to every genome site. However, Nextclade produces a quality control (QC) report that will let you evaluate if there are potential problems with your sequences and help you pinpoint where to look.
Using the QC report from Nextclade to find potential problematic regions
Below we walk you through an example of steps that can be taken to evaluate the quality of consensus genomes and find potential problematic regions using Nextclade. Let’s say that you found a SARS-CoV-2 sequence that seemed out of place in your phylogenetic tree. You need to figure out whether there are data quality issues that impact its placement on the tree. To do this, you:
- Submit the consensus genome of interest to Nextclade
Nextclade web interface for uploading sequences. Once you upload the sequence file(s), click on ‘Run’ to view QC report.
- You look at the output QC report and, sure enough, your consensus genome is flagged as “bad” due to unexpected frameshifts. See Tree quality checks and Nextclade Quality Control (QC) for details regarding the report format.
Nextclade QC report points to unexpected frameshifts in the uploaded consensus genome sequence. All other QC metrics look good.
- Although frameshifts occur naturally, they may point to sequence errors (see Types of frameshifts and when to fix them). You decide to look at the alignment viewer to get a better understanding of the problem (visit Nextclade for details regarding the alignment viewer). The frameshifts were detected in the nucleocapsid (N) and ORF9b coding regions so you decide to zoom in on those open reading frames (ORFs).The alignment viewer dropdown menu allows you to select genes of interest.
Details for frameshifts detected in nucleocapsid (N; left) and ORF9b (right) gene sequences. Gene portions affected by the frameshifts are highlighted with a red horizontal line within the alignment viewer.
Details regarding deletions and missing data (Ns) that potentially led to the frameshifts in nucleocapsid (N; left) and ORF9b (right) gene sequences. Deletions relative to the reference sequence are highlighted with black bars within the Nextclade alignment viewer.
Things to note from the alignment viewer:
- The frameshifts affected a significant portion of both genes suggesting that these genes are not functional, including the essential nucleocapsid gene. Therefore, this is probably a sequence error.
- The frameshifts were potentially caused by gaps (i.e., deletions relative to the reference sequence) and/or missing data (Ns).
- Although unexpected frameshifts were detected in two genes, both frameshifts are linked to a deletion in the same genomic region (sites 28,362 - 28,368). The frameshift impacts both genes due to their overlapping open reading frames.
Now you know that you need to check out the sequence read data that was mapped to the nucleocapsid gene (around genome position 28,360) for more details. You need to evaluate if read data supports the flagged frameshifts. Otherwise, you will need to update the consensus genome if there is a sequence error.
Using the Interactive Genomics Viewer (IGV) to view BAM files
The Interactive Genomics Viewer web app (IGV-Web) is an online interactive tool used to explore genomic data. Note that there is an IGV desktop app that you can download and install for free on your computer. See user guides for IGV desktop and IGV-Web for detailed user manuals and to stay up to date with the latest versions. Here we summarize steps to view BAM files using IGV-Web and explore genomic regions flagged by Nextclade.
To view read mapping data stored within BAM files:
- Download BAM and BAI files for the consensus genome of interest.
- Go to IGV-WebIGV-Web interface
- Select SARS-CoV-2 as your reference genome from the “Genome” dropdown menu.
Reference genome dropdown menu. Note that you can upload your own reference genome. However, the SARS-CoV-2 genome is already available within the platform.
Once you select SARS-CoV-2 as the reference genome, you will be able to see a genome ruler and annotations.
- Upload your BAM and BAI files using the “Tracks” dropdown menu.Tracks dropdown menu. Select “Local File…” to upload BAM and BAI files found on your computer.
Make sure to select two file types for a given dataset, namely “.bam” and “.bai”.
- Begin exploring read coverage and alignment along the reference genome and zoom in to view the genomic region(s) flagged by Nextclade. You can zoom in by specifying genome regions of interest using the search tool or by using the zoom sliding tool.
You will be able to see genome coverage after uploading BAM and BAI files. To view specific sequences, zoom in on genomic regions of interest using the zoom sliding tool on the right hand side of the page or by specifying a genome position range within the search tool.
- Here we continue with the example describing how to pinpoint genomic regions of interest using Nextclade from the previous section. To recap, we need to use the report from Nextclade to find the sequence of interest (around position 28,360) and evaluate if read data support frameshifts in our consensus genome.
Use the report from Nextclade to find the genomic region of interest. Here we overlaid part of the report and highlighted the region of interest in the reference sequence displayed by IGV-Web for visualization purposes. Once you find the region of interest, change the default settings to view bases by clicking on the alignment track settings dropdown menu.
- Start making your own alignment for your consensus genome (query) based on what you see. Make sure to scroll down to view more read information.
After finding the sequence of interest, start evaluating if the coverage and sequence reads support the assembled consensus genome.
- Try to pinpoint and understand differences between the original consensus and what you see through read mapping. Note if the problems are associated with the ends of sequence reads, which tend to be less reliable. During Illumina sequencing, the per base quality decreases towards the end of reads. Additionally, it is possible that not all the bases from adapter sequences are trimmed off the 3’ ends. Most assembly pipelines should trim low quality bases and adapters but how well they perform depends on the pipeline. Sequence assemblers may also be less accurate when aligning read ends if there is a deletion and only a few bases from the end span the deletion. For all these reasons, it is always good to check the alignment near read ends.
Scroll down to see more read details. If discrepancies are associated with read ends, you need to look for reads spanning the genomic region in question.
- What is the consensus for reads spanning the genomic region in question? After making your own alignment, evaluate how it compares to the original consensus sequence.
Compare your own alignment against the reference (based on read mapping) and identify if there is something that needs to be edited in the consensus genome.
- Edit your assembled consensus genome based on what you observed through read mapping.
Summary of what you learned through read mapping. You need to delete the Ns from the genomic region flagged by Nextclade.
- Upload your edited consensus genome to Nextclade and evaluate if there is any improvement on the QC report. As suspected, the frameshifts were due to sequence error.
In conclusion, the data did not support frameshifts. Edited sequence did not show frameshifts and passed all quality control metrics.
- Use your edited and corrected consensus genome for downstream analyses.