Ideally, users should make their data publicly available by submitting their sequences to GenBank. Please note that you need to create an account with the National Center of Biotechnology Information (NCBI) to upload data into GenBank. Your lab can also create an NCBI group account to allow multiple members from your group to submit sequences. Here we describe how to download sequences and metadata from your CZ GEN EPI account for GenBank sequence submissions. We also list steps to submit the downloaded data.
After reading this guide, users will be able to:
- Download sequences and metadata from their CZ GEN EPI account
- Become familiar with GenBank consensus genome submissions
Downloading data to be uploaded into GenBank
For consensus genome submissions to GenBank, you will need a metadata file describing the source of your sequences (known as a source modifier table) and a sequence fasta file. You can easily download both files from your CZ GEN EPI account. The sequence file will be ready for your GenBank submission ‘as is’ when you download it. However, you may choose to include additional source information in your metadata file (see below). Download data files from your CZ GEN EPI account following these steps:
- Select the samples you are interested in submitting to GenBank from your Sample page and click on the ‘Download’ icon.
Click on the ‘Download’ icon on the right-hand side of the Sample page after selecting samples of interest.
- A ‘Select Download’ dialog box will appear. Select ‘GenBank Submission Template’. This will allow you to download a sequence file (fasta format) and a metadata file (tab-delimited file with ‘.tsv’ file extension).
Select ‘GenBank Submission Template’ from the ‘Select Download’ dialog box. Download your selection by clicking on ‘Download’.
If more than 999 samples are selected for download, you will see a warning message within the ‘Select Download’ dialog box indicating that the data will be downloaded in multiple files.
When 1000 samples or more are selected, the data will be split into multiple files containing data for up to 999 samples.
Submitting downloaded consensus genome data to GenBank
Now that you downloaded your data, you are ready to submit it to GenBank ‘as is’ or you may want to add additional metadata (see step 12 below). In this section we describe steps to submit SARS-CoV-2 consensus genome data to GenBank. Note that genome sequences should be linked to their raw data using the same BioProject accession number used for Sequence Read Archive (SRA) submissions whenever possible. For details regarding this process, see a protocol from the Public Health Alliance for Genomic Epidemiology (PHA4GE) here. To submit SARS-CoV-2 consensus genome data to GenBank:
- Go to the GenBank Submission Portal and log in into your NCBI account.
NCBI sequence Submission Portal.
- Navigate to ‘My submissions’ and select GenBank under ‘Start a new submission’
- Click on ‘New Submission’
The submission process is divided into 9 sections. Follow the prompts from the Submission Portal.
Sections within the Submission Portal.
- Submission Type: Select SARS-CoV-2 for the submission type and provide a reference title (optional) for your submission.
Click ‘Continue’ after specifying your selections.
- Submitter: Enter your affiliation and contact information within the provided spaces.
Click ‘Continue’ after entering your information.
- Sequencing Technology: Provide requested information regarding sequencing technology, the assembly status of your sequences (should be ‘assembled’) and assembly program.
Click ‘Continue’ after providing information regarding the sequencing method, assembly status, and assembly program used.
- Sequences: Select the desired date for releasing the sequences to the public and upload the sequence fasta file you downloaded from CZ GEN EPI.
Specify when you would like to release sequences and upload your sequence file.
You should see your sequence file listed in the interface (you may delete it and upload it again if you notice something is wrong). Click ‘Continue’
- If your sequences contain more than 10 unknown nucleotides (Ns), you will see a warning message listing the sequences with strings of Ns. You have to select an option describing what those Ns represent: 1) a region of estimated length (e.g., 15 Ns that represent approximately a 15 nucleotide gap you were not able to sequence); OR 2) a region of unknown length. This is determined by the consensus genome assembly pipeline and how gaps are handled.
Warning message regarding string of unknown nucleotides (Ns) within sequences.
- Sequence Processing: Indicate if you would like GenBank to automatically remove poor quality sequences from your submission.
Click ‘Continue’ after making your selection.
- Source Information: Indicate if the sequence IDs listed in your fasta file represent isolates. Isolate names refer to codes or descriptions used in your laboratory to track individual samples. Select ‘Yes’ given that CZ GEN EPI automatically uses sequence IDs for isolate names. If you don’t want to use sequence IDs as isolate names, select ‘NONE of these’ and edit the downloaded metadata file from CZ GEN EPI to specify isolate names (see step 12 below).
Select ‘Yes’ to use sequence IDs in your fasta file as isolate names (already in your downloaded metadata file from CZ GEN EPI).
- Source Modifiers: Add source modifiers (a set of entries that describe the source of your sequences). There are four required source modifiers for SARS-CoV-2 sequences, including collection date (YYYY-MM-DD), country (Country: State, County), host (homo sapiens), and isolate. All of these required source modifiers are included in the downloaded metadata file from CZ GEN EPI in the correct format. Therefore, the easiest way to add source modifiers is by uploading the tab-delimited metadata file you already obtained from CZ GEN EPI.
Select ‘Upload a tab-delimited table’ to add source modifiers using downloaded metadata from CZ GEN EPI.
- Edit source modifier information as needed. Note that you can download a source modifier template from the Submission Portal and add information manually. However, this is not necessary given that the required information is already in the metadata file downloaded from CZ GEN EPI. If you have BioProject and/or BioSample accession numbers for sequences, you can add this information by editing the CZ GEN EPI metadata file. To do this, simply import the metadata file as text into Excel (or copy the information into Excel) or use any text editor to add the information. You may also add columns for additional source modifiers listed here.
Once ‘Upload a tab-delimited table’ is selected, you will be able to download a source modifier template table.
Comparison between source modifier template downloaded from the Submission Portal and metadata file from CZ GEN EPI.
If you are working with Excel, make sure the dates are in the correct format (YYYY-MM-DD).
If you are copying data into Excel, make sure to keep dates in the correct format.
If you open the metadata file in Excel make sure the date format is set to YYYY-MM-DD. You can select the required format from the ‘Format Cells’ dropdown menu.
- Save your edited file as a tab-delimited text file and upload it to the Submission Portal.
Upload the metadata file with source modifiers.
You should see your metadata file listed in the interface (you may delete it and upload it again if you get any errors that need to be corrected). Click ‘Continue’
- References: Enter information regarding authors for the sequencing effort and any publications discussing the uploaded sequence data.
Click ‘Continue’ after adding author and publication information.
- Review & Submit: Review all the information and submit.
After reviewing all the information and looking over the GenBank Record Preview for your sequences on the right-hand side of the page, click ‘Submit’ to finish your submission.