Why has CZ GEN EPI made the transition from GISAID to GenBank data?
As of 2/26/23, CZ GEN EPI was no longer able to access GISAID’s SARS-CoV-2 repository. We still have a cache of the SARS-CoV-2 samples on GISAID from before that date, but we were unable to pull in any new samples published on GISAID after 2/26/23 to use as phylogenetic tree focal and contextual samples. This means that any phylogenetic trees built between 2/26/23 and 4/13/23 contain samples published on GISAID no later than 2/26/23. In order to be able to include up-to-date samples from public repositories, on 4/13/23 we switched to GenBank to provide this data. GenBank is the second leading open-source database containing millions of SARS-CoV-2 samples.
How does this change impact me?
After 4/13/23, publicly available samples chosen for your SARS-CoV-2 phylogenetic trees will be sourced from GenBank instead of GISAID. You will also only be able to force-include GenBank samples on your trees after 4/13/23 using GenBank isolate names that have the format of USA/CA-CZB-0000/2021. This change does not impact Mpox trees which already source public samples from GenBank.
All trees already built before 4/13/23 will not be impacted by this change and still contain data from GISAID. You can tell from the tree subtitle whether GISAID or GenBank data is used.
Since we have a cache of previous GISAID samples, will those still end up on trees or will the focal and contextual samples on my tree only come from GenBank?
Trees built before 4/13/23 only contain publicly available samples from GISAID together with samples you uploaded to CZ GEN EPI. After 4/13/23, the publicly available samples on your phylogenetic trees will only come from GenBank. It is against GISAID’s terms of use to intermix GISAID samples with samples from other databases.
What is the difference between samples available on GISAID vs GenBank?
GISAID is the leading public repository that people all over the world submit their SARS-CoV-2 data to and contains roughly 25% more samples than GenBank. GenBank, being a database located in the US, has more US-focused SARS-CoV-2 samples. GenBank still contains a large amount of international data since it routinely ingests data from other International Nucleotide Sequence Database Collaboration (INSDC) partners DDBJ, EMBL-EBI
One important thing to note is that GISAID is established to foster data sharing through protecting data with stricter data access policies, so their data requires login to access. Data in GenBank is truly open access and in the public domain.
Will CZ GEN EPI be able to source sample data from GISAID anymore?
As of right now, we have no estimated date for when we’ll be able to pull data from GISAID. We would like to be able to provide GISAID data for contextual tree builds and will continue to try to work with GISAID on this issue. We will update users if/when anything changes.
What should I do if I have questions or concerns?
If you have any questions or concerns, please don’t hesitate to reach out to us at hello@czgenepi.org and we will do our best to answer your questions and address your concerns.
Comments
0 comments
Please sign in to leave a comment.