Exercise 3: Download bulk genomic datasets for the pathogen
To perform advanced bioinformatic analysis for your research, you may need to find and download high quality sequence data for your pathogen or a group of related organisms . The NCBI Datasets resource provides a simplified user interface to access genomic data by taxon. Upon selection of a taxonomic level, you can browse a table view of associated genomic metadata and quickly select your choice of entries to download as the metadata table; an annotation file; genomic, transcriptomic or protein sequence files; or all of these at once in a single compressed package file.
From your pathogen's NCBI taxonomy page you will explore a hierarchical view of the available genomic assembly data available for taxons relevant for your organism. Upon selection of a taxonomic level, you will explore the assembly statistics and be able to select those of interest and download bulk genome, transcriptome and proteome datasets for further research.
Key NCBI Resource for this Exercise
NCBI Datasets is a developing resource that lets you easily gather genome, transcriptome and proteome sequence data and it's metadata from across NCBI databases.If you want to:
- quickly download genome, transcriptome and/or proteome sequences or metadata (such as for Measles morbillivirus, Listeria monocytogenes or Candida auris)
- get a very large dataset (for example, all of the almost 8 million SARS-CoV-2 genomic sequences)
NOTE: This is a relatively new resource which is still in development. The team is looking for feedback on what it does well, what could work better, and what helpful things could be added..... Let us know! |
Your Turn: Create and download relevant genomic datasets for the pathogen for future studies!
Search NCBI Datasets and display a taxonomic hierarchy view of genomic assemblies for the pathogen.
- Return to the NCBI Datasets Taxonomy page that you found in Exercise 2.
If you need it, you can click here to get to a link for the pathogen record page.
Identified viral isolate
Identified bacterial isolate
Identified fungal isolate
Don't forget: You may need to search with [Candida] auris, not Candida auris. (explanation)
2. Click on the Browse Taxonomy button to see the taxonomic hierarchy view for your organism along with the number of genomic assemblies that NCBI currently has at each level.
-
- Mouse-over the name to see a pop-up of the scientific taxon. Please note that we include more levels than the standard classification levels: kingdom, phylum, class, order, family genus and species.
- The number on the right next to the taxonomic node name indicates the number of genomic assemblies that NCBI houses for organisms at and below that level.
- What do you note about the distribution of genomic assemblies for your pathogen?
Note: You can adjust and even add additional specific taxa to the search box at the top. This way you can identify and compare the number of genomic assemblies and even create your own custom datasets with information beyond those in the explicit taxonomic lineage for your pathogen. For example, here's a link to all 3 pathogens in that view! |
Explore the Genome assembly table, customize your selection and download a file containing genomic data.
Again note: You can adjust and even add additional specific taxa to the search box at the top of this table, just like you were able to do on the hierarchy view. For example, here's a link to all 3 pathogens in that view! |
- You may notice a green check-mark icon at the top of the list. Those indicate the reference genome(s) - which tend to be of higher sequence and/or annotation quality.
-
- Click on the Select Columns button to see additional information that you can display.
- Which data would you like to use for selecting particular assemblies?
-
- Click Filters (above the table) to open a panel which will help you narrow the list based on key criteria of importance to you, such as those who have Annotations, or are of a particular Assembly Level. Or you can quickly just specify those which are Reference genomes.
-
- Use the check box at the top of the table to select all rows or check off rows for specific genomic assemblies of interest.
- THEN, click the Download button. You have the choice of downloading the metadata table OR create your own dataset package!
What is available?
Learn more about the files. |
NCBI Datasets Genome Data Package Contents Dataset catalog: a list of each data file contained within or referenced by the package, along the content type and location for each.
Annotation Files in the following formats: Genome GBFF , Genome GFF3, or Genome GTF |
Take-away message!
-
- Search in the NCBI Datasets resource and "Browse Taxonomy".
-
- From NCBI Dataset's Taxonomy browser, click on a number to explore the information, filter the retrieved genomic assemblies - and even download the table!
-
- Select those assemblies of interest and Download a Data Package containing the specific genomic, transcriptomic or proteomic sequence or metadata you want.
For more advanced work....
-
- NCBI Datasets How-To Guides
- Download NCBI Datasets' command-line tool
- Help with NCBI Datasets APIs and Python or R libraries
- Intermediate-level, command-line virtual workshop featuring EDirect & the NCBI Datasets CLI:
-
- NCBI videos on YouTube:
- NCBI Minute Webinar: "Using NCBI Datasets for Downloading Sequence and Annotation for Genomes and Genes" (June 30, 2021)
- NCBI Minute Webinar: "Using NCBI Datasets Command-line Tools to Access Data and Metadata for Genomes" (September 22, 2021)
- Quick Tutorial: "Easy Access to NCBI Data with NLM's NCBI Datasets!" (July 20, 2022)
- NCBI videos on YouTube:
Last Reviewed: July 26, 2023