Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Exercise 3: Download bulk genomic datasets for the pathogen

Task:  Download bulk genomic datasets for the pathogen and related organisms.

To perform advanced bioinformatic analysis for your research, you may need to find and download high quality sequence data for your pathogen or a group of related organisms . The NCBI Datasets resource provides a simplified user interface to access genomic data by taxon. Upon selection of a taxonomic level, you can browse a table view of associated genomic metadata and quickly select your choice of entries to download as the metadata table; an annotation file; genomic, transcriptomic or protein sequence files; or all of these at once in a single compressed package file.

From your pathogen's NCBI taxonomy page you will explore a hierarchical view of the available genomic assembly data available for taxons relevant for your organism. Upon selection of a taxonomic level, you will explore the assembly statistics and be able to select those of interest and download bulk genome, transcriptome and proteome datasets for further research.




Key NCBI Resource for this Exercise

NCBI Datasets homepageNCBI Datasets is a developing resource that lets you easily gather genome, transcriptome and proteome sequence data and it's metadata from across NCBI databases.

If you want to:
  • quickly download genome, transcriptome and/or proteome sequences or metadata (such as for Measles morbillivirus, Listeria monocytogenes or Candida auris)
or
  • get a very large dataset (for example, all of the almost 8 million SARS-CoV-2 genomic sequences)

NOTE: This is a relatively new resource which is still in development. The team is looking for feedback on what it does well, what could work better, and what helpful things could be added..... 
Let us know!




Your Turn: Create and download relevant genomic datasets for the pathogen for future studies!




Search NCBI Datasets and display a taxonomic hierarchy view of genomic assemblies for the pathogen.

  1. Return to the NCBI Datasets Taxonomy page that you found in Exercise 2.

    If you need it, you can click here to get to a link for the pathogen record page.

Alternatively, search the NCBI Datasets resource (https://www.ncbi.nlm.nih.gov/datasets) with your pathogen's name. If you need a reminder....
Identified viral isolate
A graphic with the answer, Measles morbillivirus is the infectious viral isolate.
Identified bacterial isolate
A graphic with the answer, Salmonella enterica is the infectious bacterial isolate.
Identified fungal isolate
A graphic with the answer, Candida auris is the infectious fungal isolate.Don't forget: You may need to search with [Candida] auris, not Candida auris. (explanation)


2.  Click on the Browse Taxonomy button to see the taxonomic hierarchy view for your organism along with the number of genomic assemblies that NCBI currently has at each level.
    • Mouse-over the name to see a pop-up of the scientific taxon. Please note that we include more levels than the standard classification levels: kingdom, phylum, class, order, family genus and species.
    • The number on the right next to the taxonomic node name indicates the number of genomic assemblies that NCBI houses for organisms at and below that level.
  • What do you note about the distribution of genomic assemblies for your pathogen?



Note: You can adjust and even add additional specific taxa to the search box at the top. This way you can identify and compare the number of genomic assemblies and even create your own custom datasets with information beyond those in the explicit taxonomic lineage for your pathogen.  For example, here's a link to all 3 pathogens in that view!




Explore the Genome assembly table, customize your selection and download a file containing genomic data.

1.  Click on the number next to the taxonomic level you are interested in. This will take you to a Genome assembly table view showing some statistics for all of those genomic assemblies.

Again note: You can adjust and even add additional specific taxa to the search box at the top of this table, just like you were able to do on the hierarchy view. For example, here's a link to all 3 pathogens in that view!
    • You may notice a green check-mark icon at the top of the list.  Those indicate the reference genome(s) - which tend to be of higher sequence and/or annotation quality.
    • Click on the Select Columns button to see additional information that you can display.
  • Which data would you like to use for selecting particular assemblies?



2. Scan the Genome assembly table view which displays some statistics for each of those genomic assemblies. You are also able to filter the rows to focus your dataset.
    • Click Filters (above the table) to open a panel which will help you narrow the list based on key criteria of importance to you, such as those who have Annotations, or are of a particular Assembly Level. Or you can quickly just specify those which are Reference genomes.
    • Use the check box at the top of the table to select all rows or check off rows for specific genomic assemblies of interest.
    • THEN, click the Download button.  You have the choice of downloading the metadata table OR create your own dataset package!
What type of data is of interest to you?

What is available?

NCBI Datasets Genome Download menu with data options


Learn more about the files.

NCBI Datasets Genome Data Package Contents

Dataset catalog: a list of each data file contained within or referenced by the package, along the content type and location for each.
Genome data table: metadata in tab-separated format (TSV), one row per genomic assembly
Genome data report: metadata in JSON Lines format, one line per genomic assembly
Sequence data report in JSON Lines format, one line per nucleotide sequence

Sequence Files in the FASTA format
  • Genomic FASTA
  • Transcript FASTA - mRNAs for eukaryotes or coding sequences for viruses or prokaryotes.
  • Protein FASTA

Annotation Files in the following formats: Genome GBFF , Genome GFF3, or Genome GTF







Take-away message!

Looking for a set of genomic assemblies for a taxon or several?
    • Search in the NCBI Datasets resource and "Browse Taxonomy".
Interested in accessing a table containing lots of metadata for different genomic assemblies?
    • From NCBI Dataset's Taxonomy browser, click on a number to explore the information, filter the retrieved genomic assemblies - and even download the table!
Use the Genome assembly table to download genomic data from multiple genomic assemblies (from similar or diverse organisms!).
    • Select those assemblies of interest and Download a Data Package containing the specific genomic, transcriptomic or proteomic sequence or metadata you want.


For more advanced work....

Last Reviewed: July 26, 2023