Exercise 2: Learn about the pathogen's genome, download and visualize genomic data

Task: Find a genome record for the pathogen and access related genomic sequence data

To perform helpful bioinformatic analysis for your research it is imperative to find and download high quality sequence data. NCBI accepts nucleotide sequence data from research labs all over the world, but we also have a curation group who uses this data as well as published information to create high quality reference sequences and datasets for use in computational work.

You will perform a search to find a genome record for the pathogen, learn more about the genomic assembly, sequence and annotation, how to visualize and explore this data, as well as how to download the genome, transcriptomes and proteome sequences for further research.

Background

Where do the nucleotide sequences at NCBI come from?

Primary Nucleotide Sequence Repositories of the International Nucleotide Sequence Database Collaboration (INSDC) (https://www.insdc.org/):

These databases accept, store and share "primary sequences" - those nucleotide sequences who have been identified, submitted and are still “owned” by the researchers (submitters).

The databases contain records of varying quality for both their metadata (descriptive information) and the sequences.
They also contain sometimes highly redundant sets of data provided independently from a few to hundreds of research labs.

NCBI has developed the Reference Sequences (RefSeq) Project (https://www.ncbi.nlm.nih.gov/refseq/) created/curates high-quality nucleotide and protein sequences from submitted data supplemented with information from peer-reviewed, published literature. The data is produced and therefore “owned” and updated by NCBI.

The project aims to:

Records represent all molecules in the central dogma
Provide reference standards
Eukaryotes: genomic, mRNA & ncRNA, proteins
Prokaryotes and Viruses: genomic, ncRNA & protein (no mRNA records)

These are created with distinct, recognizable accessions with a “prefix and underscore (_)”

genomic: NC_, AC_, NG_, NZ_
RNA: NM_, NR_, XM_, XR_
protein: NP_ (YP_), XP_, *WP_

*NOTE on on the WP accession.

A developing issue in GenBank and now in RefSeq: We have over 200,000 RefSeq bacterial assemblies - many of them have identical protein sequences - producing redundant, redundant, redundant protein records.
A Solution: Make one copy of a “shared protein sequence” to link all annotations in the Identical Proteins Group database.
For example: The gene carbapenem-hydrolyzing class A beta-lactamase is annotated on more than 4,700 Bacterial genomic assemblies. It's encoded protein is included in the Identical Proteins Group (IPG) database and report with the accession: WP_004199234.1, MULTISPECIES Taxonomic Group carbapenem-hydrolyzing class A beta-lactamase KPC-2 [Bacteria]

What are genomic assemblies and what can you find at NCBI?

The genome of an organism consists of a set of one or more chromosomes. NCBI's Genome Assembly Model supports a single "assembly" record which lists metadata about the genomic construct as well as a collection of links to individual chromosome sequence records. Assemblies available at NCBI are either created by the research community and submitted to GenBank (assigned accessions beginning with GCA_) or created/curated with in-house annotation pipelines by the NCBI RefSeq Project Team (with accessions beginning with GCF_).

Currently, NCBI has almost 1.8 million genome assemblies for over 144,500 species. For your own organism-of-interest, which one might you prioritize if you need to focus on one to start?

Viruses may have one or more reference genomes per species and chosen assemblies are based on the designated exemplar(s) of the International Committee on Taxonomy of Viruses (ICTV) .

Prokaryotes may have one or more reference or representative genomes per species.

- Reference genomes are selected based on assembly and annotation quality, existing experimental support, and recognition as a community standard (ex: Escherichia coli str. K-12 substr. MG1655) or of clinical importance (ex: Escherichia coli O157:H7 str. Sakai or Mycobacterium tuberculosis H37Rv).
- Representative genomes are assigned to type strain assemblies if there is no current reference genome or if one other than the reference is scientifically significant and exhibits strong sequence diversity as compared to the assigned reference genome(s) (ex: Mycobacterium avium subsp. paratuberculosis K-10 or Streptococcus thermophilus JIM 8232).

Eukaryotes (incl. fungi & helminths) - no more than one reference or representative genome per species.

- Reference genomes are selected based on assembly and annotation quality, existing experimental support, and recognition as a community standard or of clinical importance (ex: Aspergillus fumigatus Af293).
- If there are no reference assemblies for a particular eukaryotic species, then RefSeq will select a Representative genome from the highest quality GenBank assembly (ex: Schistosoma mansoni ASM23792v2).

For more information: https://www.ncbi.nlm.nih.gov/assembly/help/

Key NCBI Resources for this Exercise

Now undergoing retirement (but the data is still here!):

NCBI Genome database - a catalog of species-level genome-specific information. It includes information and links to related data as curated by the RefSeq team and generated by annotation pipelines. The RefSeq Group handles viral & bacterial genome data differently than eukaryotic organisms, such as fungi and humans.

NCBI (Genome) Assembly database - a repository of genome sequence assemblies with information about submitters, statistics and links to the actual sequences.

NCBI Datasets has just recently added two new page types to display the above data!

Taxonomy pages - list information about the organism and related data at NCBI, as well as some summary information about a reference or representative genome.

Genome pages - displays more detailed information about a particular genomic assembly and annotations, including statistics, a way to visualize the genome and a mechanism to download sequence datasets.

Your Turn: Learn about the pathogen's genome!

Use the name of your patient's pathogen to begin your search following the steps below.

Click below if you need a hint on what organism you found:

Identified viral isolate

A graphic with the answer, Measles morbillivirus is the infectious viral isolate.

Identified bacterial isolate

A graphic with the answer, Salmonella enterica is the infectious bacterial isolate.

Identified fungal isolate

A graphic with the answer, Candida auris is the infectious fungal isolate.

Don't forget: You may need to search with [Candida] auris, not Candida auris. (explanation)

Find the Genome record for your pathogen

Use the Search menu pull-down selector for Genome or go directly to the Genome database homepage (https://www.ncbi.nlm.nih.gov/genome) and begin typing in the name of the pathogen into the search text box.
As you type, you'll see "autocomplete" display some names to help you!
(This is based on information in our Taxonomy database - which we'll discuss in exercise 3.)
Click on the name of the pathogen you are looking for.
You may get directly to a record page for an organism or to a list of possible organisms. If you get the list, click on the name of the one you'd like to focus on.
If you need it, you can click here to get to a link for the pathogen record page.

There are 3 sections within this NCBI Datasets Taxonomy page:

- - - About the organism
    - A summary of information for the organism's genome
    - Links to this organism's data in other NCBI databases

To begin, let's focus on the Genome section.

NOTE: For those of you familiar with the old Genome & Assembly database records, they are still available for a time. You can access the old Genome page and link to related Assembly records with:
View the legacy Genome page.

3. In the Genome section, there is a link to browse all available genomes as well as some summary information about a particular one.

How many genomic assemblies exist for your pathogen (at the species and sub-species-level) at NCBI? (We will examine and discuss these in Exercise 3).

Is there a Reference or Representative genome indicated for your pathogen? What is the name and the accession for that particular assembly?

Download: You can quickly get metadata and various sequence datasets for this particular genomic assembly!

Learn more about the reference or representative genomic assembly

To learn more about the pathogen's genomic assembly, click on the reference or representative assembly's name to go to the NCBI datasets genome page.

If you need it, you can click here to get to pathogen genome page.

There are 6 sections within this Genome page:

- - - Summary
    - Assembly statistics
    - Assembly methods
    - Assembly details
    - Chromosomes - to access the sequence records, visualize or download a metadata table
    - Revision history - to see if/when the page has been updated

What can you learn about this particular genomic assembly?

Quickly download sequence datasets for your reference or representative genomic assembly

At the top of this Genome page is a very convenient Download button!
It gives you all sorts of options for what you can get for this particular genomic assembly.

What kind of data might you want to get?

Learn more about the files.

NCBI Datasets Genome Data Package Contents

Dataset catalog: a list of each data file contained within or referenced by the package, along the content type and location for each.
Genome data table: metadata in tab-separated format (TSV), one row per genomic assembly
Genome data report: metadata in JSON Lines format, one line per genomic assembly
Sequence data report in JSON Lines format, one line per nucleotide sequence

Sequence Files in the FASTA format

Genomic FASTA
Transcript FASTA - mRNAs for eukaryotes or coding sequences for viruses or prokaryotes.
Protein FASTA

Annotation Files in the following formats: Genome GBFF , Genome GFF3, or Genome GTF

Explore genome annotations in a graphical viewer

NCBI has two main biomolecular sequence viewers that we use for visualization and exploration of nucleotide or protein sequences and locations of types of additional annotations. These run on the same basic code base and have similar functionality.

- Any nucleotide or protein sequence, including viral or bacterial genomic sequences or a eukaryotic nucleotide sequence, can be explored in the NCBI Sequence Viewer. This is available for any sequence record by clicking on "Graphics" or "Graphical View".
- For eukaryotic genomes, we've expanded the types of data that can be accessed, displayed and compared with the NCBI Genome Data Viewer (a.k.a. GDV).

There's a lot you can do with these viewers and I've put in some links at the end of this page to video tutorials as well as another dedicated workshop dedicated to working with them.

So for today, we'll just show you how to be able to load your genome into one of these viewers and begin to play!

1. On the NCBI Datasets Genome page, scroll down to the Chromosome section.
  Here you can see each of the individual nucleotide chromosome records and their accessions as well as a little metadata for each chromosome.

How many chromosomes are listed for your pathogen?

Are there any you are surprised at seeing or that you would have expected to be there but are not?

2. For the viral or bacterial pathogen's main chromosome, look all the way on the right under Action and click on the vertical 3 dots - then click on "Graphical view".

For the fungal pathogen you'll note that there are 7 chromosomes listed.

For now, let's focus on chromosome 1. In that row, look all the way on the right under Action and click on the vertical 3 dots - then click on "Graphical view".

Note: You could also have clicked on an accession to go to the corresponding Nucleotide sequence record, and then clicked on "Graphics" to get to the same display!

As fungi are eukaryotes, you could also have clicked on Genome Data Viewer to use that particular browser. This enables you to search the entire genomic assembly and jump back between different chromosomes.
Feel free to try this one on your own!

3. The viewer enables you to zoom in/out, pan & explore annotation tracks of different data types in the context of the chromosomal location.

Some things you can do with this genome browser:

- - Use the slider to zoom in to see the annotations more clearly, or click an drag in the ruler bar to zoom into a region, or search with an annotated accession or name of a gene to zoom in directly to it....try this!
    - for the viral pathogen: try searching with H (this encodes hemaglutinin - promotes viral infectivity by binding and aiding in cellular entry).
    - for the bacterial pathogen:try searching with gyrA (this encodes DNA gyrase - regulates DNA replication and when mutated confers resistance to the antibiotic ciprofloxacin).
    - for the fungal pathogen: try searching with CJI96_0001351 (this encodes FKS1 - a β-1-3-glucan synthase and when mutated confers resistance to the antifungal caspofungin)

- - Click the colorful button to quickly toggle display of the gene product (ex: protein) bars.
  - Place your cursor over a displayed bar and a pop-up a window with annotation and links to more information about that feature will appear.
  - There’s a whole lot of documentation and several tutorials to help you learn how to use this viewer. To start, click on the “?” icon in the upper right-hand corner of the viewer.

We could spend ALL day playing with the sequence viewer....
There are other workshops that help you learn to use this
and lots of video tutorials on YouTube!
see below

Take-away message!

Need a good quality genome sequence or dataset?

- Search in the Genome for a Reference or Representative Genome assembly record.

To download genomic datasets for the pathogen....

- Simply click on the blue Download button!

Visualize and explore a pathogen's genome assembly with annotations!

- Viral and bacterial genomes can be Graphically viewed in the Sequence Viewer
- Eukaryotic chromosomes (such as fungal sequences) can be Graphically viewed either in the Sequence Viewer or the full assembly can be displayed in the Genome Data Viewer.

For more advanced work....

Working with genome browsers

NCBI YouTube Tutorial Playlist for Sequence Viewer
NCBI YouTube Tutorial Playlist for Genome Data Viewer
NCBI Virtual Workshop material for: "Using NCBI's Genome Data Viewer to Visualize Eukaryotic Genome Data" (May 26, 2022)"

Annotate & submit to NCBI your own genome data!

Prokaryotic Annotation Pipeline (PGAP) - this is available through a web interface, can be downloaded from GitHub or used in "the cloud"!
Viral Annotation DefineR (VADR) - an application for viral annotations downlodable from GitHub or usable in "the cloud".
NCBI Submission Portal - the best place to go to learn about and begin to submit data to NCBI!

Working with APIs or accessing data via command-line or scripting? Try these!

Entrez Programming Utilities (EUtils) - the NCBI-wide set of APIs for accessing and downloading NCBI data
Entrez Direct (EDirect) - the NCBI-wide command-line tool for accessing and downloading NCBI data
NCBI Datasets - a new, quick dataset resource with it's own command-line tool as well as programming utilities (APIs and Python- & R-related resources)

Last Reviewed: July 27, 2023