Telling Data-Driven Stories Using NCBI Analysis and Visualization Tools

Using clear and compelling visuals to communicated research results is a key part of the scientific process. NCBI has a number of tools for both interactive exploration of data and creation of custom images. In this workshop we will demonstrate how they can be used for visualization and analysis of DNA/protein sequences, whole genomes and macromolecular structures.

Structure of today's workshop:

Overview of major NCBI visualization and analysis tools (this page, ~30 minutes):
Breakout sessions with experts (~90+ minutes):
1. Genome Exploration and Visualization
2. Sequence Comparison and Visualization
3. Biomolecular Structure Visualization

Comparative Genome Viewer

Overview

What is CGV? Comparative Genome Viewer (CGV) visualizes eukaryotic whole genome assembly-assembly alignments. You can choose from a selection of alignments that are provided by NCBI.

Why use CGV?

At a high level, it is good for generating hypotheses, that is, finding interesting patterns and differences that can be investigated further experimentally.

For example, you might want to see how a disease-associated region or gene has evolved in related species.

More specifically, use CGV to find:

- chromosome rearrangements, such as inversions
- large-scale insertions or deletions
- blocks of synteny, such as gene order conservation, between different species
- gene duplications or gene loss
- genomic changes in older and newer assemblies for the same species, or between different strains of the same species

What can CGV do?: Compare two eukaryotic assemblies

CGV includes animals, plants, and fungi

- - > 400 assemblies...and growing!
  - ~200 species...and growing!
  - ~480 alignments...and growing!
  - Offers different assemblies for the same organism where available
  - Request alignments
  - Download a graphics file of the current view and download alignment files
  - Caveats: Does not contain mitochondrial data, assemblies must already be in GenBank

Sample common workflow:

Set up an alignment between human T2T-CHM13v2.0 and human GRCh38.p14.
Use "Find a gene in this alignment" to search for glucose 6 phosphate isomerase.
Click on the gene name under "Description" to see more details

Setting up a view:

Fill out the form on the home page to select an assembly pair with available whole genome alignment data. In steps 1 and 2, choose your species of interest, and then select assemblies in steps 3 and 4. You can start typing a species name or a common name and the form will automatically suggest available options.

cgv alignment choice menu

Once you’ve selected an assembly pair, press View Comparison to show the alignment in the graphical Comparative Genome Viewer. Contact us using the link at the bottom if you’d like request additional assembly comparisons.

Examples of some common views:

Different assemblies for the same organism:

Two apis alignemnts

Link to alignment

Cross-species comparisons:

dog fox alignment

Link to alignment

View alignments as a dotplot:

fox dog dotplot

Link to view

View at different scales (i.e. gene level):

gene-level cgv options

Four ways to access sequence-level data:

Resources:

Help documentation
Relevant CGV workshop
Check out our YouTube tutorial for an overview of how to compare genomes using this tool.

Genome Data Viewer and Sequence Viewer

What is the Genome Data Viewer? GDV is a genome browser supporting the exploration and analysis of more than 3070 eukaryotic RefSeq genome assemblies. The GDV browser displays biological information mapped to a genome, including gene annotation, variation data, BLAST alignments, and experimental study data from the NCBI GEO and dbGaP databases.

In addition to being able to compare a genomic assembly with annotations from hundreds of different NCBI tracks, you can also upload data tracks produced by BLAST, import track hub data, or create tracks from your own data. This workshop focuses on eukaryotic genome data other than human, and is designed to help beginning or advanced users of GDV take advantage of the large amount of diverse genomic data available at NCBI.

Some reasons you might want to use Genome Data Viewer:

Search a genomic assembly to display a region annotated with a particular gene, phenotype, genetic variant, and more
View variation data from the EVA (European Variation Archive)
Add other sources of variant data, including your own VCF files
View non-RefSeq annotated genomes available in GDV
Display BLAST results, for example, from a human query aligned to another organism
Download a table of annotations for a region, including variants from the EVA
Share your customized view with a colleague, and download a graphic for a presentation or manuscript.

Finding Organisms and Genome Assemblies

The GDV home page allows you find and select organisms and assemblies available to view in the browser. You can search directly for common or scientific names using the search box, or click on nodes in the tree to explore different taxa. Enter gene names, dbSNP ids, phenotypes, assembly components/scaffolds, or sequence accessions into the search box on the assembly information panel. Examples of searches relevant to the selected organism are shown below the box to assist you in constructing queries. You can provide location information as a range, point, or cytogenetic band.

genome data viewer organism selectio tree

Below is an overview of GDV that highlights each of the main page elements The left sidebar contains a series of widgets that provide tools that can be used to manupulate the display. The center of the page contains an instance of the NCBI Sequence Viewer where tracks and track data is visualized. This GDV browser view can be customized in several ways. The caret arrow can be used to show or minimize each of the widgets on the left sidebar. These widgets can also be re-ordered by pressing within the header to select, and then dragging and dropping the widget into a desired location in the sidebar.

Tracks and User Data

One of the most powerful features of the Genome Data Viewer is the ability to load data to view on top of a genome assembly, such as:

Stored results from a BLAST search
NCBI-hosted tracks such as gene expression and sequence variation
Data from the Short Read Archive (SRA)
Data from Track Hubs, like GENCODE, and 8k+ others from the Track Hub Registry
User data, such as your own gene expression library mapped to the reference genome

NCBI-hosted track configuration options: Some available public Track Hubs:
NCBI track hub options

Sequence Viewer

NCBI Sequence Viewer is a graphical sequence display tool that is found on many NCBI pages, including NCBI Gene and Nucleotide record pages. It is also the core graphical component of the NCBI genome browsers, e.g. Genome Data Viewer. Sequence Viewer provides a linear graphical representation of features annotated or aligned to individual sequence accessions. You can use the pan arrows and zoom slider on the left side of the Sequence Viewer toolbar to navigate within this viewer; changes in the displayed range will be automatically propagated elsewhere in the genome browser, including to the Chromosome Region Selector and the Exon Navigator.

You can access NCBI Primer-BLAST, set markers, download sequence and track data, generate a PDF or SVG image, and more from within the Sequence Viewer toolbar and context menus. The Tracks button (in the upper right of the toolbar) allows you to configure the display with tracks provided by NCBI or to add custom tracks and BLAST request IDs. You can even embed the sequence data viewer into your own project!

BLAST and Related Viewers

NCBI’s BLAST is one of the most widely used bioinformatics tool in the world. It was designed to use a nucleotide or protein sequence to search and quickly find similar sequences in very large databases. While originally designed for studying evolutionary relationships, BLAST is now often used for identification of: the gene name and/or source organism of the sequence, related sequences from other organisms (homologs), as well as the location of a sequence within a larger reference sequence, such as a chromosome or genome.

What is BLAST?

BLAST is an acronym for Basic Local Alignment Search Tool
- Pub: Altschul, S F et al. “Basic local alignment search tool.” Journal of molecular biology vol. 215,3 (1990): 403-10. PMID: 2231712
A tool for searching databases of biological sequences, DNA or protein, using a sequence as a query
Matches and aligns regions of biological sequences: DNA-DNA, Protein-Protein
Programs:
- protein query, protein database
  - blastp, psi-blast
- nucleotide query, nucleotide database
  - megablast, blastn
- translating searches (useful for unannotated sequences)
  - protein query, nucleotide database: tblastn
  - nucleotide query, protein database: blastx
  - nucleotide query, nucleotide database: tblastx
BLAST is the most widely used sequence similarity search tool in the world
Web interfaces to databases at NCBI and many sites around the world

Standalone tool and BLAST-ready databases available for download from NCBI

What does BLAST do?

Finds high scoring local alignments between two sequences (protein or DNA)
Includes a model of score distributions for random local alignments
Provides statistical significance for matches / alignments
BLAST returns non-chance similarities between biological sequences.
- If similarities are not due to chance, then they must be due to something else!
  - Evolutionary relatedness (homology)
  - Simple identification

How and why do people use BLAST?

1. Database searches
  1. To find homologs in other species, model organisms
    - Homology is related to function
      - A human protein that is a significant match to a yeast protein may have similar functions in both
      - Homology is not a guarantee of the same function. Conversely proteins with similar functions may not be evolutionarily related.
        
        wings of birds, forelimb of mammals (homologs with related but not (exactly) the same function)
        
        wings of butterflies vs. wings of birds (not homologs but with similar function)
  2. To identify a sequence — is this sequence already in the database and what is it?
  3. To identify or classify an organism from a sequence
    - environmental samples, organism associated metagenomes
    - specialized queries and databases: Targeted loci, barcode sequence (rRNA genes, cytochrome oxidase)
2. Alignment tool
  - To quickly align, match up positions in related sequences
3. Alignment plus database searches
  - To annotate other sequences (find genes, identify exons)
    - Aligning mRNA to genomic DNA
    - Matching proteins to genomic translations

Viewing BLAST Output:

A web-BLAST output report has multiple options for viewing your results. Two of these options that allow you to take a closer look at your results and customize your views are NCBI Tree Viewer and Multiple Sequence Alignment (MSA) Viewer. Here is a relevant snippet of the results page, a protein BLAST search using accession NP_001179.1 as input against the NR protein database:

links to tree viewer and msa viewer from blast result

Tree Viewer:

NCBI Tree Viewer (TV) is the graphical display for phylogenetic trees. TV can visualize trees in Newick, Nexus, and ASN formats. To start using Tree Viewer go to the application homepage and look at some examples and demos. Here is the tree for the BLAST search above:

tree viewer example

The following actions can be performed with a tree:

Zooming and navigation
Displaying in different layouts
Selecting branches and viewing selections
Collapsing/Expanding branches
Customizing labels
Rooting at midpoint and Re-rooting at nodes
Uploading/Downloading phylogenetic tree files
Creating PDF

For more information, check out the Frequently Asked Questions page or browse through video tutorials, web tutorials, or manuals.

MSA Viewer:

Multiple Sequence Alignment Viewer application (MSA) is a web application that visualizes sequence (nucleotide or protein) alignments created by programs such as MUSCLE or CLUSTAL, including alignments from NCBI BLAST results. Users can also upload and view their own alignment files in alignment FASTA or ASN format. The MSA home page includes a links to sample protein and DNA alignment sessions. We also recommend going through the Guide.

Here is what the result MSA view looks for our sample BLAST result:

msa viewer result example

Examples of various other alignment styles:

Like Sequence Viewer, both Tree Viewer and MSA Viewer can be embedded into your own projects!

Viewing Structures in iCn3D

Overview

The amount of biomolecular structure data produced by researchers is growing rapidly and helping push scientific discovery. Knowledge of biomolecular structure helps scientists understand how the structure works and this knowledge can be used to influence function, predict binding partners, and understand biological pathways. As such, researchers in life sciences can benefit from an increased understanding of biomolecular structure and resources that build upon structural data, which is the focus of this workshop.

What is iCn3D?

"I see in 3D" (iCn3D) Structure Viewer is not only a web-based 3D viewer, but also a structure analysis tool interactively or in the batch mode using NodeJS scripts based on the npm package icn3d. iCn3D synchronizes the display of 3D structure, 2D interaction, and 1D sequences and annotations. Users' custom display can be saved in a short URL or a PNG image. The web-based nature of the tool means no downloads are needed to start visualizing and learning!

iCn3D

Why do people use iCn3D? Just a few options:

Create custom 3D images for publications or educational materials
Highlight important features like active site residues, point mutations, and binding partners
Analyze the effects of genetic mutations on protein structure
Compare experimentally-derived and prediced protein structures
Interactively view 3D alignments of similar structures
Incorporate iCn3D into your own pages
And MUCH, MUCH more

Some iCn3D features of interest:

iCn3D can export sharable links (https://structure.ncbi.nlm.nih.gov/icn3d/share.html?XCxR6fSTmXHxR3o1A)
iCn3D supports command-line analysis with either Python scripts (https://github.com/ncbi/icn3d/tree/master/icn3dpython) or Node.js scripts (https://github.com/ncbi/icn3d/tree/master/icn3dnode)
iCn3D can also be used in Jupyter Notebook (https://pypi.org/project/icn3dpy)
3D printing
Annotate and align AlphaFold structures
Create contact map
Precalculated symmetry
Symmetry dynamically
Electron density map
EM map
Transmembrane protein
Solvent Accessible Area
VR and AR views!

There is also a simpler version of iCn3D that is easier to embed your own website or projects

A comparison with other tools:

table of comparisons with other tools

a: iCn3D aligns structures (PDB or AlphaFold) based on structures or sequences.
b: iCn3D sharable links could be a short URL or a URL containing the address of an iCn3D PNG Image
c: iCn3D supports command-line analysis with either Python scripts or Node.js scripts
d: iCn3D can also be used in Jupyter Notebook