BLAST Scoring and Statistics

Alignment scoring

Position-independent scoring

Traditional BLAST uses position-independent scoring: the same substitution gets the same score all any position in the alignment.

Nucleotide Scoring

Nucleotide alignments use an identity scoring system, a simple match mismatch scoring system with a positive score for match and a negative score for a mismatch and gap open and extend penalties. The image below shows how BLAST scores and represents a nucleotide alignment.

You can use BLAST 2 Sequences to see a megablast alignment between a human insulin transcript (NM_000207.3) and a predicted insulin transcript (XM_043971863.1) from the colocolo opossum.

Set up the search and run it by clicking the BLAST button.

The above alignment was produced by the megablast program, which is less sensitive (but faster) than blastn.

Set up the blastn search and run it.

Do you differences in the alignment and score using the more sensitive program?

Protein Scoring

Protein alignments use a scoring system based on frequencies of amino acid substitutions in related proteins. The default scoring matrix is BLOSUM62, shown below. The BLOSUM series uses observed substitution frequencies in ungapped alignment blocks of related proteins. BLOSUM62 includes information up to 62% identity. Experiment have shown that this is the best general scoring system. Other available matrices for protein BLAST include several from the BLOSUM series tuned to different distances and several from the PAM series.

The numbers in BLOSUM62 are log odds ratios of the observed substitution frequency to the background frequency. Substitutions that occur more often than expected by chance have positive scores, those that occur less often than chance have negative scores, and those that occur at the background frequencies get a score of zero

It's easy to understand the BLOSUM62 scores based amino acid chemistry and protein structure. Amino acid substitutions with side chains of similar size and chemistry have positive scores (e.g., aspartate (D)/glutamate (E)). Those involving dissimilar side chains have negative scores (e.g., phenylalanine (F)/glutamine (Q)). Self-substitution scores are along the diagonal and, in part, reflect the abundance of the amino acids. Rare amino acids such as tryptophan (W) have relatively high scores. Common amino acids such as valine (V), leucine (L), and isoleucine (I) have lower scores. The relatively high self-scores for proline (P) and glycine (G) may be because these amino acids often have special roles in determining protein structure. Keep in mind though that the substitution scores in the BLOSUM matrices are based on observed frequencies not on any predictions from amino acid properties.

The image below shows how BLAST scores and represents a protein alignment.

You can use BLAST 2 Sequences to see a blastp alignment between the human creatine kinase M protein sequence (NP_001815.2) and a bacterial arginine kinase protein (MCP4285491.1).

Set up the search and run it by clicking the BLAST button.

Position Dependent Scoring

The position independent scoring systems make the unrealistic assumption that every position in a protein or nucleotide sequence is equally likely to change. Position specific scoring strategies described next do a better job of modeling real biological sequences and increase sensitivity

Specialized BLAST protein programs such as PSI-BLAST and the Conserved Domain Database (CDD) Search (RPS-BLAST) generate or search a database of Position-Specific Scoring Matrices (PSSMs). In a PSSM the score for a particular substitution depends on the position in the alignment. This is a better model of proteins since it can represent the fact that amino acids that are directly involved in catalysis, substrate, cofactor, or partner interaction as well as those required for critical structural elements are less likely to change than others. PSSMs are generated from multiple sequence alignments either generated on-the-fly from a BLAST search in the case of PSI-BLAST or as a curated database of conserved domains used by CDD search. PSSMs are better at detecting more distant protein relationships than ordinary BLAST and can have a more direct relationship to protein structure and function.

You'll use PSI-BLAST for one example in this workshop. CDD search runs by default in all of our protein examples and will show any conserved domains in your protein queries.

Last Reviewed: July 3, 2023