Blast Introducción A Blast

Introduction to BLAST
David Fristrom
Bibliographer/Librarian
Science and Engineering Library
fristrom@bu.edu
617 358-4124
What is BLAST?
Free, online service from National Center for
Biotechnology Information (NCBI)
http://blast.ncbi.nlm.nih.gov/Blast.cgi
What is BLAST?
BLAST :
Nucleotide/Protein
Sequence Databases
as
Google : Internet
Some Uses for BLAST

Identify an unknown sequence
Build a homology tree for a protein
Get clues about protein structure by
finding similar proteins with known
structures
Map a sequence in a genome
Etc., etc.
What is BLAST?
Basic Local Alignment Search Tool
Alignment
AACGTTTCCAGTCCAAATAGCTAGGC
===--===
=-===-==-======
AACCGTTC
TACAATTACCTAGGC
Hits(+1): 18
Misses (-2): 5
Gaps (existence -2, extension -1): 1 Length: 3
Score = 18 * 1 + 5 * (-2) 2 2 = 6
Global Alignment
Compares total length of two
sequences
Needleman, S.B. and Wunsch, C.D. A
general method applicable to the search
for similarities in the amino acid sequence
of two proteins. J Mol Biol. 48(3):44353(1970).
Local Alignment
Compares segments of sequences
Finds cases when one sequence is a part
of another sequence, or they only match in
parts.
Smith, T.F. and Waterman, M.S. Identification of
common molecular subsequences. J Mol Biol.
147(1):195-7 (1981)
Search Tool
By aligning query sequence against all
sequences in a database, alignment can
be used to search database for similar
sequences
But alignment algorithms are slow
What is BLAST?
Quick, heuristic alignment algorithm
Divides query sequence into short words,
and initially only looks for (exact) matches
of these words, then tries extending
alignment.
Much faster, but can miss some
alignments
Altschul, S.F. et al. Basic local alignment search
tool. J Mol Biol. 215(3):403-10(1990).
What is BLAST?
BLAST is not Google
BLAST is like doing an experiment: to get
good, meaningful results, you need to
optimize the experimental conditions
Sample Search
Human beta globin (HBB)
Subunit of hemoglobin
Acquisition number: NP_000509

Limit to mouse to more easily show
differences between searches
Interpreting Results
Score: Normalized score of alignment
(substitution matrix and gap penalty). Can
be compared across searches
Max score: Score of single best aligned
sequence
Total score: Sum of scores of all aligned
sequences
Interpreting Results
Query coverage: What percent of query
sequence is aligned
E Value: Number of matches with same
score expected by chance. For low
values, equal to p, the probability of a
random alignment
Typically, E < .05 is required to be
considered significant
Getting the most out of BLAST

1.
2.
3.
4.
What kind of BLAST?

Pick an appropriate database
Pick the right algorithm
Choose parameters
Step 0:
Do you need to use BLAST?
Step 1:
Nucleotide BLAST vs. protein BLAST
Largely determined by your query sequence
BUT
If your nucleotide sequence can be translated to a
peptide sequence, you probably want to do it (use tool
such as ExPASy Translate Tool)
Protein blasts are more sensitive and biologically
significant
Sometimes it makes sense to use other blasts
Specialized Search: blastx

Search protein database using
a translated nucleotide query
Use to find homologous proteins to a
nucleotide coding region
Translates the query sequence in all six
reading frames
Often the first analysis performed with a
newly determined nucleotide sequence
http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#blastx
Specialized Search: tblastn

Search translated nucleotide database
using a protein query
Does six-frame translations of the
nucleotide database
Find homologous protein coding regions in
unannotated nucleotide sequences such
as expressed sequence tags (ESTs) and
draft genome records (HTG)
http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#tblastn
Specialized Search: tblastx

Search translated nucleotide database
using a translated nucleotide query
Both translations use all six frames
Useful in identifying potential proteins
encoded by single pass read ESTs
Good tool for identifying novel genes
Computationally intensive
http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#tblastx
Even More Specialized
Make specific primers with Primer-BLAST

Search trace archives
Find conserved domains in your sequence (cds)
Find sequences with similar conserved domain architecture
(cdart)
Search sequences that have gene expression profiles (GEO)
Search immunoglobulins (IgBLAST)
Search for SNPs (snp)
Screen sequence for vector contamination (vecscreen)
Align two (or more) sequences using BLAST (bl2seq)
Search protein or nucleotide targets in PubChem BioAssay
Search SRA transcript libraries
Constraint Based Protein Multiple Alignment Tool
Step 2: Choose a Database

Too large:
Takes longer
Too many results
More random, meaningless matches
Too small or wrong one:

Miss significant matches
Protein Databases
Non-redundant protein sequences (nr)
Kitchen-sink:
Translations of GenBank coding sequences (CDS)

RefSeq Proteins
PDB (RCSB Protein Data Bank - 3d-structure)
SwissProt
Protein Information Resource (PIR)
Protein Research Foundation (Japanese DB)
Reference proteins (refseq_protein)

NCBI Reference Sequences: Comprehensive, integrated, nonredundant, well-annotated set of sequences
Swissprot protein sequences (swissprot)

Swiss-Prot: European protein database (no incremental updates)
Protein Databases
Patented protein sequences (pat)
Patented sequences
Protein Data Bank proteins (pdb)

Sequences from RCSB Protein Data Bank
with experimentally determined structures
Environmental samples (env_nr)

Protein sequences from environmental
samples (not associated with known
organism)
Nucleotide Databases
Human genomic + transcript
http://www.ncbi.nlm.nih.gov/genome/guide/human/
Mouse genomic + transcript

http://www.ncbi.nlm.nih.gov/genome/guide/mouse/
Nucleotide collection (nr/nt)

nr stands for non-redundant, but it isnt
GenBank (NCBI)
EMBL (European Nucleotide Sequence Database)
DDBJ (DNA Databank of Japan)
PDB (RCSB Protein Data Bank - 3d-structure)
Kitchen sink but not HTGS0,1,2, EST, GSS, STS,

PAT, WGS
Reference mRNA sequences (refseq_rna)
Reference genomic sequences
(refseq_genomic)
NCBI Reference Sequences: Comprehensive,
integrated, non-redundant, well-annotated set
of sequences
NCBI Genomes (chromosome)

Complete genomes and chromosomes from
Reference Sequences
Expressed sequence tags (est)
Non-human, non-mouse ESTs (est_others)
http://www.ncbi.nlm.nih.gov/About/primer/est.html
http://www.ncbi.nlm.nih.gov/dbEST/index.html
Genomic survey sequences (gss)

Like EST, but genomic rather than cDNA (mRNA)
random "single pass read" genome survey sequences.

cosmid/BAC/YAC end sequences
exon trapped genomic sequences
Alu PCR sequences
transposon-tagged sequences
http://www.ncbi.nlm.nih.gov/dbGSS/index.html
High throughput genomic sequences (HTGS)
Unfinished sequences (phase 1-2). Finished are
already in nr/nt
http://www.ncbi.nlm.nih.gov/HTGS/
Patent sequences (pat)

Patented genes
Protein Data Bank (pdb)

Sequences from RCSB Protein Data Bank with
experimentally determined structures
http://www.rcsb.org/pdb/home/home.do
Human ALU repeat elements
(alu_repeats)
Database of repetitive elements
Sequence tagged sites (dbsts)

Short sequences with known locations from
GenBank, EMBL, DDBJ
Whole-genome shotgun reads (wgs)

http://www.ncbi.nlm.nih.gov/Genbank/wgs.htm
l
Environmental samples (env_nt)
Nucleotide sequences from environmental
samples (not associated with known
organism)
Database Options
Limit to (or exclude) an organism
Exclude Models (XM/XP)
Model reference sequences produced by
NCBI's Genome Annotation project. These
records represent the transcripts and proteins
that are annotated on the NCBI Contigs
which may have been generated from
incomplete data.
Entrez Query
Use Entrez query syntax to limit search
Step 3:
Choose an Algorithm
How close a match are you looking for?
Determines how similarities are scored
Affects speed of search and chance of
missing match
Again, what is the goal of the search?
blastp
Protein-protein BLAST
Standard protein BLAST
PSI-BLAST
Position-Specific Iterated BLAST
Finds more distantly related matches
Iterates: Initial search results provide
information on allowed mutations;
subsequent searches use these to create
custom substitution matrix
PHI-BLAST
Pattern Hit Initiated BLAST
Variation of PSI-BLAST
Specify a pattern that hits must match
Use when you know protein family has a
signature pattern: active site, structural domain,
etc.
Better chance of eliminating false positives
Example: VKAHGKKV
megablast
Nucleotide BLAST
Finds highly similar sequences
Very fast
Use to identify a nucleotide sequence
blastn
Nucleotide BLAST
Use to find less similar sequences
discontiguous megablast
Nucleotide BLAST
Bioinformatics. 2002 Mar;18(3):440-5.
PatternHunter: faster and more sensitive
homology search. Ma B, Tromp J, Li M.
Even more dissimilar sequences

Use to find diverged sequences (possible
homologies) from different organisms
Step: 4
Algorithm Parameters
Fine-tune the algorithm
Short Queries
Expect threshold: The lower it is, the fewer
false positives (but you might miss real
hits)
Scoring Matrix:
PAM: Accepted Point Mutation
Empirically derived chance a substitution will be accepted,
based on closely related proteins
Higher PAM numbers correspond to greater evolutionary
distance
BLOSUM: Blacks Substitution Matrix

Another empirically derived matrix, based on more distantly
related proteins
Lower BLOSUM numbers correspond to greater evolutionary
distance
Compositional adjustment changes matrix to take into

account overall composition of sequence
Filters and Masking
Can ignore low complexity regions in
searching
Additional Sources
Pevsner, Jonathan Bioinformatics and Functional
Genomics, 2nd ed. (Wiley-Blackwell, 2009)
BLAST help pages:
http://blast.ncbi.nlm.nih.gov/Blast.cgi?
CMD=Web&PAGE_TYPE=BlastDocs
Slides from class on similarity searching; lots of
technical details on algorithms and similarity
matrices:
http://www.ncbi.nlm.nih.gov/Class/NAWBIS/Mod
ules/Similarity/simsrchlast.html

Blast Introducción A Blast

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Blast Introducción A Blast

Uploaded by

Copyright:

Available Formats

Introduction to BLAST

Some Uses for BLAST

Basic Local Alignment Search Tool

Acquisition number: NP_000509

Getting the most out of BLAST

What kind of BLAST?

Sometimes it makes sense to use other blasts

Specialized Search: blastx

Specialized Search: tblastn

Specialized Search: tblastx

Even More Specialized

Make specific primers with Primer-BLAST

Step 2: Choose a Database

Too small or wrong one:

Translations of GenBank coding sequences (CDS)

Reference proteins (refseq_protein)

Swissprot protein sequences (swissprot)

Protein Data Bank proteins (pdb)

Environmental samples (env_nr)

Mouse genomic + transcript

Nucleotide collection (nr/nt)

Kitchen sink but not HTGS0,1,2, EST, GSS, STS,

NCBI Genomes (chromosome)

Genomic survey sequences (gss)

random "single pass read" genome survey sequences.

Patent sequences (pat)

Protein Data Bank (pdb)

Sequence tagged sites (dbsts)

Whole-genome shotgun reads (wgs)

Even more dissimilar sequences

BLOSUM: Blacks Substitution Matrix

Compositional adjustment changes matrix to take into

You might also like