Professional Documents
Culture Documents
2173035902
Actividad BLAST
Bioinformática
Part 1: Your first BLAST search
Below is the mRNA sequence for insulin from a South American rodent,
the Degu (Octodon degus).
>gi|202471|gb|M57671.1|OCOINS Octodon degus insulin mRNA, complete cds
GCATTCTGAGGCATTCTCTAACAGGTTCTCGACCCTCCGCCATGGCCCCGTGGATGCATCTCCTCACCGT
GCTGGCCCTGCTGGCCCTCTGGGGACCCAACTCTGTTCAGGCCTATTCCAGCCAGCACCTGTGCGGCTCC
AACCTAGTGGAGGCACTGTACATGACATGTGGACGGAGTGGCTTCTATAGACCCCACGACCGCCGAGAGC
TGGAGGACCTCCAGGTGGAGCAGGCAGAACTGGGTCTGGAGGCAGGCGGCCTGCAGCCTTCGGCCCTGGA
GATGATTCTGCAGAAGCGCGGCATTGTGGATCAGTGCTGTAATAACATTTGCACATTTAACCAGCTGCAG
AACTACTGCAATGTCCCTTAGACACCTGCCTTGGGCCTGGCCTGCTGCTCTGCCCTGGCAACCAATAAAC
CCCTTGAATGAG
We will now use a BLASTN search at NCBI to determine whether this sequence looks like the
human mRNA for insulin. There are two ways we can do this:
▪ Search the entire database and look for human hits in the results,
▪ specifically search the human part of the database.
Search against NR
1. Follow the "nucleotide blast" link from the main BLAST page.
2. In the section "Program Selection" select the option "Somewhat similar sequences
(blastn)"
3. Choose "Nucleotide collection (nr/nt)" as the search database. NR is the "Non
Redundant" database, which contains all non-redundant (non-identical) sequences from
GenBank and the full genome databases.
4. Click the BLAST button to launch the search.
After the search has completed, make yourself familiar with the BLAST output page. After a
header with some information about the search, there are three main parts:
▪ Graphic Summary
▪ Each hit is represented by a line showing which part of the query sequence the
alignment covers. The lines are coloured according to alignment score.
▪ Descriptions
▪ a table with a one-line description of each hit with some alignment statistics.
▪ Alignments
▪ the actual alignments between the query and the database hits.
Note that you can toggle between hiding and showing each part by clicking on the part
title (try it!).
First, take a look at the best hit. Since our search sequence (the query) was taken from
GenBank which is part of NR, we should find an identical sequence in the search. Make sure
this is the case!
QUESTION 1.1:
Answer the following questions about the best hit:
Make a new BLASTN search with the same query sequence, this time
with Database set to Human genomic + transcript (Human G+T).
>gi|202471|gb|M57671.1|OCOINS Octodon degus insulin mRNA, complete cds
GCATTCTGAGGCATTCTCTAACAGGTTCTCGACCCTCCGCCATGGCCCCGTGGATGCATCTCCTCACCGT
GCTGGCCCTGCTGGCCCTCTGGGGACCCAACTCTGTTCAGGCCTATTCCAGCCAGCACCTGTGCGGCTCC
AACCTAGTGGAGGCACTGTACATGACATGTGGACGGAGTGGCTTCTATAGACCCCACGACCGCCGAGAGC
TGGAGGACCTCCAGGTGGAGCAGGCAGAACTGGGTCTGGAGGCAGGCGGCCTGCAGCCTTCGGCCCTGGA
GATGATTCTGCAGAAGCGCGGCATTGTGGATCAGTGCTGTAATAACATTTGCACATTTAACCAGCTGCAG
AACTACTGCAATGTCCCTTAGACACCTGCCTTGGGCCTGGCCTGCTGCTCTGCCCTGGCAACCAATAAAC
CCCTTGAATGAG
Remember again to select Somewhat similar sequences (blastn) under Program Selection.
Note: even though you may not have found exactly the same database entry in the two
searches, the alignment should be the same. Make sure this is the case by comparing the
actual alignments in the two windows where you made the searches.
QUESTION 1.3:
Answer the same questions as before about the best hit you found in this
search. Answer the following questions about the best hit:
QUESTION 1.5:
As discussed in the lecture, there will be a risk of getting false positive results (hits to
sequences that are not related to our input sequence) by purely stochastic means. In this first
part of the exercise, we will be investigating this further, by examining what happens when we
submit randomly generated sequences to BLAST searches.
Rather than giving out a set of pre-generated DNA/Peptide sequences where you only have
our word for their randomness, you'll be generating your own random sequences with the
SeqGen server. We previously used d4/d20 dice to generate these sequences manually, but we
have decided to let the computer do the work for you to save some time. It is important to
understand that these computer-generated sequences are totally random, just as if you were
rolling a die to determine each nucleotide/amino acid in each sequence.
2. cgagcacttctcggtgccgtactcg
3. cagtcggccgggtcggagggtcaga
▪ Follow the "nucleotide blast" link from the main BLAST page, and, as before,
▪ select the option "Somewhat similar sequences (blastn)" in the section "Program
Selection".
VERY IMPORTANT: For this special situation where we BLAST small artificial
sequences, we need to turn off some the automatics NCBI incorporate when short
sequences are detected. Otherwise, we'll not be able to see the intended results:
▪ Extend the "Algorithm parameters" section (see the screen shot below) in
order to gain access to fine-tuning the options.
1. Deselect the "Automatically adjust parameters for short input
sequences" option.
2. Set the E-value cut-off ("Expect threshold") to 50
Remember to adjust the BLAST settings
▪ Paste in your three sequences in FASTA format and start the BLAST search.
Browsing BLAST results: select which of your query sequences to inspect in the
drop-down box near the top of the page
QUESTION 2.2:
Answer the following small questions, and document your findings by pasting in
examples of alignments / text snippets from the overview table:
▪ Do you find any sequences that look like your input sequences (paste in a
few example alignments in your report
▪ What is the typical length of the hits (the alignment length)? Approximately 22
▪ What is the typical % identity? 100%
▪ In what range is the bit-scores ("max score")? 39.2
▪ Notice: This is conceptually the same as the "alignment score" we
have already met in the pairwise alignment exercise.
▪ What is the range of the E-values? 4.1
QUESTION 2.3:
https://www.bioinformatics.org/sms2/random_protein.html
QUESTION 2.4:
Report the sequences in FASTA format.
1. HVDGVAIPPDSAYFEVSDFSNHQWA
2. FHFCPHARVMFDLDPESIWKNDYWT
3. CWPACNFPEEGDVGNWKPHFVEHDL
Locate the "Protein BLAST" page at NCBI and choose blastp as the
algorithm to use.
▪ What is the typical length of the alignment and do they contain gaps?
The typical length is 700 the gaps are in between 10 and 25%
▪ If we had used the default E-value cut-off of 10 would any hits have been
found? No significant similarity found.
QUESTION 2.6:
Notice that in both cases it's possible to transfer information, for example, information about
gene family/protein domains. We have already touched upon a comparison of (potentially)
evolutionarily related sequences in the pairwise alignment exercise. However, this time we do
not start out with two sequences we assume are related, but we rather start out with a single
sequence ("query sequence") which we will use to search the databases for homologs (we
often informally speak of "BLAST hits", when discussing the sequences found).
BLAST example 1
Let's start out with a sequence that will produce some good hits in the database. The
sequence below is a full-length transcript (mRNA) from a prokaryote. Let's find out what it
is.
>Unknown_transcript01
CCACTTGAAACCGTTTTAATCAAAAACGAAGTTGAGAAGATTCAGTCAACTTAACGTTAATATTTGTTTC
CCAATAGGCAAATCTTTCTAACTTTGATACGTTTAAACTACCAGCTTGGACAAGTTGGTATAAAAATGAG
GAGGGAACCGAATGAAGAAACCGTTGGGGAAAATTGTCGCAAGCACCGCACTACTCATTTCTGTTGCTTT
TAGTTCATCGATCGCATCGGCTGCTGAAGAAGCAAAAGAAAAATATTTAATTGGCTTTAATGAGCAGGAA
GCTGTTAGTGAGTTTGTAGAACAAGTAGAGGCAAATGACGAGGTCGCCATTCTCTCTGAGGAAGAGGAAG
TCGAAATTGAATTGCTTCATGAATTTGAAACGATTCCTGTTTTATCCGTTGAGTTAAGCCCAGAAGATGT
GGACGCGCTTGAACTCGATCCAGCGATTTCTTATATTGAAGAGGATGCAGAAGTAACGACAATGGCGCAA
TCAGTGCCATGGGGAATTAGCCGTGTGCAAGCCCCAGCTGCCCATAACCGTGGATTGACAGGTTCTGGTG
TAAAAGTTGCTGTCCTCGATACAGGTATTTCCACTCATCCAGACTTAAATATTCGTGGTGGCGCTAGCTT
TGTACCAGGGGAACCATCCACTCAAGATGGGAATGGGCATGGCACGCATGTGGCCGGGACGATTGCTGCT
TTAAACAATTCGATTGGCGTTCTTGGCGTAGCGCCGAGCGCGGAACTATACGCTGTTAAAGTATTAGGGG
CGAGCGGTTCAGGTTCGGTCAGCTCGATTGCCCAAGGATTGGAATGGGCAGGGAACAATGGCATGCACGT
TGCTAATTTGAGTTTAGGAAGCCCTTCGCCAAGTGCCACACTTGAGCAAGCTGTTAATAGCGCGACTTCT
AGAGGGGTTCTTGTTGTAGCGGCATCTGGGAATTCAGGTGCAGGCTCAATCAGCTATCCGGCCCGTTATG
CGAACGCAATGGCAGTCGGAGCGACTGACCAAAACAACAACCGCGCCAGCTTTTCACAGTATGGCGCAGG
GCTTGACATTGTCGCACCAGGTGTAAACGTGCAGAGCACATACCCAGGTTCAACGTATGCCAGCTTAAAC
GGTACATCGATGGCTACTCCTCATGTTGCAGGTGCAGCAGCCCTTGTTAAACAAAAGAACCCATCTTGGT
CCAATGTACAAATCCGCAATCATCTAAAGAATACGGCAACGAGCTTAGGAAGCACGAACTTGTATGGAAG
CGGACTTGTCAATGCAGAAGCGGCAACACGCTAATCAATAATAATAGGAGCTGTCCCAAAAGGTCATAGA
TAAATGACCTTTTGGGGTGGCTTTTTTACATTTGGATAAAAAAGCACAAAAAAATCGCCTCATCGTTTAA
AATGAAGGTACC
BLASTN search
Perform a BLAST search in the NR/NT database (BLASTN) using default settings. Remember
to set Expect threshold back to the default value, 10.
QUESTION 3.1:
(Once again remember to document your findings)
BLASTP search
Now let's try to do the same at the protein level.
▪ Find the best ORF using VirtualRibosome (hint: remember to search all
positive reading frames) and save of copy the sequence in FASTA
format.
https://services.healthtech.dtu.dk/service.php?VirtualRibosome-2.0
QUESTION 3.2:
(Document!)
PLETVLIKNEVEKIQST*R*YLFPNRQIFLTLIRLNYQLGQVGIKMRREPNEETVGENCRKHRTT
HFCCF*FIDRIGC*RSKRKIFNWL**AGSC**VCRTSRGK*RGRHSL*GRGSRN*IAS*I*NDSCFIR*VK
PRRCGRA*TRSSDFLY*RGCRSNDNGAISAMGN*PCASPSCP*PWIDRFWCKSCCPRYRYFHSSR
LKYSWWR*LCTRGTIHSRWEWAWHACGRDDCCFKQFDWRSWRSAERGTIRC*SIRGERFRFG
QLDCPRIGMGREQWHARC*FEFRKPFAKCHT*ASC**RDF*RGSCCSGIWEFRCRLNQLSGPLC
ERNGSRSD*PKQQPRQLFTVWRRA*HCRTRCKRAEHIPRFNVCQLKRYIDGYSSCCRCSSPC*T
KEPILVQCTNPQSSKEYGNELRKHELVWKRTCQCRSGNTLINNNRSCPKRS*INDLLGWLFYI
WIKKHKKIASSFKMKV
▪ Do we find any conserved protein domains? (Hint: Indicated at the very top of
the result page, and during the search or you can try with
https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi)
▪ Identifying known protein domains can provide important clues to the function
of an unknown protein.
▪ Do we find any significant hits? (E-value?) E value equal to 0 up to 0.003
▪ Are all the best hits the same category of enzymes?
▪ From what you have seen, what is best for identifying intermediate quality hits -
DNA or Protein BLAST? The best way for identifying quality hits is protein
blast
BLAST example 2
In the previous section, we have been cheating a bit by using a sequence that was already in
the database - let's move on to the following sequence instead.
INSTRUCTIONS: You are free to write the combined answer to this question in a free
style essay-like fashion - just be sure to include the subquestions in your answers. In
an exam situation, you will need to put all the clues together yourself, reason about the
tools/databases to use, and document your findings.
Subquestion:
Cover the following in your answer:
Typically, this will be useful if you have a gene of known function from one organism (say a
cell-cycle controlling gene from Yeast, Saccharomyces cerevisiae) and want to find the human
homolog/ortholog to this gene (genes that control cell division are often involved in cancer).
When you have been performing the BLAST searches, you have probably already noticed
that it is possible to search specifically in the Human and Mouse genomes (these
databases only contain sequences from Human/Mouse). It's also possible to restrict the
output from searches in the large databases (e.g. NR) to specific organisms.
A growing number of organisms have been fully sequenced, and the research teams
responsible for a large-scale genome project typically put up their own Web resources for
accessing the data. For example the Yeast genome is principally hosted in the Saccharomyces
Genome Database (SGD - www.yeastgenome.org) - it should be noted that SGD also offers
BLAST as a means to search the database.
Let's do a small study of the relationship between the histones found in Yeast and in
humans (evolutionary distance: ~1-1.5 billion years).
Look up the HTA2 gene in SGD (http://www.yeastgenome.org - use the search box at the top
of the page). Notice that a brief description of the function of the gene and its protein product is
displayed (a huge amount of additional information can be found further down the page - much
of it Yeast specific).
QUESTION 4.1:
What information is given about the relationship between this gene and the gene
"HTA1"?
Browse the page and locate the link to the protein sequence. Save the sequence as a
file, we'll need it in a moment.
NCBI
Now return to the NCBI blastp page. S
et Database to "Reference proteins (refseq_protein)", and
enter Saccharomyces cerevisiae in the Organism field (and accept the
suggestion with taxid:4932).
QUESTION 4.2:
(Remember to document your answers)
Tip: click on the Gene links under Related Information (to the right of the alignments) to
see the gene names for the protein hits.
The next step is to search the translated version of the human genome.
QUESTION 4.3:
-10
▪ How many high-confidence hits (with E-value better than 10 ) are found?
(Approximately)
▪ What are all the high-confidence hits called?
Concluding remarks
Today we have been using BLAST to find a number of homologous genes (and protein
products). If we want to go even deeper into the analysis of the homologs, the next logical step
would be to build a dataset of the full-length versions of the sequences we have found (not just
the part found by the local alignment in BLAST).
A further analysis could consist of a series of pairwise alignments (for finding out what is
similar/different between pairs of sequences) or a multiple alignment which could form the
basis of establishing the evolutionary relationship between the entire set of sequences.
BLAST can also be used as a way to build a dataset of sequences based on a known
"seed" sequence. As we saw in the GenBank exercise, free-text searching in the GenBank
can be difficult, and if we for instance wanted to build a dataset of variants of the insulin
gene, an easy way to go around this would be to BLAST the normal version of the insulin
against the sequence database of choice, and pick the best matching hits from here.