You are on page 1of 5

Chapter 8 BLASTX

Chapter 8 BLASTX Analysis


Search the NCBI database to determine if your gene codes for a protein that is found in other organisms.
a. Background
In the previous step you compared your DNA sequence with other DNA sequences in the NCBI database. In this step you will use the predicted protein sequence derived from your Artemia cDNA to search the protein databases for similar proteins. i) Why should you look for protein sequence similarity if you already have a very good nucleotide sequence match? As biologists, we want to know as much as we can about the function of the gene we have isolated. Since you are searching with a gene from the Artemia, an organism whose genomic sequence has not been completed, the odds are very poor you will find an exact sequence match. There may be little or nothing known about the function of this gene. On the other hand, the protein sequence coded by your gene may be similar to proteins in other organisms that have been well characterized. Information about these other proteins has the potential to give you a tremendous amount of information about your gene, from the predicted phenotype of a mutant to the enzymatic activity or the threedimensional structure of the protein. ii) If there are similar proteins in other organisms why would the genes that code for them not be identified in the BLASTN searches? The BLASTN searches will find genes that are very similar to your DNA. These will likely code for homologous proteins. However, due to the degeneracy of the amino acid code, it is very possible to have two proteins with very similar amino acid sequences with very different DNA sequences. This is because many of the amino acids are coded by multiple codons. For example, there are six different codons for Arginine and Leucine. Valine, Alanine Threonine, Serine, Glycine and Proline are coded by four different codons each. Although the base pair sequence of a gene may change through divergence of two species, if there are strong requirements for a particular amino acid at each position in the protein, then the protein sequence may stay the same. An extreme example of this is shown in Fig 8-1. R S L L S R S R C. remanei AGG TCG TTA CTA TCG AGG AGT AGA | | | | | C. elegans CGT AGC CTT TTG AGT CGA TCG CGG R S L L S R S R Fig 8-1 An example of a poor DNA match that codes for identical proteins.

8-1

Chapter 8 BLASTX Although the DNA sequences in Fig 8-1 show only 20% identity, the amino acid sequences are 100% identical. A DNA match this poor is usually not considered to be significant and would never be picked up in the BLASTN search. However, this match would be significant in an alignment of protein sequences. We therefore need to perform a search to determine if there are protein sequences that are similar to the protein sequences coded for by your DNA. iii) Which protein sequence should you use? To perform the search we will need to determine the protein sequence that is coded for by your DNA. However, remember that there are 6 possible reading frames for your DNA: the three reading frames on the top strand in the 5 to 3 direction as your DNA is written (the +1, +2, and +3 reading frames in green) and the three reading frames coded by the bottom strand in the other direction (-1, -2, and -3) (Fig 8-2). It may be possible to determine which direction (+ or -) is correct. For example, if there is a poly-A at the end of the SP sequence then the correct open reading frame must be +1, +2 or +3. The sequence would be the non-template strand of DNA (Fig 8-3). In contrast, if the sequence begins with the poly-T (corresponding to a poly-A sequence on the blue DNA strand) then the correct open reading frame must be -1, -2 or -3 (Fig 8-4). The red sequence would be the template strand of the gene.

Fig 8-2. Possible reading frames of your cDNA

Fig 8-3. Possible direction of the ORF if the DNA sequence ends with a run of poly-As.

Fig 8-4. Possible direction of the ORF if the DNA sequence begins with a run of poly-Ts.

If there is no poly-A then the open reading frame could be in either direction. You would then have to look for a long open reading frame among all six possible reading frames. However, if your clone did not contain the complete coding region of a gene and was mainly the 3 UTR, then it may only contain a short sequence of the real open reading frame and would therefore not be identified in the ORF analysis. Because of these potential problems, we suggest that you perform a search with the protein sequences derived from all six reading frames of your cDNA. Although this sounds like a lot of work, it is actually very simple using the blastx program, which takes a DNA sequence and determines the sequences of the six reading frames then uses these sequences to search the protein databases.

8-2

Chapter 8 BLASTX

b. Performing a BLASTX search at NCBI with your DNA sequence.


1. Select the BLASTX option at the NCBI BLAST server site. Use your browser to connect to the NCBI BLAST home page http://www.ncbi.nlm.nih.gov/BLAST/. This is the same site you used to do the BLASTN search in the previous step. However, instead of doing a Nucleotide search (blastn) select the blastx Search protein databases using a translated nucleotide query option. The interface of this search site is very similar to the BLASTN search dialog box (Fig 8-5). (See Chapter 7 for a more in depth description of the steps and options) Fig 8-5 BLASTX Search page 2. Enter the DNA sequence into the Search dialog box and select the BLAST! button. Use the default settings to search the Non-redundant protein sequences (nr) database. 3. Select the search and format options that you want for your data output. For some proteins you may gets hundreds of hits. Therefore, we suggest you limit the number on the first search. Recheck that all the information is correct and submit the request.

c. Analysis of a BLASTX search at NCBI with your DNA sequence.


The blastx report is very similar to the blastn report. The first part shows a Graphic View of the matches, followed by a List of the matches and then the Individual Alignments. Using the same 429 bp DNA fragment that was used in the blastn search EX1 example (Fig. 7-8) there are a number of very good hits with scores over 80 (Fig 8-6). More interestingly, although the BLASTN search with the EX2 sequence showed no significant matches (Fig 7-9), the BLASTX search with the same sequence shows a significant number of very good matches (Fig 8-7).

Fig 8-6. Graphic view of BLASTX with EX1

Fig 8-7 Graphic view of BLASTX with EX2 Why are there so many good matches with the EX2 blastx search when there were only a few with the blastn search? The answer is because of the degeneracy of the amino acid 8-3

Chapter 8 BLASTX code. Although the base pair sequence of a gene may change through divergence of two species, if there are strong requirements for a particular amino acid at each position in the protein, then the protein sequence will stay the same. The fact that there were so many strong matches indicates that the protein sequence is conserved between different species even though the DNA sequence is not. Scrolling further down the EX1 BLASTX report shows the list of matches (Fig 8-8). The first three matches are to protein sequences from Sphaerius sp., Dascillus cervinus, and Carabus Fig 8-8. List of BLASTX EX1 matches granulatus, which are all beetles. The sequences are from the same study that determined the Sphaerius sp. DNA sequence in the BLASTN search. The fourth sequence is from a different study, but is from a gene from Tribolium castaneum, commonly referred to as the Red flour beetle. This information indicates that the Artemia gene is most closely related to genes from beetles. Scanning further down the list shows matches to other insects such as Papilio dardanus (African swallowtail butterfly), Drosophila melanogaster (fruit fly), Bombyx mori (domestic silkworm). Further down the list are matches to Xenopus laevis (African clawed frog), Mus musculus (house mouse), Pongo pygmaeus (orangutan), and Equus caballus (horse). All of these matches show relatively low E-values (less than 3e-13), suggesting that they are significant. This result indicates that the protein coded by the gene we isolated is strongly conserved in a wide range of organisms. Examining the alignments to the predicted Artemia sequence with the Sphaerius protein sequence show they are both 57% identical and 77% similar (conserved amino acids) in sequence (Fig 8-9). Similarity indicates positions with similar chemical properties, such as Glutamate-E vs Aspartate-D, or Alanine-A vs Valine V, Isoleucine I vs Leucine L, etc.
>gi|69608657|emb|CAJ01895.1| ubiquitin/ribosomal protein S30e fusion protein [Sphaerius sp. APV-2005] Length=131 Score = 103 bits (258), Expect = 2e-21 Identities = 68/119 (57%), Positives = 92/119 (77%), Gaps = 3/119 (2%) Frame = +3 Query Sbjct Query Sbjct 63 1 243 59 IMQIHLRGSDSSTQVINCDEGDCVIALKEQVAALEGVKVSEVRLFANGTPLTEDIPLNGI ++Q+H+RG S V++C+ + + +K+++AALE VK ++ L+A GTP+ +D ++ MIQLHIRGQ--SQHVLDCNGDEKIGQIKDRIAALENVKAKDICLYAEGTPVEDDSVVSAF QDT-IDFSVPLLGGKVHGSLARAGKVKGQTPkvdkqekkkkktgrckRRIQYNRRFVNV +D ++PLLGGKVHGSLARAGKVK QTPKV+KQEKKKKKTGR KRRIQYNRRFVNV ASVDLDLNIPLLGGKVHGSLARAGKVKQQTPKVEKQEKKKKKTGRAKRRIQYNRRFVNV 242 58 416 117

Fig 8-9. Alignment of the best BLASTX match to EX1. + indicates similar amino acids at that position in the two sequences. The gray sequence indicates low sequence complexity. The gray sequence of kvdkqekkkkktgrck in the Query indicates the sequence was removed from the analysis because of low sequence complexity. Repeating the search 8-4

Chapter 8 BLASTX without the low complexity filter gives similar matches. However, the E-value for this match is 3e-31 instead of 2e-21. Notice how there is a one-residue gap in the alignment of the predicted Aretmia sequence. Unlike in DNA alignments, these gaps do not strongly decrease the significance of the alignments. Many times these gaps are in flexible loops that are on the surface of the protein. The addition or loss of these loops may not significantly alter the overall fold or function of the protein. To find out more information about the protein, connect to the Accession link, which brings you to the report page on the sequence file. This lists a number of papers that have investigated the function of this protein. We will discuss more about this and other information on the protein in Chapter 10.

d. Fill out the DSAP


Use the information from the BLASTX search to fill out the BLASTX table in the DSAP as shown in Fig 8-10. Since we want to know how conserved your protein is across different species, please list the top five different organisms that your sequence matches. In box 5.b) enter some specific comments to help you remember anything particular about the search and results.

Fig 8-10. BLASTX page of the DSAP form for EX1.

8-5

You might also like