You are on page 1of 4

CBE647 Bioinformatics

Laboratory 1

The goal of this lab is to give you practical experience using the NCBI interface: how to navigate
the website, how to perform basic and advanced searches. You will need to draw on what you
learned in lectures, as well as the on-line help available to you through the website itself.
Becoming experienced in using these sites will help you in the future as you try to identify novel
genes or perform your own research.

A. Finding Public Biological Databases

The 2010 Database Issue of Nucleic Acids Research is the sixteenth in a series dedicated
to factual biological databases. Such databases are an essential resource for working
biologists and this compilation provides descriptions of the most important of these
databases and serves to introduce newly compiled databases that provide specialist
information in the biological area. NAR Online contains hotlinks to all of the databases in the
compilation as well as brief summaries of their content.

Go to the NAR (http://www.oxfordjournals.org/nar/database/a/)

Visit databases of your own selection to see how these databases are accessed and what
information is available. Include Genbank (Nucleotide Sequence Databases), KEGG
(Metabolic Pathways), UniProtKB (Proteins), OMIM and GO (Gene Ontology) in your visit.

B. NCBI Entrez and Searching Biological Databases

Problem: Triose Phosphate Isomerase

We are going to investigate the human triose phosphate (or triosephosphate) isomerase 1
gene. This gene is responsible for the reaction that converts dihydroxyacetone phosphate to
glyceraldehyde-3-phosphate in glycolysis. Glycolysis is the pathway in cells where a simple
sugar (glucose) is transformed into two pyruvate molecules, which are then used to
generate energy for the cell. Much is known about this gene, and when it is deficient, severe
problems can occur. A loss of function mutation in this gene would be lethal.

First, visit the NCBI website (http://ncbi.nlm.nih.gov/) and visit the All Databases page.

1. What would be a good search query to use for this gene that would specify both the
name of the gene as well as the organism?

Use this search query in the Gene, Protein, and Nucleotide sections of Entrez. You should
observe different results in each, although they will contain much similar information.

2. What is the RefSeq accession number for this gene in the mRNA form and for the
protein form?
3. On what chromosome is this gene found?

4. How many amino acids are in the protein chain? What are the first five?

One of the useful abilities of Entrez is to cross reference recent publications that relate to
this gene. A recent paper published implicates the triose phosphate isomerase protein in the
disease Lupus.

5. Who were the three authors of this paper? What is this papers unique PubMed ID?

C. Determination of the Open Reading Frame (ORF) of the Hemoglobin Alpha 2


(HBA2) Gene.

In this exercise you will learn how to determine an open reading frame (ORF) and determine
the gene product of the ORF. A reading frame is a way of dividing the sequence of
nucleotides in a nucleic acid (DNA or RNA) molecule into a set of consecutive, non-
overlapping triplets. Where these triplets equate to amino acids or stop signals during
translation, they are called codons. A single strand of a nucleic acid molecule has a
phosphoryl end, (called the 5-end) and a hydroxyl, or (3-end). These then define the 5'3'
direction. An open reading frame (ORF) is the part of a reading frame that contains no stop
codons.

1. Retrieve the alpha 2 globin mRNA sequence (NM_000517) from the GenBank database.
Can you manually identify the Open Reading Frame (ORF), i.e., the coding sequence
(e.g., in notepad or wordpad)? Proceed by determining the start and stop codons (use
genetic code table). Note that the sequence contains triplets of nucleotides that are
similar to the start/stop codons but which are not the true start and stop codons. Why is
that?

2. Once you have determined the ORF of the HBA2 gene, translate the first 10 codons to
the amino acid sequence (use genetic code table)

3. Are the ORF and the amino acid sequence confirmed by the NM_000517 annotation in
the GenBank database?

4. For the automatic determination of putative ORFs you can also use the ORF finder at
the NCBI site. Go to the ORF finder and copy/paste the NM_000517 sequence or just
type in the accession code (the program is linked to the GenBank database). The results
are the ORFs for all six reading frames. The longest ORF is most probably the frame
that will be translated to the protein. By clicking on the largest ORF, the corresponding
translation is given. Is this correct?
D. Sequence Extraction

This part of the lab will guide you through the process of getting DNA sequences using
the NCBI GeneBank database as a source.

STEP 1
Go to the NCBI website

STEP 2
Choose your search type (Nucleotide) and enter your search item in the box. Some
examples of search items are:
- ara h2
- opsins

You will get a lot of results but for the purpose of this lab, find the following links and
click on them:
- Ara h2 ==> AY158467
- Opsins ==> NM_020061

STEP 3
The new page contains the DNA coding sequence for the proteins at the bottom, below
Origin. Click and drag the cursor to highlight the entire sequence, right click the
highlighted sequence and select copy to store it.
- Ara h2 ==> from 1atggc to tactaa
- Opsins ==> from cggctgccgt to ccaa

*** Also copy this sequence and store in a .txt file. Remember to delete the numbers at
the beginning of each row.***

STEP 4
Open the Expasy page to view the translation tool. This tool will do in seconds what will
take you hours to do. It reads the codons in the sequence and translates them into
proteins.

STEP 5
Right click the cursor in the box below
Please enter DNA and select paste to enter your gene sequence. To the right of
Output format, select Includes nucleotide sequence from the drop-down menu and
click Translate Sequence.
Your results in the 5 3 Frame 1 should show the amino acid/ protein sequence of the
gene in capital letters below the corresponding codons of the gene.
Notice that:
- Ara h2 ==> the gene starts with atg and the corresponding protein is M for
methionine
- Opsins ==> the gene starts with cgg and the corresponding protein is R for arginine.
The other frames translate the sequence but in an alternate direction from the 5 3
Frame 1 frame.

STEP 6
Click on the 5 3 Frame 1 link to open another window with just the protein sequence.
Click and drag the cursor to highlight the entire sequence, right click the highlighted
sequence and select copy to store it. Now we are going to BLAST the sequence!
BLAST is a tool that will match your sequence to any other similar sequences and give
you a description of what your gene is/ does.
Click http://www.ncbi.nlm.nih.gov/BLAST to open BLAST.

STEP 7
Click protein blast and right click to paste your protein sequence into the large text
box.
Click BLAST.
You have just asked the BLAST program to search the entire NCBI protein database for
matches to your sequence.
The BLAST results page can be a lot to take in, but the colour-coded graph shows the
most similar sequence in red and other sequences that are less similar in magenta,
green, blue and black.

Under the graph, click on one of the links with a high score. On the resulting page, look
for a DEFINITION or TITLE that will give you information about your gene sequence. For
the examples we have been using, one of them is a peanut allergen and the other is an
eye gene related to long-wave sensitivity and colour blindness.
Can you tell which is which?

Submission Instructions:

Please submit your individual laboratory report in the proper format by Monday (13 Oct
2014) for EH222 8A or Thursday (16 Oct 2014) for EH222 8B.

You might also like