You are on page 1of 9

Week 2 Exercises + Answers (Bioinformatics)

Introduction
This exercise has two main goals:
1) Introduction to the types of DNA data contained in the GenBank database (data format,
visualization, cross-database links, how biological "features" such as genes are annotated
and described as coordinates in the DNA sequence).
2) Practice searching the online version of GenBank hosted at the NCBI. Since the number of
sequences in GenBank is HUGE it's critically important to be able to search and filter the
information. Especially filtering the unwanted sequences can be a challenge, as we shall see.
Where to find GenBank
The GenBank database is hosted at NCBI (National Center for Biotechnology Information,
USA) (Link: http://www.ncbi.nlm.nih.gov/). Besides the main GenBank database, NCBI also
hosts a number of other biological databases (for example whole-genome databases for
human, mouse, chimp etc.). In this particular exercise we will concentrate on the classical
"GenBank" database (http://www.ncbi.nlm.nih.gov/genbank/).
Using the "Entrez" database browser
ALL the NCBI databases can be queried through a common search interface named Entrez.
On next to all NCBI webpages a search box can be found in the upper part of the page,
allowing an easy access for searching the individual databases (or searching across all
databases). Click on the following link to open up a new browser window with Entrez, where
the focus is pre-set to search in the GenBank database:
http://www.ncbi.nlm.nih.gov/nucleotide
(Alternatively go to the main NCBI webpage and choose "Nucleotide" as the database).

Part 1: Concerning the DATA in GenBank


This part of the exercise is about the types of data hosted in GenBank.
Searching for a specific ID
The typical case for searching for a specific ID in GenBank, will be looking up information
from the literature (e.g. a gene found in a study), following up on information from other
databases, investigation of lists of interesting genes etc. In this part of the exercise we will be
working with a set of alpha-globin genes.
 Search for AB001981 - by default the result is shown in the GenBank format.

QUESTION 1.1:
a) How many genes are contained in this entry?
Inspecting the FEATURE table of the entry reveals that two CDS regions are defined;
therefore there are two genes in this entry. As stated on the GenBank hand-out "CDS" is the
most stable definition of a protein coding gene used in the GenBank format - sometimes
"gene" will also be present, but CDS is more commonly used.

b) From which organism does the DNA originate?


Columba livia (Rock pigeon / domestic pigeon)

This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00

https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
c) What kind of information is contained within the HEADER and within the FEATURE
block?
The HEADER contain general information about the entry: Organism, publication references,
keywords, accession-ID etc. The FEATURE table contains information that refers to
coordinates in the DNA sequence - for example definition of CDS regions.

PubMed links
Notice that the publication from which the DNA sequence originates is cited (and linked via
a PubMed ID) within the header. Sometimes multiple publications related to the same gene
is listed. This is of great importance since it makes it possible to trace the source(s) of the
DNA sequence and investigate if the experiments carried out are to be trusted.
This can be of real importance if something seems "wrong" with the sequence (for example
if this particular gene exhibits a really strange intron/exon structure compared to other
closely related genes, or if it simply doesn't match ANY other known genes of the same
family). By investigation of the original publication it's possible to double-check the
experimental procedure. It may be that the article correctly states the gene to be of type
XXX but when that data submitted it was accidentally annotated as YYY (it is the original
researchers' responsibility to double-check this). There can also be more serious problems
with the experiments ranging from bad/wrong PCR primers, to contamination with DNA
from a different species during a cloning step.
NEVER FORGET: biological data CAN be wrong.
 Investigate the PubMed link(s):
 Follow the PubMed link from the sequence entry.
 Observe that it is always possible to read the ABSTRACT of the publication in
PubMed, even if access to the publication requires subscription. For most
(new) publications there will also be a direct link to the publication itself.
 Return to the sequence entry once again (or perform the search again if you
closed the window).
GenBank vs. FASTA format
 View the sequence entry in FASTA format (Simply click on "FASTA" in the top part of
the page, below the page title)
Now the entire GenBank entry is shown in FASTA format.

QUESTION 1.2:
a) What happened to the alpha-globin genes? Can they still be found?
Since the FEATURE table has been thrown away, we no longer have the coordinates for the
genes. As such they are "in there" somewhere, but we cannot find them without using
external information.

b) Which part of the GenBank entry has been converted?


The entire "ORIGIN" block (all the DNA sequence) has been converted to FASTA format. The
FEATURE table is discarded. From the HEADER block the definition (title) and accession
number is preserved, the rest is discarded.

Observe that the name of the sequence is based on the name of the GenBank entry.
 Go back to GenBank format (Click on "GenBank")

TASK: Save the GenBank "raw data" on your own computer:


 Click on "Send:" in the upper right part of the page

This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00

https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
 Choose "Complete Record", "File" and "Genbank(full)" and click on "Create file"
 Locate the downloaded file on your own computer
 By default it has a pretty generic name ("sequence.gb") - rename the file to
"AB001981.gb"
Notice: The reason for renaming the file is simply a practice of good file management
- now we can by just skimming the filenames guess that it's a GenBank file ("*.gb")
and that it contains the "AB001981" entry.
 Open it in jEdit.
Notice: What we have now is the "raw" data behind the information shown online,
with no fancy HTML formatting and cross-links.
 Verify that the contents of the file is as expected by inspecting it in jEdit (it should
look exactly like the information shown online).

QUESTION 1.3: Does the downloaded file have UNIX or Windows line-endings?
The downloaded file has Unix line endings. Remember from the JEdit exercise that line
endings are indicated by the letters "U", "W" or "M" in the lower right hand corner of the
jEdit window.

Exploring the genes defined in a GenBank entry


Go back to the GenBank entry in your browser. Click the first "CDS" element (Alpha-D)
 CDS = CoDing Sequences: The PROTEIN CODING part of a gene. Basically: the
sequence you get when the CODING exons are concatenated (UTR regions are
ignored). A CDS always starts with a START codon and ends with a STOP codon.
 Hopefully it's quite intuitive why some of the sequence is high-lighted - otherwise
discuss it within the group (or with the instructor)
Repeat the same procedure for the other CDS (Alpha-A).
 When looking at the FEATURE table, the first line of text in the definition of each CDS
is as follows:

join(1104..1192,1306..1510,1614..1742)
join(4915..5009,5165..5369,5474..5602)

QUESTION 1.4: Based on your observations:


a) What do these numbers mean?
The "join" statements defines how to extract the coding sequence from the entire length of
DNA in the entry: "join(1104..1192,1306..1510,1614..1742)" is basically a recipe
stating to paste together the three intervals - and we'll get the protein coding part of the gene:
the coding exons glued together. The CDS will always start with a START codon (e.g. ATG)
and end with a STOP codon (e.g. TAA).

b) How many coding exons does each gene contain?


The gene contains three coding exons. Note: from a CDS definition we don't get any
information about UnTranslated Regions (UTR's) that are often found before and after the
coding region in the mRNA).

View both of the CDS' in FASTA format (click "Send to" in the upper right corner, choose
"Coding Sequences" and set format to "FASTA")

This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00

https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
QUESTION 1.5: What do the numbers in the sequence title represent?
The first number is the Gene Identifier (taken from the VERSION line in the header). The
subsequent numbers are the positions (coordinates) in the original gene entry (taken from
the join line).

 Switch to Graphic view (Click on Graphics at the top of the page)


An interactive graphical representation of the GenBank entry will now be shown. The
upper part of the visualization shows the entire length of the entry (5.891 bp) with
bars representing the individual exons within the two genes.
 This zoomed view below can be changed by dragging the transparent box
with the blue borders in the overview representation at the top of the page.
 The zoom level can be changed.
 By "mousing over" the bars additional information about that particular
feature will be shown.
The graphical overview is mostly useful for inspecting GenBank entries with multiple
genes (some entries have hundreds of embedded genes). Play around with the
interface for a few minutes to see what functionality is offered.

Part 2: Searching GenBank


The key issue to keep in mind when searching GenBank is to avoid drowning in huge
amounts of irrelevant data. It is therefore of great importance to filter out unwanted
information, WITHOUT losing the relevant entries. Today we will work with searching the
TEXTUAL annotation of GenBank entries (keywords, free text etc). We will later get back to
sequence based searches (BLAST).
In the first part of the exercise we'll investigate various ways to search using insulin as the
example.
Naïve search
Search for GenBank entries containing the term "insulin"
 Just do a simple search for INSULIN - don't put anything else in the search box.
Observe the following:
 A large number of entries are found.
 Go through a few pages of results and notice that we are offered data from a diverse
set of sources: Experimental work, Patent applications, predicted genes, partial genes
etc.
QUESTION 2.1.1:
a) How many search results were returned?
198,542 hits

b) Are they all from Human? If no, give a counterexample. (Would you have expected
them to be all human?)
No. There is e.g. the first hit, M57671.1, "Octodon degus insulin mRNA, complete cds" which
is from a Degu, a rat-like carnivore from Chile. In fact, you can see in the right side of the
results page that only 11,216 hits are from human. There is no reason to expect only human
results from GenBank, since it is not a human-centric database.

c) Are they all insulin? If no, give a counterexample.


No. There are many hits to complete or partial chromosome sequences which contain a lot of
other genes. An example is JWIN03000075.1, "Camelus dromedarius breed African isolate
Drom800 Contig74, whole genome shotgun sequence".

This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00

https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
By default the search term is matched against ALL POSSIBLE fields in the GenBank entries -
including almost all text in the HEADER and FEATURE table. It's even possible to pick up
entries where the match is to one of the authors names and not a gene name! (Perhaps not
an issue for insulin). Luckily it is possible to restrict the search to specific pre-indexed fields
in the HEADER and FEATURE table ("Search fields"), which makes it possible to make the
search much more focused.

How the search is interpreted


When you do a naïve search (just write a few terms google-style) GenBank tried to interpret
what you most likely meant, it is has a behind-the-scene scheme to sorting the results to
push the most interesting ones to the top. It is actually possible to see exactly how your
search query is interpreted by locating the SEARCH DETAILS box.
QUESTION 2.1.2:
a) What have your search for "insulin" been expanded into?
In the Search details box, you find "insulin[All Fields]".

Spend a few moments to investigate the HEADER section of the GenBank entry you have all
received as a hand-out (X01831) to get an idea of how the data is related to specific sections
(e.g. KEYWORDS and ORGANISM which we will use in a moment).
Try to find a search result that appears NOT to be the real insulin gene, and see why it was
picked up by the search. If you have trouble finding one in your own result, search
for DL142095.1 which came up around page 200 when the exercise was written.
The main issue here is that we find entries where "insulin" is mentioned anywhere in the
entry, and sometimes it's unrelated genes like "Insulin-receptor", "Insulin inhibitor" etc.
Searching for human insulin
Search for human insulin and see what happens.
QUESTION 2.1.3:
a) How many search results were returned?
17,342 hits.

b) Can you find the human insulin entry? (If yes, write down its title and Accession)
Yes, it is among the hits on the first page of results.

c) How was your search interpreted by the system (the SEARCH DETAILS box)?
Title: Homo sapiens insulin (INS) gene, complete cds Accession numbers: AH002844
J00265 J00268
("Homo sapiens"[Organism] OR human[All Fields]) AND insulin[All Fields]

Advanced search
Looking at the SEARCH DETAILS from the naïve searches we have just performed, give us a
good idea on how we can build our own more powerful searches. This can be done in two
ways:
1. Simply writing the advanced search string yourself (e.g. "insulin[title]" - to search in
the title field)

This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00

https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
2. Using the "Search builder" to put together the query bit by bit.
But why did the naïve search for "human insulin" go so well?
 If you just need a single (and well-known) gene from one of the well-known model
organism, it will indeed work very well to do a simple search. (Much like when you do
a Google search and get your desired hit on the first page).
 However, there are some situations where it's beneficial to specify the search in
more details - e.g. for building data sets of the same gene across multiple species, or
just trying to locate a slightly more obscure gene. (Same as when the link you were
looking for at Google was on page 10+ and you have to provide more accurate search
terms).

Now we are going to narrow down the search to specific parts of the annotation.
 Click on Advanced in the top of the page.
This brings up a form with a "Search Builder" that can be used to select and combine
terms restricted to specific fields.
 Select "Organism" and enter human.
 Select "Title" and enter insulin.
 Click "Search"
QUESTION 2.2:
a) How many hits do we have now?
5415 hits.

b) Are they all from Human? If no, give a counterexample.


Yes (except for 10 hits that are synthetic constructs, but based on human sequence). See the
"Top Organisms" box on the right.

c) Do they all appear to be insulin genes? If no, give a counterexample.


 Now use the "Search Builder" to search for insulin in other fields instead of "Title"
(still with "Organism" set to human)
No.
 There are many examples of insulin-degrading enzyme, insulin-like growth factor,
insulin receptor and insulin-induced genes.
 Many entries are mRNA and therefore not gene entries.

QUESTION 2.3:
a) How many hits are found when "Keyword" is set to insulin?
9 hits.

b) How many hits are found when "Protein Name" is set to insulin?
13 hits.

c) Find the correct Human Insulin gene entry (the correct hit). Write down its accession
number, Locus name and Definition (title).
Accession numbers: AH002844 J00265 J00268, Locus name: AH002844, Definition (title):
"Human insulin gene, complete cds".

Note that the "Search Builder" simply is a tool for filling out the search box. If you know the
names of the available search fields, it is often more convenient to type your search with the

This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00

https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
field names manually. A schematic overview of the search fields can be found on the NCBI
homepage: Search Fields and Qualifiers.

Combining search terms using boolean operators: NOT, AND and OR


Venn Diagrams for Boolean Logic
Our next task will be to find full length insulin genes from as many different organisms as
possible using the Title field. Note that it might have been easier to use the Protein
name or Keyword fields, but with Title we can immediately see the results of what we are
doing, so we are using it for pedagogical reasons. We will now type the searches directly into
the Search Box without using the Search Builder.
 Let's start out with a new clean search for Insulin:
Query:

insulin[title]

The number of hits is very high, and there are many partial genes and mRNA entries.
 Let's now specify that the entries should be complete:

insulin[title] AND complete[title]

About the use of AND: The AND keyword is implicitly used when ever you enter more than
one search term: "human globin" will be interpreted as "human AND globin" and only
results where BOTH terms are found will be reported. We could therefore have omitted the
"AND" in the previous query.
Observe that we still have many hits that are not actually insulin, so we want to add search
terms to AVOID in order to bring down the false positive rate. By a brief inspection of some
of the search hits, it turns out that some of them are, e.g., insulin receptors.
 Let's get rid of these with the NOT keyword:

insulin[title] complete[title] NOT receptor[title]

Conceptually what we are doing here is to conduct a number of searches that are either
COMBINED or SUBTRACTED from each other. The "receptor[title]" search term finds all
entries where this term is found. This list is then excluded from the combined "insulin[title]
AND complete[title]" list by using the NOT operator.

The use of boolean operators can be visualized graphically using Venn diagrams (see the
figure to the right). A good strategy for narrowing down a GenBank search is to build a list of
"kill words"/"filter words" (terms to avoid). More terms can be added to the list as search
results are inspected, and it's found out why strange entries appear on the result list.

A word of caution: Be careful of not throwing the baby out with the bath water - don't add
kill-words that are so broad that they will actually exclude the gene(s) we are looking for.
And don't add kill-words without specifying a search field - e.g. the search

insulin[title] complete[title] NOT receptor

This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00

https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
would exclude some real insulin hits that just happened to mention "receptor" in some
reference!
 The final part of the exercise to continue to find terms to exclude on your own
hand. The point is to bring down the number of search results to a level where it's
easy to pick the correct ones. Remember: the task is to find full length insulin genes
from as many different organisms as possible using the Title field.
QUESTION 2.4:
a) Which search term did you end up using?
b) How many search results do you get now?
Notice: There are several possible answers to this question, as it will be a balance between
filtering out False Positives (things that are NOT insulin) without filtering out (too many) True
Positives (things that are actually insulin).

The important thing here is not the precise search string, but that you understand the
principle of using "kill-words". One possible answer could be:
insulin[title] complete[title] NOT mRNA[title] NOT receptor[title] NOT receptor-like[title]
NOT "insulin like"[title] NOT "insulin degrading"[title] NOT "growth factor"[title] NOT "family
member"[title] NOT "insulin induced"[title] NOT "insulin dependent"[title] NOT "insulin
promoter"[title] which gives 19 hits, representing 12 organisms and some synthetic
constructs.
Note: the use of double quotes ("") to add two-word "kill phrases".
Note: don't kill "insulin precursor"! Insulin is always synthesized as a precursor,
preproinsulin, that contains both a signal peptide, a propeptide, and the two mature chains.
More about insulin in the exercises next week.

"Free exercise"
Now it's time to perform a number of GenBank searches on your own. It's important to think
about the search strategy - discuss this within the group.
QUESTION 3: Do at least three of the below and report your findings. Remember to write
down the search string you ended up using for each question.
1. Find the Rat and Mouse Insulin gene

(rat[ORGANISM] OR mouse[ORGANISM]) AND insulin[KEYWORD]

This gives 10 hits.


By manual inspection of the results, I then pick the following entries:
 J00748 - Rat insulin II gene (ins-2) with two introns
 J00747 - Rat insulin-I (ins-1) gene
 X04724 - Mouse preproinsulin gene II
 X04725 - Mouse preproinsulin gene I
Note: rodents have two copies of the insulin gene in their genomes.
Note: using "Protein Name" as field yields no results - you cannot assume that
entries are always annotated with Protein Name.

2. Find the alcohol-dehydrogenase gene from as many organisms as possible.

This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00

https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
"alcohol dehydrogenase"[title] complete[title] NOT mRNA[title] NOT
synthetic[title]

which gives 1818 hits.


Note: as many as 360 of these hits are from one organism, Populus nigra (Poplar
tree).

3. Find the alpha-globin gene from Capra hircus - (Remember: Alpha-globin is part of
hemoglobin).

"Capra hircus"[ORGANISM] AND "alpha globin"[title]

This gives 6 hits. There are 2 alpha globin genes, HBAI and HBAII, and they are both
present in two entries. Correct answers could be:
 EU938074 Capra hircus I alpha globin (HBAI) gene, complete cds
 EU938078 Capra hircus II alpha globin (HBAII) gene, complete cds

4. Find the alpha-globin gene from all ruminants - (hint: inspect the ORGANISM fields
in a GenBank entry from an animal you know to be a ruminant, in order to pick up a
good search term). If you want to go deeper into the taxonomy, the Tree of Life
project have an entry on placental mammals here:http://tolweb.org/tree?
group=Eutheria&contgroup=Mammalia.

Ruminantia[ORGANISM] AND "alpha globin"[title]

This yields 16 hits (which will need a bit of clean-up).

5. Find the actin gene from as many organisms as possible.


Avoid mRNA and entries that are part of whole chromosomes, cosmids etc

actin[title] AND actin[protein name] NOT mRNA[title] NOT partial[title]

which yields 428 hits.

6. Find the human insulin receptor gene. Avoid partial genes / single exons in the
results.

human[organism] "insulin receptor"[title] NOT mRNA[title] NOT


substrate[title] NOT partial[title]

gives 74 hits, with #1 or #2 being the right one:


 NG_008852.2 Homo sapiens insulin receptor (INSR), RefSeqGene on chromosome 19
 AH002851.2 Homo sapiens insulin receptor (INSR) gene, complete cds

This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00

https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
Powered by TCPDF (www.tcpdf.org)

You might also like