Professional Documents
Culture Documents
BI W2 Ex Ans
BI W2 Ex Ans
Introduction
This exercise has two main goals:
1) Introduction to the types of DNA data contained in the GenBank database (data format,
visualization, cross-database links, how biological "features" such as genes are annotated
and described as coordinates in the DNA sequence).
2) Practice searching the online version of GenBank hosted at the NCBI. Since the number of
sequences in GenBank is HUGE it's critically important to be able to search and filter the
information. Especially filtering the unwanted sequences can be a challenge, as we shall see.
Where to find GenBank
The GenBank database is hosted at NCBI (National Center for Biotechnology Information,
USA) (Link: http://www.ncbi.nlm.nih.gov/). Besides the main GenBank database, NCBI also
hosts a number of other biological databases (for example whole-genome databases for
human, mouse, chimp etc.). In this particular exercise we will concentrate on the classical
"GenBank" database (http://www.ncbi.nlm.nih.gov/genbank/).
Using the "Entrez" database browser
ALL the NCBI databases can be queried through a common search interface named Entrez.
On next to all NCBI webpages a search box can be found in the upper part of the page,
allowing an easy access for searching the individual databases (or searching across all
databases). Click on the following link to open up a new browser window with Entrez, where
the focus is pre-set to search in the GenBank database:
http://www.ncbi.nlm.nih.gov/nucleotide
(Alternatively go to the main NCBI webpage and choose "Nucleotide" as the database).
QUESTION 1.1:
a) How many genes are contained in this entry?
Inspecting the FEATURE table of the entry reveals that two CDS regions are defined;
therefore there are two genes in this entry. As stated on the GenBank hand-out "CDS" is the
most stable definition of a protein coding gene used in the GenBank format - sometimes
"gene" will also be present, but CDS is more commonly used.
This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00
https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
c) What kind of information is contained within the HEADER and within the FEATURE
block?
The HEADER contain general information about the entry: Organism, publication references,
keywords, accession-ID etc. The FEATURE table contains information that refers to
coordinates in the DNA sequence - for example definition of CDS regions.
PubMed links
Notice that the publication from which the DNA sequence originates is cited (and linked via
a PubMed ID) within the header. Sometimes multiple publications related to the same gene
is listed. This is of great importance since it makes it possible to trace the source(s) of the
DNA sequence and investigate if the experiments carried out are to be trusted.
This can be of real importance if something seems "wrong" with the sequence (for example
if this particular gene exhibits a really strange intron/exon structure compared to other
closely related genes, or if it simply doesn't match ANY other known genes of the same
family). By investigation of the original publication it's possible to double-check the
experimental procedure. It may be that the article correctly states the gene to be of type
XXX but when that data submitted it was accidentally annotated as YYY (it is the original
researchers' responsibility to double-check this). There can also be more serious problems
with the experiments ranging from bad/wrong PCR primers, to contamination with DNA
from a different species during a cloning step.
NEVER FORGET: biological data CAN be wrong.
Investigate the PubMed link(s):
Follow the PubMed link from the sequence entry.
Observe that it is always possible to read the ABSTRACT of the publication in
PubMed, even if access to the publication requires subscription. For most
(new) publications there will also be a direct link to the publication itself.
Return to the sequence entry once again (or perform the search again if you
closed the window).
GenBank vs. FASTA format
View the sequence entry in FASTA format (Simply click on "FASTA" in the top part of
the page, below the page title)
Now the entire GenBank entry is shown in FASTA format.
QUESTION 1.2:
a) What happened to the alpha-globin genes? Can they still be found?
Since the FEATURE table has been thrown away, we no longer have the coordinates for the
genes. As such they are "in there" somewhere, but we cannot find them without using
external information.
Observe that the name of the sequence is based on the name of the GenBank entry.
Go back to GenBank format (Click on "GenBank")
This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00
https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
Choose "Complete Record", "File" and "Genbank(full)" and click on "Create file"
Locate the downloaded file on your own computer
By default it has a pretty generic name ("sequence.gb") - rename the file to
"AB001981.gb"
Notice: The reason for renaming the file is simply a practice of good file management
- now we can by just skimming the filenames guess that it's a GenBank file ("*.gb")
and that it contains the "AB001981" entry.
Open it in jEdit.
Notice: What we have now is the "raw" data behind the information shown online,
with no fancy HTML formatting and cross-links.
Verify that the contents of the file is as expected by inspecting it in jEdit (it should
look exactly like the information shown online).
QUESTION 1.3: Does the downloaded file have UNIX or Windows line-endings?
The downloaded file has Unix line endings. Remember from the JEdit exercise that line
endings are indicated by the letters "U", "W" or "M" in the lower right hand corner of the
jEdit window.
join(1104..1192,1306..1510,1614..1742)
join(4915..5009,5165..5369,5474..5602)
View both of the CDS' in FASTA format (click "Send to" in the upper right corner, choose
"Coding Sequences" and set format to "FASTA")
This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00
https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
QUESTION 1.5: What do the numbers in the sequence title represent?
The first number is the Gene Identifier (taken from the VERSION line in the header). The
subsequent numbers are the positions (coordinates) in the original gene entry (taken from
the join line).
b) Are they all from Human? If no, give a counterexample. (Would you have expected
them to be all human?)
No. There is e.g. the first hit, M57671.1, "Octodon degus insulin mRNA, complete cds" which
is from a Degu, a rat-like carnivore from Chile. In fact, you can see in the right side of the
results page that only 11,216 hits are from human. There is no reason to expect only human
results from GenBank, since it is not a human-centric database.
This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00
https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
By default the search term is matched against ALL POSSIBLE fields in the GenBank entries -
including almost all text in the HEADER and FEATURE table. It's even possible to pick up
entries where the match is to one of the authors names and not a gene name! (Perhaps not
an issue for insulin). Luckily it is possible to restrict the search to specific pre-indexed fields
in the HEADER and FEATURE table ("Search fields"), which makes it possible to make the
search much more focused.
Spend a few moments to investigate the HEADER section of the GenBank entry you have all
received as a hand-out (X01831) to get an idea of how the data is related to specific sections
(e.g. KEYWORDS and ORGANISM which we will use in a moment).
Try to find a search result that appears NOT to be the real insulin gene, and see why it was
picked up by the search. If you have trouble finding one in your own result, search
for DL142095.1 which came up around page 200 when the exercise was written.
The main issue here is that we find entries where "insulin" is mentioned anywhere in the
entry, and sometimes it's unrelated genes like "Insulin-receptor", "Insulin inhibitor" etc.
Searching for human insulin
Search for human insulin and see what happens.
QUESTION 2.1.3:
a) How many search results were returned?
17,342 hits.
b) Can you find the human insulin entry? (If yes, write down its title and Accession)
Yes, it is among the hits on the first page of results.
c) How was your search interpreted by the system (the SEARCH DETAILS box)?
Title: Homo sapiens insulin (INS) gene, complete cds Accession numbers: AH002844
J00265 J00268
("Homo sapiens"[Organism] OR human[All Fields]) AND insulin[All Fields]
Advanced search
Looking at the SEARCH DETAILS from the naïve searches we have just performed, give us a
good idea on how we can build our own more powerful searches. This can be done in two
ways:
1. Simply writing the advanced search string yourself (e.g. "insulin[title]" - to search in
the title field)
This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00
https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
2. Using the "Search builder" to put together the query bit by bit.
But why did the naïve search for "human insulin" go so well?
If you just need a single (and well-known) gene from one of the well-known model
organism, it will indeed work very well to do a simple search. (Much like when you do
a Google search and get your desired hit on the first page).
However, there are some situations where it's beneficial to specify the search in
more details - e.g. for building data sets of the same gene across multiple species, or
just trying to locate a slightly more obscure gene. (Same as when the link you were
looking for at Google was on page 10+ and you have to provide more accurate search
terms).
Now we are going to narrow down the search to specific parts of the annotation.
Click on Advanced in the top of the page.
This brings up a form with a "Search Builder" that can be used to select and combine
terms restricted to specific fields.
Select "Organism" and enter human.
Select "Title" and enter insulin.
Click "Search"
QUESTION 2.2:
a) How many hits do we have now?
5415 hits.
QUESTION 2.3:
a) How many hits are found when "Keyword" is set to insulin?
9 hits.
b) How many hits are found when "Protein Name" is set to insulin?
13 hits.
c) Find the correct Human Insulin gene entry (the correct hit). Write down its accession
number, Locus name and Definition (title).
Accession numbers: AH002844 J00265 J00268, Locus name: AH002844, Definition (title):
"Human insulin gene, complete cds".
Note that the "Search Builder" simply is a tool for filling out the search box. If you know the
names of the available search fields, it is often more convenient to type your search with the
This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00
https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
field names manually. A schematic overview of the search fields can be found on the NCBI
homepage: Search Fields and Qualifiers.
insulin[title]
The number of hits is very high, and there are many partial genes and mRNA entries.
Let's now specify that the entries should be complete:
About the use of AND: The AND keyword is implicitly used when ever you enter more than
one search term: "human globin" will be interpreted as "human AND globin" and only
results where BOTH terms are found will be reported. We could therefore have omitted the
"AND" in the previous query.
Observe that we still have many hits that are not actually insulin, so we want to add search
terms to AVOID in order to bring down the false positive rate. By a brief inspection of some
of the search hits, it turns out that some of them are, e.g., insulin receptors.
Let's get rid of these with the NOT keyword:
Conceptually what we are doing here is to conduct a number of searches that are either
COMBINED or SUBTRACTED from each other. The "receptor[title]" search term finds all
entries where this term is found. This list is then excluded from the combined "insulin[title]
AND complete[title]" list by using the NOT operator.
The use of boolean operators can be visualized graphically using Venn diagrams (see the
figure to the right). A good strategy for narrowing down a GenBank search is to build a list of
"kill words"/"filter words" (terms to avoid). More terms can be added to the list as search
results are inspected, and it's found out why strange entries appear on the result list.
A word of caution: Be careful of not throwing the baby out with the bath water - don't add
kill-words that are so broad that they will actually exclude the gene(s) we are looking for.
And don't add kill-words without specifying a search field - e.g. the search
This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00
https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
would exclude some real insulin hits that just happened to mention "receptor" in some
reference!
The final part of the exercise to continue to find terms to exclude on your own
hand. The point is to bring down the number of search results to a level where it's
easy to pick the correct ones. Remember: the task is to find full length insulin genes
from as many different organisms as possible using the Title field.
QUESTION 2.4:
a) Which search term did you end up using?
b) How many search results do you get now?
Notice: There are several possible answers to this question, as it will be a balance between
filtering out False Positives (things that are NOT insulin) without filtering out (too many) True
Positives (things that are actually insulin).
The important thing here is not the precise search string, but that you understand the
principle of using "kill-words". One possible answer could be:
insulin[title] complete[title] NOT mRNA[title] NOT receptor[title] NOT receptor-like[title]
NOT "insulin like"[title] NOT "insulin degrading"[title] NOT "growth factor"[title] NOT "family
member"[title] NOT "insulin induced"[title] NOT "insulin dependent"[title] NOT "insulin
promoter"[title] which gives 19 hits, representing 12 organisms and some synthetic
constructs.
Note: the use of double quotes ("") to add two-word "kill phrases".
Note: don't kill "insulin precursor"! Insulin is always synthesized as a precursor,
preproinsulin, that contains both a signal peptide, a propeptide, and the two mature chains.
More about insulin in the exercises next week.
"Free exercise"
Now it's time to perform a number of GenBank searches on your own. It's important to think
about the search strategy - discuss this within the group.
QUESTION 3: Do at least three of the below and report your findings. Remember to write
down the search string you ended up using for each question.
1. Find the Rat and Mouse Insulin gene
This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00
https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
"alcohol dehydrogenase"[title] complete[title] NOT mRNA[title] NOT
synthetic[title]
3. Find the alpha-globin gene from Capra hircus - (Remember: Alpha-globin is part of
hemoglobin).
This gives 6 hits. There are 2 alpha globin genes, HBAI and HBAII, and they are both
present in two entries. Correct answers could be:
EU938074 Capra hircus I alpha globin (HBAI) gene, complete cds
EU938078 Capra hircus II alpha globin (HBAII) gene, complete cds
4. Find the alpha-globin gene from all ruminants - (hint: inspect the ORGANISM fields
in a GenBank entry from an animal you know to be a ruminant, in order to pick up a
good search term). If you want to go deeper into the taxonomy, the Tree of Life
project have an entry on placental mammals here:http://tolweb.org/tree?
group=Eutheria&contgroup=Mammalia.
6. Find the human insulin receptor gene. Avoid partial genes / single exons in the
results.
This study source was downloaded by 100000885376840 from CourseHero.com on 05-08-2024 09:39:08 GMT -05:00
https://www.coursehero.com/file/187119543/BI-W2-Ex-Ansdocx/
Powered by TCPDF (www.tcpdf.org)