You are on page 1of 13

Student’s name (Last, First): Malki, Ibrahim

Section: 006 Instructor: Zainab Tanvir

Laboratory N° 1
Bioinformatics: Self-Guided Internet-based Exercise
on Databases for the Storage and Data Mining

This exercise aims to introduce you to some of the relevant databases and bioinformatics tools for
examining and comparing different pieces of biological information. Biological databases are an
important resource (Maloney et al., 2010) for the study of biochemistry, molecular genetics,
transmission genetics, cell biology, evolution and many other branches of the biological sciences.

Biological databases contain enormous amounts of information about the sequences and structures of
nucleic acids (DNA and RNA) and proteins; gene structures and chromosomes; metabolic pathways
and enzymes; signaling mechanisms, etc. Some of them include software tools that can be used to
analyze such data. Often, the software can be used directly through a web browser (web apps).
Freestanding applications must be downloaded and installed on your computer or a local network.

The analysis of biological macromolecules (especially DNA, RNA and proteins) is based on the
fundamental principle of gene expression, also known as the Central Dogma of Molecular Genetics,

represented in this oversimplified diagram:

The Internet hyperlinks are active in this Word document, which is the one you should use to work
on. Do not use the PDF version in the Lab Manual’s PDF. Enter your answers by double-clicking the
phrase STARTTTYPINGTHERE and start typing.

Important: Always give your document a title that includes your name and other pertinent
information. “Untitled 1.docx” is not a good name, neither are “Graph.xlsx” or “ExtraCredit.pdf.”
You can imagine how many papers we get from students curiously named “Untitled 1.” So, here’s a
suggestion (assuming that you are using Microsoft Word):

LastName_FirstName_202_Section_NN_Bioinformatics.docx. Example, Mr. Paul

Whittick sends a paper to Mr. Sergio Capellutti, instructor for section 3. So Mr. Villiers
gives his paper the unmistakable name “Whittick_Paul_202_03_Bioinformatics.docx,”
and not a generic “Untitled37.docx.”

1. Finding Databases in the World Wide Web

We'll start by finding databases (Honts, 2003). You may click on the URLs in this document. Describe
in a short sentence, what is the function of each particular website. The home page usually has a brief
description of what the purpose of the website creators was. Some titles are obvious (e.g. OMIM =
120:202 Foundations of Biology CMB Laboratory/Fall 2018 Page 2 of 13

Online Mendelian Inheritance in Man); others are not. For example, if you read the top of BLAST’s
first page, you’ll find: “BLAST finds regions of similarity between biological sequences. The program
compares nucleotide or protein sequences to sequence databases and calculates the statistical
significance.” BLAST stands for Basic Local Alignment Search Tool and any biology student should
become familiar with it. Click “Learn more” to find an expanded description.
120:202 Foundations of Biology CMB Laboratory/Fall 2018 Page 3 of 13

1.1. General databases and tools for bioinformatics studies

National Center for Biotechnology Information
Brief description: This site has data about genomes and has tools to help analyze gene

Brief description: This site finds similarities between a given nucleotide or protein sequence to
sequences that have been research which helps to determine what the given gene is and how
it relates to known data.

Brief description: This site provides sources such as e-books and online journals for
biomedical research.

Online Mendelian Inheritance in Man (OMIM)
Brief description: This site provides information on human genes and phenotypes.

NCBI Conserved Domain Search
Brief description: This site is used understand MSA (multiple sequence alignment) models
and utilize them and has data on proteins.

CDART: Conserved Domain Architecture Retrieval Tool
Brief description: This site finds similarities between protein sequences by using specific

European Bioinformatics Institute
Brief description: This site provides a way to search DNA and RNA sequences.

Protein Data Bank
Brief description: This site is used to understand the features of proteins.

GenomeNet Database Resources
Brief description: This site provides information on genomes.

1.2. Access points for integrated suites of sequence analysis tools

Multiple sequence alignment (protein)
Brief description: This site is used to find differences in protein sequences.
120:202 Foundations of Biology CMB Laboratory/Fall 2018 Page 4 of 13

Brief description: This site is used to understand various life science disciplines such as
phylogeny and population genetics.

Multiple sequence alignments
Brief description: This site provides various tools for desired size and methodology for
aligning biological sequences.

PRABI (Rhone-Alpes Bioinformatics Center)
Brief description: This site is used to learn about bioinformatics and biostatics by providing
access to information and online tools.

Biology Workbench/San Diego Supercomputer Center Currently unavailable for lack of funding
Brief description: This site is used to learn about bioinformatics and is known for its
convenient use of modeling and analysis tools.

1.3. Some resources for human genomics

The Human Genome (NCBI)
Brief description: This site is used to search for human genes.

Human Genome Browser Gateway (UCSC)
Brief description: This site is used to understand species and genetics using phylogeny

Brief description: This site is used to focus on the parts of the human genome that are more
active (important proteins, regulatory functions, etc.).

1.4. Databases with entire genomic sequences

National Center for Genome Resources
Brief description: This site has publications regarding bioinformatic research and real-world

J. Craig Venter Institute
Brief description: This site has information about latest genomics research.

Gramene: A Resource for Comparative Grass Genomics
Brief Description: This site focuses on information regarding plant genomics.

 You may want to visit the website and sign up to be notified when the Biology Workbench becomes available.
120:202 Foundations of Biology CMB Laboratory/Fall 2018 Page 5 of 13

Maize GDB (Maize Genetics and Genomics Database)
Brief Description: This site is used to view plant genomics research.

1.5. Example of a specialized structure prediction tool

COILS Server
Brief description: This site is used to compare specific protein regions known as “coiled coil

1.6. Metabolic and signaling pathways

BioCyc (several organisms)
Brief description: This site is used to search for and examine genomes.

EcoCyc (Escherichia coli)
Brief description: This site contains extensive research on the bacteria known as E. coli.

Saccharomyces cerevisiae (brewer’s yeast)
Brief description: This site contains extensive research on the genome of the “brewer’s yeast”.

Arabidopsis thaliana (thale cress)
Brief description: This site has extensive research on the model plant known as thale cress.

Danio rerio (zebra fish)
Brief description: This site is used to find research on the genes of the zebra fish.

Mus musculus (mouse)
Brief description: This site has research on the genome of the mouse.

Homo sapiens (human)
Brief description: This site contains information about metabolites found in humans.

1.7. Additional learning resources (notice the absence of Wikipedia on this list)
Brief description: This site is used to look up taxonomic terms.

Gene ontology:
120:202 Foundations of Biology CMB Laboratory/Fall 2018 Page 6 of 13

Brief description: This site is used to find similarities between two different sets of genes.

Phylogenetic trees: tree
Brief description: This site has information about phylogenetic trees.
Brief description: This site explains the problems that arise when making phylogenetic trees.
Brief description: This site is used to analyze viruses.

Google Scholar
Brief description: This site is a search engine for scientific journals and articles

Brief description: This site is a comprehensive search engine for information on a wide variety
of scientific and nonscientific data.
120:202 Foundations of Biology CMB Laboratory/Fall 2018 Page 7 of 13

1.8. The National Center for Biotechnology Information (NCBI)

NCBI is a comprehensive network of databases that include
information on nucleotidyl sequences (e.g. chromosomal DNA,
mRNA, non-protein–coding RNAs), amino acyl sequences
(proteins), taxonomy, genetically-based diseases (also known
as “inborn errors of metabolism.” Here’s a diagram that
illustrates the relationships among these different databases:

You may want to continue exploring NCBI. This link will take
you to a comprehensive list of all databases in it:

2. Case Study: An Unknown Human Nucleotidyl Sequence

Specific Learning Objectives

1. Describe what GenBank files are and be able to read them.
2. Describe what FASTA format is and learn how to identify sequences in FASTA format.
3. Become familiar with the BLAST program (check NCBI websites) and learn how to use it.

NOTE: Your instructor may decide to assign you a sequence that differs from the one in this section.
If this is the case, enter modifications to this document as necessary.

The nucleotidyl-residue (or “nucleotide,” for short) sequence on the following page comes from a
human DNA sequencing project. You are given the task of identifying the location of this sequence
within the human genome (Alaie et al., 2012). The problem is that the human genome is made up of 3
billion base pairs (bp). To check even 1000 bp by eye in search of this sequence is quite time-
consuming (as you will find out shortly). Imagine if you had to check a billion nucleotides in a

Notice that the sequence provided below is in FASTA format, i.e., it does not start directly with
nucleotide abbreviations (A, G, T, C), nor it does include numbers, spaces or symbols. Instead, a
name or designation for the sequence is written in the first line, preceded by the “>” symbol.

Start by scanning (by eye) the given sequence (3360-bp) in search of the location of the following
short nucleotide stretches. Devise your own method.



Mark the sequences on your printout of this document (underline or use a highlighter) or on the
electronic document, as requested by your instructor.

Please note the time at the beginning of your search and answer the following questions once you
have located your sequence.

1. Describe the method you used to find the sequence stretches (visual comparison? computer-
120:202 Foundations of Biology CMB Laboratory/Fall 2018 Page 8 of 13

I used the computer’s find command, ctrl + f, to find these sequences

2. How long did it take for you to find your sequence?

Sequence i) about 5 seconds
Sequence ii) about 5 seconds
120:202 Foundations of Biology CMB Laboratory/Fall 2018 Page 9 of 13


120:202 Foundations of Biology CMB Laboratory/Fall 2018 Page 10 of 13

2.3. BLAST
Let us explore the efficiency of using vast online databases and online search tools to locate and
identify unknown nucleotide sequences. One such search tool is called BLAST (Basic Local
Alignment Search Tool). This program compares a nucleotidyl (DNA, RNA) or amino acyl sequence
(protein) of interest to online databases looking for regions of local similarity and calculates the
statistical significance of matches. One such online database is NCBI’s GenBank, which contains the
sequences of at least three full-length human genomes and, being hosted by the National Library of
Medicine (a brand of the National Institutes of Health), is free to the public.

Finding sequences of known (or putative) function in a database that have similarity to your
sequence of interest may allow you to identify the gene family to which your sequence belongs or the
functional significance of your sequence, if any. You will use a BLAST search to uncover information
about an unknown sequence. Copy and paste the unknown sequence (either the one from last page
or as provided by your section’s instructor) onto a new Word document and save it in your
computer’s hard drive. Give it a title in the format 202_Test_Sequence_LastName_FirstName.docx
(example: 202_Test_Sequence_McKinnell_James.docx).

1. Go to NCBI BLAST website at

2. In the resulting page, scroll down to Basic Blast and click on the link nucleotide blast. Copy the
first line of the nucleotide sequence in the Word document and paste it in the “Enter Query
Sequence” box. (The top line, preceded by the “>” sign, is the description of what the sequence

3. Leave the settings as they are, but make sure that Human genomic + transcript is selected in the
Choose Search Set options. Scroll to the bottom of the page and click the BLAST button in the
left-hand corner. Wait for results. Did your sequence find any matches in the human genome
What could be the reason for this result?
This sequence doesn’t contain enough information to compare to other genes.

4. Now try a longer sequence. Copy the first three lines and paste this sequence into the “Enter
Query Sequence” box and click BLAST again. Did your query match any sequence in the human
genome database?
If so, what match did it locate?
The homo sapiens fragile X mental retardation 1 (FMR1) gene

5. Next copy one line that is roughly in the middle of the provided sequence and paste it into the
“Query Sequence” box and run the BLAST search again. Did you get a result this time?

6. Propose a reason for why this one line yielded a different result than the one line at the beginning
of the sequence.
Maybe the first line was the noncoding region.
120:202 Foundations of Biology CMB Laboratory/Fall 2018 Page 11 of 13

7. Click on the first of the matches that your search yielded. This match should be with a sequence
within GenBank. What is the name of this gene? What is the Sequence ID?
The name of this gene is the FMR1 gene and the sequence ID is NM_001185076.1.

8. What chromosome is this located in?

It is located on chromosome X

3. Conclusion
A fully processed messenger RNA (mRNA) contains nucleotide triplets in a particular sequence that
are read from an initiation codon (AUG) up to one or two termination codons (out of three: UAG,
UAA, UGA). The expression of a eukaryotic gene is controlled by DNA sequences called regulatory
regions. The regulatory regions include the gene’s promoter, which binds RNA polymerase once the
transcription factors have bound the DNA and made that site accessible, and one or more enhancers
that also bind transcription factors and contribute to the control of gene expression.

Usually, the expression of a gene can be modified if one of its regulatory regions undergoes a
mutation. This mutation may be of immense significance, even if the change involves a single base
substitution, since a transcription factor’s recognition of the site is sequence-specific. Mutations may
involve more substantial changes to the gene’s regulatory regions, such as multiple nucleotide
deletions, or, as in the case of the gene under study in this lab, multiple nucleotide additions which
may eventually result in the silencing of this gene.

The gene you searched codes for the so-called fragile-X mental retardation protein (FMRP). The
promoter of this gene contains a variable number of the trinucleotide repeat CGG. Individuals with
no disease (normal phenotype or wildtype) have promoters containing <60 CGG repeats. Individuals
whose promoters contain 60–200 trinucleotide repeats are said to possess a “premutation” that
renders them susceptible to movement problems (ataxia) later in life. Individuals whose promoters
have >200 CGG trinucleotide repeats are afflicted with fragile-X syndrome and display a wide range
of symptoms that include mental retardation, large testes, etc. In turn, FMRP is involved in the
transport of RNA transcripts to polyribosomes located at sites of protein synthesis. In neurons these
sites include the terminals of axons. Loss of expression of FMRP has far-reaching consequences for an
affected individual.

4. Questionnaire

1. Consider the sequence you searched using the BLAST program.

Would you predict that this gene comes from a healthy person, a person with a premutation,
or a person afflicted with fragile-X syndrome, just by looking at the sequence?

Explain your reasoning.

There is less than 60 CGG repeats.

2. We used the default database when conducting our BLAST search. This database contains
only human genome sequences. Imagine that the sequence you subjected to the BLAST search
yielded no matches (regardless of the length of the sequence you entered into the Query box).
What would you infer about that sequence?
The sequence is not from the human genome and is probably from another species.

3. What result would you predict if we searched that sequence against all known sequences?
120:202 Foundations of Biology CMB Laboratory/Fall 2018 Page 12 of 13

It doesn’t matter if we do this because the sequence that we have is from the FMR1 gene,
therefore that is what will show up.

A database containing all known nucleotide sequences exists and is called “nucleotide
collection (nr/nt).” This database can be found on the BLAST site under “Choose Search Set.”
At “Database” you will see that the “Human Genome + transcript” is selected. Select
“Others” instead and you will find that the “nucleotide collection (nr/nt)” database is
automatically selected. Run your search against this vast database.

4. How do your results differ from the original search?

There are a lot more results here, before I had 7 hits and using this feature I got 100.

5. Describe the capabilities of a BLAST search.

The capabilities of a BLAST search depend on the parameters of the search and the accuracy
of the information provided. This database can show variations of a specific gene and how
many variations found is contingent on the aforementioned factors.

6. What could be the possible limitations of a BLAST search?

The use of a single query is a limitation because more complex search engines allow multiple
queries to optimize the results.

7. BLAST is often nicknamed “the Google of DNA search tools.” Compare a BLAST search to a
Google search and list one possible similarity and one possible difference.
Both BLAST and Google work using the same concept: searching for information related to a
provided set of parameters. One major difference is that BLAST requires the use of very
specific search parameters to show results meaning you input exactly what you want to find
whereas Google requires very minimal input to produce results.

5. Discussion
You are given a sequence of DNA and told that it is human. You are asked to find out its identity and
whether it has similarity to sequences in other organisms. Please describe the bioinformatics tool, the
database, and the procedure you would use to find such information. Give two possible outcomes of
your search.
I use NCBI CDART to see if the human DNA is similar to other organisms because according to the
website “CDART finds protein similarities across significant evolutionary distances using sensitive
domain profiles rather than direct sequence similarity.” This is important because the proteins the
nucleotides code for are more important than the nucleotide sequences themselves. It is possible that
the gene the humans have is particular only to humans and no other species has anything similar or it
is possible that the gene has similarities with other organisms.

Once you have completed the exercise, provide your instructor with a hard copy, or submit via
SafeAssign, or send it via e-mail, as s/he indicates.


Alaie A, Teller V, Qiu W-g (2012) A bioinformatics module for use in an introductory biology
laboratory. Am Biol Teach 74:318-332.

Honts JE (2003) Evolving strategies for the incorporation of bioinformatics within the undergraduate
cell biology curriculum. CBE Life Sci Educ 2:233-247.
120:202 Foundations of Biology CMB Laboratory/Fall 2018 Page 13 of 13

Maloney M, Parker J, LeBlanc M, Woodard CT, Glackin M, Hanrahan M (2010) Bioinformatics and
the undergraduate curriculum. CBE Life Sci Educ 9:172-174.

Maloney M, Parker J, LeBlanc M, Woodard CT, Glackin M, Hanrahan M (2010) Bioinformatics and
the undergraduate curriculum. CBE Life Sci Educ 9:172-174.

National Center for Biotechnology Information (2005) NCBI Help Manual. URL:
Accessed: 7May18