You are on page 1of 9

An Introduction to Bioinformatics

Advanced Genetics

Names:

Instructions. Work in pairs to complete this guided worksheet. Make sure you follow the
instructions exactly, as failure to do so will result in a failed analysis. The goals for today are for
you to:

1) Become familiar with some of the capabilities of the NCBI database and the BLAST (Basic
Local Alignment Search) tool.
2) Apply basic knowledge regarding gene structure in eukaryotes to select appropriate
DNA sequences for phylogenetic analysis.
3) Learn how to import and align BLAST sequences using MEGA7.
4) Learn how to create a phylogenetic tree in MEGA 7 using your aligned sequences.

You will need these skills to complete your human mitochondrial DNA lab report, so paying
close attention and taking good notes today will be critical to your success later in the unit.

Part 1: Introduction to the NCBI Database

1) Go to the NCBI database at https://www.ncbi.nlm.nih.gov/ . The current iteration of the


home page has a search window at the top, a list of resource categories on the left, a list
of popular resources on the right and several very useful links to NCBI data and tools in
the middle. These pages are constantly (and annoyingly) under revision, so do not panic
if they do not look like you expect them to look from this handout! If you use a bit of
critical thought you will find what you need.

2) Click on “Analyze” in the middle of the page and answer the following questions:
a. Name the Tool you would use if you wanted to find regions of local similarity
between any two biological sequences in the NCBI database (Hint: this is the tool
we will be using…)

b. Name the tool you would use to identify conserved domains present in a protein
sequence

c. Name the tool you would use to find regions of local similarity between your
sequence and whole genome sequences.
3) Go back to the NCBI home page, type the word ‘cystic fibrosis’ into the top search
window and click on ‘Search’. This will look for that term in all of the NCBI databases.

4) Obviously, this disease has been well studied! If you look under ‘literature’ you will see
that over 61,000 full-text articles have been published related to this disease, alone.
There are also abundant links to information about the protein, the gene that codes for
the protein, its homologs, etc.

5) Click on “OMIM” under “Health”. The Online Mendelian Inheritance in Man database
should be familiar to all of you who took Biology 102, since it is the same database you
used to try to identify your unknown genetic disorder in that class. It is a great resource
for disease-related genes. Click on record #602421 – Cystic Fibrosis Transmembrane
Conductance Regulator; CFTR and answer the following questions:

a. What is the chromosomal location of the CFTR gene?

b. What type of protein is coded for by CFTR?

6) Now let’s get started with our phylogenetic analysis. The first step is to search for CFTR
homologs* using BLAST.

(*Homologs are genes/proteins that have originated from a common ancestral


sequence and still retain significant and recognizable amounts of structural similarity.
There are two types of homologs. Orthologs evolved from a common ancestral gene
during speciation, and share a common function despite their presence in different
species. An example of an ortholog is DNA polymerase in humans and bacteria. While
these genes do have some small DNA and protein sequence differences, their protein
products have the same function: they make copies of cellular DNA during DNA
replication. Paralogs, by contrast, arise via gene duplication within a species, and then
evolve new functions over time. An example of a paralog in humans is hemoglobin (an
iron and oxygen-binding protein found in our red blood cells) and myoglobin (an iron
and oxygen-binding protein found in our muscles). When you do a BLAST search you
may find both types of homologs, so it is important to only choose genes that reflect the
type of relationship you want to study.

7) Go back to the NCBI home page, type ‘cystic fibrosis’ in the search window and choose
‘nucleotide’ from the left-hand drop-down menu before clicking on ‘search’. You should
see that there are over 200,000 sequences related in some way to cystic fibrosis. We
are NOT going to try to align all of these! Choose the second record – it should be
Human cystic fibrosis mRNA, encoding a presumed transmembrane conductance
regulator (CFTR), mRNA – and click on it. This will take you to the Genbank record for
this sequence. The record contains a ton of information about the gene, including
references, its function, the locations of introns and the exons, and the cDNA sequence.

a. What is cDNA? What does it lack that genomic DNA has?

b. Do you expect introns or exons to be more highly conserved during evolution…


Why?

8) In this analysis, I want you to analyze closely related homologs, so we will be using the
human cDNA sequence as our ‘query’. Go to the top of the Genbank record and look for
the sequence accession number. Write it down: M28668.1. You will use that number in
Part 2.

Part 2. Using MEGA7 to align homologous sequences.

Now that you have identified a sequence of interest using the NCBI database, it is time to use
MEGA7 to find homologs and to generate a phylogenetic tree.

1) Open MEGA 7. When it has finished loading you will see its Main window. In the upper
left-hand corner, choose Align Do BLAST search.

2) Make sure you have selected the blastn (nucleotide BLAST) tab from the BLAST window
(this is the default in MEGA 7 and should already be selected). Below that tab you will
see a large box In a BLAST analysis you can type accession numbers or entire sequences
into› the search box. You can also upload sequence files if you have them in the correct
format. Since our sequence is over 6,000 bases long, typing in the accession number is
much simpler. Do that now. It should look like this (but with a different accession
number):
3) Under ‘Choose Search Set', make sure the ‘Others‘ button is clicked. We do not want to
limit ourselves simply to human and mouse genes.

4) Under Program Selection, choose ‘Somewhat similar sequences (blastn)‘.

5) Click on ‘BLAST’. Do NOT click on ‘show results in a new window’ box.

6) A results window will now appear, but will probably not show your matches, yet. It will
take the BLAST tool a little while to compare your 6000+ base sequence to all of the
millions of others in the NCBI database. When your matches are ready, they will be
shown as a graphic summary (long, red lines mean high similarity) followed by a
‘Descriptions’ box. That box will show a table that has the following information:

· Description The name of the aligned sequence with a link to the alignment in the
box at the bottom of the web page. Note that the first record should be the actual
human sequence for which you entered the accession number. If this is not the case
recheck your accession number and try again!
· Max score and Total score. Both are statistical representations of the strength of
the match. High is good.
· Query cover. This tells the percentage of your human sequence is represented in
the match. For the first sequence, you can see that 100% of the bases in your input
mouse sequence were used in the match. This is expected because that top match
actually is the gene you entered. If you look a few genes down, only 97% of your
gene was used. In this analysis, we are not looking for perfect matches, but rather a
variety of closely related homologs, so we will use a cutoff of 75% Query cover.
· E value. This tells you the likelihood that you got this match purely by chance
(rather than because the sequences actually are similar). This should be very low if
you want a significant match. We will not be using anything higher than 0 in this
analysis, but if we were looking at genes with less similarity we could.
· Ident. This is the percent identity between your sequence (Query) and the matched
sequence (Subject, or Sbjct in the BLAST alignments). Note that high is good, but not
always best here, as it is only calculated using the bases in the alignment. So, an
alignment that has 100% identity but only has a Query cover of 5% of your gene is
probably not as strong as an alignment with 95% identity and 95% Query cover. We
will not use a cutoff for Ident.
· Accession. A link to the Genbank record for the sequence.

7) Next you must choose and download sequences for your alignment. Let’s start with
your human sequence, since that will be a part of everyone’s analyses. Click on the
sequence name in the Description column. This should take you to a graphical view of
the alignment of your sequence (Query) with the sequence from the database (Sbjct).
The first part of the alignment will look like the sequence shown below. You can see a
few things in this record.
· The gene name itself says that it is an mRNA (more accurately a cDNA) sequence,
and that the total length of the sequence in the genbank record is 6129 bases. We
will limit this analysis to other cDNA sequences, as aligning cDNA and genomic
sequences (that still contain the introns) requires a level of analysis a bit too
advanced for the first time through this!
· The ‘Strand’ match is Plus/Plus. That means coding strand of your sequence
matched the coding strand of the database sequence. If you had seen Plus/Minus,
that would mean your sequence’s complement matched the database sequence,
and you would need to reverse and complement the database sequence before
adding it to the alignment. Again, in today’s assignment we will not be doing that (I
think all of the matching sequences are cDNA), although your assigned reading goes
through that process should you ever need it.
· Identities are shown as lines between the top (query) and bottom (subject)
sequences. Note this first alignment is 100% identical so there are vertical lines
between every base.
· A link to the Genbank record.
8) Once you have made sure that your sequence is cDNA and has a Plus/Plus alignment,
click on the Genbank link and scroll down the page until you see Features on the left-
hand side. Find “CDS” (short for coding sequence). This tells you where the start and
stop codons are in your gene sequence. Remember that mRNA has untranslated
sequences at both the 5’ and the 3’ ends, and we do not want to analyze them – just the
actual sequence that codes for the protein. In this case our start codon begins at base
133 and the coding region ends at base 4575.

On the right-hand side of the Genbank window, you will see a box that says, “Change
Region Shown”. Click on the arrow to open the box, then click on “Selected Region”,
write in the first and last base of the coding region (in this case 133-4575), and then click
on ‘Update View’. Finally, click the “Add to Alignment” (red plus sign) button at the top
of the page to add the coding sequence to the alignment.

7) MEGA 7’s Input Sequence Label box will open up. In the “Input Sequence Label” first
word, choose H. sapiens (genus, species) and second word enter or choose CFTR (gene
name). Use the same convention for all of the other sequences you choose. Click ok.

8) The human CFTR coding sequence will now show up in a new Alignment Explorer
window.
9) Go back to the MEGA web browser window and click on the left arrow at the upper right
of the window to get back to the tab for the original alignment.

10) Click on The NCBI BLAST tab to reopen the alignment window and then repeat steps 7-9
for 5 more genes. Make sure they have at least a 75% Query cover and represent 5
different species. Also make sure to enter the correct first and last bases in the coding
sequence for each species, based on the Genbank record. Are these sequences
orthologs or paralogs? Explain in the space below.
11) You will now have 6 sequences in your Alignment Explorer Window. Make sure they all
start with ATG (can you tell me why?), then close the MEGA web browser window.

12) It is a good idea to save your data at this point. In the Alignment editor, under Data,
choose Save Session and save your data as CFTR_unaligned, since you have not aligned
the data yet. The file will save in .masx format.

13) It is finally alignment time! MEGA provides two alignment methods: ClustalW and
MUSCLE. We will use the latter, as it is generally more reliable. In the Alignment
Explorer, under Alignment, Choose Align by Muscle (codons) since our sequences are
coding sequences. This will tell the program to avoid making alignments that would
insert stop codons where they don’t belong.

14) A settings window will now open up. For now, just accept the default settings by clicking
on “Compute”. Click on ‘yes’ when the program asks if you want to remove gaps prior
to the alignment. Depending on the number and length of your sequences, this can take
from seconds to hours. In our case, it will only take seconds.

15) Now it is time to save your alignment. In the alignment explorer window, under ‘Data’
choose ‘export alignment’. Choose MEGA format and name your file CFTR_aligned.
Name your data CTFR and click ‘yes’ to confirm that you have a protein-coding
nucleotide sequence. Now you can close your alignment explorer window.

16) You should be back to your main MEGA window. To generate a phylogenetic tree,
under phylogeny choose Construct/Test Maximum Likelihood tree. There are several
different algorithms that can be used to build a tree at this point, and each have their
pros and cons, so we will just pick this one because it is at the top. Choose your
CFTR_aligned.meg file and click on ‘open’. Again, simply use the default values for now.

17) A new window will open that not only shows your phylogenetic tree, but also provides
you with a complete figure legend! Under “Image” save your tree as an enhanced
metafile (.emf), and also save the figure legend. Insert the tree in the space below, and
then copy and paste the figure legend below it. The next page shows what mine looked
like.
Figure 1. Molecular Phylogenetic analysis by Maximum Likelihood method
The evolutionary history was inferred by using the Maximum Likelihood method based on the
Tamura-Nei model [1]. The tree with the highest log likelihood (-9656.12) is shown. Initial
tree(s) for the heuristic search were obtained automatically by applying Neighbor-Join and
BioNJ algorithms to a matrix of pairwise distances estimated using the Maximum Composite
Likelihood (MCL) approach, and then selecting the topology with superior log likelihood value.
The tree is drawn to scale, with branch lengths measured in the number of substitutions per
site. The analysis involved 6 nucleotide sequences. Codon positions included were
1st+2nd+3rd+Noncoding. All positions containing gaps and missing data were eliminated. There
were a total of 4038 positions in the final dataset. Evolutionary analyses were conducted in
MEGA7 [2].

1. Tamura K. and Nei M. (1993). Estimation of the number of nucleotide substitutions in the
control region of mitochondrial DNA in humans and chimpanzees. Molecular Biology and
Evolution 10:512-526.
2. Kumar S., Stecher G., and Tamura K. (2016). MEGA7: Molecular Evolutionary Genetics
Analysis version 7.0 for bigger datasets.Molecular Biology and Evolution 33:1870-1874.

You might also like