You are on page 1of 24

Bioinformatics 1: Lecture 3

Similarity Similarityand andhomology homology Pairwise Pairwisesequence sequencealignment alignment Dot Dotmatrices matrices ORFs ORFs The TheSubstitution Substitutionmatrix matrix Gapless Gaplessalignment alignment

What does alignment mean?


Alignment of two or more character sequences implies the one-to-one association of a subset of those characters.

This association has biological meaning.


1. common ancestry 2. structural equivalence

Homology versus similarity


Two sequences are homologs of each other if they have a common ancestor. There are two kinds of homologs: orthologs = homologs by speciation paralogs = homologs by duplication Two sequences are similar to each other to the degree that the alignment score is good. Similarity is a metric. Homology is an inference.
i.e. "We infer from their high sequence similarity that they are homologs."

Degrees of similarity have different meaning


I am pretty sure these sequences are unrelated I don't know what to think. These sequences may have a common ancestor. These sequences may have a similar function. These sequences very likely have the same function.

random sequence similarity scale

identical

"Similarity" versus "distance" metrics


distance==0 means identical, similarity==0 means unrelated

Inversely related. Both are metrics.

Seqlab function "Compare"


With thousands of bases, it is impossible to plot all dots in the matrix. Instead we look for stretches of sequence with few mismatches. If the number of mismatches is less than the cutoff, plot a dot or line.

AAGACGTTTA GACGTACT
All diagonals with at least 4 out of 5 matches.

In class exercise:

Self Dot matrix exercise


Log onto modlab machine. xhost +bioinf45.bio.rpi.edu telnet bioinf45.bio.rpi.edu (...login..password..) echo $DISPLAY setenv DISPLAY modlabn (...etc if necessary) seqlab & background it. = "start seqlab"

In class exercise:

Self Dot matrix for a centromere


In NCBI (Netscape) Search the nucleotide database. Use preview/index to find telomere Select accession and type AY738108 (now there is 1 result) Display it. Send to text. Copy-and-paste into an editor (vi) in the Unix window (logged into bioinf45). Call the file centromere.gb (gb is for GenBank format)

In class exercise:

Self Dot matrix for a centromere


In SeqLab: Import the file you created, centromere.gb Select the sequence. Run Compare (Functions-->pairwise..->Compare) to make a self dot matrix. In Options... set window to 31 and stringency to 29. Use Crosshairs to locate the line near 300, indicating a long repeat. Where exactly are the two copies of the long repeat??

Finding genes
The DNA sequence does not tell us where the genes are. [Genes are segments of DNA that are transcribed into RNA.] However, it does tell us where the open reading frames (ORFs) are. We can find ORFs by looking for regions that have no STOP codons. If we think it is a protein gene, then we can also find the translation start site ATG.

In class exercise:

Finding genes using Seqlab


Upload the sequence: sars.gb Translate it : Select the sequence. Edit-->Translate All three select align translation, then OK Find all start sites "ATG" in the DNA sequence. Select DNA sequence. Edit-->Find-->ATG What frame are they in?

In class exercise:

Finding genes using Seqlab


Find all STOP codons ("*") in the translated sequence. Select all three translations. Edit-->Find-->* Which STOP goes with which Start? Find the start and end positions of the first ORF (> 20 amino acids) in SARS.

In class exercise:

Finding ORFs using Frames


The easier way: Select the DNA sequence. Run Frames. (Functions-->Gene finding..-->Frames) Compare the ORFs found by frames with the locations of ORFs when displayed using Graphics Features.

Alignment matrix
To prepare an alignment, we first consider the score for aligning any one character of the first sequence to one character of the other sequence (one association, one match)
A 0 1 0 0 0 1 0 0 A 0 1 0 0 0 1 0 0 G 1 0 0 1 0 0 0 0 A 0 1 0 0 0 1 0 0 C 0 0 1 0 0 0 1 0 G 1 0 0 1 0 0 0 0 T 0 0 0 0 1 0 0 1 T 0 0 0 0 1 0 0 1 T 0 0 0 0 1 0 0 1 A 0 1 0 0 0 1 0 0 G A C G T A C T

Not all matches are equal


So far, we have considered an aligned pair to be either a Match or a Mis-match. Is there something inbetween? Yes. Some mutations are more "conservative" than others. (Consider the wobble base...) If we know how conservative a mutation is, then we have a measure of similarity that is no longer "black & white."

Conservative mutations
DNA: A change in the 3rd base in a codon, and sometimes the first base, sometimes conserves the amino acid. Protein: A change in amino acids that are in the same chemical class conserve their chemical environment. For example: Lys to Arg is conservative because both a positively charged.

Conservative amino acid changes


C N

+
N` N C C O C C C C

N`

Lys <--> Arg


O

N C C C C C

+
N`

C C C C C O

N C C C C C C

Ile <--> Leu

Ser <--> Thr

Asp <--> Glu

Asn <--> Gln

If the chemistry of the sidechain is conserved, then the mutation is less likely to change structure/function.

Did the genetic code evolve?


Mutations in the first position usually conserve the chemical nature of the sidechain.

non-polar

polar

polar/charged

Amino acid substitution matrices


Two 20x20 substitution matrices are used: BLOSUM & PAM.

A CDE FG HI K LMNPQR ST VW Y ACDEFGH IKLMNPQRSTVWY


4 0 -2 -1 9 -3 -4 6 2 5 -2 -2 -3 -3 6 0 -3 -1 -2 -3 6 -2 -3 -1 0 -1 -2 8 -1 -1 -3 -3 0 -4 -3 4 -1 -3 -1 1 -3 -2 -1 -3 5 -1 -1 -4 -3 0 -4 -3 2 -2 4 -1 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -3 1 0 -3 0 1 -3 0 -3 -2 6 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7

Each number is the score for aligning a single pair of amino acids.
What is the score for this alignment?: ACEPGAA ASDDGTV

BLOSUM62

A teacher's dilemma
To understand... You first need to know...

Multiple sequence alignment Substitution matrices Substitution matrices Phylogenetic trees Phylogenetic trees Multiple sequence alignment

A more sensitive dot matrix: the alignment matrix


sequence 1 Fill in each box with the substitution score.

Each diagonal represents a different alignment, whose score is the sum of the boxes.

sequence 2

In class exercise:

Gapless Alignment
Y K K G E R
G D I
Fill in each box using BLOSUM62. Score all diagonals.

Each diagonal represents a different alignment, whose score is the sum of the boxes.

K R

What is the best gapless alignment?

still time?

Pseudo code for alignment matrix


Starting with the BLOSUM array, write pseudocode to fill in the alignment matrix. (1) Convert both sequences to numbers. (A=1, C=2, D=3, etc. Use
amino acids in the same order as the BLOSUM matrix)

(2) Loop over boxes (nested loops). Put the appropriate BLOSUM number in the box.

still time?

Pseudo code for alignment matrix


read blosum[1..20][1..20] read firstseq[1..N] read secondseq[1..M] for (i from 1 to N) do for (j from 1 to M) do ---- fill in the blank ---enddo enddo write alignmentmatrix[1..N][1..M]

You might also like