Leklj

Bioinformatics 1: Lecture 3
Similarity Similarityand andhomology homology Pairwise Pairwisesequence sequencealignment alignment Dot Dotmatrices matrices ORFs ORFs The TheSubstitution Substitutionmatrix matrix Gapless Gaplessalignment alignment
What does alignment mean?

Alignment of two or more character sequences implies the one-to-one association of a subset of those characters.
This association has biological meaning.

1. common ancestry 2. structural equivalence
Homology versus similarity

Two sequences are homologs of each other if they have a common ancestor. There are two kinds of homologs: orthologs = homologs by speciation paralogs = homologs by duplication Two sequences are similar to each other to the degree that the alignment score is good. Similarity is a metric. Homology is an inference.
i.e. "We infer from their high sequence similarity that they are homologs."
Degrees of similarity have different meaning

I am pretty sure these sequences are unrelated I don't know what to think. These sequences may have a common ancestor. These sequences may have a similar function. These sequences very likely have the same function.
random sequence similarity scale
identical
"Similarity" versus "distance" metrics

distance==0 means identical, similarity==0 means unrelated
Inversely related. Both are metrics.
Seqlab function "Compare"

With thousands of bases, it is impossible to plot all dots in the matrix. Instead we look for stretches of sequence with few mismatches. If the number of mismatches is less than the cutoff, plot a dot or line.
AAGACGTTTA GACGTACT
All diagonals with at least 4 out of 5 matches.
In class exercise:
Self Dot matrix exercise

Log onto modlab machine. xhost +bioinf45.bio.rpi.edu telnet bioinf45.bio.rpi.edu (...login..password..) echo $DISPLAY setenv DISPLAY modlabn (...etc if necessary) seqlab & background it. = "start seqlab"
In class exercise:
Self Dot matrix for a centromere

In NCBI (Netscape) Search the nucleotide database. Use preview/index to find telomere Select accession and type AY738108 (now there is 1 result) Display it. Send to text. Copy-and-paste into an editor (vi) in the Unix window (logged into bioinf45). Call the file centromere.gb (gb is for GenBank format)
In class exercise:
Self Dot matrix for a centromere

In SeqLab: Import the file you created, centromere.gb Select the sequence. Run Compare (Functions-->pairwise..->Compare) to make a self dot matrix. In Options... set window to 31 and stringency to 29. Use Crosshairs to locate the line near 300, indicating a long repeat. Where exactly are the two copies of the long repeat??
Finding genes
The DNA sequence does not tell us where the genes are. [Genes are segments of DNA that are transcribed into RNA.] However, it does tell us where the open reading frames (ORFs) are. We can find ORFs by looking for regions that have no STOP codons. If we think it is a protein gene, then we can also find the translation start site ATG.
In class exercise:
Finding genes using Seqlab

Upload the sequence: sars.gb Translate it : Select the sequence. Edit-->Translate All three select align translation, then OK Find all start sites "ATG" in the DNA sequence. Select DNA sequence. Edit-->Find-->ATG What frame are they in?
In class exercise:
Finding genes using Seqlab

Find all STOP codons ("*") in the translated sequence. Select all three translations. Edit-->Find-->* Which STOP goes with which Start? Find the start and end positions of the first ORF (> 20 amino acids) in SARS.
In class exercise:
Finding ORFs using Frames

The easier way: Select the DNA sequence. Run Frames. (Functions-->Gene finding..-->Frames) Compare the ORFs found by frames with the locations of ORFs when displayed using Graphics Features.
Alignment matrix
To prepare an alignment, we first consider the score for aligning any one character of the first sequence to one character of the other sequence (one association, one match)
A 0 1 0 0 0 1 0 0 A 0 1 0 0 0 1 0 0 G 1 0 0 1 0 0 0 0 A 0 1 0 0 0 1 0 0 C 0 0 1 0 0 0 1 0 G 1 0 0 1 0 0 0 0 T 0 0 0 0 1 0 0 1 T 0 0 0 0 1 0 0 1 T 0 0 0 0 1 0 0 1 A 0 1 0 0 0 1 0 0 G A C G T A C T
Not all matches are equal

So far, we have considered an aligned pair to be either a Match or a Mis-match. Is there something inbetween? Yes. Some mutations are more "conservative" than others. (Consider the wobble base...) If we know how conservative a mutation is, then we have a measure of similarity that is no longer "black & white."
Conservative mutations
DNA: A change in the 3rd base in a codon, and sometimes the first base, sometimes conserves the amino acid. Protein: A change in amino acids that are in the same chemical class conserve their chemical environment. For example: Lys to Arg is conservative because both a positively charged.
Conservative amino acid changes

C N
+
N` N C C O C C C C
N`
Lys <--> Arg

O
N C C C C C
+
N`
C C C C C O
N C C C C C C
Ile <--> Leu
Ser <--> Thr
Asp <--> Glu
Asn <--> Gln
If the chemistry of the sidechain is conserved, then the mutation is less likely to change structure/function.
Did the genetic code evolve?

Mutations in the first position usually conserve the chemical nature of the sidechain.
non-polar
polar
polar/charged
Amino acid substitution matrices

Two 20x20 substitution matrices are used: BLOSUM & PAM.
A CDE FG HI K LMNPQR ST VW Y ACDEFGH IKLMNPQRSTVWY

4 0 -2 -1 9 -3 -4 6 2 5 -2 -2 -3 -3 6 0 -3 -1 -2 -3 6 -2 -3 -1 0 -1 -2 8 -1 -1 -3 -3 0 -4 -3 4 -1 -3 -1 1 -3 -2 -1 -3 5 -1 -1 -4 -3 0 -4 -3 2 -2 4 -1 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -3 1 0 -3 0 1 -3 0 -3 -2 6 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7
Each number is the score for aligning a single pair of amino acids.
What is the score for this alignment?: ACEPGAA ASDDGTV
BLOSUM62
A teacher's dilemma
To understand... You first need to know...
Multiple sequence alignment Substitution matrices Substitution matrices Phylogenetic trees Phylogenetic trees Multiple sequence alignment
A more sensitive dot matrix: the alignment matrix

sequence 1 Fill in each box with the substitution score.
Each diagonal represents a different alignment, whose score is the sum of the boxes.
sequence 2
In class exercise:
Gapless Alignment
Y K K G E R
G D I
Fill in each box using BLOSUM62. Score all diagonals.
Each diagonal represents a different alignment, whose score is the sum of the boxes.
K R
What is the best gapless alignment?
still time?
Pseudo code for alignment matrix

Starting with the BLOSUM array, write pseudocode to fill in the alignment matrix. (1) Convert both sequences to numbers. (A=1, C=2, D=3, etc. Use
amino acids in the same order as the BLOSUM matrix)
(2) Loop over boxes (nested loops). Put the appropriate BLOSUM number in the box.
still time?
Pseudo code for alignment matrix

read blosum[1..20][1..20] read firstseq[1..N] read secondseq[1..M] for (i from 1 to N) do for (j from 1 to M) do ---- fill in the blank ---enddo enddo write alignmentmatrix[1..N][1..M]

Leklj

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Leklj

Uploaded by

Copyright:

Available Formats

Bioinformatics 1: Lecture 3

What does alignment mean?

This association has biological meaning.

Homology versus similarity

Degrees of similarity have different meaning

random sequence similarity scale

"Similarity" versus "distance" metrics

Inversely related. Both are metrics.

Seqlab function "Compare"

Self Dot matrix exercise

Self Dot matrix for a centromere

Self Dot matrix for a centromere

Finding genes using Seqlab

Finding genes using Seqlab

Finding ORFs using Frames

Not all matches are equal

Conservative amino acid changes

Lys <--> Arg

Ile <--> Leu

Ser <--> Thr

Asp <--> Glu

Asn <--> Gln

Did the genetic code evolve?

Amino acid substitution matrices

A CDE FG HI K LMNPQR ST VW Y ACDEFGH IKLMNPQRSTVWY

A more sensitive dot matrix: the alignment matrix

What is the best gapless alignment?

Pseudo code for alignment matrix

Pseudo code for alignment matrix

You might also like