0 views

Uploaded by poojohnson

catam old

save

- MIT ALL Courses
- Lsm 3241
- Reference 1
- swaminat
- BREW Final Report September 2006
- Shaft Alignment Optimization Genetic Algorithms
- Iddo-McCreight_slides on Suffix Tree Updates
- rr322303-bio-informatics
- Blast Analisis II
- PROSITE
- Bioeknologi
- Mining Sequence in Biological Data
- Need and Role of Scala Implementations in Bioinformatics
- Public Att Biotech Hoban
- genetic databases ppt
- figueroatext3
- section_four
- 07 Forensik
- KI67 Biotech - DB-070-EN.pdf
- Nightingale, Martin - 2004 - The Myth of the Biotech Revolution
- Chassy B. Food Safety Evaluation of Crops Produced Through Biotech 2002
- tmpEF99.tmp
- EU Version: The So-Called Scientific "Consensus": Why the Debate on GMO Safety is Not Over
- The So-Called Scientific "Consensus": Why the Debate on GMO Safety is Not Over
- Research Methodology 27012012
- interview with dr lawrence mays
- DNA Barcoding Identifies a Third Invasive Species of Eleutherodactylus
- Notice: Agency Information Collection Activities; Proposals, Submissions, and Approvals
- Applied Biology
- Proteomics - Technologies, Markets and Companies
- Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
- Dispatches from Pluto: Lost and Found in the Mississippi Delta
- The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
- Sapiens: A Brief History of Humankind
- Yes Please
- The Unwinding: An Inner History of the New America
- Grand Pursuit: The Story of Economic Genius
- This Changes Everything: Capitalism vs. The Climate
- A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
- The Emperor of All Maladies: A Biography of Cancer
- The Prize: The Epic Quest for Oil, Money & Power
- John Adams
- Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
- The World Is Flat 3.0: A Brief History of the Twenty-first Century
- Rise of ISIS: A Threat We Can't Ignore
- The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
- Smart People Should Build Things: How to Restore Our Culture of Achievement, Build a Path for Entrepreneurs, and Create New Jobs in America
- Team of Rivals: The Political Genius of Abraham Lincoln
- The New Confessions of an Economic Hit Man
- How To Win Friends and Influence People
- Angela's Ashes: A Memoir
- Steve Jobs
- Bad Feminist: Essays
- You Too Can Have a Body Like Mine: A Novel
- The Incarnations: A Novel
- The Light Between Oceans: A Novel
- The Silver Linings Playbook: A Novel
- Leaving Berlin: A Novel
- Extremely Loud and Incredibly Close: A Novel
- The Sympathizer: A Novel (Pulitzer Prize for Fiction)
- A Man Called Ove: A Novel
- The Master
- Bel Canto
- The Blazing World: A Novel
- The Rosie Project: A Novel
- The First Bad Man: A Novel
- We Are Not Ourselves: A Novel
- Brooklyn: A Novel
- The Flamethrowers: A Novel
- Life of Pi
- The Love Affairs of Nathaniel P.: A Novel
- Lovers at the Chameleon Club, Paris 1932: A Novel
- The Bonfire of the Vanities: A Novel
- The Perks of Being a Wallflower
- A Prayer for Owen Meany: A Novel
- The Cider House Rules
- Wolf Hall: A Novel
- The Art of Racing in the Rain: A Novel
- The Wallcreeper
- Interpreter of Maladies
- The Kitchen House: A Novel
- Beautiful Ruins: A Novel
- Good in Bed

You are on page 1of 4

**9.3 Protein Comparison in Bioinformatics (8 units)
**

This project is computationally intensive but mostly self-contained mathematically. Some un-

derstanding of random variables (covered in the Part IA Probability course) is required.

Introduction

**Sequence comparison and alignment, combined with the systematic collection and search of
**

databases containing biomolecular sequences, both DNA and protein, has become an essential

part of modern molecular biology. Molecular sequence data is important because two proteins

with similar sequences often have similar functions or structures. This means that we can learn

about the function of a protein in humans by studying the functions of proteins with similar

sequences in simpler organisms such as yeast, fruit flies, or frogs.

In this project we will examine methods for comparison of two sequences.

We will work with two strings S and T of lengths m and n respectively, composed of characters

from some finite alphabet. Write Si for the ith character of S, and S[i, j] for the substring

Si , . . . , Sj . If i > j then S[i, j] is the empty string. A prefix of S is a substring S[1, k] for some

k 6 m (possibly the empty prefix). Similarly, a suffix of S is a substring S[k, m] with k > 1.

1 The edit distance

**This section follows work originally done by Needleman and Wunsch [4], although the notation
**

we use is slightly different.

Suppose S = fruit and T = berry. We can transform S into T by

1. Replacing f with b,

2. Inserting e,

3. Matching r with r,

4. Replacing u with r,

5. Replacing i with y,

6. Deleting t.

**We call RIMRRD the editing transcript. The alignment of S and T is read vertically character by
**

character and is given by:

RIMRRD

f ruit

berry

There are, of course, many possible ways to transform one string into another; but from an

evolutionary point of view there must be a cost associated with any action other than matching.

The optimal edit transcripts are those that involve the least number of edit operations (Replace,

Insert, and Delete). We define the edit distance d(S, T ) to be the minimal number of edits

between S and T . Let D(i, j) = d(S[1, i], T [1, j]). Observe that d(S, T ) = D(m, n).

**July 2012/Part II/9.3 Page 1 of 4 °
**

c University of Cambridge

pointing to one of D(i − 1. we are interested in finding an optimal editing transcript and alignment of S and T rather than just d(S. find the score v between proteins A and B and give the first 50 steps of the optimal alignment. j) = min{D(i − 1. D(i. a protein. and D(0. This can be done by assigning a pointer when calculating D(i. Explain your reasoning carefully. Question 1 Prove that for all i. In this section we adjust our scoring algorithm in order to capture some of these biological considerations. What boundary conditions D(0. i. Question 3 Modify your algorithm so as to produce one possible optimal alignment between two strings. j) did you use? Question 2 Write a program to find the edit distance between two strings. protein A is for humans and protein B for yellowfin tuna. A gene is a sequence of DNA which can be translated into a sequence of amino acids. so two protein sequences might be relatively similar over several regions. b) is some suitable function which you should determine. Approximately twenty types of amino acid (the exact number is species dependent) are involved in the construction of each protein. b) which you found in Section 1.’order’.) Find the edit distance between them. and give the first 50 steps of an optimal alignment. we will talk about maximising a score rather than minimizing a distance.e. and some mutations are more likely than others. Use it to find the edit distance between shesells and seashells. What is the complexity of your algorithm? Usually. T ) be the maximum score of all edit transcripts from S to T . For historical reasons. July 2012/Part II/9.3 Page 2 of 4 ° c University of Cambridge . (Both proteins are myoglobin. Tj )}. and the BLOSUM matrices of Henikoff and Henikoff [3]. the appropriate BLOSUM matrix can be generated using the command blosum(62. T ). j − 1). Mutations in DNA will lead to changes in the sequence of amino acids.∗ 3 Scoring for gaps Some mechanisms for DNA mutations involve the deletion or insertion of large chunks of DNA. 2 Scoring matrix A protein is essentially a long sequence of amino acids. where s(a.’CSTPAGNDEQHRKMILVFYW’). j − 1) + 1. Proteins are often composed of combinations of different domains from a relatively small reper- toire. D(i − 1. D(i. but differ in other regions where one protein contains a certain domain but the other does not. Question 4 Using the BLOSUM matrix blosum. Let v(S. or D(i − 1. D(i. j). j − 1) + s(Si . j). There are various schemes for assessing the probability of a mutation from amino acid a to amino acid b.. currently the two dominant schemes are the PAM matrices introduced by Dayhoff [2]. j − 1). Take proteins A and B from the file proteins. The adjustment is achieved by replacing the scoring function s(a.txt on the CATAM website. j) + 1. and scoring −8 for each Insert or Delete. 0). j > 0 D(i. ∗ If your version of MATLAB has the Bioinformatics Toolbox installed. 0).txt from the CATAM website for the scoring function s.

then lim inf n→∞ xn is defined by lim inf xn = lim inf xm . E(vgap (U n . is to show that this limit exists.txt on the CATAM website. Let the score of inserting/deleting a sequence of length l be fixed: w(l) = u for all l > 1. Let U n be a random protein of length n: all the amino acids U1n . j) + w(i − k)}. 4 Statistical significance We may now ask at what threshold a score vgap (S. Let vgap (S. a) = s(b. k) + w(j − k)}. find the gap-weighted score vgap (C. i]. 06k6j−1 F (i. T ) be the gap-weighted score between S and T . Suppose there are only two letters in our alphabet. One way to obtain a quality mark in this project.) Using the BLOSUM matrix from Sec- tion 2 for the scoring function s. V n )) exists and is strictly positive. with P (Uin = a) = p and P (Uin = b) = 1 − p. n→∞ n→∞ m>n Note that the limit is always guaranteed to exist. j). and write Vgap (i. j − 1) + s(Si . G(i. Question 5 Find and implement such an algorithm. V n )). This is discussed further in most textbooks on analysis. Iterating the above equations on the n by m grid has complexity of O(mn2 + nm2 ). . a and b. . Then Vgap (i. though not the only way. j) for vgap (S[1. limn→∞ n−1 E(vgap (U n . What boundary conditions do you use? Question 6 Take proteins C and D from the file proteins. Question 8 Let u = −3. Let s(a. b) = 1 and s(a. 06k6i−1 G(i. j) = max {Vgap (k. . T [1. . Prove that for all 0 6 p 6 1. and why it has complexity O(mn). Let w(l) < 0. corresponding. Question 7 Consider two random proteins U n and V n . (Both proteins are keratin structures in humans. j). l > 1. D) and give the first 50 steps of the optimal alignment. F (i. say. j)}. independent and identically distributed. T ) should be declared to have biological significance? Let us simplify the problem slightly.) In fact.3 Page 3 of 4 ° c University of Cambridge . if w(l) takes some fixed value u for all l > 1. Happily. b) = s(b. though it may be +∞ or −∞. Now vary n and estimate the limit of n−1 E(vgap (U n . Tj ). V n )). j]). Explain how your algorithm works. we can still align two protein strings taking gaps into account. to hydrophobic and hydrophilic amino acids. E(i. p = 12 .At some computational cost. and for all u 6 0. July 2012/Part II/9. j) = Vgap (i − 1. a) = −1. then there exists an algorithm for finding vgap which has complexity O(mn). j) = max {Vgap (i. Explain how you arrive at your estimate of the limit. and u = −12 as the fixed score of insertion/deletion. Write a program to estimate n−1 E(vgap (U n . Unn are independent and identically distributed. n→∞ n (Note: if xn is a sequence of real numbers. j) = max {E(i. V n )) lim inf > 0. be the score of deleting (or inserting) a sequence of amino acids of length l from (or into) a protein.

0) = Vsfx (0. j − 1) + s(−. Pro- ceeding of the National Academy of Science 89: 10915-10919. 1978 [3] Henikoff. Bulletin of Mathematical Biology 48: 603-616. B. A general method applicable to the search for similarities in the amino acid sequence of two proteins. j]). j) + s(S i . we will use the same scoring as in Section 2. V (i − 1. T ) seems to be of much higher complexity than solving the global alignment problem. T ′ a substring of T }. only a small region of the protein is critical to its function and only this region will be conserved throughout the evolutionary process.M. T ′ a prefix of T }.B. 1986 [2] Dayhoff. Prove that 0.) Finding vsub (S. Often. We will also write s(−.G. T ′ a suffix of T }. Question 11 Find vsub for proteins C and D. M. Suppose we restrict ourselves to suffixes of S and T : vsfx (S. S.O. we will solve it using an algorithm whose complexity is still only O(mn). and Henikoff J. Journal of Molecular Biology 48: 443-453. it is useful to identify these highly conserved regions. −) = s(−. namely. j − 1) + s(S . Amazingly. −) < 0 for the score of an insertion or deletion. Atlas of Protein Sequence and Structure 5: 345-352.3 Page 4 of 4 ° c University of Cambridge . When we identify two proteins which perform similar functions but look superficially different. T ). sfx i j Vsfx (i. Question 9 Prove carefully that vsub (S. j) = 0. A model of evolutionary change in proteins. For example. T ′ ) : S ′ a prefix of S.C. j) = max V sfx (i − 1. S. Schwartz. T [1. 1992 [4] Needleman. and Erickson. We aim to find a pair of substrings S ′ and T ′ of S and T with the highest alignment score. and Orcutt. vsub (S. References [1] Altschul. T ′ ) : S ′ a substring of S. though. using the BLOSUM matrix from Sec- tion 2 and s(a. T ′ ) : S ′ a suffix of S.5 Local Alignment Full alignment of proteins is meaningful when the two strings are members of the same family. T ). B.. T ) = max{v(S ′ . as there are Θ(n2 m2 ) combinations of substrings of S and T . and Wunsch. S. T ) = max{v(S ′ . T ) = max{vsfx (S ′ . j) for vsfx (S[1.D. sfx j with boundary conditions Vsfx (i. the full sequences of the oxygen-binding proteins myoglobin and haemoglobin are very similar. V (i. Question 10 Write Vsfx (i. 1970 July 2012/Part II/9. Optimal sequence alignment using affine gap costs. Amino acid substitution matrics from protein blocks. We will first define a slightly easier problem. C. a) = −2 for all amino acids a. R. a) = s(a. −). i]. (For simplicity.W.

- MIT ALL CoursesUploaded bypradeep kumar
- Lsm 3241Uploaded bybismarck
- Reference 1Uploaded byKarthi Shanmugam
- swaminatUploaded byIvica Kelam
- BREW Final Report September 2006Uploaded byvassilyh
- Shaft Alignment Optimization Genetic AlgorithmsUploaded byJoy Deb Nath
- Iddo-McCreight_slides on Suffix Tree UpdatesUploaded byrickjdk1
- rr322303-bio-informaticsUploaded bySRINIVASA RAO GANTA
- Blast Analisis IIUploaded byUlises Ortiz Gutierrez
- PROSITEUploaded byarpita-rehni-1702
- BioeknologiUploaded bykartikagustirahmadhani
- Mining Sequence in Biological DataUploaded byAbhinav Barik
- Need and Role of Scala Implementations in BioinformaticsUploaded byAtif Sarwar
- Public Att Biotech HobanUploaded bysubjject16
- genetic databases pptUploaded byrengachen
- figueroatext3Uploaded byapi-256959970
- section_fourUploaded byapi-3779164
- 07 ForensikUploaded byDani Yustiardi
- KI67 Biotech - DB-070-EN.pdfUploaded bySamuel Madureira Silva
- Nightingale, Martin - 2004 - The Myth of the Biotech RevolutionUploaded bycabruca1
- Chassy B. Food Safety Evaluation of Crops Produced Through Biotech 2002Uploaded byHéctor Mendoza
- tmpEF99.tmpUploaded byFrontiers
- EU Version: The So-Called Scientific "Consensus": Why the Debate on GMO Safety is Not OverUploaded byFood and Water Watch
- The So-Called Scientific "Consensus": Why the Debate on GMO Safety is Not OverUploaded byFood and Water Watch
- Research Methodology 27012012Uploaded byChandrakanth
- interview with dr lawrence maysUploaded byapi-332371318
- DNA Barcoding Identifies a Third Invasive Species of EleutherodactylusUploaded byjegarciap
- Notice: Agency Information Collection Activities; Proposals, Submissions, and ApprovalsUploaded byJustia.com
- Applied BiologyUploaded byZul HM
- Proteomics - Technologies, Markets and CompaniesUploaded byBharat Book Bureau