Chương 3 Căn chỉnh trình tự - Pairwise Sequence alignment

Chương 3.
Pairwise Sequence Alignment – căn chỉnh trình tự cặp
Một đột biến điểm được chấp nhận trong protein là sự thay thế một amino
acid này bằng một amino acid khác, chấp nhận bởi chọn lọc tự nhiên. Nó là
kêt quả của hai quá trình khác biệt:
- Đầu tiên là xảy ra đột biến trong thành phần của khung gene tạo ra
amino acid của một protein.
- Thứ 2 là sự tồn tại của đột biến giống như một phần ưu thế mới trong
loài.
Để được tồn tại thì đột biến tạo ra amino acid mới này phải có chức năng
giống với amino acid cũ: tính chất hóa lí phải tương tự khi chúng được hoán
đổi cho nhau thường xuyên.
 MỤC TIÊU HỌC TẬP
Khi hoàn thành chương này, bạn nên có khả năng
- Định nghĩa sự tương đồng cũng như orthologs và paralogs

- Giải thích ma trận PAM (accepted point of Mutation) có nguồn gốc như
thế nào?
- Tương phản tiện ichs cảu ma trận chấm điểm PAM và BLOSUM
(contrast the utility of PAM and BLOSUM scoring matrices)
- Định nghĩa lập trình động và giải thích làm sao để căn chỉnh trình tự
cặp địa phương và cặp lạ
- Thực hành căn chỉnh cặp trình tự trong protein hoặc trong DNA trên
trang web NCBI.
GIỚI THIỆU
Một trong những câu hỏi cơ bản nhất về gene và protein là liệu rằng chúng có
liên quan đến tất cả những gene hay protein khác? Sự liên quan của hai protein
ở mức độ chuỗi cho thấy chúng tương đồng, chúng có thể có cùng chức năng.
Bằng việc phân tích trình tự DNA và protein chúng ta có thể biết được vùng
tương đồng và motifs mà chúng trao đổi trong cùng nhóm phân tử. Những
phân tích mối liên hệ giữa genes và proteins được hoàn thành bởi sự căn chỉnh
trình tự. Giống như việc chúng ta đã hoàn thành trình tự gene ở một số loài ,
nhiệm vụ xác định các protein đã liên quan với nhau như thế nào trong sinh
vật và giữa các sinh vật ngày càng trở nên quan trọng đối với sự hiểu biết của
chúng ta.
Ở chương này chúng tôi sẽ giới thệu về căn chỉnh trình tự cặp. Chúng tôi áp
dụng một quan điểm tiến hóa trong một tả làm thế nào các amino acid
(nucleotide) ở 2 trình tự có thể căn chỉnh và so sánh? Sau đó chúng tôi sẽ miêu
tả thuật toán và chương trình cho việc căn chinh trình tự.
Hai gene (protein) là tương đồng nếu chúng tiến hóa từ cùng tổ tiên
CĂN CHỈNH PROTEIN: Mở ra nhiều thông tin hơn căn chỉnh DNA
Lựa chọn việc căn chỉnh DNA hay Protein, thông thường sẽ thu được nhiều
thông tin hơn khi so sánh trình tự protein. Có rất nhiều lí do để giải thích vấn
đề này.
- Có nhiều thay đổi trong trình tự của DNA( đặc biệt là ở vị trí codon số
3) không làm đổi trình tự của amino acid đặc hiệu. Hơn thế nữa nhiều
amino acid có đặc điểm hóa lí tương tự (VD Lysine và arginine là hai
amino acid cơ bản)
- Điều quan trọng là việc so sánh Các mối quan hệ quan trọng giữa các
axit amin liên quan (nhưng không khớp) trong một liên kết có thể được
tính cho việc sử dụng các hệ thống tính điểm – SCORING SYSTEM
(được mô tả trong chương này)
- Trình tự DNA cho ít thông tin hơn. So sánh trình tự protein có thể xác
định được trình tự tương đồng trong khi so sánh trình tự DNA tương
ứng thì không (Pearson, 1996).
Khi phân tích trình tự nucleotide, thông thông thường dùng để nghiên cứu
protein được mã hóa từ trình tự đó. CHương 4 (Nghiên cứu về BLAST), chúng
ta thấy rằng cod thể chuyển đổi thông tin dễ dàng giữa trình tự DNA và
Protein. VD: TBLASTN của NCBI BLAST website cho phép các protein liên
quan có nguồn gốc từ cơ sở dữ liệu DNA sẽ biết được trình tự aa của protein.
Tuy nhiên trong một số trường hợp, so sánh trình tự Nucleotide thích hợp hơn.
So sánh này có thể quan trọng trong việc xác nhận danh tính của chuỗi DNA
trong tìm kiếm cơ sở dữ liệu, trong tìm kiếm đa hình, trong việc phân tích
danh tính của đoạn cDNA được nhân bản, trong so sánh các khu vực quy định
hoặc trong nhiều ứng dụng khác.
PROTEIN ALIGNMENT: OFTEN MORE INFORMATIVE THAN DNA

ALIGNMENT
Given the choice of aligning a DNA sequence or the sequence of the protein
it encodes, it is often more informative to compare protein sequence.
The reason is:
- Many changes in a DNA sequence do not change the amino acid that is
specified (particularly at third position of a codon).
- Many amino acid shares related biophysical properties (Lysine
&Arginine are both basic amino acid).
- The important relationships between related (but mismatched) amino
acids in an alignment can be accounted for using scoring systems .
- Protein sequence comparisons can identify homologous sequences
while the corresponding DNA sequence comparisons cannot (Pearson,
1996).
Nevertheless, in many cases it’s appropriate to compare nucleotide sequences,

it can be important in confirming the identify DNA sequence in database
search.
- in searching for polymorphisms.

- in analyzing the identity of a cloned cDNA fragment
- in comparing regulatory regions
- in many other applications.
DEFINITIONS: HOMOLOGY, SIMILARITY, IDENTITY
Considering the globin family of proteins – beginning with human protein –

myoglobin.
Myoglobin and hemoglobin are thoughts to have diverged some 450 million
years ago, near the time human and cartilaginous fish lineages diverged
1. Homology
Two sequences are homologous if they share common evolutionary ancestry

(Some researchers use the term analogous to refer to proteins that are not homologous but share some
similarity by chance. Such proteins are presumed not to have descended from a common ancestor .)
- There are no degrees of homology.

- Sequences are either homologous or not
Homologous proteins almost always share a significantly related tree
dimension structure.
 When 2 sequences are homologous. Their amino acid or nucleotide

sequences usually share significant identity.
- Homology is a qualitative inference.
- Notably, two molecules may be homologous without sharing
statistically significant amino acid (or nucleotide) identity.
 In general, three‐dimensional structures diverge much more slowly
than amino acid sequence identity between two proteins.
Identity and similarity are quantities that describe the relatedness of

sequences.
Two protein are homologous maybe orthologous or paralogous.
 Orthologous: are homologous sequences in different species that arose

from a common ancestral gene during speciation (sự hình thành loài).
(Where the homology is the result of speciation so that the history of the gene reflects the history
of the species (for example α hemoglobin in man and mouse) the genes should be called
orthologous (ortho = exact).”
- Orthologs are presumed to have similar biological functions

EX: in this example, human and rat myoglobin both transport oxygen
in muscle cells.
EX: Figure 3.2 shows a tree of myoglobin orthologs. There is a human
myoglobin gene and a rat gene. Humans and rodents diverged about
90 million years ago (MYA) (see Chapter 19), at which time a single
ancestral myoglobin gene diverged by SPECIATION.
 Paralogs (para = in parallel) are homologous sequences that arose by
a mechanism such as gene duplication. (Where the homology is the result of gene
duplication so that both copies have descended side by side( song song) during the history of an
organism (for example, α and β hemoglobin) the genes should be called paralogous (para = in
parallel) .)
EX: + Human alpha 1 globin is paralogous to and alpha 2 globin,

indeed these two proteins share 100% amino acid identity.
+ Human alpha 1 globin and beta globin are also paralogs
Notably, orthologs and paralogs do not necessarily have the same
function.
Two DNA (or protein) sequences are defined as homologous based on
achieving significant alignment scores.
We can assess the relatedness of any two proteins by performing a
pairwise alignment. We use the NCBI BLASTP TOOL (for protein) or
BLASTN (for nucleotide). Performs the following steps:
1. Choose the program BLASTP for our comparison of two proteins.
Check the box “align two or more sequences”
2. Enter the sequences or their accession numbers. Here we use the
sequence of human beta globin in the FASTA format, and for
myoglobin we use the accession number.
3. Select any optional parameters – tham số.
 You can chose from eight scoring matrices: BLOSUM90, BLOSUM80,
BLOSUM62, BLOSUM50, BLOSUM45, PAM250, PAM70, PAM30.
Select PAM250.
 You can change the gap creation penalty and gap extension penalty.
 For BLASTN searches you can change reward and penalty values.
 There are other parameters – thông số - you can change, such as word
size, expect value, filtering and dropoff values
4. Click “align”. The output includes a pairwise alignment using the
single – letter amino acid code
Note that the FASTA format uses the single – letter amino acid code.
If we allow gaps in the alignment to account for deletions or
insertions in the two sequences, the number of possible alignments
rises exponentially. We need a computer algorithm – thuật toán - to
perform an alignment – use an algorithm to solve a task
2. Identity & similarity
Identity is the exten – mức độ - to which amino acid (or Nucleotide)
sequences are invariant – bất biến, nghĩa là giữa hai trình tự
protein, amino acid giống hệt nhau = identity.
Note that this particularly alignment is call Local – cục bộ
- Only a subset of the two protein is aligned
- The 1 st and last few amino acid residues of each protein are not
displayed.
A global – toàn bộ - pair wise alignment includes all residues of both

sequences. Another aspect of this pairwise alignment is that some of the
aligned residues are similar but not identical because they share similar
biochemical properties.
Similar pairs of residues are structurally or functionally related – trình tự

amino acid giữa hai chuỗi có thể k giống hệt nhau mà tương tự nhau về
đặc tính sinh hóa (aa cơ bản, điện âm, dương, thơm, ...)
 The percent similarity of two protein sequences is the sum of both =

identical + similar matches.
 In general it is more useful to consider the identity share by two protein
sequences rather than the similarity, because similarity measure may be
based upon a variety of definitions of how related (similar) two amino
acid residues are to each other.
In summary, pairwise alignment is the process of lining up two sequences to

achieve maximal levels of identity (and maximal levels of conservation in the
case of amino acid alignments)
The purpose is
- Access the degree of similarity and the possibility of homology between

two molecules.
- Similarity, it is not appropriate to describe two sequences as “highly
homologous”; instead, it can be say that they share a high degree of
similarity.
GAPS
Pairwise alignment is useful as a way to identify mutations that have occurred

during evolution and have caused divergence of the sequences of the two
proteins we are studying.
The most common mutation
- Substitutions: occur when a mutation results in the codon for one amino
acid being changed into that for another.
- Insertions and deletion: occur when residues are added or removed.
Insertions or deletions (even those just one character long) are referred
to as gaps in the alignment.
Note that one of the effects of adding gaps is to make the overall length of
each alignment exactly the same.The addition of gaps can help to create an
alignment that models evolutionary changes that have occurred – tạo nên sự
liên kết mô hình “thay đổi trong quá trình tiến hóa” đã xảy.
In a typical scoring scheme there are two gap penalties called affine gap costs.
- One is score (-a): creating gap

- Second penalty is (-b): for each residue that a gap extends. If a gap
extends for k residues it is assigned (giao) a penalty of –(a+bk)
PAIRWISE ALIGNMENT, HOMOLOGY, AND EVOLUTION OF LIFE
If two protein are homologous, they share a common ancestor.
- They compare protein sequences from many species and see that the
sequences are homologous or not.
- The study of homologous protein (or DNA) sequences by pairwise
alignment involves an investigation of the evolutionary history of that
protein (or gene).
For a brief overview of the time scale of life on Earth, the divergence of
different species is established through the use of data, especially the fossil
record.
SCORING MATRICES
Margaret Dayhoff (1966,1978) provided a model of the rules by which

evolutionary change occurs in proteins, the Dayhoff model was examined in
7 steps. This provides the basis of a quantitative scoring system for pairwise
alignments between any proteins, whether they are closely or distantly related .
DAYHOFF MODEL STEP 1 (OF7): Accepted Point Mutations.
Dayhoff and colleagues considered the problem of how to assign scores to

aligned amino acid residues.
Accepted point mutation is abbreviated PAM, was defined as a replacement

of one amino acid in a protein by another residue that has been accepted by
natural selection. It occurs when
1. A gene undergoes a DNA mutation such that it encodes a different

amino acid
2. The entire species adopts that change as the predominant form of the
protein.
Intuitively, conservative replacement would be accepted.

(Để đưa ra định nghĩa “accepted mutation” Dayhoff dựa trên kinh nghiệm
quan sát các aa thay thế cho nhau. Bên cạnh đó, ông còn tiến hành những phân
tích về phát sinh gene; thay vì chỉ đơn thuần so sánh trực tiếp trình tự aa họ
còn so sánh chúng với tổ tiên được suy ra từ trình tự.)
DAYHOFF MODLE STEP 2: FREQUENCY OF AMINO ACIDS – tần

suất của amino acid
To model the probability that one aligned amino acid in a protein changes to
another, we need to know the frequencies of occurrence of each amino acid.
DAYHOFF STEP 3: RELATIVE MUTABILITY OF AMINO ACIDS –
SỰ BIẾN ĐỔI MANG TÍNH TƯƠNG ĐỐI CỦA AMINO ACIDS
To calculate the relative mutability, they divided the number of times each
amino acid was observed to mutate (m i) by the overall frequency of
occurrence of that amino acid (fi) – để tính toán sự thay đổi tương đối của
amino acid, họ chia số lần đột biến của aa cho tần số xuất hiện chung của aa
đó.
(a fairly number (20%) of the interchanges observed, required 2 nucleotides
changes – một lượng khá lớn các nút giao, khoảng 20%, yêu cầu thay đổi 2
Nu mới làm thay đổi aa, trường hợp khác chỉ cần thay đổi một Nu).
In other cases such as Gly and Trp, only a single – Nucleotide change would
be required for the substitution; this was never empirically observed however,
presumably because such a change has been rejected by natural selection)
DAYHOFF MODEL STEP 4: MUTATION PROBABILITY MATRIX

FOR THE EVOLUTIONARY DISTANCE OF 1 PAM – ma trận xác suất
cho khoảng cách tiến hóa là 1 PAM
i: row, j: column
Each element of the matrix Mi,j shows the probability that an original amino
acid j will be replaced by another amino acid i over a defined evolutionary
interval.
In this case, the interval is one PAM, which is defined as the unit of
evolutionary divergence in which 1% of the amino acids have been changed
between the two protein sequences.
NOTE THAT: the evolutionary interval of this PAM matrix is defined in

terms of percent amino acid divergence and not in units of years.
The nondiagonal elements of this matrix have the values:
(3.1)
 Mi,j: the probability that an original amino acid j will be replaced by a
amino acid from row i.
 Ai,j: An element of the accepted point mutation matrix (such as the
value corresponding to the original alanine being substituted by an
arginine.)
 λ : is a proportionality constant
 mj: is the mutability of the jth amino acid
The diagonal elements of figure 3.9 which have the values:
(3.2)
 Mjj: is the probability that original amino acid j will remain without
undergoing a substitution to another amino acid.
(Dayhoff and colleagues used the assumption that accepted amino acid
mutations are undirected, that is, they equally likely in either direction. In the
PAM1 matrix, the close relationship of the proteins makes it unlikely that the
ancestral residue is entirely different from both of the observed, aligned
residues.)
DAYHOFF MODEL STEP 5: PAM250 AND OTHER PAM MATRICES
The PAM1 matrix was based upon the alignment of closely related protein
sequences, having an average 1% change.
PAM matrices such as PAM100, PAM250 were generated to reflect the kinds
of amino acid substitutions that occur in distantly related protein.
In the matrix of figure 3.9, λ is chosen to correspond to an evolutionary

distance of 1 PAM. As we make λ larger, we model a greater evolutionary
distance. So, we could for EX make a PAM2, PAM3, PAM4 matrix by
multiplying λ. However, this approach will fail for greater evolutionary
distance.
 Adjusting λ dose not count for multiple substitution

 PAM250 matrix is produced when PAM1 matrix is multiplied by
itself 250 times and it is one of the common matrices used for
BLAST searches databases.
 This matrix applies to an evolutionary distance where protein share
about 20% amino acid identity.
DAYHOFF MODEL STEP 6: FROM A MUTATION PROBABILITY
MATRIX TO A RELATEDNESS ODDS MATRIX.
(3.3)
Equation (3.3) describes an odds ratio

Chương 3 Căn chỉnh trình tự - Pairwise Sequence alignment

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chương 3 Căn chỉnh trình tự - Pairwise Sequence alignment

Uploaded by

Copyright:

Available Formats

Chương 3.

Pairwise Sequence Alignment – căn chỉnh trình tự cặp

 MỤC TIÊU HỌC TẬP

Khi hoàn thành chương này, bạn nên có khả năng

- Định nghĩa sự tương đồng cũng như orthologs và paralogs

PROTEIN ALIGNMENT: OFTEN MORE INFORMATIVE THAN DNA

The reason is:

Nevertheless, in many cases it’s appropriate to compare nucleotide sequences,

- in searching for polymorphisms.

DEFINITIONS: HOMOLOGY, SIMILARITY, IDENTITY

Considering the globin family of proteins – beginning with human protein –

Two sequences are homologous if they share common evolutionary ancestry

- There are no degrees of homology.

 When 2 sequences are homologous. Their amino acid or nucleotide

Identity and similarity are quantities that describe the relatedness of

Two protein are homologous maybe orthologous or paralogous.

 Orthologous: are homologous sequences in different species that arose

- Orthologs are presumed to have similar biological functions

EX: + Human alpha 1 globin is paralogous to and alpha 2 globin,

A global – toàn bộ - pair wise alignment includes all residues of both

Similar pairs of residues are structurally or functionally related – trình tự

 The percent similarity of two protein sequences is the sum of both =

In summary, pairwise alignment is the process of lining up two sequences to

- Access the degree of similarity and the possibility of homology between

Pairwise alignment is useful as a way to identify mutations that have occurred

The most common mutation

to as gaps in the alignment.

- One is score (-a): creating gap

PAIRWISE ALIGNMENT, HOMOLOGY, AND EVOLUTION OF LIFE

If two protein are homologous, they share a common ancestor.

Margaret Dayhoff (1966,1978) provided a model of the rules by which

DAYHOFF MODEL STEP 1 (OF7): Accepted Point Mutations.

Dayhoff and colleagues considered the problem of how to assign scores to

Accepted point mutation is abbreviated PAM, was defined as a replacement

1. A gene undergoes a DNA mutation such that it encodes a different

Intuitively, conservative replacement would be accepted.

DAYHOFF MODLE STEP 2: FREQUENCY OF AMINO ACIDS – tần

DAYHOFF MODEL STEP 4: MUTATION PROBABILITY MATRIX

NOTE THAT: the evolutionary interval of this PAM matrix is defined in

The nondiagonal elements of this matrix have the values:

The diagonal elements of figure 3.9 which have the values:

DAYHOFF MODEL STEP 5: PAM250 AND OTHER PAM MATRICES

In the matrix of figure 3.9, λ is chosen to correspond to an evolutionary

 Adjusting λ dose not count for multiple substitution

Equation (3.3) describes an odds ratio

You might also like