CS262 Lecture 1 Notes

Computational Genomics / Biology for CS262 / Sequence Alignment
Scribed by Hong T. Lam January 6, 2004 The Goal of Genomics
Study Organisms at the DNA Level • • • • Read a complete genome such as the human DNA. (DNA Sequencing & Assembly) Identify parts such as the genes encoded by the DNA sequence. (Gene Finding) Figure out the connections between parts such as how genes interact with each other. Gene Expression: The process by which genetic code is translated into structures present and functioning in the cell. Expressed genes are transcribed into different types of RNA, of which mRNA is the only type that is translated into proteins. Gene expression provides information about how a gene functions and how it is different from other genes. DNA microarrays can be used to compare gene expression in different populations of cells. Cells have different gene expression patterns and levels. (Microarrays & Regulation)

Study Evolution at the DNA Level • • • Compare whole genomes from multiple organisms. (Large-Scale Comparative Genomics) Quantify the evolution of biological sequence. (Phylogeny & Evolution) Uncover the evolutionary tree.

The Role of CS in Biology
Computer science plays an essential role in biology. With biology becoming an information science, new high-throughput technology is needed. The shift to high throughput technologies in biology has led to an explosion of genomic data. Basic Computational Methods for Analysis of Biological Sequences • • • Sequence Alignment Algorithms Dynamic Programming Hidden Markov Models

Hong T. Lam

Page 1

1/10/2004

Genomics Applications Using Basic Computational Methods • DNA Sequencing: The process of determining the exact order of a long string of bases (A, T, C, G) that makes up the DNA of an organism. The genomes of several organisms, including human, have been completely sequenced. Comparison of DNA and proteins across organisms Discovery of genes, promoters, and regulatory sites

• •

Paradigms in Biology
There are two paradigms in biology. • Molecular Paradigm (Genetic Dogma) DNA RNA polypeptide

DNA is transcribed into RNA (rRNA, rNA, snRNA, mRNA) through a process known as RNA transcription. mRNA is translated into polypeptides which then fold into 3-D protein structures through a mechanism called protein translation. An organism consists of different types of proteins. • Evolution Paradigm: All organisms originate from a common ancestor, connected by an evolutionary tree.

Basic Biology for CS262
Structures of Biomolecules • • • The cell is composed of DNA in the nucleus and proteins in the cytoplasm, all of which is encapsulated in a lipid membrane. The nucleic acids (DNA and RNA) form the genetic material of all living organisms. They are found mainly in the nucleus of the cell. A nucleotide has three components. Sugar (ribose in RNA, deoxyribose in DNA) Phosphoric acid Nitrogen base Adenine (A) Guanine (G) Cytosine (C) Thymine (T) or Uracil (U) Two nucleotides are linked together by attaching the phosphate group of one nucleotide to the 5’ carbon atom of the sugar of the other nucleotide. Hong T. Lam Page 2 1/10/2004

Nucleic acids are linear, unbranched polymers of nucleotides. While RNA is single-stranded, DNA consists of two strands, which run in opposite directions to each other anti-parallel. The strands are joined together by pairing the nitrogenous bases (Watson & Crick base pairs). DNA and RNA are read from the 3’ to the 5’ end. This is related to the numbers on the ribose ring.

DNA
A T

RNA
A

G

C

A=T G=C

G

C

G

C

G

C

G

A

T

A

C

G

T→U

C

T

A

U

G

C

G

Three nucleotides of an mRNA strand form a codon that specifies one amino acid. This makes sense because a codon made from only one or two nucleotides would not produce enough combinations (codons) to code for all 20 of the known amino acids. 1 nucleotide = 4 possible codons 2 nucleotides = 4 * 4 possible codons 3 nucleotides = 4 * 4 * 4 possible codons = 64 possible codons for 20 amino acids Since a three-nucleotide codon produces 64 possible combinations and there are only 20 known amino acids, this implies redundancy or degeneracy in the genetic code where several different codons specify the same amino acid. The parsimony principle – that the simplest solution is often right – rules out a four-nucleotide codon.

Two amino acids form a dipeptide.

R | H2N--C--COOH | H
Hong T. Lam

R | H2N--C--COOH | H
Page 3

R O R | II | H2N--C--C--NH--C--COOH | | H H
1/10/2004

A linear sequence of amino acids forms a polypeptide, which folds to form a complex 3-D protein structure. The structure of a protein is intimately connected to its function.

How does DNA function? • In the cell, DNA provides all the information needed to function. There are questions about DNA as the carrier of genetic information. Q: How is the information stored in DNA? A: Stored as nucleotide sequences. Q: How is the stored information used? A: Used in protein synthesis. • Ribosomes are the sites of protein synthesis. Since DNA is mainly found in the nucleus and ribosomes are found in the cytoplasm, how does information flow from DNA to protein? There is a need for an intermediary -- ribonucleic acid (RNA). RNA has three functions (mRNA, tRNA, rRNA). Messenger RNA (mRNA) is synthesized on a DNA template by a process called transcription, during which information is copied from one strand of DNA to mRNA. mRNA serves as the messenger that tells the ribosomes what proteins to make. So, how are the information carried by the mRNA interpreted? Think of an mRNA sequence as a sequence of “triplets”, for example, AUGCCGGGAGUAUAG as AUG-CCG-GGA-GUA-UAG. Each triplet (codon) maps to an amino acid. Hence, the sequence of triplets (codons) is translated to a sequence of amino acids according to the genetic code. In 1968, Nirenberg and Khorana received a Nobel Prize in medicine for cracking the universal genetic code, which mapped each triplet (codon) to an amino acid. It shows how the nucleotide language of mRNA is translated into the amino acid language of proteins.

Hong T. Lam

Page 4

1/10/2004

Transfer RNA (tRNA) floats freely in the cytoplasm. It is the molecule that carries amino acids to the ribosome when a specific amino acid is called for by the information on the mRNA to be put into the protein that is being synthesized. Every amino acid has its own specific tRNA that binds to it alone. In 1962, Robert Holly solved the structure of tRNA. Although tRNA is single-stranded molecule, stretches of complementary nucleotides hydrogen bond to form short double-stranded regions, which bend the tRNA into a cloverleaf shape. All tRNAs have a similar cloverleaf structure. At a position on one of the leaves, a sequence of three nucleotides form an anti-codon, which base pairs with a specific mRNA codon. This anti-codon/codon binding is crucial. There is a different tRNA molecule corresponding to each mRNA codon.

• •

rRNA serves as part of the structure of the ribosome, the protein/RNA complex that synthesizes proteins according to the information carried by the mRNA So, to put this all together: The DNA code is transcribed into a complementary mRNA molecule within the nucleus. The mRNA enters the cytoplasm, where it associates with a ribosome. The mRNA code is then translated into a polypeptide chain. The codon AUG signals the start of translation. An activated tRNA ferries the first amino acid, methionine, to the ribosome. The tRNA anti-codon binds to the AUG codon on the mRNA. The whole complex shifts and the next codon is read by another tRNA. As the two amino acids are held in position, a peptide bond is formed between them. The second tRNA accepts the growing protein chain and the methionine tRNA is released. The process continues until a stop codon is encountered. When the stop codon is reached, translation is finished. The ribosome disassembles to be reused for translating another mRNA and one complete peptide chain is released.

What is a gene? • • A genome is a set of all genes in the organism + junk stuff (the entire DNA content). A gene is a sequence of nucleotides on the DNA that encodes a polypeptide.

Hong T. Lam

Page 5

1/10/2004

Central Dogma of Molecular Biology DNA RNA Protein Phenotype

ZOOM IN tRNA transcription rRNA

DNA

snRNA translation mRNA POLYPEPTIDE

DNA is transcribed into different types of RNA (tRNA, rRNA, snRNA, mRNA). Transcription consists of three key steps: initiation, elongation, and termination. The transcripts (mRNA molecules) contain the information to be translated into polypeptides that form proteins. • Each gene has its own promoter(s). Promoters are sequences in the DNA just upstream of the mRNA transcripts that define the sites of initiation. The role of the promoter is to attract RNA polymerase to the correct start site so transcription can be initiated. The mRNA transcripts are sometimes edited before they serve as a blueprint for a protein. The processing involves the removal of intervening, gibberish sequences (introns) in the gene. Exons are spliced together to form mRNA. Exons are nucleotide segments whose codons will be expressed.

How are genes regulated? • In an adult multi-cellular organism, there are a wide variety of cell types seen in the adult, such as muscle, nerve, and blood cells. The different cell types contain the same DNA though. This differentiation arises because different cell types express different genes. Hence, genes can be switched on and off. There are some questions about the regulation of genes. Q: What turns genes on and off? Q: When is a gene turned on or off? Q: Where (in which cells) is a gene turned on? Q: How many copies of the gene product are produced?

Hong T. Lam

Page 6

1/10/2004

Regulatory sequences are binding sites for proteins. They are often short stretches of DNA (~25 nucleotides), consisting of inexactly repeating patterns called motifs. Motifs stand out as highly conserved regions in a multiple sequence alignment.

Complete Genomes & Evolution
There has been an explosion of genomic data. Complete genomes of some organisms have been sequenced (human, pig, dog, rat, mouse, etc.). DNA in these different organisms has been compared to study evolution occurring at the DNA level, resulting from sequence edits (insertion, deletion, mutation) and rearrangements (inversion, translocation, duplication). Similarity between DNA sequences has suggested that all organisms come from a common ancestor, connected by an evolutionary tree (evolution paradigm). The evolutionary process occurs at different rates. If DNA mutations occur in non-critical regions, they are incorporated into the next generation. If the mutations occur in critical regions, they are unlikely to be propagated onward. However, some mutations have positive effects, and thus are conserved in subsequent generations, such as in the case of the highly conserved Interleukin regions found in human and mouse. Sequence conservation implies functionality. The fact that evolution did not modify a region of the sequence suggests that it is functionally important to the organism.

Interleukin regions in human and mouse Pairwise sequence alignment can be used to find sequences conserved between organisms. It can reveal if sequences are related or not. This information can help to determine their functional and structural roles and provide clues to the common ancestor.

Sequence Alignment
Given two strings, x = x1x2…xM and y = y1y2…yN, and a scoring function for calculating matched letters and gap penalty, an alignment is an assignment of gaps to positions 0,…,M in x and 0,…,N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence.

Hong T. Lam

Page 7

1/10/2004

AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Optimal Alignment
What is a good alignment? It is the “best” way to match the letters of one sequence with those of the other. The problem is: how do we define “best”? If an alignment is a hypothesis that two sequences come from a common ancestor through sequence edits, then optimal alignment is finding the least cost transformation of one sequence into another using new operations (sequence edits, inversions, translocations, duplications). The least cost transformation is measured as the edit distance between two sequences, which is defined as the minimum number of edit operations needed to transform the first string into the other. Since most of DNA changes during evolution are due to insertion, deletion, and substitution, the edit distance can be used as a way to roughly measure the number of DNA replications that occurred between two sequences. Although the edit distance is not an accurate metric system for depicting the underlying evolutionary process, it serves as an approximation that is easy to optimize algorithmically. Likewise, optimal alignment is the pairing of sequences that retains the order of letters in each sequence, introducing gaps if necessary, such that the scoring function returns an optimal score.

Scoring Function
Match: Mismatch: Gap: +m –s –d

Score F = (# matches) * m – (# mismatches) * s – (# gaps) * d The optimality of an alignment is measured by the calculated result of the scoring function. The total score of an alignment is the sum of terms for each pair of aligned letters and terms for each gap. A match receives a positive score of m, a mismatch receives a penalty of –s, and a gap receives a penalty of -d.

Hong T. Lam

Page 8

1/10/2004

Sign up to vote on this title
UsefulNot useful