P. 1
RNA-seq experiments for bioinformaticians

RNA-seq experiments for bioinformaticians

|Views: 3|Likes:
Published by ashis_biswas
This presentation discusses about some quick facts on RNA-seq experiments and then the short-read alignment methods.
This presentation discusses about some quick facts on RNA-seq experiments and then the short-read alignment methods.

More info:

Categories:Types, Research
Published by: ashis_biswas on Mar 18, 2013
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

03/18/2013

pdf

text

original

RNA-seq experiment for Bioinformaticians

Ashis Kumer Biswas
BioMeCIS at CSE, UT Arlington

April 12, 2012

Ashis Kumer Biswas

RNA-seq and Bioinformatics

Outline

1

Basics of RNA-seq technology Quick facts about RNA-seq RNA-seq steps

2

RNA-seq for Bioinformaticians Short-Read Alignments

Ashis Kumer Biswas

RNA-seq and Bioinformatics

Quick facts about RNA-seq

Ashis Kumer Biswas

RNA-seq and Bioinformatics

Quick facts about RNA-seq It’s a massively parallel sequencing method for transcriptome analysis. Ashis Kumer Biswas RNA-seq and Bioinformatics .

Ashis Kumer Biswas RNA-seq and Bioinformatics .Quick facts about RNA-seq It’s a massively parallel sequencing method for transcriptome analysis.

Ashis Kumer Biswas RNA-seq and Bioinformatics .Quick facts about RNA-seq It’s a massively parallel sequencing method for transcriptome analysis.

What is transcriptome Ashis Kumer Biswas RNA-seq and Bioinformatics .

What is transcriptome Transcriptome T is set of RNA molecules. Ashis Kumer Biswas RNA-seq and Bioinformatics .

Ashis Kumer Biswas RNA-seq and Bioinformatics .What is transcriptome Transcriptome T is set of RNA molecules.

What is RNA Figure: The Cell[1] Ashis Kumer Biswas RNA-seq and Bioinformatics .

What is RNA Figure: DNA vs. RNA[1] Ashis Kumer Biswas RNA-seq and Bioinformatics .

What is RNA Figure: RNA secondary structure for the RNA sequence (5’end)–ACCCCCUCCUUCCUUGGAUCAAGGGGCUCAA–(3’end) Ashis Kumer Biswas RNA-seq and Bioinformatics .

What is RNA Ashis Kumer Biswas RNA-seq and Bioinformatics .

Ashis Kumer Biswas RNA-seq and Bioinformatics .What is RNA Types of RNA: mRNA — messenger RNA: it carries the code from the DNA in nucleus for synthesis of one/more proteins into the cytoplasm where the protein manufacturing takes place in the organelle — “Ribosome”.

where the translation of mRNA into Amino Acid sequences.What is RNA Types of RNA: mRNA — messenger RNA: it carries the code from the DNA in nucleus for synthesis of one/more proteins into the cytoplasm where the protein manufacturing takes place in the organelle — “Ribosome”. tRNA — transfer RNA: it brings amino acids to the ribosome. Ashis Kumer Biswas RNA-seq and Bioinformatics .

tRNA — transfer RNA: it brings amino acids to the ribosome. where the translation of mRNA into Amino Acid sequences.What is RNA Types of RNA: mRNA — messenger RNA: it carries the code from the DNA in nucleus for synthesis of one/more proteins into the cytoplasm where the protein manufacturing takes place in the organelle — “Ribosome”. rRNA — ribosomal RNA: the rRNA and some proteins combine to form a nucleoprotein called “ribosome” which serves as the site and carries the necessary enzymes for protein synthesis. Ashis Kumer Biswas RNA-seq and Bioinformatics .

Examples: Ashis Kumer Biswas RNA-seq and Bioinformatics .What is RNA More types of RNA: ncRNA — non-coding RNAs: They are not translated into protein.

Examples: tRNA Ashis Kumer Biswas RNA-seq and Bioinformatics .What is RNA More types of RNA: ncRNA — non-coding RNAs: They are not translated into protein.

What is RNA More types of RNA: ncRNA — non-coding RNAs: They are not translated into protein. Examples: tRNA rRNA Ashis Kumer Biswas RNA-seq and Bioinformatics .

What is RNA More types of RNA: ncRNA — non-coding RNAs: They are not translated into protein. Examples: tRNA rRNA snoRNA—small nucleolar RNA: it guides the chemical modifications of other RNAs. Ashis Kumer Biswas RNA-seq and Bioinformatics .

Ashis Kumer Biswas RNA-seq and Bioinformatics . miRNA—microRNA: it’s a post-transcriptional regulators.What is RNA More types of RNA: ncRNA — non-coding RNAs: They are not translated into protein. Examples: tRNA rRNA snoRNA—small nucleolar RNA: it guides the chemical modifications of other RNAs.

miRNA—microRNA: it’s a post-transcriptional regulators.What is RNA More types of RNA: ncRNA — non-coding RNAs: They are not translated into protein. in certain gene expression pathway). Ashis Kumer Biswas RNA-seq and Bioinformatics .. siRNA—small interfering RNA: it is involved in RNA interference pathway (i. Examples: tRNA rRNA snoRNA—small nucleolar RNA: it guides the chemical modifications of other RNAs.e.

in certain gene expression pathway). miRNA—microRNA: it’s a post-transcriptional regulators. Examples: tRNA rRNA snoRNA—small nucleolar RNA: it guides the chemical modifications of other RNAs. siRNA—small interfering RNA: it is involved in RNA interference pathway (i.What is RNA More types of RNA: ncRNA — non-coding RNAs: They are not translated into protein. Ashis Kumer Biswas RNA-seq and Bioinformatics .e.. piRNA—piwi-interacting RNA: it forms RNA-protein complexes which regulates some post-transcriptional gene expression.

siRNA—small interfering RNA: it is involved in RNA interference pathway (i.. Ashis Kumer Biswas RNA-seq and Bioinformatics .e. lncRNA — long ncRNA: non-coding RNA longer than 200 nucleotides. in certain gene expression pathway). miRNA—microRNA: it’s a post-transcriptional regulators. Examples: tRNA rRNA snoRNA—small nucleolar RNA: it guides the chemical modifications of other RNAs. piRNA—piwi-interacting RNA: it forms RNA-protein complexes which regulates some post-transcriptional gene expression.What is RNA More types of RNA: ncRNA — non-coding RNAs: They are not translated into protein.

What is RNA Roles of RNA in the “central dogma of molecular biology”: Ashis Kumer Biswas RNA-seq and Bioinformatics .

What is transcriptome Transcriptome T is set of RNA molecules. Ashis Kumer Biswas RNA-seq and Bioinformatics .

or in disease conditions. Ashis Kumer Biswas RNA-seq and Bioinformatics . a genome does not change in a living cell except for mutation. In contrast.What is transcriptome Transcriptome T is set of RNA molecules. but a transcriptome varies according to different external environmental conditions or in different stages of cell cycles.

Ashis Kumer Biswas RNA-seq and Bioinformatics . Humans have 23 pairs of chromosomes.What is Genome The full set of DNA sequences of an organism is called its genome.

Humans have 23 pairs of chromosomes. Ashis Kumer Biswas RNA-seq and Bioinformatics .What is Genome The full set of DNA sequences of an organism is called its genome.

Humans have 23 pairs of chromosomes. Ashis Kumer Biswas RNA-seq and Bioinformatics .What is Genome The full set of DNA sequences of an organism is called its genome.

Ashis Kumer Biswas RNA-seq and Bioinformatics . In contrast.What is transcriptome Transcriptome T is set of RNA molecules. but a transcriptome varies according to different external environmental conditions or in different stages of cell cycles. a genome does not change in a living cell except for mutation. or in disease conditions.

. Ashis Kumer Biswas RNA-seq and Bioinformatics .Why analyze the transcriptome? The research branch “transcriptomics” deals with: Examining expression profiles (i. expression levels) of mRNAs in a given cell population.e.

Why analyze the transcriptome? The research branch “transcriptomics” deals with: Examining expression profiles (i. Ashis Kumer Biswas RNA-seq and Bioinformatics . expression levels) of mRNAs in a given cell population.e... Interpreting the functional elements of the genome.

e.. Interpreting the functional elements of the genome. Revealing the molecular constituents of cells. expression levels) of mRNAs in a given cell population. tissues Ashis Kumer Biswas RNA-seq and Bioinformatics ..Why analyze the transcriptome? The research branch “transcriptomics” deals with: Examining expression profiles (i.

expression levels) of mRNAs in a given cell population. tissues Understanding the disease Ashis Kumer Biswas RNA-seq and Bioinformatics .Why analyze the transcriptome? The research branch “transcriptomics” deals with: Examining expression profiles (i.. Revealing the molecular constituents of cells.. Interpreting the functional elements of the genome.e.

Interpreting the functional elements of the genome.. the entire set of proteins expressed by a genome. tissues Understanding the disease The transcriptome can be seen as a precursor for the proteome... expression levels) of mRNAs in a given cell population. Revealing the molecular constituents of cells. Ashis Kumer Biswas RNA-seq and Bioinformatics .i.e.e.Why analyze the transcriptome? The research branch “transcriptomics” deals with: Examining expression profiles (i.

Ashis Kumer Biswas RNA-seq and Bioinformatics .What is Massively Parallel Sequencing This technique allows to simultaneously sequence 1 million to several hundred millions of short reads (50-400bases) from amplified DNA clones.

Ashis Kumer Biswas RNA-seq and Bioinformatics . and commercially available since 2005. This technology emerged in late 1996.What is Massively Parallel Sequencing This technique allows to simultaneously sequence 1 million to several hundred millions of short reads (50-400bases) from amplified DNA clones.

What is Massively Parallel Sequencing

This technique allows to simultaneously sequence 1 million to several hundred millions of short reads (50-400bases) from amplified DNA clones. This technology emerged in late 1996, and commercially available since 2005. Sequencing cost decreased: ultimate goal— $1000/genome sequencing.

Ashis Kumer Biswas

RNA-seq and Bioinformatics

Outline

1

Basics of RNA-seq technology Quick facts about RNA-seq RNA-seq steps

2

RNA-seq for Bioinformaticians Short-Read Alignments

Ashis Kumer Biswas

RNA-seq and Bioinformatics

RNA-seq steps

Ashis Kumer Biswas

RNA-seq and Bioinformatics

Ashis Kumer Biswas RNA-seq and Bioinformatics .Step 1 The RNAs having Poly-A (i.e.. many Adenine (A)) tail are isolated from sample cell cytoplasm.

m-mRNA Mature mRNA: Ashis Kumer Biswas RNA-seq and Bioinformatics .

Ashis Kumer Biswas RNA-seq and Bioinformatics .Step 2 The Poly-A RNAs are reverse transcribed to produce a double-stranded cDNA (complementary DNA).

Ashis Kumer Biswas RNA-seq and Bioinformatics .Step 2 The Poly-A RNAs are reverse transcribed to produce a double-stranded cDNA (complementary DNA).

Reverse Transcription It is the opposite of transcription. Ashis Kumer Biswas RNA-seq and Bioinformatics .

Reverse Transcription It is the opposite of transcription. Ashis Kumer Biswas RNA-seq and Bioinformatics .

Transcription It’s the process of producing single-stranded mRNA from a double-stranded DNA sequence. Ashis Kumer Biswas RNA-seq and Bioinformatics .

Reverse Transcription It is the opposite of transcription. Ashis Kumer Biswas RNA-seq and Bioinformatics .

It is a way of acquiring a gene sequence—the double stranded DNA fragment from which the mRNA was transcribed. Ashis Kumer Biswas RNA-seq and Bioinformatics .Reverse Transcription It is the opposite of transcription.

Reverse Transcription It is the opposite of transcription. It is a way of acquiring a gene sequence—the double stranded DNA fragment from which the mRNA was transcribed. After reverse transcription. Ashis Kumer Biswas RNA-seq and Bioinformatics . the produced double stranded DNA is called: cDNA (complementary DNA).

Step 2 The Poly-A RNAs are reverse transcribed to produce a double-stranded cDNA (complementary DNA). Ashis Kumer Biswas RNA-seq and Bioinformatics .

Ashis Kumer Biswas RNA-seq and Bioinformatics .Step 3 The cDNAs are subject to random fragmentation of size 35-400 base pairs.

Ashis Kumer Biswas RNA-seq and Bioinformatics . the library of the short cDNA fragments are sequenced.g. SOLiD.Step 4 Using the massively parallel high throughput sequencing machines (e. Roche etc). Illumina.

Sequenced files Suppose this is one short-read sequnce: Ashis Kumer Biswas RNA-seq and Bioinformatics .

log10 P . where P is the base-calling error probability measured by the sequencing machine.Sequenced files The second section of the file contains the quality of each characters of the sequences. Ashis Kumer Biswas RNA-seq and Bioinformatics . Phred Quality Score Q = −10.

Sequenced files The second section of the file contains the quality of each characters of the sequences. P = 10 10 Ashis Kumer Biswas RNA-seq and Bioinformatics . Phred Quality Score Q = −10.log10 P . where P is the base-calling error probability measured by the sequencing machine. −Q In other words.

where P is the base-calling error probability measured by the sequencing machine. −Q In other words.Sequenced files The second section of the file contains the quality of each characters of the sequences.log10 P . Phred Quality Score Q = −10. if Q = 30 => P = 10 10 = 10−3 = 1000 Ashis Kumer Biswas RNA-seq and Bioinformatics . P = 10 10 −30 1 For example.

−Q In other words. where P is the base-calling error probability measured by the sequencing machine. P = 10 10 −30 1 For example.log10 P . Phred Quality Score Q = −10. Base call accuracy would be = 1 (1 − P ) = (1 − ) = 99.9% 1000 Ashis Kumer Biswas RNA-seq and Bioinformatics . if Q = 30 => P = 10 10 = 10−3 = 1000 So.Sequenced files The second section of the file contains the quality of each characters of the sequences.

˜] Here is the scores after the conversion: Ashis Kumer Biswas RNA-seq and Bioinformatics . 93] The Phred scores Q are converted to ASCII characters using shift of 33 (ASCII Letter = Q + 33). The ASCII letter ranges [33. 126] [!.Sequenced files The range of Phred scores Q is [0.

Step 5 Align the short-read sequences to exonic reference sequences. Ashis Kumer Biswas RNA-seq and Bioinformatics .

Types of short-reads Types of short-reads: Ashis Kumer Biswas RNA-seq and Bioinformatics .

Types of short-reads Types of short-reads: Ashis Kumer Biswas RNA-seq and Bioinformatics .

Ashis Kumer Biswas RNA-seq and Bioinformatics .Step 6 Quantify the expression levels.

RPKM: # of reads per kilobase per million mapped reads Ashis Kumer Biswas RNA-seq and Bioinformatics .Units of measurements For each transcript. the measure of expression level is quantified using a metric — RPKM.

the measure of expression level is quantified using a metric — RPKM. Ashis Kumer Biswas RNA-seq and Bioinformatics . RPKM: # of reads per kilobase per million mapped reads Suppose from an RNA-seq experiment we have.Units of measurements For each transcript.

Ashis Kumer Biswas RNA-seq and Bioinformatics . the measure of expression level is quantified using a metric — RPKM. 10 million short-reads. but out of which only 8 million reads could be mapped to the reference genome.Units of measurements For each transcript. RPKM: # of reads per kilobase per million mapped reads Suppose from an RNA-seq experiment we have.

Units of measurements For each transcript. but out of which only 8 million reads could be mapped to the reference genome. From those mapped reads. 10 million short-reads. 1000 alignments maps to a transcript of size 1 kilobases. Ashis Kumer Biswas RNA-seq and Bioinformatics . the measure of expression level is quantified using a metric — RPKM. RPKM: # of reads per kilobase per million mapped reads Suppose from an RNA-seq experiment we have.

the measure of expression level is quantified using a metric — RPKM. From those mapped reads. 1000 = 125 So. 1000 alignments maps to a transcript of size 1 kilobases.Units of measurements For each transcript. RPKM: # of reads per kilobase per million mapped reads Suppose from an RNA-seq experiment we have. RPKM score for that transcript = 1×8 Ashis Kumer Biswas RNA-seq and Bioinformatics . 10 million short-reads. but out of which only 8 million reads could be mapped to the reference genome.

Units of measurements # of RNA-seq reads generated from a transcript ∝ that transcript’s relative abundance in the sample. Ashis Kumer Biswas RNA-seq and Bioinformatics .

Ashis Kumer Biswas RNA-seq and Bioinformatics . both of which are present at the same abundance.Units of measurements # of RNA-seq reads generated from a transcript ∝ that transcript’s relative abundance in the sample. Suppose a sample has 2 transcripts A and B.

both of which are present at the same abundance. Ashis Kumer Biswas RNA-seq and Bioinformatics . an RNA-seq will contain twice as many reads from B as from A. If B is twice as long as A.Units of measurements # of RNA-seq reads generated from a transcript ∝ that transcript’s relative abundance in the sample. Suppose a sample has 2 transcripts A and B.

Ashis Kumer Biswas RNA-seq and Bioinformatics .Units of measurements # of RNA-seq reads generated from a transcript ∝ that transcript’s relative abundance in the sample. in RPKM calculation the read counts were normalized by each transcript’s length. Suppose a sample has 2 transcripts A and B. both of which are present at the same abundance. If B is twice as long as A. an RNA-seq will contain twice as many reads from B as from A. So.

this is RNA-seq: Ashis Kumer Biswas RNA-seq and Bioinformatics .RNA-seq steps So.

What we can get from an RNA-seq experiment To quantify (count) the mRNA abundance. Ashis Kumer Biswas RNA-seq and Bioinformatics .

or under different conditions.What we can get from an RNA-seq experiment To quantify (count) the mRNA abundance. To quantify the changes of expression levels of each transcript during the development stages of cells. Ashis Kumer Biswas RNA-seq and Bioinformatics .

These are required to be aligned to the reference genome.RNA-seq for Bioinformaticians Each RNA-seq experiment (“lane”) produces more than 10 million short-reads. Ashis Kumer Biswas RNA-seq and Bioinformatics .

These are required to be aligned to the reference genome. Ashis Kumer Biswas RNA-seq and Bioinformatics . Identifying the non-coding RNA.RNA-seq for Bioinformaticians Each RNA-seq experiment (“lane”) produces more than 10 million short-reads.

RNA-seq for Bioinformaticians Each RNA-seq experiment (“lane”) produces more than 10 million short-reads.. Identifying the non-coding RNA.. and many more. Ashis Kumer Biswas RNA-seq and Bioinformatics . These are required to be aligned to the reference genome.

Outline 1 Basics of RNA-seq technology Quick facts about RNA-seq RNA-seq steps 2 RNA-seq for Bioinformaticians Short-Read Alignments Ashis Kumer Biswas RNA-seq and Bioinformatics .

Output: What are the positions x1 ....The Mapping Problem Input: m l -bp (base-pair) size short-reads S1 .. ... Ashis Kumer Biswas RNA-seq and Bioinformatics . Length of the genome |R | is 3 × 109 bp. m is usually 107 − 108 Length of each short-reads l is typically 50-200 bp. . Sm and an approximate reference genome R . S2 . x2 . xm along R where each short read matches? In human genome example.

|R |=3 × 109 . Matching the read at each position p and picking the best match. Time Complexity: O (ml |R |) For human genome example: if m = 108 .The Mapping Problem: Solution 1 Naive Algorithm: Scan the reference genome R for each short-reads Si . This is clearly impractical. then complexity = 5 × 1018 . Ashis Kumer Biswas RNA-seq and Bioinformatics . l = 50.

This is also impractical. |R |=3 × 109 . l = 50.The Mapping Problem: Solution 2 KMP (Knuth-Morris-Pratt) Algorithm: Time Complexity: O (m(l + |R |)) = O (ml + m|R |) For human genome example: if m = 108 . Ashis Kumer Biswas RNA-seq and Bioinformatics . then complexity ≈ 1017 .

This approach also allows only EXACT matching.e. l = 50. i.The Mapping Problem: Solution 3 Using Suffix Tree: First build a suffix tree for R . for each Si we can find matches by traversing the tree from the root efficiently.. Time Complexity: O (ml + |R |) For human genome example: if m = 108 . ∼ 64GB which is impractical for most of today’s desktop computers. |R |=3 × 109 . Once the tree is built. Ashis Kumer Biswas RNA-seq and Bioinformatics . But saving the tree requires O (|R |log |R |) bits. then complexity ≈ 109 . This looks practical.

E.PIXIE.BOXES Output TEXYDST.SIXTY. Frequent substrings in the original text will repeat multiple times in a row in the transformed text. This kind of transformed text can be easily compressed by other algorithm — “Move-To-Front Transform” or “Run-Length-Encoding”.MIXED. It’s used in data compression technique — bzip2..The Mapping Problem: Solution 4 BWT (Burrows-Wheeler Transform): It’s invented by Burrows & Wheeler in 1994.S.DUST.SIFT. The transformation does not change the character’s value. It changes the order of the characters.PIXIES.E.IXIXIXXSSMPPS.B.EUSFXDIIOIIIT Ashis Kumer Biswas RNA-seq and Bioinformatics . Input SIX.

The Mapping Problem: Solution 4 BWT (Burrows-Wheeler Transform) : Ashis Kumer Biswas RNA-seq and Bioinformatics .

The Mapping Problem: Solution 4 BWT (Burrows-Wheeler Transform) : Ashis Kumer Biswas RNA-seq and Bioinformatics .

The Mapping Problem: Solution 4 BWT (Burrows-Wheeler Transform) : Ashis Kumer Biswas RNA-seq and Bioinformatics .

The Mapping Problem: Solution 4 BWT (Burrows-Wheeler Transform) : Ashis Kumer Biswas RNA-seq and Bioinformatics .

The Mapping Problem: Solution 4 BWT (Burrows-Wheeler Transform) : Ashis Kumer Biswas RNA-seq and Bioinformatics .

The Mapping Problem: Solution 4 BWT (Burrows-Wheeler Transform) : This shows the BWT is reversible! Ashis Kumer Biswas RNA-seq and Bioinformatics .

The Mapping Problem: Solution 4 BWT (Burrows-Wheeler Transform) : Can we answer these questions now? Is the letter “B” followed by “A” or vice versa? Ashis Kumer Biswas RNA-seq and Bioinformatics .

The Mapping Problem: Solution 4 Does the substring “ANA” present in the original text? Ashis Kumer Biswas RNA-seq and Bioinformatics .

The Mapping Problem

The popular program TopHat [2] uses the BWT algorithm to do the mapping of short-reads to the reference genome. To store the transformed text of the human genome we need only 2 × 3 × 109 bits.

Ashis Kumer Biswas

RNA-seq and Bioinformatics

Questions?

Ashis Kumer Biswas

RNA-seq and Bioinformatics

Thanks!

Ashis Kumer Biswas

RNA-seq and Bioinformatics

pp.pbworks. Pimentel. A. Kelley. 2012. no. D. Kim. Ashis Kumer Biswas RNA-seq and Bioinformatics . Salzberg. Pachter.com/w/page/16252897/ Introduction%20and%20Basic%20Molecular%20Biology C. G. S. “Differential gene and transcript expression analysis of rna-seq experiments with tophat and cufflinks.References Introduction and basic molecular biology. H. L. J. Available: http://compbio.” Nature Protocols. vol. 3. [Online]. Roberts. Pertea. 7. 562–578. Trapnell. and L. Rinn. Goff. D.

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->