You are on page 1of 7

RNAseq and ChIP-Seq principles

What is RNA-Seq analysis?


RNA-Seq (RNA sequencing) uses sequencing to identify and quantity of
RNA in a biological sample at a given moment.
Conceptually RNA-Seq analysis is quite straightforward. It goes like this:
1. Produce sequencing data from a transcriptome in a “normal”
(control) state
2. Match sequencing reads to a genome or transcriptome.
3. Count how many reads align to the region (feature). Let’s say 100
reads overlap with Gene A.
Now say the cell is subjected to a different condition, for example, a cold
shock is applied.
1. Produce sequencing data from the transcriptome in the “perturbed”
state
2. Match sequencing reads to the same genome or transcriptome.
3. Count how many reads align to the same region (feature), gene A.
Suppose there are 200 reads overlapping with Gene A.

200 is twice as big as 100. So we can state that there is a two-fold


increase in the coverage. Since the coverage is proportional to the
abundance of the transcript (expression level), we can report that the
transcription level for gene A doubled under cold shock. We call the
change as differential expression.

What is quantifying?
Quantification means that sequencing is used to determine not the
composition of DNA fragments but their abundance.

What are the main methods for quantifying RNA-Seq reads?


A) Quantifying against a genome:
When quantifying against a genome case the input data will be:

 Sequencing reads (FASTQ)


 Genome including non-transcribed regions (FASTA)
 Annotations that label the features on the genome (BED, GFF, GTF)

the approach will intersect the resulting alignment file with the use of
annotations to produce abundances that are then filtered to retain
statistically significant results.
B) Classifying against a transcriptome
When classifying against a transcriptome the input data will be:

 Sequencing reads (FASTQ)


 Transcriptome files that contains all transcripts of the organism
(FASTA)

the approach will directly produce abundances that are then filtered to
produce statistically significant results.

What is the typical outcome of an RNA-Seq analysis?


The goal of most RNA-Seq analyses is to find genes or transcripts that
change across experimental conditions. This change is called the
differential expression. By observing these genes and transcripts, we can
infer the functional characteristics of the different states.

How does RNA-seq analysis work?


The RNA-seq protocol turns the RNA produced by a cell into DNA (cDNA,
complementary DNA) via a process known as reverse transcription. The
resulting DNA is then sequenced, and from the observed abundances of
DNA, we attempt to infer the original amounts of RNA in the cell.

What is the first decision when performing an RNA-Seq analysis?


In general, the most important distinguishing factor between methods is
the choice of the reference frame:
1. You may want to quantify against a genome, a database of all the
DNA of the organism.
2. You may choose to compare against a transcriptome, a database of
all known transcripts for the organism.
Using a genome allows for the discovery of potentially new transcripts and
gene isoforms. Using a transcriptome usually means more accurate
quantification but only against a predetermined “ground” truth.
How do I quantify mRNA abundances?
The most common measures used currently are:
1. Counts: The number of reads overlapping with a transcript.
2. RPKM/FPKM: Reads/Fragments per kilobase of transcript per
millions of reads mapped.
3. TPM: Transcripts per million

How do I compare mRNA abundances?


Two types of comparisons may be necessary, and different criteria should
be used for each:
1. Within-sample comparisons. In this case, we compare the
expression of genes within the same experiment. For example: in
this experiment does gene A express at a higher level than gene B?
2. Between-sample comparisons. In this case, we compare the
expression of genes across experimental conditions. For example,
has the gene expression for gene A gene changed across different
experimental conditions?

What is normalization?
When we assign values to the same labels in different samples, it
becomes essential that these values are comparable across the samples.
The process of ensuring that these values are expressed on the same
scale is called normalization.

What is the RPKM?


If N were the total number of reads mapped to a transcript, and C was
the total number of reads mapped for the sample, we cannot just take N /
C as our measure of gene expression. A single copy of a longer transcript
will produce more fragments (larger N) that a single copy of a shorter
transcript. Instead, we may choose to divide the fraction of the reads
mapped to the transcript by the effective length of the transcript:
Gene expression=10^9*N/L * 1/C.

Whereas the N and C are integer numbers, the 1/L is an inverse of a


distance.

What is FPKM?
FPKM is an extension of the already flawed concept of RPKM to paired-end
reads. Whereas RPKM refers to reads, FPKM computes the same values
over read pair fragments.

What is TPM?
TPM is where we multiply the above by a million:
TMP = 10^6 N / L * 1 / sum

What kind of questions can we answer with a statistical test?


Here is a selection:
 How accurate (close to reality) are these results?
 For which observation do values change between conditions?
 Are there genes for which there is a trend to the data?

What types of statistical tests are common?


The pairwise comparison is one of the most common and conceptually
most straightforward tests. For example, a pairwise comparison would
compare the expressions of a gene between two conditions. A gene that is
found to have changed its expression is called differentially expressed.
The set of all genes with modified expression forms what is called the
differential expression (DE).

What does a p-value mean?


p-value is the probability of obtaining an effect of the size that you
observe due to random chance.

What does a differential expression file look like?


A differential expression file describes the changes in gene expression
across two conditions. It will be similar to:
 id: Gene or transcript name that the differential expression is
computed for,
 baseMean: The average normalized value across all samples,
 baseMeanA, baseMeanB: The average normalized gene expression
for each condition,
 foldChange: The ratio baseMeanB/baseMeanA,
 log2FoldChange: log2 transform of foldChange. When we apply a 2-
based logarithm the values become symmetrical around 0. A log2
fold change of 1 means a doubling of the expression level, a log2
fold change of -1 shows show a halving of the expression level.
 pval: The probability that this effect is observed by chance,
 padj: The adjusted probability that this effect is observed by
chance.

You have to use padj in all other cases as this adjusted value corrects for
the so-called multiple testing error - it accounts for the many alternatives
and their chances of influencing the results that we see.

What are Kallisto and Salmon?


Kallisto and Salmon are software packages for quantifying transcript
abundances. The tools perform a pseudo-alignment of reads against a
transcriptome. In pseudo-alignment, the program tries to identify for each
read the target that it originates from.

How do I visualize the differentially expressed genes?


A simplified method for visualizing differentially expressed genes is a
heatmap.
What is ChIP-Seq?
ChIP-Seq stands for chromatin immunoprecipitation followed by
sequencing. In a nutshell, the process consists of a laboratory protocol
(abbreviated as ChIP) by the end of which the full DNA content of a cell is
reduced to a much smaller subset of it. This subset of DNA is then
sequenced (abbreviated as Seq) and is mapped against a known
reference genome.

How are RNA-Seq studies different from ChIP-Seq studies?


Whereas both approaches appear to operate under similar constraints and
count reads over intervals, there are several significant differences:

1. The DNA fragments coming from a ChIP-Seq study are much


shorter than a transcript studied in RNA-Seq analysis.
2. The DNA fragments of a ChIP-Seq experiment are less localized
than transcripts.
3. ChIP-Seq data produces a more significant number of false
positives. In a typical RNA-Seq experiment, only transcripts are
isolated identically, and we can be fairly sure that an observed
sequence did exist as RNA. In contrast, ChIP-Seq protocols strongly
depend on the properties of the reagents, selectivity, and specificity
of the binding process, and these properties will vary across
different proteins and across the genome.

What are the processing steps for ChIP-Seq data?


The general processing steps are as follows:
1. Visualize and correct the quality of the sequencing data.
2. Align sequencing reads to a reference genome.
3. Call peaks from the alignment bam files.
4. Visualize the resulting peak files and signal files.
5. Find the biological interpretation of the positions where the peaks
are observed.
Transcriptomics data analysis
Transcriptomics technologies are the techniques used to study an
organism's transcriptome, the sum of all of its RNA transcripts. A
transcriptome captures a snapshot in time of the total transcripts present
in a cell. There are two key contemporary techniques in the field:
microarrays, which quantify a set of predetermined sequences, and RNA-
Seq, which uses high-throughput sequencing to record all transcripts.
Measuring the expression of an organism's genes in different tissues or
conditions, or at different times, gives information on how genes are
regulated and reveal details of an organism's biology. It can also be used
to infer the functions of previously unannotated genes. Transcriptome
analysis has enabled the study of how gene expression changes in
different organisms and has been instrumental in the understanding of
human disease. An analysis of gene expression in its entirety allows
detection of broad coordinated trends which cannot be discerned by more
targeted assays.

Transcriptomics data
All transcriptomic methods require RNA to first be isolated from the
experimental organism before transcripts can be recorded. Although
biological systems are incredibly diverse, RNA extraction techniques are
broadly similar and involve mechanical disruption of cells or tissues,
separation of RNA from undesired biomolecules including DNA, and
concentration of the RNA via precipitation from solution or elution from a
solid matrix.

Microarrays
Microarrays consist of short nucleotide oligomers, known as "probes",
which are typically arrayed in a grid on a glass slide. Transcript
abundance is determined by hybridisation of fluorescently labelled
transcripts to these probes. The fluorescence intensity at each probe
location on the array indicates the transcript abundance for that probe
sequence.

RNA-Seq
RNA-Seq refers to the combination of a high-throughput sequencing
methodology with computational methods to capture and quantify
transcripts present in an RNA extract. The nucleotide sequences
generated are typically around 100 bp in length, but can range from 30
bp to over 10,000 bp depending on the sequencing method used. Both
low-abundance and high-abundance RNAs can be quantified in an RNA-
Seq experiment. RNA-Seq may be used to identify genes within
a genome, or identify which genes are active at a particular point in time
and read counts can be used to accurately model the relative gene
expression level.
RNA-Seq data analysis
RNA-Seq experiments generate a large volume of raw sequence reads
which have to be processed to yield useful information. Data analysis
usually requires a combination of bioinformatics software tools that vary
according to the experimental design and goals. The process can be
broken down into four stages: quality control, alignment, quantification,
and differential expression. Nowadays, most popular RNA-Seq programs
are run from a command-line interface, either in a Unix environment or
within the R/Bioconductor statistical environment.

You might also like