You are on page 1of 2

RAFAEL IRIZARRY: Here we provide a quick example of how statistical models can

provide insights into the nature of high throughput data from a new technology.
The particular example we're going to be looking at
was a data set generated with a relatively new technology called
RNA sequencing.
Several groups have suggested that the variability
we see across replicates in this technology
is well described by Poisson distributions.
And we're going to use a simulation based on Poisson to answer or help
us understand the properties of a summary statistic
that is commonly used.
So the first thing we're going to do is we're
going to generate, using a simulation, two replicates.
So we are going to create data for 10,000 genes.
And we're going to assume that their rates range from 2 to the 1 to 2
to the 16.
And the next step, we're going to actually generate two replicates.
They have the same lambdas, but they're two independent realizations
of this experiment.
Now, the question that we're going to ask
is if it's appropriate to consider the log ratio as a summary statistic.
And if it is, if there's something we should know about this log ratio
before we interpret the results.
Log ratios are commonly used in biology to summarize differences,
so it's of particular interest here.
So what we're going to do is we're going to look at a subset
where both x and y are positive so that we can form a log ratio
and plot the log ratio against the lambdas
to see what are the properties of these log ratios as lambda changes.
And we do this because this is something that
has been observed in many high throughput technologies,
that there is a relationship between the mean values and the behavior of summary
statistics.
So the mean values, for example, of gene expression.
So let's make that plot.
This is based on the simulation.
And we see an interesting result, namely that when
the lambdas are on the low end of the spectrum, the variability of the log ratio
is much higher than at the high end of the spectrum.
This is actually not surprising.
If you know the mathematics of the Poisson, you can actually predict this.
But with a simulation, we can see it as well.
So if you're going to be using log ratios as a summary statistic,
you should be aware of this fact.
That for low values of lambda, you can get very large log ratios
just by chance, even when you're considering technical replicates.
So to see that this is in fact what you see with actual data,
we load an experiment that has RNA-seq data.
And we're going to take two replicates.
This is randomly.
I know from before that these are in fact replicates,
so let's take those two.
I'm going to, again, only consider cases where it's positive.
And now because I don't have lambda-- I don't
know what lambda is, an unknown parameter-- I'm going to instead
use the average of the two as an estimate of lambda,
and I'm going to plot the log ratio against that.
And what we see is that we see a very similar behavior with the actual data.
This is a simulation, and this is the actual data.
And we see again, for low values of our estimate of lambda
larger variability of log ratios.
So again, this is a very simple example of how statistical models can give us
insights into the nature of the data that we're tasked with analyzing.

You might also like