You are on page 1of 5

MKTG 776 Applied Probability Models

Spring 2015 February 25, 2015


Paper 1 [Agregar titulo]
[Agregar exec summary el professor lo recomienda]
Motivation
The motivation behind the question being answered is that in any scientific
testing process in which a microorganism colony or formation is being studied and
tested, the manager and decision maker conducting the study would could improve
their analysis of the results if they would werebe able to forecast the expected
results ahead of time and thus be were able to judge the uniqueness or normality
ofany deviations from the expected results the results of the study [el concepto de
normality es limitado a la normal]. Additionally, given the various uses of yeast in
particular, there ins an inherent interest in being able to estimate how much yeast
colonies grow in a given time period.
The dataset being analyzed was obtained from www.github.com (exact URL:
https://github.com/vincentarelbundock/Rdatasets/blob/master/csv/HistData/Yeast.cs
v entire data set on the right of this page) [would only include a small sample of
the dataset, even may be in the appendix]. The dataset shows the results of a
students yeast cell count study that included four different samples (A, B, C, D) of
400 unique yeast colonies each. In each of the samples, we see the number of
unique results that had a yeast cell count of 0, 1, 2, etc yeast cells after a given
time period. Some samples cut off at a relatively low yeast cell count of 5 or 6 yeast
cells, while other samples show much higher yeast cell counts of 9 or higher yeast
cells. Because of the way the study is conducted, one can sum up the number of
frequencies for each respective yeast cell count (i.e., add up all the 0s, 1s, etc)
and come up with aggregate data for each yeast cell count that we can use to test

overall yeast cell growth over one time period. This is a very important
assumption
Yeast
Cell Count
Data Set

as it allows us to generate a larger sample which we expect to


present a behavior that can be summarized with the Negative
Binomial distribution discussed in class. In particular, we believe
that the yeast cells behave in a way that can be captured by a
Poisson distribution allowing for Gamma heterogeneity. I selected
this data set because it has a high number of observations, it is a
fairly simple and straight forward count data set and it has enough
variations stemming from the high number of samples that could be
analyzed and lead to interesting takeaways and conclusions.
The exercise at hand is to fit an appropriate count model to
the data set. I elected to fit a beta-geometric/NBD NBD (poissongamma mixture) model to the data set, as the model is used for
datasets bounded at 0 with no upper limit and it expresses the
count behavior in probabilistic/random/stochastic terms that sums
up individual yeast cell behavior into the aggregate behavior of a
population. Additionally, several papers in the literature discuss the applicability of
this model to biological behaviors. The NBD has two shape parameters, gamma r
and alpha, that determine the shape of the heterogeneity across the population.
[pondria el histograma sin la linea roja aca] For this particular data set, I would
expect a gamma r somewhere between 1 and 2, as the data seems to have a very
round top that doesnt drop off as steeply until the later part of the distribution,
and a lower level of alpha closer to 1 as the difference between the distribution is
not as concentrated on any particular value.

On the right is the model output from


an initial NBD model. As expected, the
gamma r and alpha of 1.81 and 0.86, make
sense and provide the expected vs. actual

distribution shown below. The chisquared

goodness of fit test provides a value of 0.37 and shows the reliability of the model.
[estas
confundiendo gamma con el parametron que en clase llamamos r, si
bien matemticamente es indistinto, tratara de mantener la nomenclatura de la
clase para evitar que algn boludo te critique por eso]
After these initial results I explored extending this model to 0 and 1-inflated
negative binomial distributions in the attempt to capture over-inflation of yeast cell
counts at 0 or 1. The 0-inflation model produced meaningless values for the shape
parameters, and the 1-inflation model was fairly similar to the original model and
not statistically significant enough to validate its use. Additionally, I struggle to find
a story that would validate why the results would be over-inflated at 1 in
particular. [Esto ponelo al reves. Si no tenes una historia, no haces los inflated
models! Yo pondra que, aunque no se te ocurre una historia porque no verias por
que la heterogeneidad del sample necesitara ajuste, queres investigar los otros dos
modelos de cualquier manera for intelectual curiosity]
My next step was to fit models to the various samples in an attempt to
determine if fitting models to individual samples and then aggregating there are
any differences in the behavior of the smaller samplesthe results would yield better
results than fitting a model to the entire population as a whole[el concepto de hacer
las chicas, y agregar es erroneo]. In practice, if a researcher were running
experiments with samples of yeast colonies, it would be important to determine if a
forecast could be made based on 1 sample for the rest of the population or if
various samples would have to be tested before a valuable forecast could be
produced.

More information to tests, but once we analyze the samples, but 3 of them
come out with meaningless results interesting enough that the aggregate works
and is logical, but samples individually dont fit to an NBD model, or samples are too
small. So on an aggregate level the noise cancels out, but at the sample level. What
else behaves like yeast? What other distributions would work to fit these types of
data?

You might also like