Professional Documents
Culture Documents
Sampling Theory,
Correlation, & Linear
Regression Techniques
Case Studies
Dr.Shaheen
8/9/2017
This document presents case studies on Sampling Theory & Techniques, Correlation, Linear Regression
Techniques (Simple Linear Regression, Multiple Linear Regression & Logistic Regression)
Sampling Theory & Techniques
Discussion Questions
4. Are there useful results here? Which ones are useful? Are they sufficient, or is further
study needed?
o There are some useful results here, but even after follow-up efforts, only about 70%
have responded. We still lack information from about 30% of these people. Considering
the differences between initial mailing and follow-up responses, these 30% could well
be below any of the reported averages. We don’t know. In the real world, there often is
non-response that cannot be eliminated as a problem.
o The pilot study should probably not be combined with the other responses. The pilot
study was not a random sample to begin with and has served its purpose in testing the
questionnaire.
o It might be argued that all other responses (first and second sample, initial mailings and
follow-up responses) are useful and should be combined. The resulting average ($3,374)
has a standard error of $64, which is less than the $100 required. This then represents
the planned spending of the approximately 70% of members who are likely to answer
these questionnaires.
o Ideas for further study could include different design methods or additional follow-up of
non-respondents.
The techies (scientists) in the laboratory have been lobbying you, and management in
general, to include just one more laboratory step. They think it’s a good idea, although you have
some doubt because one of them is known to be good friends with the founder of the start-up
biotechnology company that makes the reagent used in the reaction. But if adding this step works
as expected, it could help immensely in reducing production costs. The trouble is, the test results
just came back and they don’t look so good. Discussion at the upcoming meeting between the
technical staff and management will be spirited, so you’ve decided to take a look at the data.
Your firm is anticipating government approval from the Food and Drug Administration
(FDA) to market a new medical diagnostic test made possible by monoclonal antibody
technology, and you are part of the team in charge of production. Naturally, the team has been
investigating ways to increase production yields or lower costs.
Discussion Questions
1. Does the amount of purifier have a significant effect on yield according to this regression
analysis? Based on this alone, would you be likely to recommend including a purifying
step in the production process?
2. What would you recommend? Are there any other considerations that might change your
mind?
This case is about the need to look at data and not to jump to conclusions based only on an
analysis that requires specific assumptions (e.g., the linear model) that may not be satisfied. It is
also about the problems of the real world where answers may not be clear, even once some data
have been obtained.
No, the amount of purifier does not have a significant effect on yield. The t statistic is
1.805. The p-value, 0.105, is not even significant if a one-sided test is used. (credit should
certainly be given to students who realize that a one-sided test is appropriate here: the p-
value would be 0.052=0.105/2 in this case). Based on this alone, there is no evidence that
purification affects yield other than randomly.
But this hypothesis testing conclusion is a weak one: we accept the null hypothesis
perhaps for lack of enough data against it. It is close to significant with the one-sided test
and we do have a small sample size (n=11). Perhaps the effect of the purifier has been
masked by randomness of the production process, and more data would be needed in order
to detect it.
2. What would you recommend? Are there any other considerations that might change your
mind?
From the scatter-plot it looks like the purifier helps (increases yield) somewhat through
about amount 8, but a higher purifier amounts the yield decreases. A regression omitting the
last point is highly significant (t=3.49 and p=0.008), while a regression omitting the last two
points is highly significant (t=7.12 and p=0.0002). On the other hand, it can be argued that
we are not really free to delete points we don’t like. After all, by doing that you could end up
with a significant relationship in nearly any situation.
If the true relationship is an increase in yield at low purifier levels and a decline at high
levels, then the linear model assumption is violated and statistical inference may not be
valid. Thus the finding of “not significant” is open to some question. If the cost is not too
high, you might consider viewing this as a pilot study. In the next stage, more data would be
collected to guide us towards the optimal level of purifier to obtain the highest yield. Some
students may want to consider the cost of the purifier, arguing that the optimal level will be
somewhat under the amount that would produce the highest yield.
Everybody seems to disagree about just why so many parts have to be fixed or thrown
away after they are produced. Some say that its the temperature of the production process, which
needs to be held constant (with a reasonable range). Others claim that its clearly the density of
the product, and that if we could only produce a heavier material, the problems would disappear.
Then there is Ole, who has been warning everyone forever to take care not to push the equipment
beyond its limits. This problem would be the easiest to fix, simply by slowing down the
production rate; however, this would increase costs. Interestingly, many of the workers on the
morning shift think that the problem is “those inexperienced workers in the afternoon,” who,
curiously, feel the same way about the morning workers.
Ever since the factory was automated, with computer network communication and bar
code readers at each station, data have been piling up. You have finally decided to have a look.
After your assistant aggregated the data by four-hour blocks and then typed in the AM/ PM
variable, you found the following note on your desk with a printout of the data already loaded
into the computer network. The variables include:
Naturally you decide to run a multiple regression to predict the defect rate from all of the
explanatory variables, the idea being to see which (if any) are associated with the occurrence of
defects. There is also the hope that if a variable helps predict defects, then you might be able to
control (reduce) defects by changing its value.
Discussion Questions
1. What are the obvious conclusions from the hypothesis tests in the regression output?
2. Look through the data. Do you find anything that calls into question the regression
results? Perform further analysis as needed.
1. What are the obvious conclusions from the hypothesis tests in the regression output?
The F test is very highly significant (p=4.37E-12, so p<0.001). The t tests for Temperature
and for AM/ PM are significant. This says that temperature variability has a significant
effect on the defect rate, adjusting for the other factors. Similarly for AM/ PM. This suggests
that you should concentrate your attention on these two factors: controlling temperature
variability (because higher temperature variability is associated with significantly more
defects), and seeing why defects seem to be significantly lower in the morning, all else
equal.
None of the t-tests for the explanatory variables is significant. However, the F test
remains highly significant. This suggests multi-collinearity, which is confirmed by the
correlation matrix for these variables, which show three variables (Temperature, Density and
Rate) correlated with one another at ±0.9 or stronger.
Since, in the multiple regression, no single X variable is significant (holding the others
constant) some students may want to consider these variables one at a time. Here are results
of ordinary bivariate regressions to predict defect from each of the X variables. All are
significant now, except AM/ PM, which makes no significant difference.
One suggestion that is consistent with the analysis of the corrected data is to concentrate on
the variable that “would be the easiest to fix”, namely slowing down the rate of production.
The high correlations between rate and each of the other X variables (except AM/ PM)
suggest that a lower rate may result in less temperature variability and a higher density, each
of which is associated with fewer defects. Of course this will raise costs. Some students may
have something to say about the cost-benefit trade-off between a lower defect rate (the
benefit) and fewer items produced (the cost).