You are on page 1of 7

Summer 2021 Midterm Examination

Statistics 100
Due 14 July 2021, 5:00 PM

– The exam consists of 3 problems and 7 pages. The exam is worth 200 points, and the point
value for each question is displayed.
– The exam covers material from Units 1, 2, and 3.
– Your solutions are due at 5:00 pm, 14 July 2021. No late submissions will be accepted.
– Solutions must be uploaded to the course website. Submit 1) a PDF file produced from R
Markdown and 2) the R Markdown file used to produce the PDF. Name the files with your
first initial and last name; e.g. stat100_midterm_summer2021_j_vu.
– Before submitting the exam, read and (electronically) sign the statement on the second page
confirming that you have worked independently.
– Be sure to read the questions carefully. Some parts of a problem statement may ask for more
than one calculation.
– Some parts of a question may require the answer to an earlier part. If you cannot solve
the earlier part, you can still receive partial credit for the later parts; make up a reasonable
answer for the earlier part to use in subsequent parts of the problem.
– Show your work and explain your reasoning; the final answer is not as important as the
process by which you arrived at that answer. We can more easily give partial credit if you
have written out your steps clearly.
– You may use any course materials while working this exam, including the lecture slides,
labs, lab notes, section materials, problem set solutions, and the OI Biostat textbook. While
access to non-course materials is also permitted, the exam has not been written to require
any outside research.
– All your work must be your own. Collaboration is strictly forbidden, including any discus-
sion about resolving technical issues related to knitting files, etc. This includes posting any
questions about the exam or discussing the exam on the class Slack channels.
– Answers must be in your own words. Plagiarism is not acceptable and we will very likely
detect if an answer has been copied from a website.
– Once the exam is released, the teaching staff will not be able to provide assistance with
working through problems, or to answer questions about concepts covered in class.
– Office hours will be held for help with technical issues, such as files that may not knit, or R
code that may not run. If you are experiencing technical issues, please contact the teaching
staff via email (or private messages on Slack).
– Be sure to frequently save your work, in addition to saving copies on your local machine.
Knitting errors and unexpected loss of work are not acceptable excuses for late submission.

1
PROBLEM 1: SHORT ANSWER (70 pts. total)
a) (10 pts.) A number of studies have been conducted to investigate the association between
sleep duration and risk of stroke in middle-aged and older adults. One such study was
conducted in a sample of 31,000 participants with an average age of 62 years. Information
on sociodemographic characteristics and sleep duration was obtained by a self-administered
questionnaire. The study took place over 6 years, during which 1,557 cases of stroke were
documented. The risk of stroke was higher in participants who reported sleeping at least 9
hours per night, in comparison to the group who reported sleeping 7 to 8 hours per night;
the relative risk was 1.23.
A news article reporting the results stated: “Sleeping a lot may increase the risk for stroke. A
new study has found that compared with sleeping seven to eight hours a night, sleeping nine
or more hours increased the relative risk for stroke by 23 percent. Maintaining appropriate
sleep duration might hold great promise as primary prevention of stroke.”
Write a short response to the newspaper editor explaining clearly why the article is po-
tentially misleading. Be sure to use language accessible to a general audience without a
statistics background. Limit your answer to at most six sentences.
b) (8 pts.) Suppose that you have purchased a 40-piece box of salt water taffy; taffy is a type
of soft, chewy candy that became popular in the United States in the 1800s. The box is
advertised as containing a random assortment of 8 flavors. Calculate the probability that at
least one flavor is missing from the box.
If using an algebraic approach, explain your reasoning and any necessary assumptions. If
using a simulation approach, be sure to clearly comment any code and describe your logic.
c) (52 pts.) Accurately distinguishing lung cancer from benign lung disease remains challeng-
ing, even with the use of imaging scans; computed tomography (CT) scans are known to have
high sensitivity but poor specificity for lung cancer diagnosis. Tumor markers, molecules
produced by a tumor associated with a cancer or by the body in response to a cancer, may
be useful for clinical diagnosis.
Consider two tumor markers for lung cancer, CYFRA 21-1 and CEA, which tend to be ele-
vated in patients with lung cancer relative to those with benign lung disease. A study was
conducted on patients with known lung cancer status to assess how these tumor markers
could be used for clinical diagnosis. The study team observed that in patients with lung
cancer, CYFRA 21-1 is normally distributed with mean 4.7 ng/mL and standard deviation
9.2 ng/mL while CEA is normally distributed with mean 5.9 ng/mL and standard deviation
19.8 ng/mL. In patients with benign lung disease, CYFRA 21-1 is normally distributed with
mean 1.6 ng/mL and standard deviation 4.3 ng/mL while CEA is normally distributed with
mean 2.2 ng/mL and standard deviation 5.3 ng/mL.
Use the data from this study to answer the following questions.
i. Compute the sensitivity and specificity of a diagnosis test based on classifying patients
with CYFRA 21-1 level greater than 3.3 ng/mL as having lung cancer.
ii. Compute the sensitivity and specificity of a diagnosis test based on classifying patients
with CEA level greater than 5.0 ng/mL as having lung cancer.

2
iii. Explain the reasoning behind why a diagnostic test with low sensitivity may not be
recommended for use in the general population but appropriate for use in high-risk
groups, such as patients presenting with several risk factors or symptoms strongly pre-
dictive of lung cancer. Use language accessible to someone who has not taken a statis-
tics course. Limit your answer to no more than six sentences.
iv. Suppose a high-risk patient is tested for elevated CYFRA 21-1 level and found to have
CYFRA 21-1 level below the cutoff in part i. Explain whether it seems reasonable to rule
out lung cancer for this patient based on this test result and the test features computed
in part i. Limit your answer to no more than six sentences, referencing numerical
results as necessary.
The study team is interested in whether a diagnostic test based on both CYFRA 21-1 and
CEA is an improvement over tests based solely on one of the markers. Suppose that a patient
is classified as having lung cancer if at least one of the markers is above the cutoffs used in
parts i. and ii.; i.e., a patient tests positive for lung cancer if CYFRA 21-1 level is greater
than 3.3 ng/mL, CEA level is greater than 5.0 ng/mL, or both are elevated.
v. Compute the sensitivity and specificity of this diagnostic test. State any assumptions
necessary to make the calculation and comment on whether those assumptions seem
reasonable.
vi. Does the diagnostic test based on both markers represent an improvement over the
tests in parts i. and ii. for use in high-risk patients? Explain your answer, referencing
numerical results to support your reasoning.

3
PROBLEM 2: ICE CREAM SALES (60 pts. total)
Congratulations! You have inherited an ice cream truck for the summer of 2022. You will be
selling ice cream from the truck every day that summer: 101 days in total, from Memorial Day
weekend until Labor Day weekend. This problem will step you through projecting how much ice
cream you expect to sell (as well as how much revenue you expect to make) during the summer.
a) (8 pts.) Ice cream sales are known to be very dependent on weather. Suppose that for any
particular day, the weather is either rainy or sunny; on average, one-third of the days are
rainy and two-thirds of the days are sunny. Compute the expected number of sunny days
in Summer 2022 and the probability of there being more rainy days than sunny days in
Summer 2022. You may assume that the weather is independent between days.
b) (22 pts.) It is more realistic to think that tomorrow’s weather depends on today’s weather
(for all days throughout the summer). Suppose that if it is sunny today, there is an 80%
chance that tomorrow will also be sunny; however, if it is rainy today, there is only a 30%
chance that tomorrow will be sunny. Additionally, suppose that the weather on ‘Day 0’ (i.e.,
the day before your ice cream truck opens) is known to be sunny.
Based on a simulation with 10,000 replicates, estimate the probability of there being more
rainy days than sunny days in Summer 2022. Write a brief paragraph outlining the logic of
the simulation and clearly comment your code.
The number of ice cream cones sold on a typical day depends on the weather. When it is sunny,
the number of ice cream cones sold is approximately normally distributed with mean 200 and
standard deviation 40. When it is rainy, the number of ice cream cones sold is approximately
normally distributed with mean 120 and standard deviation 30. Use the round() function to
round to the nearest whole number.
You make $2.00 in profit for each ice cream cone sold and your fixed daily operating cost is $300
per day.
c) (14 pts.) Based on a simulation with 10,000 replicates, estimate the probability that you lose
money on any one randomly selected day in the summer; write a brief paragraph outlining
the logic of the simulation and clearly comment your code. Assume the weather follows the
pattern described in part a).
d) (16 pts.) Based on a simulation with 10,000 replicates, estimate the probability that you
lose money with your ice cream truck during the summer of 2022; write a brief paragraph
outlining the logic of the simulation and clearly comment your code. Assume the weather
follows the pattern described in part b).

4
PROBLEM 3: VACCINE ACCEPTANCE (70 pts. total)
Vaccine development and availability are not the only obstacles to controlling a pandemic. A
sufficient share of the population must be vaccinated to reach herd immunity: the point at which
enough of the population is immune such that person to person spread is unlikely. Herd immunity
allows for individuals who cannot be vaccinated to be protected from disease.
However, as of July 2021, the rate of COVID-19 vaccinations in the United States has declined
and it is unclear whether herd immunity might be reached in the near future. A number of
strategies are being considered to encourage more individuals to get vaccinated, such as door-to-
door outreach and setting up clinics in workplaces. Studies investigating factors associated with
vaccine hesitancy are currently underway; the results of these studies will inform strategy behind
vaccination efforts.
One study conducted in July 2020, before COVID-19 vaccines were available, investigated self-
reported willingness to receive COVID-19 vaccination by asking survey participants to consider
several hypothetical vaccines with various features. Researchers collected data from 1,971 partic-
ipants via a 15-minute online survey, presenting respondents with 10 hypothetical vaccines and
asking how likely or unlikely they would be to receive each vaccine. For example, one vaccine
participants were asked to consider was described as being 70% effective, approved and licensed
by the US Food and Drug Administration, developed in the United States, and endorsed by Vice
President Joe Biden. The survey also included questions about demographic characteristics. Sur-
vey respondents were recruited from a pool of US adults by a survey sampling company.
Data from the study are in the file vax.Rdata. The descriptions of the variables are as follows.
Variable Description
work_status recent change in employment status
work_home ability to work from home
covid_future view about where the US stands in the COVID-19 outbreak
flu_vaccine past frequency of receiving flu vaccine
age age in years
ideology political ideology
party3 political party
gender gender
income annual income range in USD
education highest level of education completed
vax_safety opinion on vaccine safety in general
ethnicity race/ethnicity
likely vaccine acceptance score
Some additional background on certain variables:
– The covid_future variable records responses to the question “Which comes closer to your
view about where the US stands in the coronavirus outbreak: the worst is behind us, or the
worst is yet to come?”
– The ideology variable records responses to the question “Politically, I consider myself. . . ”
with seven options ranging from very liberal to very conservative. The party3 variable
records responses to the question “In politics, as of today, do you consider yourself a Re-
publican, a Democrat, or an Independent?”

5
– The vax_safety variable records responses to the question “In general, how safe do you
think vaccines are?”
– The vaccine acceptance score ranges from 1 to 7 and represents the average response to the
question “How likely or unlikely would you be to get Vaccine X?” for the 10 hypotheti-
cal COVID-19 vaccines described to survey participants. The score can be thought of as a
measure of how willing someone is to receive a COVID-19 vaccination, with a higher score
representing greater willingness.
Use these data to answer the following questions.
a) (16 pts.) Explore the data.
i. Briefly summarize the distribution of vaccine acceptance score. Reference appropriate
numerical and graphical summaries as needed.
ii. Write an informative summary describing features of the study participants with re-
spect to the variables age, income, and education. Reference appropriate numerical
and graphical summaries as needed. Limit your answer to no more than five sentences.
iii. Compute the proportion of respondents with vaccine acceptance score greater than 4;
scores higher than 4 represent willingness to get a vaccine.
iv. Might the proportion in part iii. represent an underestimate or overestimate of the
proportion of US adults willing to get a COVID-19 vaccine? Explain your reasoning.
b) (4 pts.) From these data, compute and interpret the relative risk of being unwilling to get a
vaccine (i.e., having vaccine acceptance score less than or equal to 4) comparing individuals
who identify as being very conservative versus those who identify as being very liberal.
c) (12 pts.) Investigate the relationship between vaccine acceptance score, age, and political
party identification.
i. Create a graphical summary that shows the association between age and vaccine accep-
tance score. Describe what you see, referencing an appropriate numerical summary.
ii. Repeat part i. separately among Democrats and among Republicans.
iii. Write a single sentence conveying the main idea from the investigation in parts i. and
ii., using language accessible to someone who has not taken a statistics course.
d) (18 pts.) Investigate the relationship between vaccine acceptance score, education level, and
ideology.
i. Describe the association between vaccine acceptance score and education level, refer-
encing numerical and graphical summaries as needed.
ii. Describe the association between vaccine acceptance score and ideology, referencing
numerical and graphical summaries as needed.
iii. Does the association between vaccine acceptance score and ideology seem to differ ac-
cording to education level? Explain your answer, referencing numerical and graphical
summaries as needed. Limit your answer to no more than six sentences.
e) (6 pts.) Describe the association between vaccine acceptance score and ability to work from
home. Is the nature of the association as you might expect? Explain your answer.

6
f) (6 pts.) Prepare a short statement, no more than eight sentences long, summarizing the
main findings of the analyses with respect to understanding who should be targeted in pub-
lic health campaigns to address vaccine hesitancy. Reference previous numerical results
as needed. If desired, you may also reference numerical results of analyses that were not
explicitly called for in the previous parts of this question.
g) (8 pts.) Describe two limitations of this study for understanding features associated with
vaccine acceptance (or vaccine hesitancy) in the US population during the current stage of
the pandemic (i.e., July 2021). Limit your answer to no more than eight sentences.

You might also like