You are on page 1of 22

2021

HOMEWORK
ASSIGNMENT 1
Probability and Statistics
SCENARIO: U.S. CENSUS
We took a random sample from the 2000 U.S. Census. Here is part of the dataset:

1. Who are the individuals described by this data?


2. What type of variable is Zipcode?
3. What type of variable is Annual Income?

CLINICAL DEPRESSION AND DRUG TREATMENT

Background
Clinical depression is the most common mental illness in the United States, affecting 19
million adults each year (Source: NIMH, 1999). Nearly 50% of individuals who
experience a major episode will have a recurrence within 2 to 3 years. Researchers are
interested in comparing therapeutic solutions that could delay or reduce the incidence of
recurrence.

In a study conducted by the National Institutes of Health, 109 clinically depressed


patients were separated into three groups, and each group was given one of two active
drugs (imipramine or lithium) or no drug at all. For each patient, the dataset contains
the treatment used, the outcome of the treatment, and several other interesting
characteristics.

Here is a summary of the variables in our dataset:

• Hospt: The patient's hospital, represented by a code for each of the 5 hospitals (1, 2, 3,
5, or 6)

• Treat: The treatment received by the patient (Lithium, Imipramine, or Placebo)

• Outcome: Whether or not a recurrence occurred during the patient's treatment


(Recurrence or No Recurrence)
• Time: Either the time in days till the first recurrence, or if a recurrence did not occur,
the length in days of the patient's participation in the study.

• AcuteT: The time in days that the patient was depressed prior to the study.

• Age: The age of the patient in years, when the patient entered the study.

• Gender: The patient's gender (1 = Female, 2 = Male)

Here's a snapshot of the first 50 patients in the dataset with gender recoded to display
Female or Male:
4. Who are the individuals described by this data?
5. Which of the following variables is categorical? Check all that apply.
6. Which of the following variables is quantitative? Check all that apply.
7. What is the difference between the two bar charts?

8. What do the results suggest about how the students are divided
across the three body image categories?
9. How does the middle group of students (19.6%) feel about their
weight?
10. How do the vast majority of students (71.3%) feel about their
weight?
11. What was the body perception that occurred the least often?

EXAM GRADES
Here are the exam grades of 15 students:

88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73

We first need to break the range of values into intervals (also called "bins" or "classes").
In this case, since our dataset consists of exam scores, it will make sense to choose
intervals that typically correspond to the range of a letter grade, 10 points wide: 40-50,
50-60, ... 90-100. By counting how many of the 15 observations fall in each of the
intervals, we get the following table:
Exam Grades

Score Count

[40-50) 1

[50-60) 2

[60-70) 4

[70-80) 5

[80-90) 2

[90-100] 1

12. What percentage of students earned less than a grade of 70 on the exam?

13. An instructor asked her students how much time (to the nearest hour) they spent
studying for the midterm. The data are displayed in the following histogram:
What do the numbers on the horizontal axis represent?
14. An instructor asked her students how much time (to the nearest hour) they spent
studying for the midterm. The data are displayed in the following histogram:

What do the numbers on the vertical axis represent?

15. An instructor asked her students how much time (to the nearest hour) they spent
studying for the midterm. The data are displayed in the following histogram:
What percentage of students study 6 or more hours for the midterm?

16. Thirty-two students were asked the number of servings of fruits and vegetables
they eat daily. The results are displayed in the histogram below.

How many of the students surveyed eat at least 4 servings of fruits and vegetables
daily?

17. Thirty-two students were asked the number of servings of fruits and vegetables
they eat daily. The results are displayed in the histogram below.
What percentage of the students surveyed eat no more than 3 servings of fruits and
vegetables daily?

18. Thirty-two students were asked the number of servings of fruits and vegetables
they eat daily. The results are displayed in the histogram below.

What proportion of the students surveyed eats exactly 5 servings of fruits and
vegetables daily?

19. A survey was conducted to see how many phone calls people made daily. The
results are displayed in the table below:
Number of calls made Frequency

1-4 16

5-8 11

9 - 12 5

13 - 16 3

17 - 20 1

How many of the people surveyed make less than 9 phone calls daily?

20. A survey was conducted to see how many phone calls people made daily. The
results are displayed in the table below:

Number of calls made Frequency


1-4 16
5-8 11
9 - 12 5
13 - 16 3
17 - 20 1
How many people were surveyed?

Different Shape Distribution


We will use the Best Actor Oscar winners (1970-2013) dataset to practice what you've learned about
describing the histogram. Below is the histogram of Best Actor Oscar winners from 1970 - 2013
grouped by age.

21. What is the shape of this histogram?


22. What is the most likely shape of the distribution of age of death from natural causes
(heart disease, cancer, etc.). when represented by a histogram?
23. What is the most likely shape of the distribution of the age at which a child takes its
first steps?
24. Which of the following is the best description of the data used to generate this
histogram (note that the horizontal axis has no scale, so you will make your choice
based solely upon the histogram's shape)?

SAT Math scores of 1,000 future engineers and scientists.


Results of rolling a six-sided die 1,000 times.
Cholesterol levels of 1,000 adults.
Shoe sizes of 1,000 men and women.
Prices of 1,000 California homes.

25. Which of the following is the best description of the data used to generate this
histogram (note that the horizontal axis has no scale, so you will make your choice
based solely upon the histogram's shape)?

SAT Math scores of 1,000 future engineers and scientists. Results of rolling a six-sided die 1,000 times.
Cholesterol levels of 1,000 adults. Shoe sizes of 1,000 men and women. Prices of 1,000 California
homes.
26. Which of the following is the best description of the data used to generate this
histogram (note that the horizontal axis has no scale, so you will make your choice
based solely upon the histogram's shape)?

SAT Math scores of 1,000 future engineers and scientists. Results of rolling a six-sided die 1,000 times.
Cholesterol levels of 1,000 adults. Shoe sizes of 1,000 men and women. Prices of 1,000 California
homes.

27. Which of the following is the best description of the data used to generate this
histogram (note that the horizontal axis has no scale, so you will make your choice
based solely upon the histogram's shape)?

SAT Math scores of 1,000 future engineers and scientists. Results of rolling a six-sided die 1,000 times.
Cholesterol levels of 1,000 adults. Shoe sizes of 1,000 men and women. Prices of 1,000 California
homes.
28. Which of the following is the best description of the data used to generate this
histogram (note that the horizontal axis has no scale, so you will make your choice
based solely upon the histogram's shape)?

SAT Math scores of 1,000 future engineers and scientists. Results of rolling a six-sided die 1,000 times.
Cholesterol levels of 1,000 adults. Shoe sizes of 1,000 men and women. Prices of 1,000 California
homes.

29. Here are the number of hours that 9 students spend on the computer on a typical
day: 1 6 7 5 5 8 11 12 15

What is the mode number of hours spent on the computer?

30. Here are the number of hours that 9 students spend on the computer on a typical
day: 1 6 7 5 5 8 11 12 15

What kind of distribution is formed by the data from the above 9 students?

31. Here are the number of hours that 9 students spend on the computer on a typical
day: 1 6 7 5 5 8 11 12 15

Which of the following is the mean number of hours spent on the computer?

32. A recent survey asked 90 students, How many hours do you spend on the computer
in a typical day?

Of the 90 respondents, 3 said 1 hour, 5 said 2 hours, 15 said 3 hours, 25 said 4


hours, 20 said 5 hours, 15 said 6 hours, 5 said 7 hours, 1 said 8 hours, and 1 said 9
hours.

What is the average (mean) number of hours spent on the computer?


33. Here are the number of hours that 9 students spend on the computer on a typical
day: 1 6 7 5 8 5 11 12 15

What is the median number of hours spent on the computer?

Background
A study was done in order to find out whether pamphlets containing information for
cancer patients are written at a level that the cancer patients can understand. Tests
were administered to measure the reading levels of 63 cancer patients, and the
readability levels of 30 cancer pamphlets were evaluated based on such factors as the
lengths of the sentences and the number of polysyllabic words. Both the reading and
readability levels correspond to grade levels, but patients' reading levels of less than
grade 3 and above grade 12 cannot be determined exactly. (Source: Short, Moriarty,
and Cooly. (1995). "Readability of Educational Materials for Cancer Patients." Journal of
Statistics Education, v.3, n.2)

The following tables indicate the number of patients at each reading level and the
number of pamphlets at each readability level.

Comment
Note: For both the reading level and readability level, the data are presented in a
grouped form where the count represents the frequency of occurrence of that level. In
the readability data, for example, the count of level 6 is 3 which means that the first
three data points are 6 6 6; the count of level 7 is 3 which means that the next three
data points are 7 7 7; the count of level 8 is 8 which means that the next eight data
points are 8 8 8 8 8 8 8 8, and so on.

Answer the following questions:

34. Which of the following is the mode of the readability level for the pamphlets?
35. Explain why you cannot calculate the mean (average) reading level of patients
given the above data.
36. Now find the median reading level of the patients. Note that the data are already
ordered—that's good!
37. How many observations (n) are there for the reading level of cancer patients?
38. Since n is odd, the median reading level for patients will be which observation in the
ordered data? (center observation or average of the two center observations)
39. What is the rank of the observation that represents the median reading level for
patients?
40. Find the median readability level of the pamphlets.
41. Can you conclude that the pamphlets are well matched to the patients' reading
levels? Look carefully at the data.
42. In the histogram below, what will be the relationship between the mean and median
of the collected data?

43. The SAT Math scores of 1,000 future engineers and physicists are recorded. What
will be the relationship between the mean and median of the collected data?

Depression Days

In the workplace, depression is a leading cause of absenteeism and loss of productivity


(Greenberg, et al. 1993). To assess the degree to which people suffer from depression,
prior to receiving treatment, data were collected on the number of days that 105
patients were depressed prior to starting a new treatment. These data are displayed in
the following table and histogram:

Days Count

[20-60] 5
[60-100] 10

[100-140] 20

[140-180] 30

[180-220] 16

[220-260] 10

[260-300] 6

[300-340] 4

[340-380] 2

[380-420] 0

[420-460] 0

[460-500] 0

[500-540] 2
Which of the following is a possible value of the median number of days that patients
were depressed?

53 170 220 290

44. Using this same histogram of 105 patients, which of the following is most likely to
be true?

The mean will be larger than the median


The median will be larger than the mean
The mean and median will be about the same
45. Using this same histogram of 105 patients, what percentage of patients had 220 or
more days of depression?
46. A survey taken in a large statistics class contained the question: "What's the fastest
you have driven a car (in miles per hour)?" The five-number summary for the 87
males surveyed is: min = 55, Q1 = 95, Median = 110, Q3 = 120, Max = 155

Should the largest observation in this data set be classified as an outlier?

47. A survey taken of 140 sports fans asked the question: "What is the most you have
ever spent for a ticket to a sporting event?" The five-number summary for the data
collected is: min = 85, Q1 = 130, Median = 145, Q3 = 150, Max = 250

Should the smallest observation be classified as an outlier?

Now let's look at the five number summary of the age of Best Actor Oscar winners
(1970-2001). The five number summary is: Min: 29 Q1: 38 M: 43.5 Q3: 50.5 Max: 76

48. Half of the actors won the Oscar before what age?
49. What is the range covered by the middle 50% of the ages?
50. The boxplot below displays ratings for TV shows during sweeps week.

A) 50% of the shows have a rating greater than which value?


B) What is the percentage of shows that have a rating of less than 20?
C) Within which interval of rating would you expect to find the largest number of
shows?
51. The following boxplot displays the winning times for women Ironman champions
from 1979 to 2014 (note the race was conducted twice in 1982).
A) What is the value of Q1 in this boxplot (hover over the image to see the five-
number summary)?
B) Look at the data points on the left side of the graph, and compare them to the
boxplot on the right. What do you notice about how the data are grouped, and the
corresponding parts of the boxplot?
C) What percentage of the winners had a time of less than 9.5 hours (hover over the
image to see the five-number summary)?

D) What is the IQR of this dataset?


E) Within which interval of times would you expect to find the largest number of
winners?
52. The Web site www.kiplinger.com published the "Best Values in Public Colleges,"
saying they were determined by rankings based on data from more than 500 public
four-year colleges and universities. Using out-of-state rankings that considered the
academic quality, total out-of-state costs, and average costs after financial aid, the
50 best-rated colleges were selected. The boxplot pictured displays the average
debt per graduate for these colleges.

A) What percentage of the graduates will have a debt greater than $25,000?
B) 50% of the debts owed are smaller than what amount?
C) Within which interval of debts would you expect to find the largest number of
graduates?
SCENARIO: GRADUTION RATE
The percentage of each entering Freshman class that graduated on time was recorded
for each of six colleges at a major university over a period of several years.

In order to compare the graduation rates among the different colleges, we created side-
by-side boxplots (graduation rate by college), and supplemented the graph with
numerical measures.

Summary Data College A College B College C College D College E College F

Min 43.20 67.30 54.50 74.10 54.50 57.50

Q1 50.95 69.55 56.58 76.65 56.88 65.05

Median 63.75 70.15 67.65 79.00 59.15 72.00

Q3 70.50 73.05 71.58 81.10 63.70 81.28

Max 73.80 76.70 74.80 84.60 71.30 87.40

53. Based on the boxplots and data, which of the six colleges has the best on-time
graduation rate?
Background
At the end of a statistics course, the 27 students in the class were asked to rate the
instructor on a number scale of 1 to 9 (1 being "very poor," and 9 being "best instructor
I've ever had"). Assume that the average rating in each of the three classes is 5 (which
should be visually reasonably clear from the histograms), and recall the interpretation of
the SD as a "typical" or "average" distance between the data points and their mean. The
following table provides three hypothetical rating data:

Rating 1 2 3 4 5 6 7 8 9

Class I 1 0 0 0 22 0 0 0 1

Class II 12 0 0 0 1 0 0 0 12

Class III 2 2 2 2 2 2 2 2 2

And here are the histograms of the data:

54. Judging from the table and the histograms, which class would have the largest
standard deviation?

You might also like