You are on page 1of 12

Jason

Sylvester
Math 1040

Skittles Term Project Part 2






Variables are classified in the following four levels of measurement: Nominal,
Ordinal, Interval, and Ratio. Each level of measurement is defined by the following
categories: Type of Data (qualitative/quantitative), Natural Order (means more),
differences (add/subtract) makes sense, and Ratios (fractions) makes sense.
Candy colors is a qualitative variable because its level of measurement is nominal
(name only), there is no natural order (adding/subtracting colors doesnt make sense),
and ratios dont make sense when classifying the variable.
The number of candies per bag is a quantitative variable because its level of
measurement is ratio. Natural order (means more), differences (add/subtract), and
ratios (fractions) make sense.

Jason Sylvester
Math 1040

Summary statistics:
number of candies per
bag

Mean

Std. dev.

Median

Range

Min

Max

Q1

Q3

IQR

Mode

60

59.18

3.11

59

14

50

64

58

61.5

3.5

59

Five number summary


Min = 50, Q1 = 58, Median = 59, Q3 61.5, Max = 64
Fences
Lower Fence: 58-1.5(3.5) = 52.75
Upper Fence: 61.5+1.5(3.5) = 66.75
Outliers
There were two outliers 50, and 52. The bag of skittles I purchased had 57 total
candies. It was not an outlier because its less than the upper fence, and greater than
the lower fence.
Shape of the distributions
It is not appropriate to discuss the shape of the qualitative distribution as the
qualitative data is not numeric. However, it is appropriate to discuss the shape of the

Jason Sylvester
Math 1040
quantitative data as there is a logical order (more means more), differences
(add/subtract) makes sense, and ratios (fractions) make sense. Also, outliers (extremes)
affect the shape of a quantitative frequency distribution.

Jason Sylvester
Math 1040

Skittles Term Project Part 3





The question was asked Can height be used to predict the number of candies that will be in a
bag of Skittles you purchase?.

What do I think the results will be? You will not be able to use height of a person to predict the
numbers of candies per bag. The correlation between the two variables is not relevant nor does
it makes sense. The height of a person does not influence the number of candies per bag since
bags are purchased at random.

Height is the explanatory variable, and number of candies would be the response variable. The
response variable is number of candies per bag because the variable of interest (number of
candies per bag) would be explained/predicted by another variable height. Height is the
explanatory variable since it would explain the value of the response variable number of
candies per bag.
Scatter Diagram


















There is no significant relationship between the two variables. Since n=60, the critical value is
.361. The absolute value of the correlation coefficient (0.17042887) is less than the critical
value, there is no significant relationship. The correlation coefficient (0.17042887) is not close
to -1 or 1, which would indicate either a strong negative or strong positive relationship. Because
the dots are so spread out there is no linear relationship (general pattern). The general
direction of the diagram is not easily seen, but because of the slope of the regression line we
know that its positive.

Jason Sylvester
Math 1040

This is exactly what I expected since since using the height of a person is not really a good way
to predict the number of candies in a bag of skittles that is purchased from the store.

Use the mean of the response variables

Using the data, the regression equation would be Y (y hat) = 50.713668 + 0.1287705X. If a
person is 63.5 inches tall we would plug 63.5 in for X and it would give us an approximation for
number of candies per bag. According to the regression line if a person is 63.5 inches tall, then
the approximate number of candies per bag would be 58.891 or 59 candies (rounded). It was
not appropriate to use the regression equation since there is no linear relation.

Using the regression output, we can then determine the coefficient of determination which
measures the proportion of variation in the response variable explained by the regression line
(percent of variation in Y explained by X. To find the coefficient of determination we square the
correlation coefficient (0.17042887) and get .0290459997 or 2.9%. We see that 2.9% of
variation in the number of candies per bag can be explained by the height of the person who
purchased the bag. Since this is so low, we can see that this is not a good way to predict the
number of candies per bag using height as the explanatory variable.

If we were to assume there is a significant relationship between height and number of candies
per bag, would it be appropriate to predict the number of candies in a bag purchased by retired
Houston Rockets player Yao Ming, who is 90 inches tall? No, because Yao Mings height of 90
inches is outside the range of samples taken and would therefore be considered extrapolation.

Doing a similar analysis on a smaller data set I used a systematic approach, starting with the
second row of data and every 10th row after that I got a smaller sample size of 6. Plugging the
data into my calculator I got a correlation coefficient of .1457, and a regression equation of Y (y
hat) = 52.9615 + .0769X. The critical value for the data set is .811. So the absolute value of the
correlation coefficient is .1457, and since .1457 is less than the critical value we can say that
there is no significant relationship between X and Y for the smaller data set.

Jason Sylvester
Math 1040

Skittles Term Project Part 4




The bag of skittles I purchased contained the following:
Red - 16
Orange - 11
Yellow - 13
Green - 9
Purple - 8
Total Number of Candies - 57
My height in inches is 71

What is the probability that both Skittles are purple if you select them with replacement?
Give your answer correct to four decimal places.

There are a total of 8 purple skittles, and the total number of skittles in the bag is 57. The
probability of selecting two skittles (with replacement) that are both purple is (8/57)^2 = .0197

What is the probability that both Skittles are purple if you select them without replacement?
Give your answer correct to four decimal places.

The probability of selecting two purple skittles (without replacement) is (8/57) (7/56) = .0175

What is the probability that at least one Skittle is purple if you select them with replacement?

At least means 1 Probability(None). The probability of none is 1 (probability of selecting
two purple skittles). So the probability of selecting at least 1 purple skittle (with replacement) is
1 - (1 - 8/57)^2 = .2610.

Suppose all of the Skittles in the class data set are combined into one large bowl and you are
going to randomly select one Skittle.

What is the probability that you select a green Skittle?

There are 710 green skittles in the bowl, and there is a total of 3,551 skittles in the bowl. The
probability of selecting a green skittle is 710/3,551 = .1999

What is the probability that you select a Skittle that is NOT green?

To find the probability of selecting a skittle that is not green we use the compliment rule, which
is 1 P(selecting a green skittle). So the probability that the skittle is not green is 1 - .1999 =
.8001.

Jason Sylvester
Math 1040
What is the probability that you select a Skittle that is red OR yellow?

To find the probability that you select a skittle that is red or yellow we use the addition rule for
disjoint events which is P(selecting a red) + P(selecting a yellow). So the probability of selecting
a red or yellow is (716/3551) + (726/3551) = .4061

What is the probability that you select a Skittle that is orange GIVEN that it is a secondary
color (secondary colors are green, orange and purple)?

To find the probability that a skittle is orange given that its secondary color we will add
together the total number of skittles that are secondary colors (green + orange + purple). The
total number of skittles that are a secondary number is 710 + 698 + 701 = 2,109. Next, we will
divide the total number of orange skittles by the total number of skittles that are secondary
colors. We get 698/2,109 = .3310

Problem 3: Suppose all of the Skittles in the class data set are combined into one large bowl
and you are going to randomly select ten Skittles with replacement and count how many are
yellow.

Show that this meets the requirements of the binomial probability distribution and identify n
and p.

The criteria for a binomial probability is as follows
Is the experiment performed a fixed number of times? Yes, 10
Are the trials independent (outcome of one trial does not affect the outcome of the
other)? Yes, because we are selecting skittles with replacement. Selecting one skittle
does not affect the outcome of selecting another skittle.
For each trial there are 2 disjoint outcomes success or failure? Yes, either the skittle
is yellow or its not.
Probability of success is the same for each trial? Yes, you are equally likely to select a
yellow skittle in each trial.
The number of independent trials (n) of the experiment is 10
The probability of success (drawing a yellow skittle) for each trial is .2044

What is the probability that exactly 4 of the 10 Skittles are yellow?

We plug in (n,p,x) in the calculator using the binomialpdf function. With, n = 10, p = .2044, and x
= 4). The probability that exactly 4 of the 10 skittles are yellow is .093

For samples of size 10, what is the expected value and standard deviation for the number of
yellow skittles that will be included?

The expected value is (n)(P), so (10)(.2044) = 2.044. The standard deviation is 1.28

Jason Sylvester
Math 1040
Problem 4: For this problem, treat a 2.17 ounce bag of Skittles as an individual. Suppose the
values for our class data are the parameter values for all 2.17 ounce bags of Skittles. In other
words, assume = mean number of candies per bag in our class data set and = standard
deviation of number of candies per bag in our class data set (you computed these values in
Part 2).

Describe the sampling distribution for the mean number of candies per bag for samples of 32
bags. Include center, spread and shape. Note: The shape of the SAMPLING DISTRIBUTION is
different from the shape of the population, which you determined in Part 2 of the project.

Earlier in part 2 we found that the mean number of candies per bag is 59.18, and that the
standard deviation is 3.11. If n = 32, the sampling distribution of the mean has a center of 59.18
(mean), spread (standard error of the mean) of .5498, and the shape is approximately normal.
Because the sample size is greater than 30 we know that the shape is approximately normal.

What is the probability that the mean number of candies per bag for a sample of 32 bags is
greater than 58.5?

To find p(x bar > 58.5) we first calculate the z score (58.5 59.18)/(.5498) = -1.24. We then find
the probability using the table or calculator. We get a probability of 1 - .1075 = .8925. So theres
a 89.25% chance that a sample of 32 bags has a mean number of candies per bag that is greater
than 58.5.




Jason Sylvester
Math 1040

Skittles Term Project Part 5



Explain in general the purpose and meaning of a confidence interval. (5 points)



In most cases its not feasible to conduct experiments for entire populations. Simple
random samples from of size n from a population whose parameter is unknown will result
in an interval that contains the parameter. Its an interval of numbers based on a point
estimate that gives a range of likely values for an unknown parameter.

Identify the requirements for computing confidence intervals. List the requirements
separately for a confidence interval for a population proportion and for a population
mean. (5 points)

In order to construct a confidence interval for the population proportion you need the
following:
1. An approximately normal sampling distribution of p hat: np(1 p) 10
2. Independent trials: n 0.05N (sample size is smaller than 5% of population).
In order to construct a confidence interval for the population proportion you need the
following:
1. Sample data come from a simple random sample or randomized experiment.
2. Sample size is small relative to the population n 0.05N
3. The data comes from a population that is normally distributed, OR the sample size is
large (n 30).
Using values for the class data that you computed in Part 2 of the project, construct a 99%
confidence interval estimate for the true proportion of yellow candies using the class data
as your sample. Remember that for this computation, n is the number of CANDIES for the
entire class data. Include all your work, showing the formula used and appropriate values
inserted (neatly written and scanned or typed). (10 points)

Since our parameter of interest is qualitative data we will compute the confidence interval
for the proportion. We will use n = 3,551, x = 726, and a c-level of .99. Using technology, we
use 1-PropZInt command and get a confidence interval of (.18702, .22188), and p hat of
.2044.

Give an appropriate interpretation of your interval. (5 points)

With 99% confidence the true proportion of all yellow skittles is between .18702, and
.22188.

Jason Sylvester
Math 1040

Based on your interval for the true proportion of yellow candies, was the proportion of
yellow candies in the single bag of candy you purchased a likely value for the true
population proportion? Explain how you know using actual values from your data and
computations. (5 points)

The proportion of yellow candies in my bag was is extremely close to the the upper bound
in the confidence interval. The proportion of yellow candies in my bag was 13/57, which
equals about .2281, and the upper bound in the confidence interval is .22188.

Using values you computed in Part 2 of the project, construct a 95% confidence interval
estimate for the true mean number of candies per bag using the class data as your
sample, but for this computation, n is the number of BAGS. Include all your work, showing
the formula used and appropriate values inserted (neatly written and scanned or
typed). (10 points)

Since our parameter of interest in qualitative data we will compute the t-interval. We will
use n = 60, mean = 59.18, standard deviation = 3.11, and a c-level of .95. Using technology,
we use t-interval command and get a confidence interval of (58.377, 59.983).

Give an appropriate interpretation of your interval. (5 points)

With 95% confidence the mean number of candies per all the bags is between 58.377, and
59.983.

Based on your interval for the true mean number of candies per bag, was the total
number of candies in the single bag you purchased a likely value for the population
mean? Explain how you know using actual values from your data and computations. (5
points)

The number of candies in my bag was is slightly outside the the lower value in the
confidence interval, but it was very close. The number candies in my bag was 57, which is
slightly less than the lower bound (58.377) in the confidence interval.



Jason Sylvester
Math 1040

Skittles Term Project Part 6




Purpose and meaning of a hypothesis test: A procedure based on sample results and probability
that tests hypotheses about the population. Testing two statements (null hypothesis, and
alternative hypothesis), and whether or not the statement is true or false.

1. Null hypothesis: H0: P = .20, 20% of all skittles are red. Alternative Hypothesis: H1: P
.20, the proportion of red skittles is not 20%
2. Conditions for performing the hypothesis test:
a. Simple random sample or random experiment? Convenience sampling, but the
test will work in this case.
b. n .05N? 3551 .05(all skittles)? Yes
c. np (1-p) 10? (3551)(.20)(1-.20) 10? Yes
3. The Test Statistic: 1-PropZTest (.20(p), 716(x), 3551(n)), so Z0 = .2433
4. The P-Value: .8076
5. Compare the p-value to alpha (reject H0 if p-value < alpha). 8076(p-value) > .05(alpha).
So we do not reject the null hypothesis. There is insufficient evidence to conclude that
the proportion of red skittles is not 20%.
6. The Type I and Type II errors are as follows
a. Type I error: To Reject H0 when H0 is true. Reject that 20% of all skittles are red,
when 20% of all skittles are actually red.
b. Type II error: Do not Reject H0 when H0 is false, and H1 is true. Do not reject that
20% of all skittles are red, when the proportion of red skittles is actually different
than 20%.

1. Null hypothesis: H0: mew = 58, mean number of skittles is 58. Alternative Hypothesis:
H1: mew > 58, the mean number of skittles is greater than 58.
2. Conditions for performing the hypothesis test:
a. Simple random sample or random experiment? Convenience sampling, but the
test will work in this case.
b. No outliers & normal or n 30.
c. Independent, n .05N? 60 .05(all skittles)? Yes
3. The Test Statistic: T-Test (mew(58), 59.18(xBar), 3.11(s), 60(n), Right Tailed), so T0 =
2.9390
4. The P-Value: .0023
5. Compare the p-value to alpha (reject H0 if p-value < alpha). .0023 (p-value) < .05(alpha).
So we reject the null hypothesis. There is sufficient evidence to conclude that the true
mean of skittles per bag is greater than 58
6. Interpret the p value: If the mean number of skittles per bag is really 59.18, then the
probability of getting 59.18 or greater is .0023.

Jason Sylvester
Math 1040

Skittles Term Project Part 7 Reflection





This semester we have been learning the concepts of statistics. As weve learned each
concept weve applied it to our skittles project. At the beginning of the semester all of the
students in the class were asked to purchase a bag of original skittles. All the students counted
the number of orange, green, yellow, red, and purple skittles in their bag. We then combined all
the classes data and used it throughout the the semester for parts 2-6 of the project. Each part of
the project helped us solidify the statistical concepts we had just learned.
Throughout this semester I have learned a ton about collecting data, the importance of data,
and what we can learn/infer from the data that we collect. I learned that data can easily be
misrepresented, and that collecting bad/wrong data can lead to incorrect assumptions and
decisions. I learned that the sampling methods used to collect data is extremely important, and it
must be done in a way that will not skew or misrepresent your explanatory or response variables.
I also learned that graphs can be misleading, and that its important to pay close attention to what
youre looking at. Ive been able to identify graphs on the web, and in social media that looked
great, but were misleading.
Ive learned a lot about inference, and what we can conclude and predict based of the
information we collect. For instance, during the project we were asked can height be used to
predict the number candies that will be in a bag of skittles?. Obviously we couldnt as this
doesnt make sense, but its been interesting to think about the real world applications and what
can be concluded from data thats collected. We can actually use math to see if there is any
correlation between two variables. Ive learned a lot throughout this semester and Its been
interesting to learn statistics. I can now apply what Ive learned to my degree, and its exciting!