You are on page 1of 9

Jayel Kirby

MATH 1040 T/R @ 10am


November 25, 2014

Term Project
Stat The Rainbow
Introduction
I started this project by purchasing a 2.17-ounce bag of Original Skittles. I counted and
recorded the number of candies of each color: red (13), orange (8), yellow (20), green (9), and
purple (11). The total number of candies in my bag of Skittles was 61. This information was
submitted to my instructor. All students in my class were given the same assignment. Our
instructor took the results of 38 students and reported the results. Out of 2435 candies (in 38
bags), 500 were red, 446 were orange, 474 were yellow, 503 were green, and 512 were purple.
Using this data, I developed Pie and Pareto Charts showing the number of candies by color (as
shown below and on the following page).
Organizing and Displaying Categorical Data: Colors
Number of Skittles, By Color, In 38 Bags

Green, 503, 20.66%


Orange, 446, 18.32%
Purple, 512, 21.03%
Red,

500, 20.53%

Yellow, 474, 19.47%

Number Of Skittles In 38 Bags

500

400

300

200

100

0
Purple

Green

Red

Yellow

Orange

According to these charts, the difference in the number of each color of candy
does not appear particularly significant.
The following table demonstrates the data from my own sample bag in comparison to the
data collected from the class as a whole:
Comparison of Individual Data to Class Data

My Sample
Class Sample
My Proportion
Class Proportion

Red
13
500
.213
.205

Orange
8
446
.131
.183

Yellow
20
474
.328
.195

Green
9
503
.148
.207

Purple
11
512
.18
.210

Total
61
2435
1
1

I was surprised that my own findings did not necessarily agree with those of the class.
Because my own bag had nearly twice as many yellow candies as any other one color, I assumed
that most bags would contain a greater number of yellow candies. Yet, according to the class
data, yellow candies were outnumbered by every other color except orange. Doing this project
has helped me to better understand the importance of using a large data sample in order to make
more correct assumptions about an entire population.

Organizing and Displaying Quantitative Data: the Number of Candies per Bag
Another set of information that the class data supplied was the number of candies in each
bag. There were 61 candies in my bag. As stated earlier, the total number of candies in all 38
bags was 2,435. The mean number of candies in each bag was 64.1. The standard deviation of
the number of candies per bag was 13.2 (13.20); the 5-number summary was: 45, 59, 61, 62, 114.
Since my bag had 61 candies, it was exactly the same as the median number in our class, yet it
was not the same as the mean. Below, you will see a histogram and a box plot that I developed
with this data.

These charts (above, on previous page) indicate a right-skewed distribution of data, with
a somewhat bell-shape. I didnt expect to see such a gap between the third quartile and the right
whisker. When this data is drawn up in a modified box plot (as shown below), a number of
outliers are revealed. I believe this suggests the possibility that a few (Im guessing 3) students
gathered their data from Skittles packages that were larger than the designated 2.17-ounce size. If
that was the case, then their data literally skewed the results, as the box plot below reflects a
slightly left-, rather than extremely right-skewed distribution, and it would have been an example
of a non-random sampling error, since the data wasnt collected from similar samples (sample
bags of the same package size).

Quantitative vs. Categorical Reflection


It is important to differentiate quantitative data versus categorical data. When working
with categorical information, it wouldnt make sense to compute an average or a mean of the
numbers on the jerseys of a football team; the answers would be meaningless. Jersey numbers

are used to identify specific players; not to count or measure them. So it stands to reason, that
different types of data require different charts to reflect them. When comparing the number of
different colors of candies within a sample, I used a Pie Chart and a Pareto Chart, because these
charts work best to display categorical data, such as color. On the other hand, when
demonstrating quantitative data, histograms and box plots are more appropriate. Histograms
work well for quantitative data, because they have class boundaries that range from a low limit to
a high one, and can include a full range of integers. The color of a Skittles candy doesnt fall
within a range; either it is one color, or it is another. Since Pareto Charts have gaps between bars,
and Histograms do not, it wouldnt make sense to use a Histogram to display categorical
information. Although it may sound a bit confusing and complicated at first glance, common
sense guides statisticians to recognize the appropriate use of each category of data.

Confidence Interval Estimates


A confidence interval gives you a low number and a high number between which a
specific value is expected to fall. For example, when a significance level of .05 is used, the
confidence interval should cover a base of 95% of the possibilities. Below, you will find some
confidence intervals based on our previous candy data. The work for this information is on the
following page.

Specific Value

Significance Level

Confidence Interval

Proportion of yellow candies

99%

0.174 < P < 0.215

Mean number of candies per bag

95%

59.762 < < 68.438

Standard deviation of number of candies per bag

98%

11.26 < < 20.76

Based on these confidence interval estimates, I can make the following statements:

I have 99% confidence that a random bag of Skittles will have between 17.4 and 21.5%
yellow candies.

I have 95% confidence that a random bag of Skittles will have a mean of between 59 and
69 candies.

I have 98% confidence that the number of Skittles in a random bag will have a standard
deviation of 13 candies.

Hypothesis Tests
When a claim is made about the characteristics of all members of a general population,
a hypothesis test can be made on a simple random sample to find the likelihood that any
randomly chosen individual/item would fall into the parameters of the claim. With the data from
such a test, a determination can be made, with a specified degree of confidence, whether there is
sufficient evidence to support or reject the original claim.
For instance, for the claim that 20% of all Skittles candies are red, I can run a hypothesis
test at a 0.05 significance level. Since the z-score for this test (0.65) is less than the critical value
(1.96), there isnt sufficient evidence to reject the claim that 20% of all Skittles candies are red.
Another example would be to test the accuracy of the claim that the mean number of
candies in a bag of Skittles is 35, using a 0.01 significance level. Since the t-stat for this test
(4.250) is greater than the critical value (2.715), there is sufficient evidence to reject the claim
that the mean number of candies in a bag of Skittles is 35.
The work for both of these hypothetical tests can be found on the following page.

Interval Estimates and Hypothesis Testing: Reflection


The purpose of using confidence interval estimates is to be able to make assumptions
about the whole population based on the data from a sample. I cant possibly count how many
candies are in every Skittles package to find out the true proportion of yellow candies. But with a
sample size of 38 bags, I can get relatively close. I would be able to get even closer to the true
proportion if I used a larger sample size, like 50 or even 100 bags.
The purpose of hypothesis testing is to check the accuracy of a claim concerning an entire
population, by testing data obtained from a sample. The two claims on the previous page were
good examples of this. Still, there is the possibility of error. Earlier, I stated my suspicion that
three students gathered data from bags that were larger than 2.17 ounces. If that was the case, our
confidence intervals and hypothesis testing could be a bit off. 3 out of 38 may not seem like a lot,
but it is 7.9%, which exceeds the 5% rule. So, to be have more accurate summations, I would
need to have data from a sample where all of the information was gathered from bags that were
the same size.