Professional Documents
Culture Documents
Statistics
Terri Miller revised July 14, 2015
2 Statistics
Basic Terms.
A population is the set of all objects under study, a sample is any subset of a population,
and a data point is an element of a set of data.
The frequency is the number of times a particular data point occurs in the set of data. A
frequency distribution is a table that list each data point and its frequency. The relative
frequency is the frequency of a data point expressed as a percentage of the total number
of data points.
x f
1 1
3 3
4 4
2 2
5 1
Data is often described as ungrouped or grouped. Ungrouped data is data given as indi-
vidual data points. Grouped data is data given in intervals.
1, 3, 6, 4, 5, 6, 3, 4, 6, 3, 6
Number of
television sets Frequency
0 2
1 13
2 18
3 0
4 10
5 2
Total 45
Example 5. Grouped data.
Organizing Data.
Given a set of data, it is helpful to organize it. This is usually done by creating a frequency
distribution.
Example 6. Ungrouped data.
Given the following set of data, we would like to create a frequency distribution.
1 5 7 8 2
3 7 2 8 7
To do this we will count up the data by making a tally (a tick mark in the tally column for
each occurrence of the data point. As before, we will designate the data points by x.
x tally
1 |
2 ||
3 |
4
5 |
6
7 |||
8 ||
Now we add a column for the frequency, this will simply be the number of tick marks for
each data point. We will also total the number of data points. As we have done previously,
we will represent the frequency with f .
4 Statistics
x tally f
1 | 1
2 || 2
3 | 1
4 0
5 | 1
6 0
7 ||| 3
8 || 2
Total 10
This is a frequency distribution for the data given. We could also include a column for the
relative frequency as part of the frequency distribution. We will use rel f to indicate the
relative frequency.
x tally f rel f
1
1 | 1 10
∗ 100% = 10%
2
2 || 2 10
∗ 100% = 20%
3 | 1 10%
4 0 0%
5 | 1 10%
6 0 0%
3
7 ||| 3 10
∗ 100% = 30%
8 || 2 20%
Totals 10 100%
For our next example, we will use the data to create groups (or categories) for the data and
then make a frequency distribution.
Example 7. Grouped Data.
Given the following set of data, we want to organize the data into groups. We have decided
that we want to have 5 intervals.
26 18 21 34 18
38 22 27 22 30
25 25 38 29 20
24 28 32 33 18
Since we want to group the data, we will need to find out the size of each interval. To do
this we must first identify the highest and the lowest data point. In our data the highest
data point is 38 and the lowest is 18. Since we want 5 intervals, we make the computation
highest − lowest 38 − 18 20
= = =4
number of intervals 5 5
Since we need to include all points, we always take the next highest integer from that which
was computed to get the length of our interval. Since we computed 4, the length of our
intervals will be 5. Now we set up the first interval
lowest ≤ x < lowest + 5 which results in 18 ≤ x < 23.
MAT 142 - Module ST 5
Our next interval is obtained by adding 5 to each end of the first one:
18 ≤ x < 23
23 ≤ x < 28
28 ≤ x < 33
33 ≤ x < 38
38 ≤ x < 43.
Now we are ready to tally the data and make the frequency distribution. Be careful to make
sure that a data point that is the same number as the end of the interval is placed in the
correct interval. This means that the data point 33 is counted in the interval 33 ≤ x < 38
and NOT in the interval 28 ≤ x < 33.
x tally f rel f
7
18 ≤ x < 23 ||||||| 7 20
∗ 100% = 35%
5
23 ≤ x < 28 ||||| 5 20
∗ 100% = 25%
4
28 ≤ x < 33 |||| 4 20
∗ 100% = 20%
2
33 ≤ x < 38 || 2 20
∗ 100% = 10%
38 ≤ x < 43 || 2 10%
Totals 20 100%
Histogram.
Now that we have the data organized, we want a way to display the data. One such display
is a histogram which is a bar chart that shows how the data are distributed among each
data point (ungrouped) or in each interval (grouped)
Number of
television sets Frequency
0 2
1 13
2 18
3 0
4 10
5 2
"'
!(
!&
!$
!"
!'
&
"
'
! " # $ % &
Example 9. Histogram example for grouped data. We will use the data from example 7.
The histogram would look as follows.
MAT 142 - Module ST 7
Mode.
The mode is the data point which occurs most frequently. It is possible to have more than
one mode, if there are two modes the data is said to be bimodal. It is also possible for
a set of data to not have any mode, this situation occurs if the number of modes gets to
be “too large”. It it not really possible to define “too large” but one should exercise good
judgement. A reasonable, though very generous, rule of thumb is that if the number of data
points accounted for in the list of modes is half or more of the data points, then there is no
mode.
Note: if the data is given as a list of data points, it is often easiest to find the mode by
creating a frequency distribution. This is certainly the most organized method for finding
it. In our examples we will use frequency distributions.
Example 10. A data set with a single mode.
Consider the data from example 8:
Number of
television sets Frequency
0 2
1 13
2 18
3 0
4 10
5 2
You can see from the table that the data point which occurs most frequently is 2 as it has a
frequency of 18. So the mode is 2.
Example 11. A data set with two modes.
Consider the data:
Number of
hours of television Frequency
0 1
0.5 4
1 8
1.5 9
2 13
2.5 10
3 11
3.5 13
4 5
4.5 3
You can see from the table that the data points 2 and 3.5 both occur with the highest
frequency of 13. So the modes are 2 and 3.5.
8 Statistics
Median.
The median is the data point in the middle when all of the data points are arranged in
order (high to low or low to high). To find where it is, we take into account the number
of data points. If the number of data points is odd, divide the number of data points by 2
and then round up to the next integer; the resulting integer is the location of the median.
If the number of data points is even, there are two middle values. We take the number of
data points and divide by 2, this integer is the first of the two middles, the next one is also
a middle. Now we average these two middle values to get the median.
Example 13. An odd number of data points with no frequency distribution.
Mean.
The mean is the average of the data points, it is denoted x. There are three types of data
for which we would like to compute the mean, ungrouped of frequency 1, ungrouped with a
frequency distribution, and grouped.
Starting with the first type, ungrouped of frequency 1, is when data is given to you as a list
and it is not organized into a frequency distribution. When this happens, we compute the
average as we have always done, add up all of the data points and divide by the number of
data points. To write a formula for this, we use the capital greek letter sigma, Σx. This
just means to add up all of the data points. We will use n to represent the number of data
points.
Σx
mean: x =
n
This corresponds to the left hand column of your calculator instructions.
Example 16. Given the ungrouped data list below:
10 15 13 25 22 53 47
We would enter the data into the calculator following the instructions in the left hand column,
the result is x = 26.4285714.
When have a frequency distribution for the data, we have to take the average as before but
remember that the frequency gives the number of times that the data point occurs.
Σ(f x)
mean: x =
n
This corresponds to the right hand column of your calculator instructions.
Example 17. Given the frequency distribution for ungrouped data below:
Number of
television sets Frequency
0 2
1 13
2 18
3 0
4 10
5 2
We would enter the data into our calculators following the directions in the right hand
column. The result is x = 2.2.
10 Statistics
Our final type of data is grouped data. This requires a computation before we can begin.
Since we cannot enter the entire interval as a data point, we use a representative for each
interval, xi . This representative is the midpoint of the interval, to find the midpoint of an
interval you add the two endpoints and divide by 2. These are the numbers that you use as
data points for computing the mean.
Example 18. Mean for Grouped data.
We will use the data from example 7 again:
x f
18 ≤ x < 23 7
23 ≤ x < 28 5
28 ≤ x < 33 4
33 ≤ x < 38 2
38 ≤ x < 43 2
Total 20
Start by calculating the representative for each interval.
23 + 18 41
= = 20.5
2 2
Since this is the midpoint of the first interval and the intervals have length 5, we find the
rest by adding 5 to this one.
x xi f
18 ≤ x < 23 20.5 7
23 ≤ x < 28 25.5 5
28 ≤ x < 33 30.5 4
33 ≤ x < 38 35.5 2
38 ≤ x < 43 40.5 2
Totals 20
Now enter the data as directed using the right hand column of the calculator directions. You
use xi as the data point (list 1) and the frequency as usual in list 2. The result should be
x = 27.25.
MAT 142 - Module ST 11
Measures of Dispersion
In the last section we looked at measure of central tendency. These let us know what the
middle of a set of data look like. We are not going to look at measures of dispersion. These
will tell us know spread out a set of data is.
One measure of spread is the range. The range is the difference between the highest data
value and the lowest data value.
We are going to be using the following data for the next few examples. Below is a list of the
heights, in inches, of the 2015/16 Phoenix Suns roster players.
84 73 78 85 77 75 85 82 75 82 76 78 80
Example 19. What is the range of the heights of the players on the current Phoenix Suns
roster?
Solution: The tallest player is 85 inches and the shortest player is 73 inches. This means
that the range of the heights is 85 − 73 = 12.
The standard deviation is a number that describes the amount of variation or dispersion
of a set of data values. The standard deviation is the square root of the variance.
The standard deviation is the square root of the sum of the square of the differences between
each data value and the mean divide by one less than the number of data values.
s
Σ(x − x)2
Sx =
(n − 1)
Example 20. Find the standard deviation of the heights of the players on the Phoenix Suns
2015/16 roster. How many of the players fall within one standard deviation of the mean?
Solution: We first must find the mean of the player heights. We do this by adding all the
heights together and dividing by the number of players. (Note: Round the mean to two
decimal places and any squared values to four decimal places.)
84 + 73 + 78 + 85 + 77 + 75 + 85 + 82 + 75 + 82 + 76 + 78 + 80
x= = 79.23
13
Now that we have the mean, we can calculate the standard deviation. This is most easily
done by creating a table.
12 Statistics
Normal Distribution
Many types of data are normally distributed. This is often known as the bell curve. In
normally distributed data, the mean, median and mode are equal.
Any data which is normally distributed can transformed to a standard normal distribution.
The standard normal distribution has a mean, median, and mode of 0 (µ = 0) and a standard
deviation of 1 (σ = 1). Normally distributed data can be summarized by a rule often referred
to the 68-95-99.7 rule. This rule says that approximately 68% of the data fall within one
standard deviation of the mean; 95% of the data fall within two standard deviations of the
mean; and 99.7% of the data fall within three standard deviations of the mean.
14 Statistics
!
Standard Normal Distribution 68-95-99.7 Rule
A z-score is a statistical measure that tells us how many standard deviations above or below
the mean a particular data value is in a normal distribution. A positive z-score indicates
that the data value is above the mean. A negative z-score indicates that the data value is
below the mean. A z-score of 0 indicates that the data value is equal to the mean.
x−µ
z=
σ
Using a z-score, we can use what we know about the standard normal distribution to deter-
mine the area under the standard normal curve that falls below or above a particular data
value. There are table that will give us this information or allow us to calculate it. We have
provided one such z-table as part of the formula sheet. Our z−table gives the area under
the normal curve to the left of a particular z-value.
Example 21. Human pregnancies are normally distributed with a mean of 266 days and a
standard deviation of 16 days. What is the probability that a pregnancy will last fewer than
270 days?
Solution: Let’s start by looking at a picture for this problem.
270
!!
MAT 142 - Module ST 15
In order to answer this question, we will need to determine how many standard deviations
away from from the mean 270 days is. This is what we get when we calculate the z-score.
270 − 266
z= = .25
16
We now need to look this z-score up in the z-table. We go down the left column of the
positive side of the table until we get to 0.2. We then go across the .2 row until we get to
the 0.05 column.
Z-table
z 0.0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0..6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
!
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 !
0.7517 0.7549
This number,
0.7 0.5948,
0.7580 is 0.7611
the probability
0.7642 that 0.7704
0.7673 a pregnancy
0.7734 will
0.7764 last0.7794
fewer 0.7823
that 270 days.
0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
Example 22.
0.9
Human
0.8159
pregnancies
0.8186 0.8212
are0.8238
normally
0.8264
distributed
0.8289
with 0.8340
0.8315
a mean0.8365
of 2660.8389
days and a
standard deviation
1.0
of 160.8438
0.8413
days. 0.8461
What 0.8485
percent0.8508
of pregnancies
0.8531
will last
0.8554 0.8577
more than0.8621
0.8599
280 days?
1.1
Solution: Let’s 0.8643 by 0.8665
start looking0.8686 0.8708
at a picture 0.8729
for this0.8749 0.8770
problem. 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 00.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
area above 280 days
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990 !
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
In order to 3.2answer this 0.9993
0.9993
question, we will
0.9994 0.9994
need to determine
0.9994 0.9994
how 0.9995
0.9994
many standard
0.9995
deviations
0.9995
away from from
3.3
the
0.9995
mean 280 0.9995
0.9995
days is.0.9996
This is what0.9996
0.9996
we get0.9996
when 0.9996
we calculate
0.9996
the
0.9997
z-score.
3.4 0.9997 0.9997 0.9997 280 −0.9997
0.9997 266 0.9997 0.9997 0.9997 0.9997 0.9998
z= = .875
16
16 Statistics
We now need to look this z-score up in the z-table. Since our z-table only has z-score to two
decimal places, we will need to look up 0.88. We go down the left column of the positive
side of the table until we get to 0.8. We then go across the .8 row until we get to the 0.08
column.
This number, 0.8106, is the probability that a pregnancy will last fewer that 280 days. We
are asked about more that 280 days, thus we need to subtract this number from 1. Thus
the probability that a pregnancy will last more than 280 days it 1 − 0.8106 = .1894. We
need to change this probability to a percent by multiplying it by 100 to find that 18.94% of
pregnancies will last more than 280 days.
Example 23. Human pregnancies are normally distributed with a mean of 266 days and a
standard deviation of 16 days. What is the probability that a pregnancy will last between
250 and 275 days?
Solution: Let’s start by looking at a picture for this problem.
275
!!
In order to answer this question, we will need to determine how many standard deviations
away from from the mean 250 days is. This is what we get when we calculate the z-score.
250 − 266
z= = −1
16
We go down the left column of the negative side of the table until we get to 1.0. We then
go across the −1.0 row until we get to the 0.00 column. This gives us 0.1587.
We now need to determine how many standard deviations away from from the mean 250
days is. This is what we get when we calculate the z-score.
275 − 266
z= = 0.5625
16
We now need to look this z-score up in the z-table. Since our z-table only has z-score to two
decimal places, we will need to look up 0.56. We go down the left column of the positive
side of the table until we get to 0.5. We then go across the 0.5 row until we get to the 0.06
column. This gives us 0.7123.
MAT 142 - Module ST 17
In order to find the probability that a pregnancy lasts between 250 and 275 days, we need
to subtract the probability of less than 250 day from the probability of less than 275 days.
This will give us 0.7123 − 0.1587 = 0.5536 is the probability of a pregnancy lasting between
250 and 275 days.
In the previous three examples, we first found a z-score and then looked that up in the
z-table to get the information that we needed. We can also use a z-table in essentially a way
that is backward from that.
Example 24. All incoming freshmen at a major university are required to take a math-
ematics placement exam. The scores are normally distributed with a mean of 420 and a
standard deviation of 45. If a student score less than a certain score, he or she will have to
take a review course. Find the cutoff score at which 25% of the students would have to take
the review course.
Solution: For this problem we want to find a test score so that 25% of the students score
below that score. The picture below illustrates what we are looking for.
!
!
25%!
420!
The 25% indicates the value within the z-table that we need to look for. We would need to
change the 25% to a decimal of 0.25. This is the number within the table that we need to
find. Once we find the value, we need to determine which z-score.
18 Statistics
!
z! 0.09! 0.08! 0.07! 0.06! 0.05! 0.04! 0.03! 0.02! 0.01! 0.0!
-1.0! 0.1379! 0.1401! 0.1423! 0.1446! 0.1469! 0.1492! 0.1515! 0.1539! 0.1562! 0.1587!
-0.9! 0.1611! 0.1635! 0.1660! 0.1685! 0.1711! 0.1736! 0.1762! 0.1788! 0.1814! 0.1841!
-0.8! 0.1867! 0.1894! 0.1922! 0.1949! 0.1977! 0.2005! 0.2033! 0.2061! 0.2090! 0.2119!
-0.7! 0.2148! 0.2177! 0.2206! 0.2236! 0.2266! 0.2296! 0.2327! 0.2358! 0.2389! 0.2420!
-0.6! 0.2451! 0.2483! 0.2514! 0.2546! 0.2578! 0.2611! 0.2643! 0.2676! 0.2709! 0.2743!
-0.5! 0.2776! 0.2810! 0.2843! 0.2877! 0.2912! 0.2946! 0.2981! 0.3015! 0.3050! 0.3085!
-0.4! 0.3121! 0.3156! 0.3192! 0.3228! 0.3264! 0.3300! 0.3336! 0.3372! 0.3409! 0.3446!
-0.3! 0.3483! 0.3520! 0.3557! 0.3594! 0.3632! 0.3669! 0.3707! 0.3745! 0.3783! 0.3821!
-0.2! 0.3829! 0.3897! 0.3936! 0.3974! 0.4013! 0.4052! 0.4090! 0.4129! 0.4168! 0.4207!
-0.1! 0.4247! 0.4286! 0.4325! 0.4364! 0.4404! 0.4443! 0.4483! 0.4522! 0.4562! 0.4602!
-0.0! 0.4641! 0.4681! 0.4721! 0.4761! 0.4801! 0.4840! 0.4880! 0.4920! 0.4960! 0.5000!
!
When we look at the partial table above, we see that there is not exact entry within the
table for 0.2500. The two values that are closest to 0.2500. We want to pick the value that
is closest. We see that 0.2500 − 0.2483 = 0.0017 and 0.2514 − 0.2500 = 0.0014. Thus, 0.2514
is the closest to .2500. We now follow the row to the left and the column up to find the
z-score associated with the probability 0.2514. This score is z = −0.67
We now have the z-score. We need to figure out which test score will create this z-score. To
do this, we plug what we know into the formula z = x−µ
σ
and solve for x.
x − 420
−0.67 =
45
−30.15 = x − 420
389.85 = x
Since score are whole numbers, we will round this number to 390. Thus a score of 390 on the
exam will be the cutoff score at which 25% of the students would have to take the review
course.