You are on page 1of 19

Chapter 3: Displaying and Summarizing Quantitative Data

Histograms

What is the difference between a bar graph and a histogram?

❑ A histogram is for quantitative data.

❑ The bars in a histogram are touching.

❑ Each bin represents an interval of numbers.

❑ The height of each bar represents the frequency or the relative frequency within that
particular interval.

❑ Histograms are good for large data sets!

Example #1: Grouped Frequency Distribution of student weights


Below are the weights (lbs) of the class of 79 students.
The first step is to create a frequency table. Unfortunately a simple frequency table would be too big,
containing over 100 rows. To simplify the table, we group scores together as shown in the table below.

Relative
Class Interval Frequency Frequency
100 to <115 2 0.025
115 to <130 10 0.127
130 to <145 21 0.266
145 to <160 15 0.190
160 to <175 15 0.190
175 to <190 8 0.101
190 to <205 3 0.038
205 to <220 1 0.013
220 to <235 2 0.025
235 to <250 2 0.025
79 1.000

Bin Size: Slicing up the entire span of values into equal width intervals.

16
In a histogram, the class frequencies are represented by bars. The height of each bar corresponds to its
class frequency. A histogram of these data is shown below:

The histogram makes it plain that most of the scores are in the middle of the distribution, with fewer
scores in the extremes. You can also see that the distribution is not symmetric; the scores extend to the
right farther than they do on the left. The distribution is therefore said to be skewed (We'll have more to
say about the shape of distributions in our discussion on summarizing distributions.)

a) Estimate the percentage of students weighing between 130 and 175 pounds.

b) What percentage of students were less than 130 pounds or more than 205?

c) What percentage of students were less than 130 pounds and more than 205?

If we wanted to create a relative frequency histogram, we would replace the frequencies on the y-axis
with percentages.

17
Stem Plots

*Good for small data sets and it is a very quick way to see the distribution of the data

Guidelines For Constructing Regular Stem Plots (Stem and Leaf Display)
1. Put all of your data in order from the smallest to the largest value.
2. Separate observations into stem (all but rightmost digit) and leaf (final digit).
3. Write stems in a vertical column; draw a vertical line to the right of the column.
4. Write the leaves to the right of the appropriate stem, in increasing order (round or truncate data if
necessary)

Example #1: Graph of the weights of a class of 79 students:

10 3
11 37
12 011444555
13 000000455589
14 000000000555
15 000000555567
16 000005558
17 0000005555
18 0358
19 5
20 00
21 0
22 55
23 79

Back-to-Back Stem Plots (Comparative Stem Plots):


This type of stem plot is used to compare two different data sets. The guidelines are the same as the
regular stem plot except the two data sets share the internal stem. Leaves for one data set will extend to
the left and the other set of leaves for the second data set will extend to the right.

Example: Minitab Output of Weights of College Students from example #1, separated by gender.

Comparative stem plot of weights N=79 Leaf Unit = 1.0

Female Male
3 10
3 11 7
554410 12 145
95000 13 0004558
000 14 000000555
75000 15 0005556
0 16 00005558
0 17 000005555
18 0358
5 19
0 20 0 **It is reasonable to separate the weights by
21 0 gender because women tend to weigh less
22 55 than men on average.
23 79

18
Split Stem Plots:
Follow the guidelines for a regular stem plot. The difference is you will write each stem value twice and
the leaves will be separated into two different categories (Low and High). The first set of leaves will
consist of digits from 0-4 (L) and the second set of leaves will consist of digits from 5-9 (H).

Example #2: The following data represent the mean August temperature (Fahrenheit) in 20 U.S. cities:
64 64 68 69 70 71 71 72 74 75
76 76 76 77 81 82 82 83 85 98
a) Construct a split-stem plot.
.

b) What percent of the 20 cities have a mean August temperature in the 80s?

19
Describing your quantitative graphs using:
Shape of the distribution (How your data is spread out),
Center
Spread

Shape

A. Symmetric (mirror image)

Uniform Bell/normal

B. Skewed (positive or right, negative or left)

If the upper tail stretches out much further than the lower tail= positively skewed, right skewed
If the lower tail is much longer than the upper tail= negatively skewed, left skewed

C. Modes (Uni, Bi, Multi)

Unimodal Bimodal

D. Gap: (Space in between data)

E. Outlier: Unusual observation apart from the group. (Either high or low)

20
CENTER:

❑ Measuring the Center: Sample Mean

SAMPLE MEAN ( X) : This measure of the average is found by adding up all of the values in the data
set and dividing by the number of values.

X = X i= i=

n= =

1 n X + X 2 + X 3 + ... + X n
X = 
n i =1
Xi = 1
n

• The true mean of a population uses the notation 


• The sample mean is the most appropriate measure of center for normal or approximately normal
distributions.
• The sample mean is influenced by a skewed distribution and sensitive to outliers, so it is not an
appropriate measure of the center in these cases. It is not resistant to outliers.

❑ Measuring the Center: Median

Median: “The Midpoint” or typical value where half of the values are above the median and half are
below. (50% mark)

Calculation:
1st: Put all the data in order from smallest to largest
n+1
2nd: Find the middle position using the following location formula location =
2

1. For an odd number of values: Single middle value is the median

2. For an even number of values: The average of the two middle values (add and divide by 2)

• The median is not influenced by outliers. It is a resistant measure (not sensitive).


• More appropriate measure of center for skewed distributions or distributions with outliers.

21
SPREAD:

❑ Measuring Spread: Range and IQR

1) Range= (Quick & Easy, but not very informative!)

How to determine quartiles:

▪ 25th percentile= first quartile = Q1 = Median of the smaller half of data


▪ 50th percentile = second quartile = Q2 =Median
▪ 75h percentile = third quartile = Q3 =Median of the larger half of data

2) Inter-quartile Range (IQR)= third quartile – first quartile

Example #1: Below are the numbers of deaths from tornadoes in the U.S. from 1990-2000.

53 39 39 33 69 30 25 67 130 94 40

a) Calculate the median.

b) Calculate the range.

c) Calculate the Interquartile range (IQR)

Example #2: Histogram representing the lengths (minutes) of International Calls


Length of International Calls Using a $5 prepaid calling card
14

12

10
Frequency

0
0 16 32 48 64
Length of Calls

Find the quartile intervals:

Q1= Q2 (Median)= Q3=

22
Example #3: U.S. Department of Labor Bureau of Labor Statistics
Unemployment Rates June 2016

South Dakota 2.7 Michigan 4.6


Nebraska 3.0 Florida 4.7
Vermont 3.2 Oklahoma 4.8
North Dakota 3.2 Oregon 4.8
Hawaii 3.3 North Carolina 4.9
Virginia 3.7 Ohio 5.0
Idaho 3.7 Kentucky 5.0
Minnesota 3.8 California 5.4
Iowa 4.0 Wyoming 5.7
Utah 4.0 Arizona 5.8
Montana 4.2 Mississippi 5.9
Maryland 4.3 Nevada 6.4
Texas 4.5 Alaska 6.7

1. Find the median.

2. Find the first and third quartile

3. Find the IQR

❑ Measuring Spread: Variance and Standard Deviation

Variance is a measure of how far the data is from the mean, on average squared. The Standard
Deviation is simply the square root of the variance.

(X i − X )2
Sample Standard deviation: s = i =1

n −1

(X i − )2
Population Standard deviation:  = i =1

Notation sample population

Mean X 
Standard deviation s 

23
Properties of the standard deviation and variance:
1. Sensitive to OUTLIERS and SKEWNESS
2. Standard deviation is greater than, or equal to, zero.
3. Values that are very close together have a small standard deviation.
4. Values that are very far apart have a large standard deviation.

Comparing the Mean and the Median:


When we look at distributions, where is the mean and where is the median?

NORMAL “SYMMETRIC” CURVE SKEWED DISTRIBUTIONS

• The mean will be pulled towards the tail of the skewed data.
• For Normal Bell Shaped Distributions, the mean is the more appropriate measure of the center. The
standard deviation is the appropriate measure of spread.
• For Skewed Distributions, the median is the more appropriate measure of the center. The IQR is the
appropriate measure of spread.

Example: The back-to-back stemplot below gives the bowling scores of male and female participants in
the finals of a national tournament.

Males Females
23 89
0 24 028
3 25 0267
80 26 378
88421 27 25
997632 28 6

Identify the best measure of center and spread for each gender, based on the distributions. Then calculate
these values and justify your choice of this measure in one sentence:

Males Choice of center_____________ calculated value_____________

Choice of spread_____________ calculated value_____________

Justification of choice_______________________________________________________

24
Females Choice of center_____________ calculated value_____________

Choice of spread_____________ calculated value_____________

Justification of choice_______________________________________________________

5-NUMBER SUMMARY
The 5-number summary of a distribution consists of the following:

Minimum Quartile 1 (Q1) Median (Q2) Quartile 3 (Q3) Maximum

Example: The back-to-back stemplot below gives the bowling scores of male and female participants in
the finals of a national tournament.

Males Females
23 89
0 24 028
3 25 0267
80 26 378
88421 27 25
997632 28 6

Calculate (and clearly label) the five-number summary for the distribution of male scores

25
BOXPLOTS: We use the 5-NUMBER SUMMARY to construct box plots. Boxplots can be useful for
comparing several groups of data.

Constructing a boxplot if there are no outliers in the data set:


A box is drawn connecting the first and third quartiles, with a line across at the median and then lines are
drawn out to the smallest and largest observations.

Side-by-side skeletal boxplots comparing the distributions of earnings for two levels of education:

Criterion for Outliers


We can flag mild outliers using the 1.5(IQR) Criterion and extreme outliers using the 3(IQR) Criterion:

Step 1: Determine Quartile 1 and Quartile 3 from the data set.

Step 2: Compute the Interquartile Range (IQR) IQR= Q3-Q1

Step 3: Determine the Fences. Fences serve as cutoff points for determining outliers

Mild Outliers (open circle) Extreme Outliers (asterisk)

Lower Fence= Q1- 1.5(IQR) Lower Fence= Q1- 3(IQR)

Upper Fence= Q3 + 1.5(IQR) Upper Fence= Q3 + 3(IQR)

If you identify an outlier in a data set and are asked to graph the data, you must create a modified box
plot! In a modified box plot drawn by hand, you will flag a mild outlier with a open circle and an
extreme outlier with an asterisk. You will extend the lines from your quartiles to the next smallest or next
largest value that is not an outlier. Note: In minitab, the box plot “flags” all outliers with an asterisk.

26
Example #1: Oxygen capacity
To understand better the effects of exercise and aging on various circulatory functions, a study was done
on16 middle-aged male runners. The following data set gives values of oxygen capacity values (ml/kg per
minute) while the participants pedaled at a specified rate on a bicycle.
14 16 18 19 20 21 21 21 22 22 23 23 24 26 35 36
Are there any outliers? Check mild and extreme criterions and show all work! Then draw a boxplot to
represent the distribution of your data.

Step 1: Determine Quartile 1 and Quartile 3 from the data set.

Step 2: Compute the Interquartile Range

Step 3: Determine the Fences. Fences serve as cutoff points for determining outliers

Mild Outliers (open circle)

Lower Fence=

Upper Fence=

Extreme Outliers (asterisk)

Lower Fence=

Upper Fence=

____________________________________________________________________
10 15 20 25 30 35 40

27
Example #2:

The following back-to-back stemplot represents the amount a sample of students spent on their last
haircut. Because there are significant differences between males and females, the data was separated by
gender.
Male spending Female spending
4 3 2 0 | 1 |
5 3 1 0 0 | 2 | 0 5 5
5 4 3 2 | 3 | 5 7 7 9
2 | 4 | 3 5 6
| 5 | 0 3
| 6 | 0
| 7 | 0
| 8 | 5
| 9 |
| 10 | 0
I. Calculate (and label) the five-number summary for the amount spent by females.

II. Construct, and clearly label, a modified boxplot for the amount spent by females.
a) Check for outliers, show all work:

IQR = _________________

Mild outlier check: Lower Fence: _________________

Upper Fence: _________________

Extreme outlier check: Lower Fence: _________________

Upper Fence: _________________

b) Modified Boxplot:

_____________________________________________________________________________
20 30 40 50 60 70 80 90 100

28
Comparing Groups

Example: Wine prices (textbook page 90 problem 53)


The boxplots display case prices (in dollars) of varieties of wines produced by vineyards along three of
the Finger Lakes in upstate New York.
Boxplot of CasePrice vs Location

150

125
CasePrice

100

75

50
Cayuga Keuka Seneca
Location

a) Which lake region produces the most expensive wine?

b) Which lake region produces the cheapest wine?

c) In which region are the wines generally more expensive?

Outliers
What should be done with outliers? First try to understand them in the context of the data. A histogram
can show how the outlier fits with the rest of the data:

• Is there a large gap between the outlier and the rest of the data?

• Is the outlier a value at the end of a stretched out tail?

• Could the outlier be an error?

What you should NOT do:

• Leave an outlier in place without comment, and proceed as if nothing were unusual.

• Drop an outlier without comment just because it is unusual.

29
Example: Don’t automatically discard outliers! In 1985 three researchers (Farman, Gardinar and
Shanklin) were puzzled by some data gathered by the British Antarctic Survey showing the ozone levels
for Antarctica had dropped 10% below normal January levels. The puzzle was why the Nimbus 7 satellite
which had instruments aboard for recording ozone levels, hadn't recorded similarly low ozone
concentrations. When they examined the data from the satellite it didn't take long to realize that the
satellite was in fact recording these low concentration levels and had been doing so for years. But
because the ozone concentrations recorded by the satellite were so low they were being treated as outliers
by a computer program and discarded! The Nimbus 7 satellite had in fact been gathering evidence of low
ozone levels since 1976. The damage to our atmosphere caused by chlorofluorocarbons went undetected
and untreated for up to nine years because outliers were discarded without being examined.

Standardizing
When comparing scores from different variables it is helpful to standardize the values, to determine how
many standard deviations the value is away from the mean.

z = Observed value – mean


Standard deviation

X−X X −
So, Z= or Z=
s 

where μ (population mean) and σ (population standard deviation) are given.

The z-score tells us how many standard deviation an observation is above or below the mean.

Example:
• A Z-score of 1 means the observation is 1 standard deviation larger than the mean.
• A Z-Score of –2 means the observation is 2 standard deviations smaller than the mean.

Example: Dan is working diligently to get through his general education requirements at a local
community college. As a result, he is enrolled in a statistics course as well as a history course. On his
first round of exams, Dan got a 79% on his statistics exam and an 84% on his history exam. The class
results for his statistics exam were normally distributed with a mean of 64% and a standard deviation of
6.23%. The results of his history exam were also normally distributed with a mean of 78% and standard
deviation of 3.27. Which of the following is true about Dan’s performance?
A. Dan performed better on his statistics exam because the z-score of his statistics exam was larger
than the z-score for his history exam.
B. Dan performed better on his history exam because the z-score of his history exam was closer to 0
than the z-score for his statistics exam.
C. Dan performed better on his history exam because the z-score of his history exam was larger than
the z-score for his statistics exam.
D. Dan performed better on his statistics exam because the z-score of his statistics exam was closer
to 0 than the z-score for his history exam.

30
Timeplots
A graph of data collected over a period of time (measured by seconds, days, months, years, etc.). Time
goes on the x-axis. This graph is useful when looking for trends or patterns over time.

Total Revenues and Outlays in CBO's Baseline and Under the President's Budget (Percentage of
GDP)

http://www.cbo.gov/ftpdocs/

31
Practice Problems:
1. A. The median score of a student’s seven quizzes is 80 points. The instructor allows each student to
drop the lowest score, which in this case is 62 points. The median score of the remaining six is:
a) 71 points.
b) 80 points.
c) 83 points.
d) Cannot be determined from the information given.

B. Using the information above: if 80 was the mean score, could the mean of the remaining 6 be found?

2. A severe drought affected several western states for 3 years. A Christmas tree farmer is worried about the
drought’s effect on the size of his trees. To decide whether the growth of the trees has been retarded, the
farmer takes a sample of the heights of 15 trees and obtains the following results (in inches):
60 57 62 69 46 54 64 60 58 75 51 49 67 65 44

Create a split stem-and-leaf plot based on the above data.

b. 25% of the heights are above what value?

Answer

c. Calculate the mean and the standard deviation for the data above:

Mean___________________ Standard Deviation _________________

32
3. The distribution of payoffs at a racetrack has a mean of $5.85 and a median of $3.40. We can conclude
the following:
a) More than half the winners get less than a $3.40 payoff
b) More than half the winners get less than a $5.85 payoff
c) The distribution of payoffs is symmetric
d) The distribution of payoffs is left-skewed

4. About how many music CDs do you own? Responses to this question for 24 students are shown in the
stem-and-leaf plot below:

Stem-and-leaf of CDs N = 24
Leaf Unit = 10

0 001222233
0 55569
1 002
15
2 002
25
30
3
4
45

a. Describe the shape of the dataset.

b. The mean of this distribution is


a) Higher than the median c) Equal to the median
b) Lower than the median d) Could be higher or lower than the median

c. Calculate the mean and median for this distribution.

d. Would the mean or the median be the best measure of center for this dataset? Explain your choice.

5. An athlete completed an 800-m race in 150 seconds. The distribution of 800-m race times followed a
bell curve, with mean 165 seconds and standard deviation 7. The same athlete also competed in a swim,
finishing in 12.25 minutes. The distribution of swim times also followed a bell curve, with mean 15
minutes and standard deviation 1.5 minutes. In which event does the athlete have a better standing relative
to the other competitors in the event?

33
6. A meteorological station in Hawaii has gathered the following average daily wind speeds over 43 days.
In the following histogram, these average speeds (miles/h) are displayed:

a) What percentage of speeds was above 60 miles/h or below 20 miles/h?

A) 7% B) 38% C) 14% D) 26% E) 16%

Answer:

b) Describe in detail the shape, center (find the median interval), and spread of the distribution.

Shape:

Center (Median interval):

Spread:

▪ For a histogram, report the range from the beginning interval to the highest value interval (that
does not contain an outlier).

34

You might also like