You are on page 1of 86

Quantitative Variables

Author(s)
David M. Lane
Prerequisites
Variables
1. Stem and Leaf Displays
2. Histograms
3. Frequency Polygons
4. Box Plots
5. Box Plot Demonstration
6. Bar Charts
7. Line Graphs
8. Dot Plots
As discussed in the section on variables in Chapter 1, quantitative
variables are variables measured on a numeric scale. Height,
weight, response time, subjective rating of pain, temperature, and
score on an exam are all examples of quantitative variables.
Quantitative variables are distinguished from categorical
(sometimes called qualitative) variables such as favorite color,
religion, city of birth, and favorite sport in which there is no
ordering or measuring involved.
There are many types of graphs that can be used to portray
distributions of quantitative variables. The upcoming sections cover
the following types of graphs: (1) stem and leaf displays, (2)
histograms, (3) frequency polygons, (4) box plots, (5) bar charts,
(6) line graphs, (7) scatter plots (discussed in a different chapter),
and (8) dot plots. Some graph types such as stem and leaf displays
are best-suited for small to moderate amounts of data, whereas
others such as histograms are best-suited for large amounts of

data. Graph types such as box plots are good at depicting


differences between distributions. Scatter plots are used to show
the relationship between two variables.

Stem and Leaf Displays


Author(s)
David M. Lane
Prerequisites
Distributions
Learning Objectives
1. Create and interpret basic stem and leaf displays
2. Create and interpret back-to-back stem and leaf displays
3. Judge whether a stem and leaf display is appropriate for a
given data set
A stem and leaf display is a graphical method of displaying data. It
is particularly useful when your data are not too numerous. In this
section, we will explain how to construct and interpret this kind of
graph.
As usual, an example will get us started. Consider Table 1 that
shows the number of touchdown passes (TD passes) thrown by
each of the 31 teams in the National Football League in the 2000
season.
Table 1. Number of touchdown passes.
37,33,33,32,29,28,28,23,22,
22,22,21,21,21,20,20,19,19,
18,18,18,18,16,15,14,14,14,
12,12,9,6

A stem and leaf display of the data is shown in Figure 1. The left
portion of Figure 1 contains the stems. They are the numbers 3, 2,

1, and 0, arranged as a column to the left of the bars. Think of


these numbers as 10s digits. A stem of 3, for example, can be used
to represent the 10s digit in any of the numbers from 30 to 39. The
numbers to the right of the bar are leaves, and they represent the
1s digits. Every leaf in the graph therefore stands for the result of
adding the leaf to 10 times its stem.
3|2337
2|001112223889
1|2244456888899
0|69

Figure 1. Stem and leaf display of the number of touchdown passes.


To make this clear, let us examine Figure 1 more closely. In the top
row, the four leaves to the right of stem 3 are 2, 3, 3, and 7.
Combined with the stem, these leaves represent the numbers 32,
33, 33, and 37, which are the numbers of TD passes for the first
four teams in Table 1. The next row has a stem of 2 and 12 leaves.
Together, they represent 12 data points, namely, two occurrences of
20 TD passes, three occurrences of 21 TD passes, three occurrences
of 22 TD passes, one occurrence of 23 TD passes, two occurrences
of 28 TD passes, and one occurrence of 29 TD passes. We leave it
to you to figure out what the third row represents. The fourth row
has a stem of 0 and two leaves. It stands for the last two entries in
Table 1, namely 9 TD passes and 6 TD passes. (The latter two
numbers may be thought of as 09 and 06.)
One purpose of a stem and leaf display is to clarify the shape of
the distribution. You can see many facts about TD passes more
easily in Figure 1 than in Table 1. For example, by looking at the
stems and the shape of the plot, you can tell that most of the teams
had between 10 and 29 passing TDs, with a few having more and a
few having less. The precise numbers of TD passes can be
determined by examining the leaves.
We can make our figure even more revealing by splitting each
stem into two parts. Figure 2 shows how to do this. The top row is
reserved for numbers from 35 to 39 and holds only the 37 TD
passes made by the first team in Table 1. The second row is

reserved for the numbers from 30 to 34 and holds the 32, 33, and
33 TD passes made by the next three teams in the table. You can
see for yourself what the other rows represent.
3|7
3|233
2|889
2|001112223
1|56888899
1|22444
0|69

Figure 2. Stem and leaf display with the stems split in two.
Figure 2 is more revealing than Figure 1 because the latter
figure lumps too many values into a single row. Whether you should
split stems in a display depends on the exact form of your data. If
rows get too long with single stems, you might try splitting them
into two or more parts.
There is a variation of stem and leaf displays that is useful for
comparing distributions. The two distributions are placed back to
back along a common column of stems. The result is a back-toback stem and leaf graph. Figure 3 shows such a graph. It
compares the numbers of TD passes in the 1998 and 2000 seasons.
The stems are in the middle, the leaves to the left are for the 1998
data, and the leaves to the right are for the 2000 data. For
example, the second-to-last row shows that in 1998 there were
teams with 11, 12, and 13 TD passes, and in 2000 there were two
teams with 12 and three teams with 14 TD passes.
11 4
3
332 3
8865 2
44331110 2
987776665 1
321 1
7 0

7
233
889
001112223
56888899
22444
69

Figure 3. Back-to-back stem and leaf display. The left side shows the
1998 TD data and the right side shows the 2000 TD data.
Figure 3 helps us see that the two seasons were similar, but that
only in 1998 did any teams throw more than 40 TD passes.
There are two things about the football data that make them
easy to graph with stems and leaves. First, the data are limited to
whole numbers that can be represented with a one-digit stem and a
one-digit leaf. Second, all the numbers are positive. If the data
include numbers with three or more digits, or contain decimals, they
can be rounded to two-digit accuracy. Negative values are also
easily handled. Let us look at another example.
Table 2 shows data from the case study Weapons and
Aggression. Each value is the mean difference over a series of trials
between the times it took an experimental subject to name
aggressive words (like punch) under two conditions. In one
condition, the words were preceded by a non-weapon word such as
"bug." In the second condition, the same words were preceded by a
weapon word such as "gun" or "knife." The issue addressed by the
experiment was whether a preceding weapon word would speed up
(or prime) pronunciation of the aggressive word compared to a nonweapon priming word. A positive difference implies greater priming
of the aggressive word by the weapon word. Negative differences
imply that the priming by the weapon word was less than for a
neutral word.
Table 2. The effects of priming (thousandths of a second).
43.2,42.9,35.6,25.6,25.4,23.6,
20.5,19.9,14.4,12.7,11.3,10.2,
10.0,9.1,7.5,5.4,4.7,3.8,2.1,1.2,
0.2,6.3,6.7,8.8,10.4,10.5,
14.9,14.9,15.0,18.5,27.4

You see that the numbers range from 43.2 to -27.4. The first value
indicates that one subject was 43.2 milliseconds faster pronouncing
aggressive words when they were preceded by weapon words than
when preceded by neutral words. The value -27.4 indicates that

another subject was 27.4 milliseconds slower pronouncing


aggressive words when they were preceded by weapon words.
The data are displayed with stems and leaves in Figure 4. Since
stem and leaf displays can only portray two whole digits (one for the
stem and one for the leaf), the numbers are first rounded. Thus, the
value 43.2 is rounded to 43 and represented with a stem of 4 and a
leaf of 3. Similarly, 42.9 is rounded to 43. To represent negative
numbers, we simply use negative stems. For example, the bottom
row of the figure represents the number -27. The second-to-last
row represents the numbers -10, -10, -15, etc. Once again, we
have rounded the original values from Table 2.
4|33
3|6
2|00456
1|00134
0|1245589
0|0679
1|005559
2|7

Figure 4. Stem and leaf display with negative numbers and rounding.
Observe that the figure contains a row headed by "0" and another
headed by "-0." The stem of 0 is for numbers between 0 and 9,
whereas the stem of 0 is for numbers between 0 and -9. For
example, the fifth row of the table holds the numbers 1, 2, 4, 5, 5,
8, 9 and the sixth row holds 0, -6, -7, and -9. Values that are
exactly 0 before rounding should be split as evenly as possible
between the "0" and "-0" rows. In Table 2, none of the values are 0
before rounding. The "0" that appears in the "-0" row comes from
the original value of -0.2 in the table.
Although stem and leaf displays are unwieldy for large data sets,
they are often useful for data sets with up to 200 observations.
Figure 5 portrays the distribution of populations of 185 US cities in
1998. To be included, a city had to have between 100,000 and
500,000 residents.

Figure 5. Stem and leaf display of populations of 185 US cities with


populations between 100,000 and 500,000 in 1998.
Since a stem and leaf plot shows only two-place accuracy, we had to
round the numbers to the nearest 10,000. For example, the largest
number (493,559) was rounded to 490,000 and then plotted with a
stem of 4 and a leaf of 9. The fourth highest number (463,201) was
rounded to 460,000 and plotted with a stem of 4 and a leaf of 6.
Thus, the stems represent units of 100,000 and the leaves
represent units of 10,000. Notice that each stem value is split into
five parts: 0-1, 2-3, 4-5, 6-7, and 8-9.
Whether your data can be suitably represented by a stem and
leaf graph depends on whether they can be rounded without loss of
important information. Also, their extreme values must fit into two
successive digits, as the data in Figure 5 fit into the 10,000 and
100,000 places (for leaves and stems, respectively). Deciding what
kind of graph is best suited to displaying your data thus requires
good judgment. Statistics is not just recipes!

Question 1 out of 7.
A stem and leaf display is a good method of displaying large
amounts of data.
True

False

Stem and leaf displays can be unwieldy with large amounts of data
because every single data value is shown in the figure.
Histograms
Author(s)
David M. Lane
Prerequisites
Distributions, Graphing Qualitative Data
Learning Objectives
1. Create a grouped frequency distribution
2. Create a histogram based on a grouped frequency distribution
3. Determine an appropriate bin width
A histogram is a graphical method for displaying the shape of a
distribution. It is particularly useful when there are a large number
of observations. We begin with an example consisting of the scores
of 642 students on a psychology test. The test consists of 197
items, each graded as "correct" or "incorrect." The students' scores
ranged from 46 to 167.

The first step is to create a frequency table. Unfortunately, a


simple frequency table would be too big, containing over 100 rows.
To simplify the table, we group scores together as shown in Table 1.
Table 1. Grouped Frequency Distribution of Psychology Test Scores
Interval's
Lower Limit

Interval's
Upper Limit

Class
Frequency

39.5

49.5

49.5

59.5

10

59.5

69.5

53

69.5

79.5

107

79.5

89.5

147

89.5

99.5

130

99.5

109.5

78

109.5

119.5

59

119.5

129.5

36

129.5

139.5

11

139.5

149.5

149.5

159.5

159.5

169.5

To create this table, the range of scores was broken into intervals,
called class intervals. The first interval is from 39.5 to 49.5, the
second from 49.5 to 59.5, etc. Next, the number of scores falling
into each interval was counted to obtain the class frequencies.
There are three scores in the first interval, 10 in the second, etc.
Class intervals of width 10 provide enough detail about the
distribution to be revealing without making the graph too "choppy."

More information on choosing the widths of class intervals is


presented later in this section. Placing the limits of the class
intervals midway between two numbers (e.g., 49.5) ensures that
every score will fall in an interval rather than on the boundary
between intervals.
In a histogram, the class frequencies are represented by bars.
The height of each bar corresponds to its class frequency. A
histogram of these data is shown in Figure 1.

Figure 1. Histogram of scores on a psychology test.


The histogram makes it plain that most of the scores are in the
middle of the distribution, with fewer scores in the extremes. You
can also see that the distribution is not symmetric: the scores
extend to the right farther than they do to the left. The distribution
is therefore said to be skewed. (We'll have more to say about
shapes of distributions in the chapter " Summarizing Distributions.")
In our example, the observations are whole numbers.
Histograms can also be used when the scores are measured on a
more continuous scale such as the length of time (in milliseconds)
required to perform a task. In this case, there is no need to worry
about fence-sitters since they are improbable. (It would be quite a
coincidence for a task to require exactly 7 seconds, measured to the
nearest thousandth of a second.) We are therefore free to choose
whole numbers as boundaries for our class intervals, for example,

4000, 5000, etc. The class frequency is then the number of


observations that are greater than or equal to the lower bound, and
strictly less than the upper bound. For example, one interval might
hold times from 4000 to 4999 milliseconds. Using whole numbers as
boundaries avoids a cluttered appearance, and is the practice of
many computer programs that create histograms. Note also that
some computer programs label the middle of each interval rather
than the end points.
Histograms can be based on relative frequencies instead of
actual frequencies. Histograms based on relative frequencies show
the proportion of scores in each interval rather than the number of
scores. In this case, the Y-axis runs from 0 to 1 (or somewhere in
between if there are no extreme proportions). You can change a
histogram based on frequencies to one based on relative
frequencies by (a) dividing each class frequency by the total
number of observations, and then (b) plotting the quotients on the
Y-axis (labeled as proportion).
There is more to be said about the widths of the class intervals,
sometimes called bin widths. Your choice of bin width determines
the number of class intervals. This decision, along with the choice of
starting point for the first interval, affects the shape of the
histogram. There are some "rules of thumb" that can help you
choose an appropriate width. (But keep in mind that none of the
rules is perfect.) Sturges' rule is to set the number of intervals as
close as possible to 1 + Log2(N), where Log2(N) is the base 2 log of
the number of observations. The formula can also be written as 1 +
3.3 Log10(N), where Log10(N) is the log base 10 of the number of
observations. According to Sturges' rule, 1000 observations would
be graphed with 11 class intervals since 10 is the closest integer to
Log2(1000). We prefer the Rice rule, which is to set the number of
intervals to twice the cube root of the number of observations. In
the case of 1000 observations, the Rice rule yields 20 intervals
instead of the 11 recommended by Sturges' rule. For the psychology
test example used above, Sturges' rule recommends 10 intervals
while the Rice rule recommends 17. In the end, we compromised
and chose 13 intervals for Figure 1 to create a histogram that

seemed clearest. The best advice is to experiment with different


choices of width, and to choose a histogram according to how well it
communicates the shape of the distribution.
To provide experience in constructing histograms, we have
developed an interactive demonstration. The demonstration reveals
the consequences of different choices of bin width and of lower
boundary for the first interval.
Frequency Polygons
Author(s)
David M. Lane
Prerequisites
Histograms
Learning Objectives
1. Create and interpret frequency polygons
2. Create and interpret cumulative frequency polygons
3. Create and interpret overlaid frequency polygons
Frequency polygons are a graphical device for understanding the
shapes of distributions. They serve the same purpose as
histograms, but are especially helpful for comparing sets of data.
Frequency polygons are also a good choice for
displaying cumulative frequency distributions.
To create a frequency polygon, start just as for histograms, by
choosing aclass interval. Then draw an X-axis representing the
values of the scores in your data. Mark the middle of each class
interval with a tick mark, and label it with the middle value
represented by the class. Draw the Y-axis to indicate the frequency
of each class. Place a point in the middle of each class interval at
the height corresponding to its frequency. Finally, connect the
points. You should include one class interval below the lowest value

in your data and one above the highest value. The graph will then
touch the X-axis on both sides.
A frequency polygon for 642 psychology test scores shown in
Figure 1 was constructed from the frequency table shown in Table 1.
Table 1. Frequency Distribution of Psychology Test Scores.
Lower
Limit

Upper
Limit

Count

Cumulative
Count

29.5

39.5

39.5

49.5

49.5

59.5

10

13

59.5

69.5

53

66

69.5

79.5

107

173

79.5

89.5

147

320

89.5

99.5

130

450

99.5

109.5

78

528

109.5

119.5

59

587

119.5

129.5

36

623

129.5

139.5

11

634

139.5

149.5

640

149.5

159.5

641

159.5

169.5

642

169.5

179.5

642

The first label on the X-axis is 35. This represents an interval


extending from 29.5 to 39.5. Since the lowest test score is 46, this
interval has a frequency of 0. The point labeled 45 represents the
interval from 39.5 to 49.5. There are three scores in this interval.
There are 147 scores in the interval that surrounds 85.
You can easily discern the shape of the distribution from Figure
1. Most of the scores are between 65 and 115. It is clear that the

distribution is not symmetric inasmuch as good scores (to the right)


trail off more gradually than poor scores (to the left). In the
terminology of Chapter 3 (where we will study shapes of
distributions more systematically), the distribution is skewed.

Figure 1. Frequency polygon for the psychology test scores.


A cumulative frequency polygon for the same test scores is shown
in Figure 2. The graph is the same as before except that the Y value
for each point is the number of students in the corresponding class
interval plus all numbers in lower intervals. For example, there are
no scores in the interval labeled "35," three in the interval "45," and
10 in the interval "55." Therefore, the Y value corresponding to "55"
is 13. Since 642 students took the test, the cumulative frequency
for the last interval is 642.

Figure 2. Cumulative frequency polygon for the psychology test


scores.
Frequency polygons are useful for comparing distributions. This is
achieved by overlaying the frequency polygons drawn for different
data sets. Figure 3 provides an example. The data come from a task
in which the goal is to move a computer cursor to a target on the
screen as fast as possible. On 20 of the trials, the target was a
small rectangle; on the other 20, the target was a large rectangle.
Time to reach the target was recorded on each trial. The two
distributions (one for each target) are plotted together in Figure 3.
The figure shows that, although there is some overlap in times, it
generally took longer to move the cursor to the small target than to
the large one.

Figure 3. Overlaid frequency polygons.


It is also possible to plot two cumulative frequency distributions in
the same graph. This is illustrated in Figure 4 using the same data
from the cursor task. The difference in distributions for the two
targets is again evident.

Figure 4. Overlaid cumulative frequency polygons.


You might be curious about your own performance in the cursor
task. Try the task yourself, and compare your times with ours.

Question 1 out of 3.
A frequency polygon is very similar to a
histogram

stem and leaf display

listing of raw data

Frequency polygons do not list the raw data, as stem and leaf plots
do. Frequency polygons are very similar to histograms, except
histograms have bars and frequency polygons have dots and lines
connecting the frequencies of each class interval.
Box Plots
Author(s)
David M. Lane
Prerequisites
Percentiles, Histograms, Frequency Polygons
Learning Objectives
1. Define basic terms including hinges, H-spread, step, adjacent
value, outside value, and far out value
2. Create a box plot
3. Create parallel box plots
4. Determine whether a box plot is appropriate for a given data
set
We have already discussed techniques for visually representing data
(seehistograms and frequency polygons). In this section, we present
another important graph called a box plot. Box plots are useful for
identifying outliers and for comparing distributions. We will explain
box plots with the help of data from an in-class experiment. As part
of the "Stroop Interference Case Study," students in introductory
statistics were presented with a page containing 30 colored
rectangles. Their task was to name the colors as quickly as possible.
Their times (in seconds) were recorded. We'll compare the scores
for the 16 men and 31 women who participated in the experiment
by making separate box plots for each gender. Such a display is said
to involve parallel box plots.

There are several steps in constructing a box plot. The first relies
on the 25th, 50th, and 75th percentiles in the distribution of scores.
Figure 1 shows how these three statistics are used. For each gender,
we draw a box extending from the 25thpercentile to the
75th percentile. The 50th percentile is drawn inside the box.
Therefore,
the bottom of each box is the 25th percentile,
the top is the 75th percentile,
and the line in the middle is the 50th percentile.
The data for the women in our sample are shown in Table 1.
Table 1. Women's times.
14
15
16
16
17

17
17
17
17
18

18
18
18
18
18

19
19
19
20
20

20
20
20
20
21

21
22
23
24
24

29

For these data, the 25th percentile is 17, the 50th percentile is 19,
and the 75thpercentile is 20. For the men (whose data are not
shown), the 25th percentile is 19, the 50th percentile is 22.5, and the
75th percentile is 25.5.

Figure 1. The first step in creating box plots.


Before proceeding, the terminology in Table 2 is helpful.
Table 2. Box plot terms and values for women's times.
Name

Formula

Value

UpperHinge

75thPercentile

20

LowerHinge

25thPercentile

17

HSpread

UpperHingeLowerHinge

Step

1.5xHSpread

4.5

UpperInner
UpperHinge+1Step
Fence

24.5

LowerInner
LowerHinge1Step
Fence

12.5

UpperOuter
UpperHinge+2Steps
Fence

29

LowerOuter LowerHinge2Steps

Fence
Upper
Adjacent

LargestvaluebelowUpperInner
Fence

24

Lower
Adjacent

SmallestvalueaboveLower
InnerFence

14

Outside
Value

AvaluebeyondanInnerFence
butnotbeyondanOuterFence

29

FarOut
Value

AvaluebeyondanOuterFence

None

Continuing with the box plots, we put "whiskers" above and below
each box to give additional information about the spread of the
data. Whiskers are vertical lines that end in a horizontal stroke.
Whiskers are drawn from the upper and lower hinges to the upper
and lower adjacent values (24 and 14 for the women's data).

Figure 2. The box plots with the whiskers drawn.

Although we don't draw whiskers all the way to outside or far out
values, we still wish to represent them in our box plots. This is
achieved by adding additional marks beyond the whiskers.
Specifically, outside values are indicated by small "o's" and far out
values are indicated by asterisks (*). In our data, there are no far
out values and just one outside value. This outside value of 29 is for
the women and is shown in Figure 3.

Figure 3. The box plots with the outside value shown.


There is one more mark to include in box plots (although sometimes
it is omitted). We indicate the mean score for a group by inserting a
plus sign. Figure 4 shows the result of adding means to our box
plots.

Figure 4. The completed box plots.


Figure 4 provides a revealing summary of the data. Since half
the scores in a distribution are between the hinges (recall that the
hinges are the 25th and 75thpercentiles), we see that half the
women's times are between 17 and 20 seconds, whereas half the
men's times are between 19 and 25.5. We also see that women
generally named the colors faster than the men did, although one
woman was slower than almost all of the men. Figure 5 shows the
box plot for the women's data with detailed labels.

Figure 5. The box plot for the women's data with detailed labels.
Box plots provide basic information about a distribution. For
example, a distribution with a positive skew would have a longer
whisker in the positive direction than in the negative direction. A
larger mean than median would also indicate a positive skew. Box
plots are good at portraying extreme values and are especially good
at showing differences between distributions. However, many of the
details of a distribution are not revealed in a box plot, and to
examine these details one should create a histogram and/or a stem
and leaf display.
Here are some other examples of box plots:
Time to move the mouse over a target
Draft lottery
VARIATIONS ON BOX PLOTS
Statistical analysis programs may offer options on how box plots are
created. For example, the box plots in Figure 6 are constructed from
our data but differ from the previous box plots in several ways.
1. It does not mark outliers.
2. The means are indicated by green lines rather than plus signs.

3. The mean of all scores is indicated by a gray line.


4. Individual scores are represented by dots. Since the scores
have been rounded to the nearest second, any given dot might
represent more than one score.
5. The box for the women is wider than the box for the men
because the widths of the boxes are proportional to the
number of subjects of each gender (31 women and 16 men).

Figure 6. Box plots showing the individual scores and the means.
Each dot in Figure 6 represents a group of subjects with the
same score (rounded to the nearest second). An alternative
graphing technique is to jitter thepoints. This means spreading out
different dots at the same horizontal position, one dot for each
subject. The exact horizontal position of a dot is determined
randomly (under the constraint that different dots dont overlap
exactly). Spreading out the dots helps you to see multiple
occurrences of a given score. However, depending on the dot size
and the screen resolution, some points may be obscured even if the
points are jittererd. Figure 7 shows what jittering looks like.

Figure 7. Box plots with the individual scores jittered.


Different styles of box plots are best for different situations, and
there are no firm rules for which to use. When exploring your data,
you should try several ways of visualizing them. Which graphs you
include in your report should depend on how well different graphs
reveal the aspects of the data you consider most important.
Question 1 out of 6.
What is the upper hinge?

The upper hinge is the 75th percentile. It is the top of the box.
Bar Charts
Author(s)
David M. Lane

Prerequisites
Graphing Qualitative Variables
Learning Objectives
1. Create and interpret bar charts
2. Judge whether a bar chart or another graph such as a box plot
would be more appropriate
In the section on qualitative variables, we saw how bar charts could
be used to illustrate the frequencies of different categories. For
example, the bar chart shown in Figure 1 shows how many
purchasers of iMac computers were previous Macintosh users,
previous Windows users, and new computer purchasers.

Figure 1. iMac buyers as a function of previous computer ownership.


In this section, we show how bar charts can be used to present
other kinds of quantitative information, not just frequency counts.
The bar chart in Figure 2 shows the percent increases in the Dow
Jones, Standard and Poor 500 (S & P), and Nasdaq stock indexes
from May 24th 2000 to May 24th 2001. Notice that both the S & P
and the Nasdaq had negative increases which means that they

decreased in value. In this bar chart, the Y-axis is not frequency but
rather the signed quantity percentage increase.

Figure 2. Percent increase in three stock indexes from May 24th 2000
to May 24th 2001.
Bar charts are particularly effective for showing change over time.
Figure 3, for example, shows the percent increase in the Consumer
Price Index (CPI) over four three-month periods. The fluctuation in
inflation is apparent in the graph.

Figure 3. Percent change in the CPI over time. Each bar represents
percent increase for the three months ending at the date indicated.
Bar charts are often used to compare the means of different
experimental conditions. Figure 4 shows the mean time it took one
of us (DL) to move the mouse to either a small target or a large

target. On average, more time was required for small targets than
for large ones.

Figure 4. Bar chart showing the means for the two conditions.
Although bar charts can display means, we do not recommend them
for this purpose. Box plots should be used instead since they
provide more information than bar charts without taking up more
space. For example, a box plot of the mouse-movement data is
shown in Figure 5. You can see that Figure 5 reveals more about the
distribution of movement times than does Figure 4.

Figure 5. Box plots of times to move the mouse to the small and large
targets.

The section on qualitative variables presented earlier in this chapter


discussed the use of bar charts for comparing distributions. Some
common graphical mistakes were also noted. The earlier discussion
applies equally well to the use of bar charts to display quantitative
variables.
Question 1 out of 2.
Bar charts can only be used for qualitative variables.
True

False

Although bar charts can be used for qualitative variables, they can
also portray quantitative variables.
Line Graphs
Author(s)
David M. Lane
Prerequisites
Bar Graphs
Learning Objectives
1. Create and interpret line graphs
2. Judge whether a line graph would be appropriate for a given
data set
A line graph is a bar graph with the tops of the bars represented by
points joined by lines (the rest of the bar is suppressed). For

example, Figure 1 was presented in the section on bar charts and


shows changes in the Consumer Price Index (CPI) over time.

Figure 1. A bar chart of the percent change in the CPI over time. Each
bar represents percent increase for the three months ending at the
date indicated.
A line graph of these same data is shown in Figure 2. Although the
figures are similar, the line graph emphasizes the change from
period to period.

Figure 2. A line graph of the percent change in the CPI over time.
Each point represents percent increase for the three months ending
at the date indicated.
Line graphs are appropriate only when both the X- and Y-axes
display ordered (rather than qualitative) variables. Although bar
graphs can also be used in this situation, line graphs are generally
better at comparing changes over time. Figure 3, for example,
shows percent increases and decreases in five components of the
Consumer Price Index (CPI). The figure makes it easy to see that
medical costs had a steadier progression than the other
components. Although you could create an analogous bar chart, its
interpretation would not be as easy.

Figure 3. A line graph of the percent change in five components of the


CPI over time.
Let us stress that it is misleading to use a line graph when the Xaxis contains merely qualitative variables. Figure 4 inappropriately
shows a line graph of the card game data from Yahoo, discussed in
the section on qualitative variables. The defect in Figure 4 is that it
gives the false impression that the games are naturally ordered in a
numerical way.

Figure 4. A line graph, inappropriately used, depicting the number of


people playing different card games on Sunday and Wednesday.
Question 1 out of 2.
Line graphs are most similar to
bar charts.

histograms.

stem and leaf displays.

frequency polygons.

A line graph is a bar graph with the tops of the bars represented by
points joined by lines.
Dot Plots
Author(s)
David M. Lane
Prerequisites
Bar Charts
Learning Objectives
1. Create and interpret dot plots
2. Judge whether a dot plot would be appropriate for a given data
set
Dot plots can be used to display various types of information. Figure
1 uses a dot plot to display the number of M & M's of each color
found in a bag of M & M's. Each dot represents a single M & M. From
the figure, you can see that there were 3 blue M & M's, 19 brown M
& M's, etc.

Figure 1. A dot plot showing the number of M & M's of various colors
in a bag of M & M's.

The dot plot in Figure 2 shows the number of people playing


various card games on the Yahoo website on a Wednesday. Unlike
Figure 1, the location rather than the number of dots represents the
frequency.

Figure 2. A dot plot showing the number of people playing various


card games on a Wednesday.
The dot plot in Figure 3 shows the number of people playing on
a Sunday and on a Wednesday. This graph makes it easy to
compare the popularity of the games separately for the two days,
but does not make it easy to compare the popularity of a given
game on the two days.

Figure 3. A dot plot showing the number of people playing various


card games on a Sunday and on a Wednesday.

Figure 4. An alternate way of showing the number of people playing


various card games on a Sunday and on a Wednesday.
The dot plot in Figure 4 makes it easy to compare the days of the
week for specific games while still portraying differences among
games.
Question 1 out of 3.
Dot plots are typically used to represent frequencies.
True

False

Although dot plots could be used to represent statistics such as


means, it is not recommended. They are typically used for
frequencies.

Statistical Literacy
Author(s)
Seyd Ercan and David Lane

Are Commercial Vehicles in Texas


Unsafe?
Prerequisites
Graphing Distributions

A news report on the safety of commercial vehicles in Texas stated


that one out of five commercial vehicles have been pulled off the
road in 2012 because they were unsafe. In addition, 12,301
commercial drivers have been banned from the road for safety
violations.
The author presents the bar chart below to provide information
about the percentage of fatal crashes involving commercial vehicles
in Texas since 2006. The author also quotes DPS director Steven
McCraw:
Commercial vehicles are responsible for approximately 15
percent of the fatalities in Texas crashes. Those who choose to
drive unsafe commercial vehicles or drive a commercial vehicle
unsafely pose a serious threat to the motoring public.

WHAT DO YOU THINK?


Based on what you have learned in this
chapter, does this bar chart provide enough
information to conclude that unsafe or
unsafely driven commercial vehicles pose a
serious threat to the motoring public? What
might you conclude if 30 percent of all the
vehicles on the roads of Texas in 2010 were
commercial and accounted for 16 percent of
fatal crashes?
This bar chart does not provide enough information to draw such a
conclusion because we dont know, on the average, in a given year
what percentage of all vehicles on the road are commercial vehicles.
For example, if 30 percent of all the vehicles on the roads of Texas
in 2010 are commercial ones and only 16 percent of fatal crashes
involved commercial vehicles, then commercial vehicles are safer
than non-commercial ones. Note that in this case 70 percent of

vehicles are non-commercial and they are responsible for 84


percent of the fatal crashes.

Linear By Design
Prerequisites
Graphing Distributions
Fox News aired the line graph below showing the number
unemployed during four quarters between 2007 and 2010.

WHAT DO YOU
THINK?
Does Fox News'
line graph
provide
misleading
information?
Why or Why
not?

There are major flaws with the Fox News graph. First, the title of
the graph is misleading. Although the data show the number
unemployed, Fox News graph is titled "Job Loss by Quarter."
Second, the intervals on the X-axis are misleading. Although there
are 6 months between September 2008 and March 2009 and 15
months between March 2009 and June 2010, the intervals are
represented in the graph by very similar lengths. This gives the
false impression that unemployment increased steadily.
The graph presented below is corrected so that distances on the Xaxis are proportional to the number of days between the dates. This

graph shows clearly that the rate of increase in the number


unemployed is greater between September 2008 and March 2009
than it is between March 2009 and June 2010.

Learning Objectives

Construct a stem-and leaf plot.

Understand the importance of a stem-and-leaf plot in statistics.

Construct and interpret a pie chart.

Construct and interpret a bar graph.

Create a frequency distribution chart.

Construct and interpret a histogram.

Use technology to create graphical representations of data.

What is the puppet doing? She cant be cutting a pizza, because the pieces are all
different colors and sizes. It seems like she is drawing some type of a display to show
different amounts of a whole circle. The colors must represent different parts of the
whole. As you proceed through this lesson, refer back to this picture so that you will be
able to create a meaningful and detailed answer to the question, What is the puppet
doing?

Pie Charts
Pie charts, or circle graphs, are used extensively in statistics. These graphs appear
often in newspapers and magazines. A pie chart shows the relationship of the parts to
the whole by visually comparing the sizes of the sections (slices). Pie charts can be
constructed by using a hundreds disk or by using a circle. The hundreds disk is built on
the concept that the whole of anything is 100%, while the circle is built on the concept
that 360 is the whole of anything. Both methods of creating a pie chart are acceptable,
and both will produce the same result. The sections have different colors to enable an
observer to clearly see the differences in the sizes of the sections. The following
example will first be done by using a hundreds disk and then by using a circle.
Example 10
The Red Cross Blood Donor Clinic had a very successful morning collecting blood
donations. Within 3 hours, people had made donations, and the following is a table
showing the blood types of the donations:
Blood Type

AB

Number of donors

Construct a pie chart to represent the data.


Solution:
Step 1: Determine the total number of donors: 7+5+9+4=25.
Step 2: Express each donor number as a percent of the whole by using the
formula Percent=fn100%, where f is the frequency and n is the total number.

725100%=28%525100%=20%925100%=36%425100%=16%
Step 3: Use a hundreds disk and simply count the correct number for each blood type
(1 line = 1 percent).
Step 4: Graph each section. Write the name and correct percentage inside the section.
Color each section a different color.

The above pie chart was created by using a hundreds disk, which is a circle with 100
divisions in groups of 5. Each division (line) represents 1 percent. From the graph, you
can see that more donations were of Type O than any other type. The fewest number of
donations of blood collected was of Type AB. If the percentages had not been entered in
each section, these same conclusions could have been made based simply on the size
of each section.
Solution:
Step 1: Determine the total number of donors: 7+5+9+4=25.
Step 2: Express each donor number as the number of degrees of a circle that it
represents by using the formula Degrees=fn360, where f is the frequency and n is
the total number.

725360=100.8525360=72925360=129.6425360=57.6
Step 3: Using a protractor, graph each section of the circle.
Step 4: Write the name and correct percentage inside each section. Color each section
a different color.

The above pie chart was created by using a protractor and graphing each section of the
circle according to the number of degrees needed. From the graph, you can see that
more donations were of Type O than any other type. The fewest number of donations of
blood collected was of Type AB. Notice that the percentages have been entered in each
section of the graph and not the numbers of degrees. This is because degrees would
not be meaningful to an observer trying to interpret the graph. In order to create a pie
chart by using a circle, it is necessary to use the formula to calculate the number of
degrees for each section, and in order to create a pie chart by using a hundreds disk, it
is necessary to use the formula to determine the percentage for each section. In the
end, however, both methods result in identical graphs.
Example 11
A new restaurant is opening in town, and the owner is trying very hard to complete the
menu. He wants to include a choice of 5 salads and has presented his partner with the
following pie chart to represent the results of a recent survey that he conducted of the
towns people. The survey asked the question, "What is your favorite kind of salad?"

Use the pie chart to answer the following questions:

1.

Which salad was the most popular choice?

2.

Which salad was the least popular choice?

3.

If 300 people were surveyed, how many people chose each type of salad?

4.

What is the difference between the number of people who chose the spinach
salad and the number of people who chose the garden salad?

Solution:
1. The most popular salad was the caesar salad.
2. The least popular salad was the taco salad.

3. Caesar salad: 35%=35100=0.35

(300)(0.35)=105 people
Taco salad: 10%=10100=0.10

(300)(0.10)=30 people
Spinach salad: 17%=17100=0.17

(300)(0.17)=51 people
Garden salad: 13%=13100=0.13

(300)(0.13)=39 people
Chef salad: 25%=25100=0.25

(300)(0.25)=75 people
4. The difference between the number of people who chose the spinach salad and the
number of people who chose the garden salad is 5139=12 people.
If we revisit the puppet who was introduced at the beginning of the lesson, you should
now be able to create a story that details what she is doing. An example would be that
she is in charge of the student body and is presenting to the students the results of a
questionnaire regarding student activities for the first semester. Of the 5 activities, the
one that is orange in color is the most popular. The students have decided that they
want to have a winter carnival week more than any other activity.
Stem-and-Leaf Plots
In statistics, data is represented in tables, charts, and graphs. One disadvantage of
representing data in these ways is that the actual data values are often not retained.
One way to ensure that the data values are kept intact is to graph the values in a stemand-leaf plot. A stem-and-leaf plot is a method of organizing the data that includes
sorting the data and graphing it at the same time. This type of graph uses a stem as the
leading part of a data value and a leaf as the remaining part of the value. The result is a
graph that displays the sorted data in groups, or classes. A stem-and-leaf plot is used
most when the number of data values is large.
Example 12
At a local veterinarian school, the number of animals treated each day over a period of
20 days was recorded. Construct a stem-and-leaf plot for the data set, which is as
follows:

2834233516174705602639354735383555475448
Solution:
Step 1: Create the stem-and-leaf plot.
Some people prefer to arrange the data in order before the stems and leaves are
created. This will ensure that the values of the leaves are in order. However, this is not
necessary and can take a great deal of time if the data set is large. We will first create
the stem-and-leaf plot, and then we will organize the values of the leaves.

The leading digit of a data value is used as the stem, and the trailing digit is used as the
leaf. The numbers in the stem column should be consecutive numbers that begin with
the smallest class and continue to the largest class. If there are no values in a class, do
not enter a value in the leafjust leave it blank.
Step 2: Organize the values in each leaf row.

Now that the graph has been constructed, there is a great deal of information that can
be learned from it.

The number of values in the leaf column should equal the number of data values that
were given in the table. The value that appears the most often in the same leaf row is
the trailing digit of the mode of the data set. The mode of this data set is 35. For 7 of the
20 days, the number of animals receiving treatment was between 34 and 39. The
veterinarian school treated a minimum of 5 animals and a maximum of 60 animals on
any one day. The median of the data can be quickly calculated by using the values in
the leaf column to locate the value in the middle position. In this stem and leaf plot, the
median is the mean of the sum of the numbers represented by the 10th and
the 11th leaves: 35+352=702=35.
Example 13
The following numbers represent the growth (in centimeters) of some plants after 25
days.
Construct a stem-and-leaf plot to represent the data, and list 3 facts that you know
about the growth of the plants.

18103736613941495052575351573948563336193041513860
Solution:

Answers will vary, but the following are some possible responses:

From the stem-and-leaf plot, the growth of the plants ranged from a minimum of
10 cm to a maximum of 61 cm.

The median of the data set is the value in the 13th position, which is 41 cm.
There was no growth recorded in the class of 20 cm, so there is no number in the
leaf row.
The data set is multimodal.

Bar Graphs
The different types of graphs that you have seen so far are plots to use with quantitative
variables. A qualitative variable can be plotted using a bar graph. A bar graph is a plot
made of bars whose heights (vertical bars) or lengths (horizontal bars) represent the
frequencies of each category. There is 1 bar for each category, with space between

each bar, and the data that is plotted is discrete data. Each category is represented by
intervals of the same width. When constructing a bar graph, the category is usually
placed on the horizontal axis, and the frequency is usually placed on the vertical axis.
These values can be reversed if the bar graph has horizontal bars.
Example 14
Construct a bar graph to represent the depth of the Great Lakes:
Lake Superior 1,333 ft.
Lake Michigan 923 ft.
Lake Huron 750 ft.
Lake Ontario 802 ft.
Lake Erie 210 ft.
Solution:

Example 15
The following bar graph represents the results of a survey to determine the type of TV
shows watched by high school students:

Use the bar graph to answer the following questions:

1.

What type of show is watched the most?

2.

What type of show is watched the least?

3.

Approximately how many students participated in the survey?

4.

Does the graph show the differences between the preferences of males and
females?

Solution:
1.
Sit-coms are watched the most.
2.

Quiz shows are watched the least.

3.

Approximately 45+20+18+6+35+16=140 students participated in the survey.

4.

No, the graph does not show the differences between the preferences of males
and females.

If bar graphs are constructed on grid paper, it is very easy to keep the intervals the
same size and to keep the bars evenly spaced. In addition to helping in the appearance
of the graph, grid paper also enables you to more accurately determine the frequency of
each class.
Example 16
The following bar graph represents the part-time jobs held by a group of grade 10
students:

Using the above bar graph, answer the following questions:


1.
What was the most popular part-time job?
2.

What was the part-time job held by the least number of students?

3.

Which part-time jobs employed 10 or more of the students?

4.
5.

Is it possible to create a table of values for the bar graph? If so, construct the
table of values.
What percentage of the students worked as a delivery person?

Solution:
1. The most popular part-time job was in the fast food industry.
2. The part-time job of tutoring was the one held by the least number of students.
3. The part-time jobs that employed 10 or more students were in the fast food, delivery,
lawn maintenance, and grocery store businesses.
4. Yes, it's possible to create a table of values for the bar graph.
Part-Time Job

Baby
Sitting

Fast
Food

Deliver Lawn
y
Care

Grocery
Store

Tutoring

Number of
Students

14

12

13

10

5. The percentage of the students who worked as a delivery person was approximately
19.4%.

Histograms
An extension of the bar graph is the histogram. A histogram is a type of vertical bar
graph in which the bars represent grouped continuous data. While there are similarities
between a bar graph and a histogram, such as each bar being the same width, a
histogram has no spaces between the bars. The quantitative data is grouped according
to a determined bin size, or interval. The bin size refers to the width of each bar, and the
data is placed in the appropriate bin.
The bins, or groups of data, are plotted on the x-axis, and the frequencies of the bins
are plotted on the y-axis. A grouped frequency distribution is constructed for the
numerical data, and this table is used to create the histogram. In most cases, the
grouped frequency distribution is designed so there are no breaks in the intervals. The
last value of one bin is actually the first value counted in the next bin. This means that if
you had groups of data with a bin size of 10, the bins would be represented by the
notation [0-10), [10-20), [20-30), etc. Each bin appears to contain 11 values, which is 1
more than the desired bin size of 10. Therefore, the last digit of each bin is counted as
the first digit of the following bin.

The first bin includes the values 0 through 9, and the next bin includes the values 9
through 19. This makes the bins the proper size. Bin sizes are written in this manner to
simplify the process of grouping the data. The first bin can begin with the smallest
number of the data set and end with the value determined by adding the bin width to
this value, or the bin can begin with a reasonable value that is smaller than the smallest
data value.
Example 17
Construct a frequency distribution table with a bin size of 10 for the following data,
which represents the ages of 30 lottery winners:

384129334074664560552552546146515957666232476550392235727749
Solution:
Step 1: Determine the range of the data by subtracting the smallest value from the
largest value.

Range: 7722=55
Step 2: Divide the range by the bin size to ensure that you have at least 5 groups of
data. A histogram should have from 5 to 10 bins to make it meaningful: 5510=5.56.
Since you cannot have 0.5 of a bin, the result indicates that you will have at least 6 bins.
Step 3: Construct the table.
Bin

Frequency

[2030)

[3040)

[4050)

[5060)

[6070)

[7080)

Step 4: Determine the sum of the frequency column to ensure that all the data has been
grouped.

3+5+6+8+5+3=30
When data is grouped in a frequency distribution table, the actual data values are lost.
The table indicates how many values are in each group, but it doesn't show the actual
values.

There are many different ways to create a distribution table and many different
distribution tables that can be created. However, for the purpose of constructing a
histogram, the method shown works very well, and it is not difficult to complete. When
the number of data values is very large, another column is often inserted in the
distribution table. This column is a tally column, and it is used to account for the number
of values within a bin. A tally column facilitates the creation of the distribution table and
usually allows the task to be completed more quickly.
Example 18
The numbers of years of service for 75 teachers in a small town are listed below:

1, 6, 11, 26, 21, 18, 2, 5, 27, 33, 7, 15, 22, 30, 831, 5, 25, 20, 19, 4, 9, 19, 34, 3, 16,
23, 31, 10, 42, 31, 26, 19, 3, 12, 14, 28, 32, 1, 17, 24, 34, 16, 1,18, 29, 10, 12, 30, 13
, 7, 8, 27, 3, 11, 26, 33, 29, 207, 21, 11, 19, 35, 16, 5, 2, 19, 24, 13, 14, 28, 10, 31
Using the above data, construct a frequency distribution table with a bin size of 5.
Solution:

Range: 351345=34=6.87
You will have 7 bins.
For each value that is in a bin, draw a stroke in the Tally column. To make counting the
strokes easier, draw 4 strokes and cross them out with the fifth stroke. This process
bundles the strokes in groups of 5, and the frequency can be readily determined.
Bin

Tally

Frequency

[05)

|||| |||| |

11

[510)

|||| ||||

[1015)

|||| |||| ||

12

[1520)

|||| |||| ||||

14

[2025)

|||| ||

[2530)

|||| ||||

10

[3035)

|||| |||| ||

12

11+9+12+14+7+10+12=75

Now that you have constructed the frequency table, the grouped data can be used to
draw a histogram. Like a bar graph, a histogram requires a title and properly labeled xand y-axes.
Example 19
Use the data from Example 17 that displays the ages of the lottery winners to construct
a histogram. The data is shown again below:
Bin

Frequency

[2030)

[3040)

[4050)

[5060)

[6070)

[7080)

Solution:
Use the data as it is represented in the distribution table to construct the histogram.

From looking at the tops of the bars, you can see how many winners were in each
category, and by adding these numbers, you can determine the total number of winners.
You can also determine how many winners were within a specific category. For
example, you can see that 8 winners were 60 years of age or older. The graph can also
be used to determine percentages. For example, it can answer the question, What
percentage of the winners were 50 years of age or older? as follows:

1630=0.533(0.533)(100%)5.3%.

Example 20
a) Use the data and the distribution table that represent the ages of teachers from
Example 18 to construct a histogram to display the data. The distribution table is shown
again below:
Bin

Tally

Frequency

[05)

|||| |||| |

11

[510)

|||| ||||

[1015)

|||| |||| ||

12

[1520)

|||| |||| ||||

14

[2025)

|||| ||

[2530)

|||| ||||

10

[3035)

|||| |||| ||

12

b) Now use the histogram to answer the following questions.


i) How many teachers teach in this small town?
ii) How many teachers have worked for less than 5 years?
iii) If teachers are able to retire when they have taught for 30 years or more, how many
are eligible to retire?
iv) What percentage of the teachers still have to teach for 10 years or fewer before they
are eligible to retire?
v) Do you think that the majority of the teachers are young or old? Justify your answer.
Solution:

a)
b) i) 11+9+12+14+7+10+12=75
In this small town, 75 teachers are teaching.
ii) 11 teachers have taught for less than 5 years.
iii) 12 teachers are eligible to retire.
iv) 1775=0.2266(0.2266)(100%)23%
Approximately 23% of the teachers must teach for 10 years or fewer before they are
eligible to retire.
v) Answers will vary, but one possible answer is that the majority of the teachers are
young, because 46 have taught for less than 20 years.
Technology can also be used to plot a histogram. The TI-83 can be used to create a
histogram by using STAT and STAT PLOT on the calculator.
Example 21
Scientists have invented a new dietary supplement that is supposed to increase the
weight of a piglet within its first 3 months of growth. Farmer John fed this supplement to
his stock of piglets, and at the end of 3 months, he recorded the weights of 50 randomly
selected piglets.
The following table is the recorded weights (in pounds) of the 50 selected piglets:

120111651101147211610511911493 113 991181089710795 1137584 12010210484


9712169 10010110711877 1051097889 68 74 10387 67 79 90 1099410696 92 88
Using the above data set and the TI-83, construct a histogram to represent the data.
Solution:

Using the TRACE feature will give you information about the data in each bar of the
histogram.

The TRACE feature tells you that in the first bin, which is [60-70), there are 4 values.

The TRACE feature tells you that in the second bin, which is [70-80), there are 6 values.
To advance to the next bin, or bar, of the histogram, use the cursor and move to the
right. The information obtained by using the TRACE feature will enable you to create a
frequency table and to draw the histogram on paper.
The shape of a histogram can tell you a lot about the distribution of the data, as well as
provide you with information about the mean, median, and mode of the data set. The
following are some typical histograms, with a caption below each one explaining the
distribution of the data, as well as the characteristics of the mean, median, and mode.
Distributions can have other shapes besides the ones shown below, but these represent
the most common ones that you will see when analyzing data. In each of the graphs
below, the distributions are not perfectly shaped, but are shaped enough to identify an
overall pattern.

a)
Figure a represents a bell-shaped distribution, which has a single peak and tapers off to
both the left and to the right of the peak. The shape appears to be symmetric about the
center of the histogram. The single peak indicates that the distribution is unimodal. The
highest peak of the histogram represents the location of the mode of the data set. The
mode is the data value that occurs the most often in a data set. For a symmetric
histogram, the values of the mean, median, and mode are all the same and are all
located at the center of the distribution.

b)
Figure b represents a distribution that is approximately uniform and forms a rectangular,
flat shape. The frequency of each class is approximately the same.

c)
Figure c represents a right-skewed distribution, which has a peak to the left of the
distribution and data values that taper off to the right. This distribution has a single peak
and is also unimodal. For a histogram that is skewed to the right, the mean is located to
the right on the distribution and is the largest value of the measures of central tendency.
The mean has the largest value because it is strongly affected by the outliers on the

right tail that pull the mean to the right. The mode is the smallest value, and it is located
to the left on the distribution. The mode always occurs at the highest point of the peak.
The median is located between the mode and the mean.

d)
Figure d represents a left-skewed distribution, which has a peak to the right of the
distribution and data values that taper off to the left. This distribution has a single peak
and is also unimodal. For a histogram that is skewed to the left, the mean is located to
the left on the distribution and is the smallest value of the measures of central tendency.
The mean has the smallest value because it is strongly affected by the outliers on the
left tail that pull the mean to the left. The median is located between the mode and the
mean.

e)
Figure e has no shape that can be defined. The only defining characteristic about this
distribution is that it has 2 peaks of the same height. This means that the distribution is
bimodal.
Another type of graph that can be drawn to represent the same set of data as a
histogram represents is a frequency polygon. A frequency polygon is a graph
constructed by using lines to join the midpoints of each interval, or bin. The heights of
the points represent the frequencies. A frequency polygon can be created from the
histogram or by calculating the midpoints of the bins from the frequency distribution
table. The midpoint of a bin is calculated by adding the upper and lower boundary
values of the bin and dividing the sum by 2.
Example 22

The following histogram represents the marks made by 40 students on a math 10 test.
Use the histogram to construct a frequency polygon to represent the data.

Solution:

There is no data value greater than 0 and less than 20. The jagged line that is inserted
on the x-axis is used to represent this fact. The area under the frequency polygon is the
same as the area under the histogram and is, therefore, equal to the frequency values
that would be displayed in a distribution table. The frequency polygon also shows the
shape of the distribution of the data, and in this case, it resembles a bell curve.
Example 23
The following distribution table represents the number of miles run by 20 randomly
selected runners during a recent road race:
Bin

Frequency

[5.510.5)

[10.515.5)

[15.520.5)

Bin

Frequency

[20.525.5)

[25.530.5)

[30.535.5)

[35.540.5)

Using this table, construct a frequency polygon.


Solution:
Step 1: Calculate the midpoint of each bin by adding the 2 numbers of the interval and
dividing the sum by 2.

Midpoints: 5.5+10.52=162=820.5+25.52=462=2335.5+40.52=762=3810.5+15.
52=262=1325.5+30.52=562=2815.5+20.52=362=1830.5+35.52=662=33
Step 2: Plot the midpoints on a grid, making sure to number the x-axis with a scale that
will include the bin sizes. Join the plotted midpoints with lines.

A frequency polygon usually extends 1 unit below the smallest bin value and 1 unit
beyond the greatest bin value. This extension gives the frequency polygon an
appearance of having a starting point and an ending point, which provides a view of the
distribution of data. If the data set were very large so that the number of bins had to be
increased and the bin size decreased, the frequency polygon would appear as a smooth
curve.
Lesson Summary
In this lesson, you learned how to represent data that was presented in various forms.
Data that could be represented as percentages was displayed in a pie chart, or circle
graph. Discrete data that was qualitative was displayed on a bar graph. Finally,

continuous data that was grouped was graphed on a histogram or on a frequency


polygon. You also learned to detect characteristics of a distribution by simply observing
the shape of a histogram. Once again, technology was shown to be an asset when
constructing a histogram.
Points to Consider

Can any of these graphs be used for comparing data?

Can these graphs be used to display solutions to problems in everyday life?

How do these graphs compare to ones presented in previous lessons?

Learning Objectives

Represent data that has a linear pattern on a graph.

Represent data using a broken-line graph.


Understand the difference between continuous data and discrete data as it
applies to a line graph.

Represent data that has no definite pattern as a scatter plot.

Draw a line of best fit on a scatter plot.

Use technology to create both line graphs and scatter plots.

Before you continue to explore the concept of representing data graphically, it is very
important to understand the meaning of some basic terms that will often be used in this
lesson. The first such definition is that of a variable. In statistics, a variable is simply a
characteristic that is being studied. This characteristic assumes different values for
different elements, or members, of the population, whether it is the entire population or
a sample. The value of the variable is referred to as an observation, or a measurement.
A collection of these observations of the variable is a data set.
Variables can be quantitative or qualitative. A quantitative variable is one that can be
measured numerically. Some examples of a quantitative variable are wages, prices,
weights, numbers of vehicles, and numbers of goals. All of these examples can be
expressed numerically. A quantitative variable can be classified as discrete or
continuous. A discrete variable is one whose values are all countable and does not
include any values between 2 consecutive values of a data set. An example of a
discrete variable is the number of goals scored by a team during a hockey game.
Acontinuous variable is one that can assume any countable value, as well as all the
values between 2 consecutive numbers of a data set. An example of a continuous
variable is the number of gallons of gasoline used during a trip to the beach.
A qualitative variable is one that cannot be measured numerically but can be placed in
a category. Some examples of a qualitative variable are months of the year, hair color,

color of cars, a persons status, and favorite vacation spots. The following flow chart
should help you to better understand the above terms.

Example 1
Select the best descriptions for the following variables and indicate your selections by
marking an x in the appropriate boxes.
Variable

Quantitative Qualitative Discrete Continuous

Number of members in a family


A persons marital status
Length of a persons arm
Color of cars
Number of errors on a math test
Solution:
Variable

Quantitative Qualitative Discrete Continuous

Number of members in a family x

A persons marital status


Length of a persons arm

Color of cars
Number of errors on a math test x

x
x
x

Variables can also be classified as dependent or independent. When there is a linear


relationship between 2 variables, the values of one variable depend upon the values of
the other variable. In a linear relation, the values of y depend upon the values of x.
Therefore, the dependent variable is represented by the values that are plotted on
the y-axis, and the independent variable is represented by the values that are plotted
on the x-axis.
Example 2

Sally works at the local ballpark stadium selling lemonade. She is paid $15.00 each time
she works, plus $0.75 for each glass of lemonade she sells. Create a table of values to
represent Sallys earnings if she sells 8 glasses of lemonade. Use this table of values to
represent her earnings on a graph.
Solution:
The first step is to write an equation to represent her earnings and then to use this
equation to create a table of values.

y=0.75x+15, where y represents her earnings and x represents the number of


glasses of lemonade she sells.

Number of Glasses of Lemonade

Earnings

$15.00

$15.75

$16.50

$17.25

$18.00

$18.75

$19.50

$20.25

$21.00

The dependent variable is the money earned, and the independent variable is the
number of glasses of lemonade sold. Therefore, money is on the y-axis, and the
number of glasses of lemonade is on the x-axis.
From the table of values, Sally will earn $21.00 if she sells 8 glasses of lemonade.

Now that the points have been plotted, the decision has to be made as to whether or not
to join them. Between every 2 points plotted on the graph are an infinite number of
values. If these values are meaningful to the problem, then the plotted points can be
joined. This type of data is called continuous data. If the values between the 2 plotted
points are not meaningful to the problem, then the points should not be joined. This type
of data is called discrete data. Since glasses of lemonade are represented by whole
numbers, and since fractions or decimals are not appropriate values, the points
between 2 consecutive values are not meaningful in this problem. Therefore, the points
should not be joined. The data is discrete.
Now it is time to revisit the problem presented in the introduction.
The local arena is trying to attract as many participants as possible to attend the
communitys Skate for Scoliosis event. Participants pay a fee of $10.00 for registering,
and, in addition, the arena will donate $3.00 for each hour a participant skates, up to a
maximum of 6 hours. Create a table of values and draw a graph to represent a
participant who skates for the entire 6 hours. How much money can a participant raise
for the community if he/she skates for the maximum length of time?
Solution:
The equation for this scenario is y=3x+10, where y represents the money made by
the participant, and x represents the number of hours the participant skates.

Numbers of Hours Skating

Money Earned

$10.00

$13.00

$16.00

$19.00

$22.00

Numbers of Hours Skating

Money Earned

$25.00

$28.00

The dependent variable is the money made, and the independent variable is the
number of hours the participant skated. Therefore, money is on the y-axis, and time is
on the x-axis as shown below:

A participant who skates for the entire 6 hours can make $28.00 for the "Skate for
Scoliosis" event. The points are joined, because the fractions and decimals between 2
consecutive points are meaningful for this problem. A participant could skate for 30
minutes, and the arena would pay that skater $1.50 for the time skating. The data is
continuous.
Linear graphs are important in statistics when several data sets are used to represent
information about a single topic. An example would be data sets that represent different
plans available for cell phone users. These data sets can be plotted on the same grid.
The resulting graph will show intersection points for the plans. These intersection points
indicate a coordinate where 2 plans are equal. An observer can easily interpret the
graph to decide which plan is best, and when. If the observer is trying to choose a plan
to use, the choice can be made easier by seeing a graphical representation of the data.
Example 3

The following graph represents 3 plans that are available to customers interested in
hiring a maintenance company to tend to their lawn. Using the graph, explain when it
would be best to use each plan for lawn maintenance.

Solution:
From the graph, the base fee that is charged for each plan is obvious. These values are
found on the y-axis. Plan A charges a base fee of $200.00, Plan C charges a base fee
of $100.00, and Plan B charges a base fee of $50.00. The cost per hour can be
calculated by using the values of the intersection points and the base fee in the
equation y=mx+b and solving for m. Plan B is the best plan to choose if the lawn
maintenance takes less than 12.5 hours. At 12.5 hours, Plan B and Plan C both cost
$150.00 for lawn maintenance. After 12.5 hours, Plan C is the best deal, until 50 hours
of lawn maintenance is needed. At 50 hours, Plan A and Plan C both cost $300.00 for
lawn maintenance. For more than 50 hours of lawn maintenance, Plan A is the best
plan. All of the above information was obvious from the graph and would enhance the
decision-making process for any interested client.
The above graphs represent linear functions, and are called linear (line) graphs. Each of
these graphs has a defined slope that remains constant when the line is plotted. A
variation of this graph is a broken-line graph. This type of line graph is used when it is
necessary to show change over time. A line is used to join the values, but the line has
no defined slope. However, the points are meaningful, and they all represent an
important part of the graph. Usually a broken-line graph is given to you, and you must
interpret the given information from the graph.
Example 4

The following graph is an example of a broken-line graph, and it represents the time of a
round-trip journey, driving from home to a popular campground and back.

a) How far is it from home to the picnic park?


b) How far is it from the picnic park to the campground?
c) At what 2 places did the car stop?
d) How long was the car stopped at the campground?
e) When does the car arrive at the picnic park?
f) How long did it take for the return trip?
g) What was the speed of the car from home to the picnic park?
h) What was the speed of the car from the campground to home?
Solution:
a) It is 40 miles from home to the picnic park.
b) It is 60 miles from the picnic park to the campground.
c) The car stopped at the picnic park and at the campground.
d) The car was stopped at the campground for 15 minutes.
e) The car arrived at the picnic park at 11:00 am.
f) The return trip took 1 hour.
g) The speed of the car from home to the picnic park was 40 mi/h.
h) The speed of the car from the campground to home was 100 mi/h.
Example 5
Sam decides to spend some time with his friend Aaron. He hops on his bike and starts
off to Aarons house, but on his way, he gets a flat tire and must walk the remaining
distance. Once he arrives at Aarons house, they repair the flat tire, play some poker,

and then Sam returns home. On his way home, Sam decides to stop at the mall to buy
a book on how to play poker. The following graph represents Sams adventure:

a) How far is it from Sams house to Aarons house?


b) How far is it from Aarons house to the mall?
c) At what time did Sam have a flat tire?
d) How long did Sam stay at Aarons house?
e) At what speed did Sam travel from Aarons house to the mall and then from the mall
to home?
Solution:
a) It is 25 km from Sams house to Aarons house.
b) It is 15 km from Aarons house to the mall.
c) Sam had a flat tire at 10:00 am.
d) Sam stayed at Aarons house for 1 hour.
e) Sam traveled at a speed of 30 km/h from Aarons house to the mall and then at a
speed of 40 km/h from the mall to home.
Often, when real-world data is plotted, the result is a linear pattern. The general
direction of the data can be seen, but the data points do not all fall on a line. This type of
graph is called a scatter plot. A scatter plot is often used to investigate whether or not
there is a relationship or connection between 2 sets of data. The data is plotted on a
graph such that one quantity is plotted on the x-axis and one quantity is plotted on
the y-axis. The quantity that is plotted on the x-axis is the independent variable, and the
quantity that is plotted on the y-axis is the dependent variable. If a relationship does
exist between the 2 sets of data, it will be easy to see if the data is plotted on a scatter
plot.
The following scatter plot shows the price of peaches and the number sold:

The connection is obviouswhen the price of peaches was high, the sales were low,
but when the price was low, the sales were high.
The following scatter plot shows the sales of a weekly newspaper and the temperature:

There is no connection between the number of newspapers sold and the temperature.
Another term used to describe 2 sets of data that have a connection or a relationship
is correlation. The correlation between 2 sets of data can be positive or negative, and it
can be strong or weak. The following scatter plots will help to enhance this concept.

If you look at the 2 sketches that represent a positive correlation, you will notice that the
points are around a line that slopes upward to the right. When the correlation is
negative, the line slopes downward to the right. The 2 sketches that show a strong
correlation have points that are bunched together and appear to be close to a line that is
in the middle of the points. When the correlation is weak, the points are more scattered
and not as concentrated.
In the sales of newspapers and the temperature, there was no connection between the
2 data sets. The following sketches represent some other possible outcomes when
there is no correlation between data sets:

Example 6
Plot the following points on a scatter plot, with m as the independent variable and nas
the dependent variable. Number both axes from 0 to 20. If a correlation exists between
the values of m and n, describe the correlation (strong negative, weak positive, etc.).

m4913161767 1810n 531118611181216


Solution:

Example 7
Describe the correlation, if any, in the following scatter plot:

Solution:
In the above scatter plot, there is a strong positive correlation.
You now know that a scatter plot can have either a positive or a negative correlation.
When this exists on a scatter plot, a line of best fit can be drawn on the graph. The line
of best fit must be drawn so that the sums of the distances to the points on either side
of the line are approximately equal and such that there are an equal number of points
above and below the line. Using a clear plastic ruler makes it easier to meet all of these
conditions when drawing the line. Another useful tool is a stick of spaghetti, since it can

be easily rolled and moved on the graph until you are satisfied with its location. The
edge of the spaghetti can be traced to produce the line of best fit. A line of best fit can
be used to make estimations from the graph, but you must remember that the line of
best fit is simply a sketch of where the line should appear on the graph. As a result, any
values that you choose from this line are not very accurate the values are more of a
ballpark figure.
Example 8
The following table consists of the marks achieved by 9 students on chemistry and math
tests:
Student

Chemistry Marks

49

46

35

58

51

56

54

46

53

Math Marks

29

23

10

41

38

36

31

24

Plot the above marks on scatter plot, with the chemistry marks on the x-axis and the
math marks on the y-axis. Draw a line of best fit, and use this line to estimate the mark
that Student I would have made in math had he or she taken the test.
Solution:

If Student I had taken the math test, his or her mark would have been between 32 and
37.
Scatter plots and lines of best fit can also be drawn by using technology. The TI-83 is
capable of graphing both a scatter plot and of inserting the line of best fit onto the
scatter plot.
Example 9
Using the data from Example 8, create a scatter plot and draw a line of best fit with the
TI-83.
Student

Chemistry Marks

49

46

35

58

51

56

54

46

53

Student

Math Marks

29

23

10

41

38

36

31

24

Solution:

The calculator can now be used to determine a linear regression equation for the given
values. The equation can be entered into the calculator, and the line will be plotted on
the scatter plot.

From the line of best fit, the calculated value for Student I's math test mark was 33.6.
Remember that the mark that you estimated was between 32 and 37.
Lesson Summary
In this lesson, you learned how to represent data by graphing a straight line of the
form y=mx+b, and also by using a scatter plot and a line of best fit. Interpreting a
broken-line graph was also presented in this lesson. You learned about correlation as it
applies to a scatter plot and how to describe the correlation of a scatter plot. You also
learned how to draw a line of best fit on a scatter plot and to use this line to make
estimates from the graph. The final topic that was demonstrated in the lesson was how

to use the TI-83 calculator to produce a scatter plot and how to graph a line of best fit by
using linear regression.
Points to Consider

Can any of these graphs be used for comparing data?

Can the equation for the line of best fit be used to calculate values?

Is any other graphical representation of data used for estimations?

Learning Objectives

Construct and interpret a box-and-whisker plot.

Use technology to create box-and-whisker plots.

Box-and Whisker Plots

In traditional statistics, data is organized by using a frequency distribution. The results of


the frequency distribution can then be used to create various graphs, such as a
histogram or a frequency polygon, which indicate the shape or nature of the distribution.
The shape of the distribution will allow you to confirm various conjectures about the
nature of the data.
To examine data in order to identify patterns, trends, or relationships, exploratory data
analysis is used. In exploratory data analysis, organized data is displayed in order to
make decisions or suggestions regarding further actions. A box-and-whisker
plot (often called a box plot) can be used to graphically represent the data set, and the
graph involves plotting 5 specific values. The 5 specific values are often referred to as
a five-number summary of the organized data set. The five-number summary consists
of the following:
1.
The lowest number in the data set (minimum value)
2.

The median of the lower quartile: Q1 (median of the first half of the data set)

3.

The median of the entire data set (median)

4.

The median of the upper quartile: Q3 (median of the second half of the data set)

5.

The highest number in the data set (maximum value)

The display of the five-number summary produces a box-and-whisker plot as shown


below:

The above model of a box-and-whisker plot shows 2 horizontal lines (the whiskers) that
each contain 25% of the data and are of the same length. In addition, it shows that the
median of the data set is in the middle of the box, which contains 50% of the data. The
lengths of the whiskers and the location of the median with respect to the center of the
box are used to describe the distribution of the data. It's important to note that this is just
an example. Not all box-and-whisker plots have the median in the middle of the box and
whiskers of the same size.
Information about the data set that can be determined from the box-and-whisker plot
with respect to the location of the median includes the following:
a) If the median is located in the center or near the center of the box, the distribution is
approximately symmetric.
b) If the median is located to the left of the center of the box, the distribution is positively
skewed.
c) If the median is located to the right of the center of the box, the distribution is
negatively skewed.
Information about the data set that can be determined from the box-and-whisker plot
with respect to the length of the whiskers includes the following:
a) If the whiskers are the same or almost the same length, the distribution is
approximately symmetric.
b) If the right whisker is longer than the left whisker, the distribution is positively skewed.
c) If the left whisker is longer than the right whisker, the distribution is negatively
skewed.
The length of the whiskers also gives you information about how spread out the data is.
A box-and-whisker plot is often used when the number of data values is large. The
center of the distribution, the nature of the distribution, and the range of the data are
very obvious from the graph. The five-number summary divides the data into quarters
by use of the medians of the upper and lower halves of the data. Remember that, unlike
the mean, the median of the entire data set is not affected by outliers, so it is the
measure of central tendency that is most often used in exploratory data analysis.

Example 24
For the following data sets, determine the five-number summaries:
a) 12, 16, 36, 10, 31, 23, 58
b) 144, 240, 153, 629, 540, 300
Solution:
a) The first step is to organize the values in the data set as shown below:

12, 16, 36, 10, 31, 23, 5810, 12, 16, 23, 31, 36, 58
Now complete the following list:
Minimum value 10

Q112
Median 23

Q336
Maximum value 58
b) The first step is to organize the values in the data set as shown below:

144, 240, 153, 629, 540, 300144, 153, 240, 300, 540, 629

Now complete the following list:


Minimum value 144

Q1153
Median 270

Q3540
Maximum value 629
Example 25

Use the data set for Example 24 part a) and the five-number summary to construct a
box-and-whisker plot to model the data set.
Solution:
The five-number summary can now be used to construct a box-and-whisker plot. Be
sure to provide a scale on the number line that includes the range from the minimum
value to the maximum value.
a) Minimum value 10

Q112
Median 23

Q336
Maximum value 58

It is very visible that the right whisker is much longer than the left whisker. This indicates
that the distribution is positively skewed.
Example 26
For each box-and-whisker plot, list the five-number summary and describe the
distribution based on the location of the median.

Solution:
a) Minimum value 4

Q16
Median 9

Q310
Maximum value 12
The median of the data set is located to the right of the center of the box, which
indicates that the distribution is negatively skewed.
b) Minimum value 225

Q1250
Median 300

Q3325
Maximum value 350
The median of the data set is located to the right of the center of the box, which
indicates that the distribution is negatively skewed.
c) Minimum value 60

Q170
Median 75

Q395
Maximum value 100
The median of the data set is located to the left of the center of the box, which indicates
that the distribution is positively skewed.
Example 27
The numbers of square feet (in 100s) of 10 of the largest museums in the world are
shown below:
650, 547, 204, 213, 343, 288, 222, 250, 287, 269
Construct a box-and-whisker plot for the above data set and describe the distribution.
Solution:
The first step is to organize the data values as follows:

20,40021,30022,20025,00026,90028,70028,80034,30054,70065,000
Now calculate the median, Q1, and Q3.
20,40021,30022,20025,00026,90028,70028,80034,30054,70065,000
Median26,900+28,7002=55,6002=27,800

Q1=22,200
Q3=34,300
Next, complete the following list:
Minimum value 20,400

Q122,200
Median 27,800

Q334,300
Maximum value 65,000

The right whisker is longer than the left whisker, which indicates that the distribution is
positively skewed.
The TI-83 or TI-84 can also be used to create a box-and whisker plot. In the following
examples, the TI-83 is used. In the next chapter, key strokes using the TI-84 will be
presented to you. The five-number summary values can be determined by using the
TRACE feature of the calculator or by using CALC and 1-Var Stats.
Example 28
The following numbers represent the number of siblings in each family for 15 randomly
selected students:

4, 1, 2, 2, 5, 3, 4, 2, 6, 4, 6, 1, 7, 8, 4
Use technology to construct a box-and-whisker plot to display the data. List the fivenumber summary values.
Solution:

Note that when creating a box-and-whisker plot with a TI calculator, you don't have to
actually sort the data. The calculator will sort the data automatically when creating the
box-and-whisker plot.
The fivenumber summary can be obtained from the calculator in 2 ways.
1. The following results are obtained by simply using the TRACE feature and the left
and right arrows:

The values at the bottom of each screen are the five-number summary.
2. The second method involves pressing STAT and using 1-Var Stats on the CALC
menu for L1:

Many data sets contain values that are either extremely high values or extremely low
values compared to the rest of the data values. These values are calledoutliers. There
are several reasons why a data set may contain an outlier. Some of these are listed
below:

1.

The value may be the result of an error made in measurement or in observation.


The researcher may have measured the variable incorrectly.

2.

The value may simply be an error made by the researcher in recording the value.
The value may have been written or typed incorrectly.

3.

The value could be a result obtained from a subject not within the defined
population. A researcher recording marks from a math 12 examination may have
recorded a mark by a student in grade 11 who was taking math 12.

4.

The value could be one that is legitimate but is extreme compared to the other
values in the data set. (This rarely occurs, but it is a possibility.)

If an outlier is present because of an error in measurement, observation, or recording,


then either the error should be corrected, or the outlier should be omitted from the data
set. If the outlier is a legitimate value, then the statistician must make a decision as to
whether or not to include it in the set of data values. There is no rule that tells you what
to do with an outlier in this case.
One method for checking a data set for the presence of an outlier is to follow the
procedure below:
1.
Organize the given data set and determine the values of Q1 and Q3.
2.
Calculate the difference between Q1 and Q3. This difference is called
theinterquartile range (IQR): IQR=Q3Q1.
3.
4.

Multiply the difference by 1.5, subtract this result from Q1, and add it to Q3.
The results from Step 3 will be the range into which all values of the data set
should fit. Any values that are below or above this range are considered outliers.

Example 29
Using the procedure outlined above, check the following data sets for outliers:
a) 18, 20, 24, 21, 5, 23, 19, 22
b) 13, 15, 19, 14, 26, 17, 12, 42, 18
Solution:
a) Organize the given data set as follows:

18, 20, 24, 21, 5, 23, 19, 225, 18, 19, 20, 21, 22, 23, 24
Determine the values for Q1 and Q3.
5, 18, 19, 20, 21, 22, 23, 24
Q1=18+192=372=18.5Q3=22+232=452=22.5

Calculate the difference between Q1 and Q3: Q3Q1=22.518.5=4.0.


Multiply this difference by 1.5: (4.0)(1.5)=6.0.
Finally, compute the range.

Q16.0=18.56.0=12.5
Q3+6.0=22.5+6.0=28.5.
Are there any data values below 12.5? Yes, the value of 5 is below 12.5 and is,
therefore, an outlier.
Are there any values above 28.5? No, there are no values above 28.5.
b) Organize the given data set as follows:

13, 15, 19, 14, 26, 17, 12, 42, 1812, 13, 14, 15, 17, 18, 19, 26, 42
Determine the values for Q1 and Q3.
12, 13, 14, 15, 17, 18, 19, 26, 42
Q1=13+142=272=13.5Q3=19+262=452=22.5
Calculate the difference between Q1 and Q3: Q3Q1=22.513.5=9.0.
Multiply this difference by 1.5: (9.0)(1.5)=13.5.
Finally, compute the range.

Q113.5=13.513.5=0
Q3+13.5=22.5+13.5=36.0
Are there any data values below 0? No, there are no values below 0.
Are there any values above 36.0? Yes, the value of 42 is above 36.0 and is, therefore,
an outlier.
Lesson Summary
You have learned the significance of the median as it applies to dividing a set of data
values into quartiles. You have also learned how to apply these values to the fivenumber summary needed to construct a box-and-whisker plot. In addition, you have
learned how to construct a box-and-whisker plot and how to obtain the five-number
summary by using technology. The last topic that you learned about in this lesson was
the meaning of the term outlier. Some reasons why an outlier might exist in a data set
and the procedure for determining whether or not a data set contains an outlier were
also discussed.
Points to Consider

Are there still other ways to represent data graphically?

Are there other uses for a box-and-whisker plot?

Can box-and-whisker plots be used for comparing data sets?

Vocabulary
Bar graph
A plot made of bars whose heights (vertical bars) or lengths (horizontal bars)
represent the frequencies of each category.
Bins
Quantitative or qualitative categories. Bins are also known as classes.
Box-and-whisker plot
A graph of a data set in which the five-number summary is plotted. 50 percent of
the data values are in the box, and the remaining 50 percent are divided equally
on the whiskers.
Broken-line graph
A graph that is used when it is necessary to show change over time. A line is
used to join the values, but the line has no defined slope.
Continuous data
Data for which the plotted points can be joined.
Continuous variable
A variable that can assume all values between 2 consecutive values of a data
set.
Correlation
A statistical method used to determine whether or not there is a linear
relationship between 2 variables.
Data set
A collection of observations of a variable.

Dependent variable
The variable represented by the values that are plotted on the y-axis.
Discrete data
Data for which the plotted points cannot be joined.
Discrete variable
A variable that can only assume values that can be counted.
Five-number summary
5 values for a data set that include the smallest value, the lower quartile, the
median, the upper quartile, and the largest value.
Frequency distribution
A table that lists all of the classes and the number of data values that belong to
each of the classes.
Frequency polygon
A graph that uses lines to join the midpoints of the tops of the bars of a histogram
or to join the midpoints of the classes.
Histogram
A graph in which the classes, or bins, are on the horizontal axis and the
frequencies are plotted on the vertical axis. The frequencies are represented by
vertical bars that are drawn adjacent to each other.
Independent variable
The variable represented by the values that are plotted on the x-axis.
Interquartile range (IQR)
The difference between the third quartile and the first quartile.
Left-skewed distribution
A distribution in which most of the data values are located to the right of the
mean.
Line of best fit

A straight line drawn on a scatter plot such that the sums of the distances to
points on either side of the line are approximately equal and such that there are
an equal number of points above and below the line.
Midpoint
The value obtained by adding the lower and upper limits of a class and dividing
the sum by 2.
Outliers
Extremely high values or extremely low values compared to the rest of the data
values.
Pie chart
A circle that is divided into sections (slices) according to the percentage of the
frequencies in each class.
Qualitative variable
A variable that can be placed into specific categories according to some defined
characteristic.
Quantitative variable
A variable that is numerical in nature and that can be ordered.
Right-skewed distribution
A distribution in which most of the data values are located to the left of the mean.
Scatter plot
A graph used to investigate whether or not there is a relationship between 2 sets
of data. The data is plotted on a graph such that one quantity is plotted on the xaxis and one quantity is plotted on the y-axis.
Stem-and-leaf plot
A method of organizing data that includes sorting the data and graphing it at the
same time. This type of graph uses the stem as the leading part of the data value
and the leaf as the remaining part of the value.

Symmetric histogram
A histogram for which the values of the mean, median, and mode are all the
same and are all located at the center of the distribution.
Variable
A characteristic that is being studied.

You might also like