You are on page 1of 25

Chapter 3

Graphical and Numerical


Summaries of Categorical Data
UNIT OBJECTIVES
At the conclusion of this unit you should be able to:
 1) Construct graphs that appropriately describe
data
 2) Calculate and interpret numerical summaries
of a data set.
 3) Combine numerical methods with graphical
methods to analyze a data set.
Displaying Qualitative Data
“Sometimes you can see a lot just
by looking.”
Yogi Berra
Hall of Fame Catcher, NY Yankees
The three rules of data analysis
won’t be difficult to remember
 1. Make a picture —reveals aspects not obvious
in the raw data; enables you to think clearly about
the patterns and relationships that may be hiding in
your data.
 2. Make a picture —to show important features
of and patterns in the data. You may also see things
that you did not expect: the extraordinary (possibly
wrong) data values or unexpected patterns
 3. Make a picture —the best way to tell others
about your data is with a well-chosen picture.
Bar Charts: show counts
or relative frequency for
each category
 Example: Titanic passenger/crew distribution
Titanic Passengers by Class

1000.00
885
900.00
800.00 706
700.00
600.00
500.00
400.00 325
285
300.00
200.00
100.00
0.00
Crew First Second Third
Pie Charts: shows
proportions of the
whole in each category
 Example: Titanic passenger/crew
distribution Titanic Passengers by Class

Third
32% Crew
40%

Second
13% First
15%
Example: Top 10 causes of death in the United
States 2001
% of top % of total
Rank Causes of death Counts
10s deaths
1 Heart disease 700,142 37% 28%
2 Cancer 553,768 29% 22%
3 Cerebrovascular 163,538 9% 6%
4 Chronic respiratory 123,013 6% 5%
5 Accidents 101,537 5% 4%
6 Diabetes mellitus 71,372 4% 3%
7 Flu and pneumonia 62,034 3% 2%
8 Alzheimer’s disease 53,852 3% 2%
9 Kidney disorders 39,480 2% 2%
10 Septicemia 32,238 2% 1%

All other causes 629,967 25%

For each individual who died in the United States in 2001, we record what was
the cause of death. The table above is a summary of that information.
Top 10 causes of death: bar graph
Each category is represented by one bar. The bar’s height shows the count (or
sometimes the percentage) for that particular category.

Top 10 causes of deaths in the United States 2001

The number of
individuals who died of
an accident in 2001 is
approximately 100,000.
800
700 Top 10 causes of deaths in the United
Counts (x1000)

600
500 States 2001
400 Bar graph sorted by rank
300  Easy to analyze
200
100
0

800
700
600 Sorted alphabetically
Counts (x1000)

500
400
 Much less useful
300
200
100
0
Top 10 causes of death: pie chart
Each slice represents a piece of one whole. The size of a slice depends on what
percent of the whole this category represents.

Percent of people dying from


top 10 causes of death in the United States in 2001
Make sure your
labels match
the data.

Make sure
all percents
add up to 100.
Percent of deaths from top 10 causes

Percent of
deaths from
all causes
Child poverty before and after
government intervention—UNICEF,
1996

What does this chart tell you?


•The United States has the highest rate of child
poverty among developed nations (22% of
under 18).

•Its government does the least—through taxes


and subsidies—to remedy the problem (size of
orange bars and percent difference between
orange/blue bars).

Could you transform this bar graph to fit in 1 pie


chart? In two pie charts? Why?
The poverty line is defined as 50% of national median income.
Contingency Tables:
Categories for Two
Variables
 Example: Survival and class on the
Titanic Marginal distributions marg. dist.
of survival

Crew First Second Third Total 710/2201


Alive 212 202 118 178 710 32.3%
Dead 673 123 167 528 1491 1491/2201
67.7%
Total 885 325 285 706 2201
885/2201 325/2201 285/2201 706/2201
marg. dist. 40.2% 14.8% 12.9%
of class 32.1%
Marginal distribution of class.
Bar chart.
Marginal distribution of class:
Pie chart
Contingency Tables: Categories
for Two Variables (cont.)
 Conditional distributions.
Given the class of a passenger, what is the
chance the passenger survived?

Class
Crew First Second Third Total
Alive Count 212 202 118 178 710
Survival % of col. 24.0% 62.2% 41.4% 25.2% 32.3%
Dead Count 673 123 167 528 1491
% of col. 76.0% 37.8% 58.6% 74.8% 67.7%
Total Count 885 325 285 706 2201
Conditional distributions:
segmented bar chart
Contingency Tables:
Categories for Two
Variables (cont.)
Questions:
 What fraction of survivors were in first class? 202/710
 What fraction of passengers were in first class and 202/2201
survivors ?
 What fraction of the first class passengers survived? 202/325
Class
Crew First Second Third Total
Alive Count 212 202 118 178 710
Survival % of col. 24.0% 62.2% 41.4% 25.2% 32.3%
Dead Count 673 123 167 528 1491
% of col. 76.0% 37.8% 58.6% 74.8% 67.7%
Total Count 885 325 285 706 2201
3-Way Tables
 Example: Georgia death-sentence data
Race of Defendant
Black White
Race of Victim Race of Victim
Black White Black White Totals
Death Yes 18 50 2 58 128
Sentence No 1420 178 62 687 2347
Totals 1438 228 64 745 2475
% Death Sentence 1.2 21.9 3.1 7.8
UC Berkeley Lawsuit
MEN WOMEN

No. of
2691 1835
applicants

Admitted 1199 557

%
44.6 30.4
admitted
LAWSUIT (cont.)
MEN WOMEN
MAJOR No. of No. No. of No.
Applicants Admitted Applicants Admitted
A 825 512 (62%) 108 *89 (82%)
B 560 353 (63%) 25 *17 (68%)
C 325 120 (37%) 593 202 (34%)
D 417 138 (33%) 375 *131 (35%)
E 191 53 (28%) 393 94 (24%)
F 373 23 (6%) 341 *24 (7%)
TOTAL 2691 1199 1835 557
Simpson’s Paradox
 The reversal of the direction of a
comparison or association when
data from several groups are
combined to form a single group.
Fly Alaska Airlines, the on-
time airline!
Alaska Airlines American West
% Arrivals No. of % Arrivals No. of
Destination On Time Arrivals On Time Arrivals
L. A. 88.9% 559 85.6% 811
Phoenix 94.8% 233 92.1% 5,255
San Diego 91.4% 232 85.5% 448
San Fran. 83.1% 605 71.3% 449
Seattle 85.8% 2,146 76.7% 262
Total 3,775 7,225
American West Wins!
You’re a Hero!
Alaska Airlines American West
% Arrivals No. of % Arrivals No. of
Destination On Time Arrivals On Time Arrivals
L. A. 88.9% 559 85.6% 811
Phoenix 94.8% 233 92.1% 5,255
San Diego 91.4% 232 85.5% 448
San Fran. 83.1% 605 71.3% 449
Seattle 85.8% 2,146 76.7% 262
Total 86.7% 3,775 89.1% 7,225
End of Chapter 3

You might also like