You are on page 1of 17

DISPLAYING AND DESCRIBING CATEGORICAL DATA

Chapter 3

The goal of statistics is to uncover the story told by a data set

Whats the first thing you should always do with data?


Make

a picture! Make a picture! Make a picture!

Titanic Data (p. 15)


What

did you find out from your reading about the titanic data? (Especially once the data were organized in a table?)

Early Main Ideas

Frequency Table

Records totals and categories Gives percents for each category Names categories and tells how frequently each occurs Area on graph should correspond to magnitude Displays distribution of a categorical variable, showing counts for each category next to each other for easy comparison Shows whole group of cases as a circle, cut into slices where each slice is proportional to the fraction of the whole in each category

Relative Frequency Table

Distribution

Area Principle

Bar Chart

Pie Chart

Comparing Two Categorical Variables

Often, we want to investigate whether theres a relationship between two categorical variables. For example, our authors want to determine whether theres a relationship between the kind of ticket a passenger held and the passengers chance of survival. One method for investigating this relationship is by using a two-way table called a contingency table.

Class vs. Survival


Class First Survival
Alive Dead Total 203 122 325

Second
118 167 285

Third
178 528 706

Crew
212 673 885

Total
711 1490 2201

The frequency distribution of one of the variables is called its marginal distribution.

Class
First Alive Count % of Row % of Table 203 28.6% 9.2% 122 8.2% 5.6% 325 14.8% 14.8% Second 118 16.6% 41.4% 5.4% 167 11.2% 58.6% 7.6% 285 12.9% Third 178 25.0% 25.2% 8.1% 528 35.4% 74.8% 24.0% 706 32.1% Crew 212 29.8% 24.0% 9.6% 673 45.2% 76.0% 30.6% 885 40.2% Total 711 100% 32.3% 32.3% 1490 100% 67.7% 67.7% 2201 100%

% of Column 62.5% Dead Count % of Row % of Table Total Count % of Row % of Table

Survival

% of Column 37.5%

% of Column 100%

100%
12.9%

100%
32.1%

100%
40.2%

100%
100%

Conditional Distribution: Class by Survival


Class

First
Alive Count % of Column Dead Total Count % of Column Count % of Column 203 62.5% 122 37.5% 325 100%

Secon d
118 41.4% 167 58.6% 285 100%

Third
178 25.2% 528 74.8% 706 100%

Crew
212 24.0% 673 76.0% 885 100%

Total
711 32.3% 1490 67.7% 2201 100%

Survival

Percent of what?

You must be careful when answering similar sounding questions:


1. 2. 3.

What percent of the survivors were in second class? What percent were second-class passengers who survived? What percent of the second-class passengers survived? will help you to know the Who and whether to use row, column, or table percentages.

Always ask yourself: percent of what?


That

Conditional Distributions

A conditional distribution shows the distribution of one variable for just the individuals who satisfy some condition on another variable. For example, we could look at the conditional distribution of ticket class, conditional on having survived:
Class First Survival Alive 203 28.6% Secon Third d 118 16.6% 178 25.0% Crew 212 29.8% Total 711 100%

Segmented Bar Charts

One way to compare the conditional distributions for survival by class is to look at a new type of bar chart called a segmented bar chart. A segmented bar chart treats each bar as the whole and divides it proportionally into segments corresponding to the percentage in each group.

Segmented Bar Chart for Class by Survival


100% 90% 80% 70% 60% 50% 40% 30%

Crew Third Second First

20%
10% 0% Alive Dead

Does it appear that survival may have depended on class? Do you think there is an association between these variables?

Independence

In a contingency table, when the distribution of one variable is the same for all categories of another, we say that the variables are independent. Well see a way to check for independence formally later in the book. For now, well just compare the distributions.

Class Survey Data

Lets investigate our own data and see what questions we can answer

What Can Go Wrong?

Dont violate the area principle. Keep it honest. Dont confuse similar-sounding percentages. Dont forget to look at the variables separately, too. Be sure to use enough individuals. Dont overstate your case. Dont use unfair or silly averages.

Consider this

Its the last inning of an important game. Your team is a run down with the bases loaded and two outs. The pitcher is due up, so youll be sending in a pinch-hitter. There are 2 batters available on the bench. Whom should you send in to bat?
Player A B Overall 33 for 103 45 for 151 Vs. LHP 28 for 81 12 for 32 Vs. RHP 5 for 22 33 for 119

Whats going on here?

The baseball problem is an example of Simpsons Paradox.

This occurs when someone uses unfair averaging over different groups.

The moral of Simpsons paradox is to be careful when you average across different levels of a second variable. Its always better to compare percentages or other averages within each level of the other variable. The overall average may be misleading.

Homework

Pgs. 21-26 Choose 1 from numbers 1-4, then complete problems 5, 7, 12, 16, 22-24, 26, and 31 Read the Investigative Task

You might also like