Professional Documents
Culture Documents
Categorical Data
Summarizing categorical data
Two-way tables
Marginal distributions
Conditional distributions
Simpsons paradox
Summary of Categories
Count
Each category has a number of occurrences (frequency
tables)
Percentages are useful (relative frequency tables)
Sex
Count
Percentage
Male
157
46.4%
Female
181
53.6%
Total
338
100%
Two Categories
What is the relationship between two
categories?
Cross-classification table is a good
summary (contingency tables or two-way
Freshman
Sophomor Junior
Senior(+) Row
tables)
Male
43
Femal
e
Column
Totals
Totals
67
30
17
157
48
62
25
46
181
91
129
55
63
338
Column
Percentages
F
Sp
43
67
SR
30
17
48 62
25
46
53% 48% 45% 73%
91
129
Row
Percentages
55
63
157
46%
181
54%
338
Sp
43
27%
48
27%
67
SR
30
17
157
62
25
46
181
91
129
55
63
27%
38%
16%
19%
338
Response variable
Predicted, effect, interesting variable
Bar Graph
Year in School
My Pet Peeves
Graphs should
Be clear
Allow comparisons
Tell a story
Graphs should not
Have uninformative aspects
Obscure
Arrrgh!
Arrrgh! Arrrgh!
Pie Charts
Or
More examples
Number of
days read
Frequency
44
24
18
16
20
22
26
30
Total
200
Current Investment
Portfolio
Amount
(in thousands $)
Stocks
46.5
42.27
Bonds
32.0
CD
15.5
Savings
16.0
14.55
Total
Saving
s
15%
110
29.09
14.09
CD
14%
Stock
s
42%
100
Qualitative variables
Must equal 100%
Bond
s
29%
Percentages
are rounded
to the
nearest
percent
Marginal distributions
We can look at each categorical variable separately in a twoway table by studying the row totals and the column totals.
They represent the marginal distributions, expressed in
counts or percentages (They are written as if in a margin.)
Conditional distribution
Music and wine purchase decision
30 = 35.7%
84
cell total .
=
column total
Simpsons paradox
On the surface,
Hospital B would
seem to have a
better record.