Professional Documents
Culture Documents
Lecture 15 Crosstabs 1
Lecture 15 Crosstabs 1
Sociology 5811
Copyright © 2005 by Evan Schofer
Do not copy or distribute without
permission
Announcements
• Final Project Assignment Handed out
• Proposal due November 15
• Final Project due December 13
• Today’s class:
• New Topic: Crosstabulation
• Also called “crosstabs”
• Coming Soon: correlation, regression
Crosstabulation: Introduction
• T-Test and ANOVA look to see if groups differ
on a continuous dependent variable
• Groups are actually a nominal variable
• Example: Do different ethnic groups vary in wages?
• Difference in means for two groups indicates a
relationship between two variables
• Null hypothesis (means are the same) suggests that there is
no relationship between variables
• Alternate hypothesis (means differ) is equivalent to saying
that there is a relationship.
Crosstabulation: Introduction
• T-test and ANOVA determine whether there is a
statistical relationship between a nominal variable
and a continuous variable in your data
• But, we may be interested in two nominal variables
• Examples: Class and unemployment; gender and drug use
• Crosstabulation: used for nominal/ordinal variables
• Tools to descriptively examine variables
• Tools to identify whether there is a relationship between two
variables.
Crosstabulation: Introduction
• What is bivariate crosstabulation?
• Start two nominal variables in a dataset:
• Example: gender (Male/Female) and political party
(Democrat, Republican)
• Crosstabulation is simply counting up the number
of people in each combined category
• How many democratic women? democratic men?
republican women? Republican men?
• It is similar to computing frequencies
• But, for two variables jointly, rather than just one.
Crosstabulation: Introduction
• Example: Female = 1, Democrat = 1
ID Gender Political Party Question: How
many Republican
1 0 1 Women are in the
2 1 0 dataset?
3 1 1
4 0 0
5 0 0
Answer: 2
6 1 0
7 1 1
8 0 1
Crosstabulation: Introduction
• Example: Dataset of 68 people
• Look and count up the number of people in each
combined category
• Or, determine frequency along the first variable:
• Frequency: 43 women, 25 men
• Then break out groups by the second variable
• Of 43 women, 27 = democrat, 16 = republican
• Of 25 men, 10 = democrat, 15 = republican.
Crosstabulation: Introduction
• Crosstab: a table that presents joint frequencies
• Also called a “joint contingency table”
Each box with a
value is a “cell”
Women Men
This is a table
row
Democrat 27 10
This is a table
Republican 16 15 column
Crosstabulation: Introduction
• Tables may also have additional information:
• Row and column marginals (i.e., totals)
Dem 27 + 10 = 37
+ This is the
Rep 16 15 31
total N
=
Total 43 25 68
Crosstabulation: Introduction
• Tables can also reflect percentages
• Either of total N, or of row or column marginals
• This table shows percentage of total N:
Just divide each
Women Men N cell value by the
27 10 total N to get a
Dem 39.7% 14.7% 37 proportion.
16 15
Multiply by 100
for a percentage:
Rep 23.5% 22.1% 31
(10/68)(100)=14.7
N 43 25 68
Crosstabulation: Introduction
• In addition, you can calculate percentages with
respect to either row or column marginals
• Here is an example of column percentages
Just divide each
Women Men N cell by the column
27 10 marginal to get a
Dem 62.8% 40.0% 37 proportion.
16 15
Multiply by 100
Rep 37.2% 60.0% 31 for a percentage:
(10/25)(100)=40%
N 43 25 68
Crosstabulation: Independence
• Question: How can we tell if there is a
relationship between the two variables?
• Answer: If category on one variable appears to be linked to
category on the other:
Women Men N
Dem 43 0 43
Rep 0 25 25
N 43 25 68
Crosstabulation: Independence
• If there is no relationship between two variables,
they are said to be “independent”
• Neither “depends” on the other
• If there is a relationship, the variables are said to
be “associated” or to “covary”
• If individuals in one category also consistently
fall in another (women=dem, men=rep), you may
suspect that there is a relationship between the
two variables
• Just as when the mean of a certain sub-group is much
higher or lower than another (in T-test/ANOVA).
Crosstabulation: Independence
• Relationships aren’t always very clearly visible
• Widely differing numbers of people in categories make
comparisons difficult (e.g., if there were 200 men and only
15 women in the sample)
• And, large tables become more difficult to interpret
(Example: Knoke, p. 157)
• Looking at row or column percentages can make
visual interpretation a bit easier
• Calculate the percentages within the category you think is
the “independent” variable
• If you think that political party affiliation depends on
gender (column variable), look a column percentages.
Crosstabulation: Independence
• Here, column percentages highlight the
relationship among variables:
Women Men N
N 43 25 68
– fij
Expected Cell Values
• If two variables are independent, cell values will
depend only on row & column marginals
– Marginals reflect frequencies… And, if frequency is
high, all cells in that row (or column) should be high
• The formula for the expected value in a cell is:
ˆf ( f i )( f j )
ij
N
• fi and fj are the row and column marginals
• N is the total sample size
Expected Cell Values
• Expected cell values are easy to calculate
– Expected = row marginal * column marginal / N
N 43 25 68
Expected Cell Values
• Question: What makes these values “expected”?
• A: They simply reflect percentages of marginals
• Look at column %’s based on expected values:
Women Men N
N 43 25 68
Expected Cell Values
• Expected values are “expected” because they
mirror the properties of the sample.
• If the sample is 63% women, you’d expect:
– 63% of democrats would be women and
– 63% of republicans would be women
• If not, the variables (gender & political view)
would not be “independent” of each other
Chi-Square Test of Independence
• The Chi-square test is a comparison of expected
and observed values
• For each cell, compute:
( Expected Observed ) 2
Expected
• Then, sum this up for all cells
• If cells all deviate a lot from the expected values,
then the sum is large
• Maybe we can reject H0
Chi-square Test of Independence
• The actual Chi-square formula:
R C ( Eij Oij ) 2
2
i 1 j 1 Eij
• R = total number of rows in the table
• C = total number of columns in the table
• Eij = the expected frequency in row i, column j
• Oij = the observed frequency in row i, column j
• Question: Why square E – O ?
Chi-square Test of Independence
• Assumptions require for Chi-square test:
• Only one: Sample size is large, N > 100
• Hypotheses
– H0: Variables are statistically independent
– H1: Variables are not statistically independent
• The critical value can be looked up in a Chi-
square table
– See Knoke, p. 509-510
– Calculate degrees of freedom: (#Rows-1)(#Col-1)
Chi-square Test of Independence
• Example: Gender and Political Views
– Let’s pretend that N of 68 is sufficient
Women Men
O11: 27 O12 : 10
Democrat
E11: 23.4 E12 : 13.6
O21 : 16 O22 : 15
Republican
E21 : 19.6 E22 : 11.4
Chi-square Test of Independence
• Compute (E – O)2 /E for each cell
Women Men
Absolute
Strength of Association
value of Q
0 to .24 “virtually no relationship”