Lecture 15 Crosstabs 1

Lecture 15: Crosstabulation 1
Sociology 5811
Copyright © 2005 by Evan Schofer
Do not copy or distribute without
permission
Announcements
• Final Project Assignment Handed out
• Proposal due November 15
• Final Project due December 13
• Today’s class:
• New Topic: Crosstabulation
• Also called “crosstabs”
• Coming Soon: correlation, regression
Crosstabulation: Introduction
• T-Test and ANOVA look to see if groups differ
on a continuous dependent variable
• Groups are actually a nominal variable
• Example: Do different ethnic groups vary in wages?
• Difference in means for two groups indicates a
relationship between two variables
• Null hypothesis (means are the same) suggests that there is
no relationship between variables
• Alternate hypothesis (means differ) is equivalent to saying
that there is a relationship.
• T-test and ANOVA determine whether there is a
statistical relationship between a nominal variable
and a continuous variable in your data
• But, we may be interested in two nominal variables
• Examples: Class and unemployment; gender and drug use
• Crosstabulation: used for nominal/ordinal variables
• Tools to descriptively examine variables
• Tools to identify whether there is a relationship between two
variables.
• What is bivariate crosstabulation?
• Start two nominal variables in a dataset:
• Example: gender (Male/Female) and political party
(Democrat, Republican)
• Crosstabulation is simply counting up the number
of people in each combined category
• How many democratic women? democratic men?
republican women? Republican men?
• It is similar to computing frequencies
• But, for two variables jointly, rather than just one.
• Example: Female = 1, Democrat = 1
ID Gender Political Party Question: How
many Republican
1 0 1 Women are in the
2 1 0 dataset?
3 1 1
4 0 0
5 0 0
Answer: 2
6 1 0
7 1 1
8 0 1
• Example: Dataset of 68 people
• Look and count up the number of people in each
combined category
• Or, determine frequency along the first variable:
• Frequency: 43 women, 25 men
• Then break out groups by the second variable
• Of 43 women, 27 = democrat, 16 = republican
• Of 25 men, 10 = democrat, 15 = republican.
• Crosstab: a table that presents joint frequencies
• Also called a “joint contingency table”
Each box with a
value is a “cell”
Women Men
This is a table
row
Democrat 27 10
This is a table
Republican 16 15 column
• Tables may also have additional information:
• Row and column marginals (i.e., totals)
Women Men Total
Dem 27 + 10 = 37
+ This is the
Rep 16 15 31
total N
=
Total 43 25 68
• Tables can also reflect percentages
• Either of total N, or of row or column marginals
• This table shows percentage of total N:
Just divide each
Women Men N cell value by the
27 10 total N to get a
Dem 39.7% 14.7% 37 proportion.
16 15
Multiply by 100
for a percentage:
Rep 23.5% 22.1% 31
(10/68)(100)=14.7
N 43 25 68
• In addition, you can calculate percentages with
respect to either row or column marginals
• Here is an example of column percentages
Just divide each
Women Men N cell by the column
27 10 marginal to get a
Dem 62.8% 40.0% 37 proportion.
16 15
Multiply by 100
Rep 37.2% 60.0% 31 for a percentage:
(10/25)(100)=40%
N 43 25 68
Crosstabulation: Independence
• Question: How can we tell if there is a
relationship between the two variables?
• Answer: If category on one variable appears to be linked to
category on the other:
Women Men N
Dem 43 0 43
Rep 0 25 25
N 43 25 68
• If there is no relationship between two variables,
they are said to be “independent”
• Neither “depends” on the other
• If there is a relationship, the variables are said to
be “associated” or to “covary”
• If individuals in one category also consistently
fall in another (women=dem, men=rep), you may
suspect that there is a relationship between the
two variables
• Just as when the mean of a certain sub-group is much
higher or lower than another (in T-test/ANOVA).
• Relationships aren’t always very clearly visible
• Widely differing numbers of people in categories make
comparisons difficult (e.g., if there were 200 men and only
15 women in the sample)
• And, large tables become more difficult to interpret
(Example: Knoke, p. 157)
• Looking at row or column percentages can make
visual interpretation a bit easier
• Calculate the percentages within the category you think is
the “independent” variable
• If you think that political party affiliation depends on
gender (column variable), look a column percentages.
• Here, column percentages highlight the
relationship among variables:
Women Men N
Dem 62.8% 40.0% 37
Rep 37.2% 60.0% 31
N 43 25 68
• It appears as though women tend to be more

democratic, while men tend to be republican
Chi-square Test of Independence
• In the sample, women appear to be more
democratic, men republican
• How do we know if this difference is merely due
to sampling variability? (Thus, there is no
relationship in the population?)
• Or, is it indicative of a relationship at the population level?
• Answer: A new kind of statistical test
• The chi-square (2) test
• Pronunciation: “chi” rhymes with “sky”
• Chi-square tests: Similar to T-tests, F-tests
• Another family of distributions with known properties.
• Chi-Square test is a test of independence
• Asks “is there a relationship between variables or not?”
• Independence = no relationship
• ANOVA, T-Test do this too (same means = independent)
• Null hypothesis: the two variables are
statistically independent
• H0: Gender and political party are independent
• There is no relationship between them
• Alternate hypothesis: the variables are related,
not independent of each other
• H1: Gender and political party are not independent.
• How does a chi-square test of independence
work?
• It is based on comparing the observed cell values
with the values you’d expect if there were no
relationship between variables
• Definitions:
• Observed values = values in the crosstab cells based on
your sample
• Expected values = crosstab cell values you would expect if
your variables were unrelated.
Crosstabs: Notation
• The value in a cell is referred to as a frequency
– Math symbol = f
• Cells are referred to by row and column numbers
– Ex: women republicans = 2nd row, 1st column
– In general, rows are numbered from 1 to i, columns
are numbered from 1 to j
• Thus, the value in any cell of any table can be
written as:
– fij
Expected Cell Values
• If two variables are independent, cell values will
depend only on row & column marginals
– Marginals reflect frequencies… And, if frequency is
high, all cells in that row (or column) should be high
• The formula for the expected value in a cell is:
ˆf  ( f i )( f j )
ij
N
• fi and fj are the row and column marginals
• N is the total sample size
• Expected cell values are easy to calculate
– Expected = row marginal * column marginal / N
Women Men N RowM * ColM / N

(25*37)/68=13.6
Dem 23.4 13.6 37
Rep 19.6 11.4 31
N 43 25 68
• Question: What makes these values “expected”?
• A: They simply reflect percentages of marginals
• Look at column %’s based on expected values:
Women Men N
Dem 54% 54% 37 (54%)
Rep 46% 46% 31 (46%)
N 43 25 68
• Expected values are “expected” because they
mirror the properties of the sample.
• If the sample is 63% women, you’d expect:
– 63% of democrats would be women and
– 63% of republicans would be women
• If not, the variables (gender & political view)
would not be “independent” of each other
Chi-Square Test of Independence
• The Chi-square test is a comparison of expected
and observed values
• For each cell, compute:
( Expected  Observed ) 2
Expected
• Then, sum this up for all cells
• If cells all deviate a lot from the expected values,
then the sum is large
• Maybe we can reject H0
• The actual Chi-square formula:
R C ( Eij  Oij ) 2
  
2
i 1 j 1 Eij
• R = total number of rows in the table
• C = total number of columns in the table
• Eij = the expected frequency in row i, column j
• Oij = the observed frequency in row i, column j
• Question: Why square E – O ?
• Assumptions require for Chi-square test:
• Only one: Sample size is large, N > 100
• Hypotheses
– H0: Variables are statistically independent
– H1: Variables are not statistically independent
• The critical value can be looked up in a Chi-
square table
– See Knoke, p. 509-510
– Calculate degrees of freedom: (#Rows-1)(#Col-1)
• Example: Gender and Political Views
– Let’s pretend that N of 68 is sufficient
Women Men
O11: 27 O12 : 10
Democrat
E11: 23.4 E12 : 13.6
O21 : 16 O22 : 15
Republican
E21 : 19.6 E22 : 11.4
• Compute (E – O)2 /E for each cell
Women Men
(23.4 – 27)2/23.4 (13.6 – 10)2/13.6

Democrat
= .55 = .95
(19.6 – 16)2/19.6 (11.4 – 15)2/15
Republican
= .66 = .86
Chi-Square Test of Independence
• Finally, sum up to compute the Chi-square
• 2 = .55 + .95 + .66 + .86 = 3.02
• What is the critical value for a=.05?
• Degrees of freedom: (R-1)(C-1) = (2-1)(2-1) = 1
• According to Knoke, p. 509: Critical value is 3.84
• Question: Can we reject H0?
• No. 2 of 3.02 is less than the critical value
• We cannot conclude that there is a relationship between
gender and political party affiliation.
• Weaknesses of chi-square tests:
• 1. If the sample is very large, we almost always
reject H0.
• Even tiny covariations are statistically significant
• But, they may not be socially meaningful differences
• 2. It doesn’t tell us how strong the relationship is
• It doesn’t tell us if it is a large, meaningful difference or a
very small one
• It is only a test of “independence” vs. “dependence”
• Measures of Association address this shortcoming.
Measures of Association
• Separate from the issue of independence,
statisticians have created measures of association
– They are measures that tell us how strong the
relationship is between two variables
• Weak Association Strong Association
Women Men Women Men
Dem. 51 49 Dem. 100 0
Rep. 49 51 Rep. 0 100

Crosstab Association:Yule’s Q
• #1: Yule’s Q
– Appropriate only for 2x2 tables (2 rows, 2 columns)
• Label cell frequencies a through d: a b
bc  ad
Formula : Q  c d
bc  ad
• Recall that extreme values along the “diagonal”
(cells a & d) or the “off-diagonal” (b & c)
indicate a strong relationship.
• Yule’s Q captures that in a measure
• 0 = no association. -1, +1 = strong association
• Rule of Thumb for interpreting Yule’s Q:
• Bohrnstedt & Knoke, p. 150
Absolute
Strength of Association
value of Q
0 to .24 “virtually no relationship”
.25 to .49 “weak relationship”
.50 to .74 “moderate relationship”
.75 to 1.0 “strong relationship”

• Example: Gender and Political Party Affiliation
Women Men Calculate “bc”
a b bc = (10)(16) = 160
Dem 27 10
Calculate “ad”
c d
Rep 16 15 ad = (27)(15) = 405
bc  ad 160  405  245

Q    .48
bc  ad 160  405 505
• -.48 = “weak association”, almost “moderate”
Association: Other Measures
• Phi ()
• Very similar to Yule’s Q
• Only for 2x2 tables, ranges from –1 to 1, 0 = no assoc.
• Gamma (G)
• Based on a very different method of calculation
• Not limited to 2x2 tables
• Requires ordered variables
• Tau c (tc) and Somer’s d (dyx)
• Same basic principle as Gamma
• Several Others discussed in Knoke, Norusis.

Lecture 15 Crosstabs 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 15 Crosstabs 1

Uploaded by

Copyright:

Available Formats

Lecture 15: Crosstabulation 1

Women Men Total

Dem 62.8% 40.0% 37

Rep 37.2% 60.0% 31

• It appears as though women tend to be more

Women Men N RowM * ColM / N

Rep 19.6 11.4 31

Dem 54% 54% 37 (54%)

Rep 46% 46% 31 (46%)

(23.4 – 27)2/23.4 (13.6 – 10)2/13.6

Women Men Women Men

Dem. 51 49 Dem. 100 0

Rep. 49 51 Rep. 0 100

.25 to .49 “weak relationship”

.50 to .74 “moderate relationship”

.75 to 1.0 “strong relationship”

bc  ad 160  405  245

You might also like