You are on page 1of 52

Clarifying relationships using thought experiments

Shoaib Ul-Haq LUMS

1) Measuring Income Inequality

Source: Nicholson (Dec 2001)

Theories

Theories involve relationships between concepts.

Types of Variables

Continuous variables:

Always numeric Can be any number, positive or negative Examples: age in years, weight, blood pressure readings, temperature, concentrations of pollutants and other measurements
Information that can be sorted into categories Types of categorical variables ordinal, nominal and dichotomous (binary)

Categorical variables:

Categorical Variables:
Ordinal Variables

Ordinal variablea categorical variable with some intrinsic order or numeric value Examples of ordinal variables:

Education (no high school degree, HS degree, some college, college degree) Agreement (strongly disagree, disagree, neutral, agree, strongly agree) Rating (excellent, good, fair, poor) Frequency (always, often, sometimes, never) Any other scale (On a scale of 1 to 5...)

Categorical Variables:
Nominal Variables

Nominal variable a categorical variable without an intrinsic order Examples of nominal variables:

Where a person lives in the U.S. (Northeast, South, Midwest, etc.) Sex (male, female) Nationality (American, Mexican, French) Race/ethnicity (African American, Hispanic, White, Asian American) Favorite pet (dog, cat, fish, snake)

Categorical Variables:
Dichotomous Variables

Dichotomous (or binary) variables a categorical variable with only 2 levels of categories

Often represents the answer to a yes or no question Did you attend the church picnic on May 24? Did you eat potato salad at the picnic? Anything with only 2 categories

For example:

Process vs. variable oriented research

Process

How do you go from point A to point B Process an organization goes through from a small firm to a large corporation.
Relationship between variables

Variable

Rational Approach

10

Example of a variable oriented research


Y Lung Capacity (cc) 5673 5632 5712 5723 5484 5308 5133 X1 X2 X3 X4 X5

Gender
1 1 1 1 1 1 1

Height
69.5 70.1 68.2 70.9 71.9 69.2 71.9

Smoker
0 0 0 0 1 1 1

Exercise
25 24 26 26 20 15 0

Age
47 67 36 68 58 19 40

11 11

Preliminary Analyses
The table below shows some descriptive statistics for each variable. What basic statements about our data can we make from this?
Lung Capacity (cc) Mean Stdev Min Max 5325.60 410.48 4233.71 6261.00 Gende r 0.50 0.50 0.00 1.00 Smoke r 0.39 0.49 0.00 1.00 Exercis e 21.35 8.91 0.00 40.29

Height 68.23 3.45 58.93 76.61

Age 46.42 13.98 19.00 82.14

12 12

Capacity by Gender, Smoking


Gender Smoker NonSmoker Data Average of Lung Capacity (cc) Female 5427.6 7 Male 5662.2 2 Grand Total 5546.87

StdDev of Lung Capacity (cc)


Count of Smoker = 0 Smoker Average of Lung Capacity (cc) StdDev of Lung Capacity (cc) Count of Smoker = 1 Total Average of Lung Capacity (cc) Total StdDev of Lung Capacity (cc) Total Count of Smoker

256.41
30.00 4837.4 5 273.74 20.00 5191.5 8 391.51 50.00

284.71
31.00 5129.0 5 297.51 19.00 5459.6 1 387.93 50.00

293.75
61.00 4979.51 318.12 39.00 5325.60 410.48 100.00

Does there appear to be a relationship between, Smoking, Gender, and Lung Capacity?
13 13

Distributions
Lung Capacity (cc.)
40 120.00% 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% 4400 4800 5200 5600 6000 More Capacity in cc, up to number shown

Height Distribution
50 120.00% 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% 60 64 68 72 76 More Height in Inches

Frequency

30 20 10 0

Frequency

40 30 20 10 0

Distribution of Exercise Time


30 25 20 15 10 5 0 5 10 15 20 25 30 35 More Minutes of exercise per day 120.00% 100.00% 80.00% 60.00% 40.00% 20.00% 0.00%

Distribution of Age
30 25 20 15 10 5 0 20 30 40 50 60 70 80 More Age in years 120.00% 100.00% 80.00% 60.00% 40.00% 20.00% 0.00%

Frequency

Frequency

14 14

Bivariate Analysis Matrix Plot


Matrix Plot of Lung Capacity (cc), Height, Exercise, Age
60 66 72 0 20 40 30 60 90 6000 Lung C apacity ( cc) 5000 4000 72 H eight 66 60 40

Exer cise

20 0

A ge

15 15

Capacity distribution by Gender, Smoking


Histogram of Lung Capacity (cc)
Normal
Gender 0 1 Mean StDev N 5192 391.5 50 5460 387.9 50

10

Men have a larger lung capacity than women, on average.

Frequency

4400

4800 5200 5600 Lung Capacity (cc)

6000

6400

Histogram of Lung Capacity (cc)


Normal 18 16 14 12
Smok er 0 1 Mean StDev N 5547 293.7 61 4980 318.1 39

Frequency

10 8 6 4 2 0 4400 4800 5200 5600 Lung Capacity (cc) 6000

Non-Smokers have a larger lung capacity than smokers on average. What about the variance?

16 16

Simple Regression
How well can exercise time alone predict the lung capacity?
Lung Capacity and Exercise Time
7000 y = 28.71x + 4712.5 R2 = 0.3881

Lung Capacity in cc.

6000 5000 4000 3000 2000 1000 0 0 10 20 30 40 50 Minutes of exercise per day

17 17

Multiple Regression
How do all the Xs together help predict y?
SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations Coefficients Intercept Gender Height Smoker Exercise Age 1662.3965 202.3282 50.3468 -278.9711 11.2949 -0.1174 Standard Error 475.1456634 41.86861042 7.08207335 52.71395448 2.991170972 1.462303258 t Stat 3.498709192 4.832456809 7.109058989 -5.292169492 3.776112614 -0.080303367 0.8798341 0.7741081 0.7620926 200.21 100 P-value 0.000716253 5.23607E-06 2.24959E-10 7.88193E-07 0.000279023 0.936166702
18 18

Thought experiment for relationships between categorical variables


PTI Females 70 PML 30 Total 100

Males
Total

30
100

70
100

100
200

Females are more likely than males to be PTI members


19 19

Thought experiment for relationships between categorical variables


PTI Females 51 PML 49 Total 100

Males
Total

49
100

51
100

100
200

Females are more likely than males to be PTI members but now relationship is weaker

20 20

Thought experiment for relationships between categorical variables


PTI Females 95 PML 5 Total 100

Males
Total

5
100

95
100

100
200

Effect of gender on political party affiliation is very strong.


21 21

How to use a thought experiment

Create a contingency table, listing the values of the cause as rows and the values of the effect as columns. Think of 100 hypothetical people for each row of the table; that is, set the marginal frequency of each row to equal 100. Of these 100 people, specify how many you think will fall into each column category; this represents the percentage of people in each category.
22

Categorical variables with more than 2 levels


PML (N) Punjabi 50 PPP 25 PTI 25 Total 100

Sindhi
Balochi Total

25
50 100

50
25 100

25
25 100

100
100 300

23 23

Categorical variables with more than 2 levels

Proposition 1: Punjabi, Sindhi and Balochi are equally likely to be members of PTI. Proposition 2: Sindhis are more likely to be PPP members than either Punjabis or Balochis; Punjabis and Balochis are equally likely to be PPP members. Proposition 3: Punjabi and Balochi both are more likely than Sindhis to be PML(N) members;

24

1. Linear & Non-linear relationships between variables

Often of greatest interest in social science is investigation into relationships between variables: is social class related to political perspective? is income related to education? is worker alienation related to job monotony? We are also interested in the direction of causation, but this is more difficult to prove empirically: our empirical models are usually structured assuming a particular theory of causation
25

Relationships between scale variables

The most straight forward way to investigate evidence for relationship is to look at scatter plots:

traditional to:

put the dependent variable (I.e. the effect) on the vertical axis
or y axis

put the explanatory variable (I.e. the cause) on the horizontal axis
or x axis

26

Scatter plot of IQ and Income:


40000

30000

20000

INCOME

10000 60 80 100 120 140 160

IQ

27

We would like to find the line of best fit: y a bx


40000

30000

INCOME a b IQ where, a y intercept b slope of line

20000

INCOME

10000 60 80 100 120 140 160

IQ

28

Co

What does the output mean?

Model 1

(Cons tant) IQ

Unstandard Coef f icien B Std -8236.836 12 258.523

a. Dependent Variable: INCOME


29

Sometimes the relationship appears non-linear:


40000

30000

20000

INCOME

10000 0 100 200 300

IQ2
30

and so a straight line of best fit is not always very satisfactory:


40000

30000

20000

INCOME

10000 0 100 200 300

IQ2
31

Could try a quadratic line of best fit:


40000

30000

20000

INCOME

10000 0 100 200 300

IQ2
32

Or could try two linear lines:


structural break
40000

30000

20000

INCOME

10000 0 100 200 300

IQ2
33

Illustration of Curvilinear Regression


60 50 40

30 20 10 0 0 5 10 X 15 20 25

Inverted-U Theory
This theory illustrates the relationship between the market structures and technological advances.

Technological Discontinuities
Discontinuity

Categorical Cause & Quantitative Effect Hypothetical means


Mean annual income 45,000 40,000

Private Public

Private Universitys faculty has higher salaries than public universitys faculty
41 41

Quantitative Cause & Categorical Effect Hypothetical Probabilities


PTI Liberalness = 5 Liberalness = 4 Liberalness = 3 PML (N) Total 100 100 100

Liberalness = 2
Liberalness = 1

100
100

Degree to which a person is liberal affects his political party affiliation


42 42

Quantitative Cause & Categorical Effect Hypothetical Probabilities


PTI Liberalness = 5 Liberalness = 4 Liberalness = 3 70 60 50 PML (N) 30 40 50 Total 100 100 100

Liberalness = 2
Liberalness = 1

40
30

60
70

100
100

Degree to which a person is liberal affects his political party affiliation


43 43

Quantitative Cause & Categorical Effect Hypothetical Probabilities


PTI Liberalness = 5 Liberalness = 4 Liberalness = 3 0.70 0.60 0.50 PML (N) 0.3 0.40 0.50 Total 100 100 100

Liberalness = 2
Liberalness = 1

0.40
0.30

0.60
0.70

100
100

Probability of being a PTI member is a direct linear function of liberalness.


44 44

Thought experiments for moderated relationships


Moderated relationships involve 3 variables. Focus on cases where the strength of a relationship between two variables changes depending on the value of a third variable. Examples:

Inflation has a bigger influence in economies in underdeveloped countries as opposed to developed. Higher level of education are more likely to translate into job opportunities for punjabis as opposed to Balochis.

45

Hypothetical Factorial Design


BBA Females 6.0 MBA 6.0

Males

5.0

4.0

Gender differences in Universitys satisfaction during MBA are larger than gender differences in BBA

46 46

Thought experiments for moderated relationships

Create a factorial table with the moderator variable (MV) as columns and the focal IV as rows. Fill in plausible hypothetical mean values on the outcome variable for each cell of the table. Calculate the effect of the focal IV at each level of the MV and then calculate the interaction contrast to determine if there is a moderated relationship.

47

Interaction Contrast
BBA Females A MBA C

Males

Interaction contrast = (a-b)-(c-d). If this value is non-zero, then a moderated relationship is present.
48 48

Factorial design with more than 2 levels


FSc Females 6.0 BBA 6.0 MBA 6.0

Males

5.0

4.0

4.0

49 49

More than 2 levels


Females Males

Fsc 6.0 5.0

BBA 6.0 4.0

Females Males

FSc 6.0 5.0

MBA 6.0 4.0

BBA Females 6.0 Males 4.0

MBA 6.0 4.0

Gender differences in Universitys satisfaction during BBA are larger than in FSc.

50 50

Factorial design with quantitative variables


Male High 3.0 Female 4.0

Medium
Low

3.0
3.0

5.0
6.0

IV = How much time a teacher spends with students.


51 51

Quantitative variables
High Male 3.0
Female

4.0

High Low

Male Female 3.0 4.0 3.0 6.0

Medium

3.0

5.0

Medium Low

Male Female 3.0 5.0 3.0 6.0

The effect of spending large amount of time versus moderate is stronger for females than for males.
52 52

You might also like