You are on page 1of 19

! ! ! ! ! ! ! ! ! ! ! ! !

STA303H5S - Winter 2014: Data Analysis II
LECTURE 2: One Way ANOVA and Linear Regression Ramya Thinniyam

January 9th, 2014

! ! ! ! ! ! ! ! ! ! ! ! !

The Spock Conspiracy Trial
Q: Is there evidence of gender bias in the jury selection of Spock’s trial? A1: Last Class: Used a two-sample t-test to answer the question of interest. H0 : µspock = µother vs. Ha : µspock = µother t-test Method Pooled t-test (assuming equal variances) Satterthwaite Approximation Test Statistic 5.67 7.16 p-value < 0.0001 < 0.0001

Concluded that there is very strong evidence of a difference in the mean percentage of women on Spock’s judge’s venires and that of the other judges.
1 / 11

! ! ! ! ! ! ! ! ! ! ! ! !

Spock Conspiracy Trial
A2: Use a Linear Model approach / ANOVA

2 / 11

! ! ! ! ! ! ! ! ! ! ! ! !

Spock Conspiracy Trial
A2: Use a Linear Model approach / ANOVA Recall: A Multiple Linear Regression model Yi = β0 + β1 X1,i + β2 X2,i . . . + βp Xp,i + +ei ; for i = 1, 2, . . . , n
Yi : response for the i th case (quantitative variable) X1,i , X2,i , . . . , Xp,i : predictors for i th case (quantitative or categorical) ei : error term for the i th case, where ei iid ∼ N (0, σ 2 ) β0 , β1 , . . . , βp : regression coefficients/parameters, β0 : intercept n : number of cases / sample size
<- P predictor

If we are interested in using a factor/categorical variable with levels, then we model with − 1 indicator/dummy variables. Choose one level as the default (has no indicator variable) and all the other levels do. Q: Why do we use − 1 indicator variables instead of ?
2 / 11

! ! ! ! ! ! ! ! ! ! ! ! !

Spock Conspiracy Trial
A2: Use a Linear Model approach / ANOVA Recall: A Multiple Linear Regression model Yi = β0 + β1 X1,i + β2 X2,i . . . + βp Xp,i + +ei ; for i = 1, 2, . . . , n
Yi : response for the i th case (quantitative variable) X1,i , X2,i , . . . , Xp,i : predictors for i th case (quantitative or categorical) ei : error term for the i th case, where ei iid ∼ N (0, σ 2 ) β0 , β1 , . . . , βp : regression coefficients/parameters, β0 : intercept n : number of cases / sample size

If we are interested in using a factor/categorical variable with levels, then we model with − 1 indicator/dummy variables. Choose one level as the default (has no indicator variable) and all the other levels do. A:
1, 2 ,3 ...,or l -1, by default it Q: Why do we use − 1 indicator variables instead of ? belong In level l.
2 / 11

If case does not belong to level

! ! ! ! ! ! ! ! ! ! ! ! !

Using Indicator Variables
Suppose a factor has levels, we can define indicator variables as follows. For k = 1, 2, . . . , − 1 Ik , i = 1, 0, if ith case belongs in factor level k otherwise

Then, in Spock example: Ispock,i = 1, 0, if ith venire has Spock’s judge otherwise

Fit the model: Yi = β0 + β1 Ispock,i + ei for i = 1, 2, . . . , 46 where Yi = % women on ith venire → Simple Linear Regression Model
3 / 11

! ! ! ! ! ! ! ! ! ! ! ! !

Least Squares Estimates of Regression Parameters
ˆ0 ≡ b0 = y ¯ − b1 x ¯ β ˆ1 ≡ b1 = SSXY /SSXX = β
n ¯ i =1 (xi − x )(yi − n ¯ 2 i =1 (xi − x )

¯) y

=

n ¯¯ i =1 xi yi − nx y n 2 ¯2 i = 1 xi − n x

Q: In Spock example, what are the following quantities? xi = n i = 1 xi = ¯= x n 2 i = 1 xi = n i =1 xi yi =

4 / 11

! ! ! ! ! ! ! ! ! ! ! ! !

Least Squares Estimates of Regression Parameters
ˆ0 ≡ b0 = y ¯ − b1 x ¯ β ˆ1 ≡ b1 = SSXY /SSXX = β
n ¯ i =1 (xi − x )(yi − n ¯ 2 i =1 (xi − x )

¯) y

=

n ¯¯ i =1 xi yi − nx y n 2 ¯2 i = 1 xi − n x

Q: In Spock example, what are the following quantities? xi = n i = 1 xi = ¯= x n 2 i = 1 xi = n i =1 xi yi = A:
4 / 11

! ! ! ! ! ! ! ! ! ! ! ! !

Parameter Interpretation
For the model Yi = β0 + β1 Ispock,i + ei : E (Yi ) = β0 + β1 , β0 , if ith venire has Spock’s judge if ith venire has another judge

So, β0 is the mean % of women in other judge’s venires β1 is the difference in the mean % of women (response) between Spock’s and other judge’s venires β1 = 0: no difference between mean % women in Spock’s and other judges β1 > 0 : mean % women is higher for Spock’s than other judges β1 < 0: % women is lower for Spock’s than other judges
Caution: If the factor has more levels, interpretation is slightly different: expectations are relative to the default factor level. Write out the model using indicators and take expectations to correctly interpret the parameters.
5 / 11

! ! ! ! ! ! ! ! ! ! ! ! !

Regression Parameter Estimates
The parameter estimates in Spock’s example simplify to: ¯spock − y ¯other b1 = y

6 / 11

! ! ! ! ! ! ! ! ! ! ! ! !

Regression Parameter Estimates
The parameter estimates in Spock’s example simplify to: ¯spock − y ¯other b1 = y Proof:

6 / 11

! ! ! ! ! ! ! ! ! ! ! ! !

Regression Parameter Estimates
The parameter estimates in Spock’s example simplify to: ¯spock − y ¯other b1 = y Proof:

¯other . Homework Exercise: Show b0 = y

6 / 11

! ! ! ! ! ! ! ! ! ! ! ! !

Testing using a Linear Regression Model
H0 : β1 = β10 vs Ha : β1 = β10 t= b1 − β10 ∼ tn−2 under H0 se(b1 )

Assuming the following hold: Correct form of the model Gauss-Markov Conditions:
1. E (ei ) = 0 2. Var (ei ) = σ 2 (constant) 3. E (ei ej ) = 0 for i = j (uncorrelated errors)

ei are Normal Testing if the means differ is equivalent to testing if the β1 parameter is significant in the regression.
7 / 11

! ! ! ! ! ! ! ! ! ! ! ! !

Connection to ANOVA
When β10 = 0 (like in Spock example), using the linear model is the same as One-Way Analysis of Variance (ANOVA): 1 factor - testing if the means of the groups are different. In general, it can be extended to multiple factors and factors with more than two levels: testing if all the factor level means are equal or if any of them differ. We will discuss ANOVA next class and use it to answer the questions of interest in Spock Conspiracy case study:

8 / 11

! ! ! ! ! ! ! ! ! ! ! ! !

Connection to ANOVA
When β10 = 0 (like in Spock example), using the linear model is the same as One-Way Analysis of Variance (ANOVA): 1 factor - testing if the means of the groups are different. In general, it can be extended to multiple factors and factors with more than two levels: testing if all the factor level means are equal or if any of them differ. We will discuss ANOVA next class and use it to answer the questions of interest in Spock Conspiracy case study:
Question of Interest 1: Is there evidence of difference in mean percent of women on Spock’s judge’s venires when compared to other judges? → One-Way ANOVA with 2 factor levels (Spock and other) Question of Interest 2: Is there evidence that there are differences in womens representation in venires of the other 6 judges? → One-Way ANOVA with 6 factor levels (A,B,C,D,E,F)

8 / 11

! ! ! ! ! ! ! ! ! ! ! ! !

Spock Linear Model in ‘R’
> I_spock=rep(0,46) > I_spock [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [23] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > for(i in 1:length(judge1)) { if (judge1[i]=="SPOCK"){ I_spock[i]=1 } } > I_spock [1] 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [23] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

9 / 11

! ! ! ! ! ! ! ! ! ! ! ! !

> spock_linearreg = lm(percentwomen ˜ I_spock) > summary(spock_linearreg) Call: lm(formula = percentwomen ˜ I_spock) Residuals: Min 1Q -12.9919 -4.6669

Median 0.2581

3Q 3.7854

Max 19.4081

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 29.492 1.160 25.42 < 2e-16 *** I_spock -14.870 2.623 -5.67 1.03e-06 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 7.056 on 44 degrees of freedom Multiple R-squared: 0.4222, Adjusted R-squared: 0.409 F-statistic: 32.15 on 1 and 44 DF, p-value: 1.03e-06
10 / 11

! ! ! ! !

Example: Spock Conspiracy
Q: Answer the first question of interest using a linear model approach. Include all the necessary elements, assumptions, and make a conclusion. A:

11 / 11