2020-02-22 Linear Models

Linear_modelS
Alex Homer
When_might_we_usE 
Linear_models?
• Response variable: continuous data
- Should be normal, but hard to tell before fitting
• Explanatory variables: continuous or categorical data

- Sometimes linear model is overcomplicated, e.g. one
categorical variable with two levels
Three_typeS
• ANOVA: all explanatory variables categorical

• Linear regression: all explanatory variables continuous
• ANCOVA: at least one of each
• These are basically all the same
Choosing_a_responsE
• In a designed experiment (especially ANOVA), usually

obvious
• Otherwise, think about the research question

- “Do <explanatory> and <explanatory> have an effect on
<response>?”
- Remember correlation does not mean causation!

Linear_regressioN
• Continuous explanatory variable(s)

• Example: iris dataset in R
• “Does sepal length increase linearly with sepal width?”
The_iris_dataseT
First data points
The_iris_dataseT
Exploratory data analysis
The_modeL
• Y = α + βx + ε
- Data is known: x = sepal width, Y = sepal length
- Parameters are unknown: α = intercept, β = slope (with

respect to sepal width)
- ε ∼ N (0,σ 2): random error, with σ 2 another unknown

parameter
The_modeL
• Using subscript to denote different observations:

Yi = α + βxi + εi
- α and β are the same for each observation
- εi are independent
• We want to estimate α and β from the data

Estimating_parameterS
• We can estimate α and β from the data: call the estimates
α̂ and β ̂
Big
cloud of
maths
• Best estimates minimise ∑ (Yi − α̂ − βx̂ i)

2
- “Least squares regression”

Fitting_in_R
Summary output
Interpreting_summarieS
• Mostly interested in the “coefficients” table
- Estimate: gives α̂ and β ̂
- Std. Error: gives standard error of these estimates

(remember they’re random)
- t value: gives the test statistic for a (two-tailed) t-test

with null hypothesis α = 0 or β = 0
- Pr(>|t|): gives the p-value for this test

Interpreting_summarieS
• Multiple R-squared: measure of how well the model fits

the data
- “Proportion of total variance explained by the model”
- Good value in biology: at least 0.6
• Residual standard error: estimate of σ from the data

Fitting_in_R
ANOVA table
Anova_tableS
• Used for model comparison
• Read from bottom-up: performs tests for sequentially
dropping each variable (only one here)
• Null hypothesis is always model with fewer variables

• Mean Sq: Sum Sq/df
• F value: critical value for an F-test (ratio of Mean Sq)
• Pr(>F) corresponding p-values
F-valueS
• “Signal-to-noise” ratio; always positive
• Under the null hypothesis, have mean 1
- Value less than 1: “experimental error” (though this
doesn’t have much mathematical meaning)
• Large values: reject null hypothesis

• Critical value depends on two numbers of degrees of
freedom
More_explanatorY 
VariableS
• “Does sepal length increase linearly with petal length and
petal width?”
• Now we need to consider possible interaction terms:

Yi = α + βxi + γzi + δxizi + εi
- xi = ith petal length, zi = ith petal width, α, β, γ, δ

unknown constants; Yi and εi as before
Fitting_in_R
Summary output
Fitting_in_R
ANOVA table
Anova_tableS
• Order matters (why we read from the bottom up)

• We don’t remove “simple” terms without removing
interaction terms, nor remove the intercept ever, unless
there’s a good a priori reason to
PredictinG
• What does the model predict

as the sepal length of a plant
with petal length 1.1cm and
width 0.3cm?
• 4.75cm (3sf):
4.57717 + 0.44168 × 1.1 − 1.23932 × 0.3 + 0.18859 × 1.1 × 0.3

AnovA
• Discrete explanatory variable(s)

• One-way: one explanatory variable
• Two-way: two explanatory variables
One-way_anovA
• Example: iris dataset (“Does species have an effect on sepal

length?”)
• Species is a factor with three levels: setosa, versicolor and

virginica
• Choose one as a baseline: by default first one in the

dataset (so setosa here)
The_iris_dataseT
One-way_anovA
• Model: Yi = α + βvi + γwi + εi
- Yi, α, εi: all as before
- vi = 1 if the ith observation is versicolor; vi = 0

otherwise
- wi = 1 if the ith observation is virginica; wi = 0

otherwise
- R does this for you

Fitting_in_R
Summary output
Summary_data_in_anovA
• Very similar to linear regression case

• First column: note values of factor appear
• t-tests: test for a significant difference between that
species and baseline
- So less useful: there is a way around this, beyond the

scope of this course
PredictinG
• What does the model predict

as the sepal length of a plant
of species versicolor?
• 5.94 (3sf): 5.0060 + 0.9300

Fitting_in_R
ANOVA table
Anova_tables_in_anovA
• Lets us consider removing an entire factorial variable at
once, not just individual coefficients
- e.g. here performs test with null hypothesis β = γ = 0,

and alternative that at least one of them is non-zero
• Species sum of squares: fraction of total sum of squares

explained by variation between groups*
• Residual sum of squares: fraction of total sum of squares

explained by variation within groups*
Anova_tables_in_anovA
• * doesn’t really make sense with continuous explanatory
variables; sum of squares harder to interpret
• Total sum of squares will increase with more data points: not
a good estimate of variance
• Mean squares: estimate of variance between/within groups

(obtained by dividing SS by number of df)
• Species df: no. levels, minus 1

• Residual df: no. data points, minus no. levels
Two-way_anovA
• Example: ToothGrowth data
• Continuous variable: (tooth) length
• Two factors: supplement (OJ, VC) and dose (0.5, 1, 2)
- Note we treat dose as a factor, even though its values
are numbers
• “Does length depend on supplement and dose, and is there

an interaction effect?”
The_toothgrowth_dataseT
Fitting_in_R
Summary output
Fitting_in_R
ANOVA table
Two-way_anovA
• In a balanced design, order in ANOVA table doesn’t matter
- Meaning of “balanced design”: see later
• If exactly one observation in each combination of factors:

interaction model fits perfectly
- Overfitting: can’t test for interaction effects
- Solution: interaction plot

DiagnosticS
Interaction plot
AncovA
• Combination of factor and continuous explanatory variables

• Back to iris data: “Does sepal length depend on sepal width
and species?”
• Can include interaction terms as well

The_iris_dataseT
Fitting_in_R
Summary output
Fitting_in_R
ANOVA table
ResidualS
• Recall the simple linear regression: Yi = α + βxi + εi
• We don’t know the εi values, but want to check assumptions on
them
• Use residuals instead: vertical distance of point from fit line

• ei = Yi − α̂ − βx̂ i
• Typically smaller than “true” error
• In a residuals vs. fitted plot, residuals are “standardised”: rescaled so
their sample variance is equal to 1
Residuals_vs_fitteD
Values_ploT
• Lets us find outliers (but we can’t just remove them!)
- Residual bigger than 2 or less than -2, by convention
• Tests equal variance and uncorrelated assumptions

- Funnel shape suggests unequal variance
- If red line spends a long period on one side of the zero line,
suggests correlation between residuals
• Also tests for linearity (curved red line = probably non-linear)

DiagnosticS
Residuals vs. fitted values for ANCOVA with iris data
Normal_q-q_ploT
• Tests whether residuals are normally distributed (which

implies that the assumption that the error ε was normally
distributed was valid)
• Points lying on dashed line implies normality

• If a few points at either end deviate from the line, that’s
fine
DiagnosticS
Residual Q–Q plot for ANCOVA with iris data
Model_diagnosticS
• Residuals vs. fitted plot works less well for ANOVA: can’t
test independence as easily
- Fails because multiple points have same fitted value
• Can still do other checks

DiagnosticS
Residuals vs. fitted values for ANOVA with ToothGrowth
data
DiagnosticS
Residual Q–Q plot for ANOVA with ToothGrowth data
Factorial_designS
• Factorial design: some number of factors at some number

of levels
• Full factorial design: all possible combinations of factor

levels occurs at least once
• Completely randomised design: treatments assigned to

individuals completely at random
Dealing_witH
Non-linearitY
• Solution is usually to transform the data, though other

techniques are available
• Exactly how you do that is beyond the scope of this course!

Replication_anD
BalancE
• Replication: applying every treatment to multiple,
independent experimental units
- Advantage: reduces standard error, increases power of

tests
• Balance: giving all treatments the same sample size

- Advantage: reduces standard error
BlockinG
• Blocks are groups with common features (e.g. different

chambers/fields)
• Randomised block design: treatment levels assigned

randomly within a block; each appears once in each block
- Advantage: taking blocks into account can increase

power of tests/reduce effect of environmental variation
on conclusions
RandomisatioN
• Treatments assigned at random (either within a block, or

overall)
- Advantage: reduces potential bias by equalising factors

that are unaccounted for

2020-02-22 Linear Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2020-02-22 Linear Models

Uploaded by

Copyright:

Available Formats

Linear_modelS

• Explanatory variables: continuous or categorical data

• ANOVA: all explanatory variables categorical

• In a designed experiment (especially ANOVA), usually

• Otherwise, think about the research question

- Remember correlation does not mean causation!

• Continuous explanatory variable(s)

- Parameters are unknown: α = intercept, β = slope (with

- ε ∼ N (0,σ 2): random error, with σ 2 another unknown

• Using subscript to denote different observations:

- α and β are the same for each observation

• We want to estimate α and β from the data

• Best estimates minimise ∑ (Yi − α̂ − βx̂ i)

- “Least squares regression”

- Std. Error: gives standard error of these estimates

- t value: gives the test statistic for a (two-tailed) t-test

- Pr(>|t|): gives the p-value for this test

• Multiple R-squared: measure of how well the model fits

- “Proportion of total variance explained by the model”

- Good value in biology: at least 0.6

• Residual standard error: estimate of σ from the data

• Null hypothesis is always model with fewer variables

• Large values: reject null hypothesis

• Now we need to consider possible interaction terms:

- xi = ith petal length, zi = ith petal width, α, β, γ, δ

• Order matters (why we read from the bottom up)

• What does the model predict

4.57717 + 0.44168 × 1.1 − 1.23932 × 0.3 + 0.18859 × 1.1 × 0.3

• Discrete explanatory variable(s)

• Example: iris dataset (“Does species have an effect on sepal

• Species is a factor with three levels: setosa, versicolor and

• Choose one as a baseline: by default first one in the

- vi = 1 if the ith observation is versicolor; vi = 0

- wi = 1 if the ith observation is virginica; wi = 0

- R does this for you

• Very similar to linear regression case

- So less useful: there is a way around this, beyond the

• What does the model predict

• 5.94 (3sf): 5.0060 + 0.9300

- e.g. here performs test with null hypothesis β = γ = 0,

• Species sum of squares: fraction of total sum of squares

• Residual sum of squares: fraction of total sum of squares

• Mean squares: estimate of variance between/within groups

• Species df: no. levels, minus 1

• “Does length depend on supplement and dose, and is there

• If exactly one observation in each combination of factors:

- Overfitting: can’t test for interaction effects

- Solution: interaction plot

• Combination of factor and continuous explanatory variables

• Can include interaction terms as well

• Use residuals instead: vertical distance of point from fit line

• Tests equal variance and uncorrelated assumptions

• Also tests for linearity (curved red line = probably non-linear)

• Tests whether residuals are normally distributed (which

• Points lying on dashed line implies normality

- Fails because multiple points have same fitted value

• Can still do other checks

• Factorial design: some number of factors at some number

• Full factorial design: all possible combinations of factor

• Completely randomised design: treatments assigned to

• Solution is usually to transform the data, though other

• Exactly how you do that is beyond the scope of this course!

- Advantage: reduces standard error, increases power of