You are on page 1of 54

Linear_modelS

Alex Homer
When_might_we_usE

Linear_models?
• Response variable: continuous data
- Should be normal, but hard to tell before fitting

• Explanatory variables: continuous or categorical data


- Sometimes linear model is overcomplicated, e.g. one
categorical variable with two levels
Three_typeS

• ANOVA: all explanatory variables categorical


• Linear regression: all explanatory variables continuous
• ANCOVA: at least one of each
• These are basically all the same
Choosing_a_responsE

• In a designed experiment (especially ANOVA), usually


obvious

• Otherwise, think about the research question


- “Do <explanatory> and <explanatory> have an effect on
<response>?”

- Remember correlation does not mean causation!


Linear_regressioN

• Continuous explanatory variable(s)


• Example: iris dataset in R
• “Does sepal length increase linearly with sepal width?”
The_iris_dataseT
First data points
The_iris_dataseT
Exploratory data analysis
The_modeL

• Y = α + βx + ε
- Data is known: x = sepal width, Y = sepal length

- Parameters are unknown: α = intercept, β = slope (with


respect to sepal width)

- ε ∼ N (0,σ 2): random error, with σ 2 another unknown


parameter
The_modeL

• Using subscript to denote different observations:


Yi = α + βxi + εi

- α and β are the same for each observation

- εi are independent

• We want to estimate α and β from the data


Estimating_parameterS
• We can estimate α and β from the data: call the estimates
α̂ and β ̂
Big
cloud of
maths

• Best estimates minimise ∑ (Yi − α̂ − βx̂ i)


2

- “Least squares regression”


Fitting_in_R
Summary output
Interpreting_summarieS
• Mostly interested in the “coefficients” table
- Estimate: gives α̂ and β ̂

- Std. Error: gives standard error of these estimates


(remember they’re random)

- t value: gives the test statistic for a (two-tailed) t-test


with null hypothesis α = 0 or β = 0

- Pr(>|t|): gives the p-value for this test


Interpreting_summarieS

• Multiple R-squared: measure of how well the model fits


the data

- “Proportion of total variance explained by the model”

- Good value in biology: at least 0.6

• Residual standard error: estimate of σ from the data


Fitting_in_R
ANOVA table
Anova_tableS
• Used for model comparison
• Read from bottom-up: performs tests for sequentially
dropping each variable (only one here)

• Null hypothesis is always model with fewer variables


• Mean Sq: Sum Sq/df
• F value: critical value for an F-test (ratio of Mean Sq)
• Pr(>F) corresponding p-values
F-valueS
• “Signal-to-noise” ratio; always positive
• Under the null hypothesis, have mean 1
- Value less than 1: “experimental error” (though this
doesn’t have much mathematical meaning)

• Large values: reject null hypothesis


• Critical value depends on two numbers of degrees of
freedom
More_explanatorY

VariableS
• “Does sepal length increase linearly with petal length and
petal width?”

• Now we need to consider possible interaction terms:


Yi = α + βxi + γzi + δxizi + εi

- xi = ith petal length, zi = ith petal width, α, β, γ, δ


unknown constants; Yi and εi as before
Fitting_in_R
Summary output
Fitting_in_R
ANOVA table
Anova_tableS

• Order matters (why we read from the bottom up)


• We don’t remove “simple” terms without removing
interaction terms, nor remove the intercept ever, unless
there’s a good a priori reason to
PredictinG

• What does the model predict


as the sepal length of a plant
with petal length 1.1cm and
width 0.3cm?

• 4.75cm (3sf):

4.57717 + 0.44168 × 1.1 − 1.23932 × 0.3 + 0.18859 × 1.1 × 0.3


AnovA

• Discrete explanatory variable(s)


• One-way: one explanatory variable
• Two-way: two explanatory variables
One-way_anovA

• Example: iris dataset (“Does species have an effect on sepal


length?”)

• Species is a factor with three levels: setosa, versicolor and


virginica

• Choose one as a baseline: by default first one in the


dataset (so setosa here)
The_iris_dataseT
Exploratory data analysis
One-way_anovA
• Model: Yi = α + βvi + γwi + εi
- Yi, α, εi: all as before

- vi = 1 if the ith observation is versicolor; vi = 0


otherwise

- wi = 1 if the ith observation is virginica; wi = 0


otherwise

- R does this for you


Fitting_in_R
Summary output
Summary_data_in_anovA

• Very similar to linear regression case


• First column: note values of factor appear
• t-tests: test for a significant difference between that
species and baseline

- So less useful: there is a way around this, beyond the


scope of this course
PredictinG

• What does the model predict


as the sepal length of a plant
of species versicolor?

• 5.94 (3sf): 5.0060 + 0.9300


Fitting_in_R
ANOVA table
Anova_tables_in_anovA
• Lets us consider removing an entire factorial variable at
once, not just individual coefficients

- e.g. here performs test with null hypothesis β = γ = 0,


and alternative that at least one of them is non-zero

• Species sum of squares: fraction of total sum of squares


explained by variation between groups*

• Residual sum of squares: fraction of total sum of squares


explained by variation within groups*
Anova_tables_in_anovA
• * doesn’t really make sense with continuous explanatory
variables; sum of squares harder to interpret

• Total sum of squares will increase with more data points: not
a good estimate of variance

• Mean squares: estimate of variance between/within groups


(obtained by dividing SS by number of df)

• Species df: no. levels, minus 1


• Residual df: no. data points, minus no. levels
Two-way_anovA
• Example: ToothGrowth data
• Continuous variable: (tooth) length
• Two factors: supplement (OJ, VC) and dose (0.5, 1, 2)
- Note we treat dose as a factor, even though its values
are numbers

• “Does length depend on supplement and dose, and is there


an interaction effect?”
The_toothgrowth_dataseT
Exploratory data analysis
Fitting_in_R
Summary output
Fitting_in_R
ANOVA table
Two-way_anovA
• In a balanced design, order in ANOVA table doesn’t matter
- Meaning of “balanced design”: see later

• If exactly one observation in each combination of factors:


interaction model fits perfectly

- Overfitting: can’t test for interaction effects

- Solution: interaction plot


DiagnosticS
Interaction plot
AncovA

• Combination of factor and continuous explanatory variables


• Back to iris data: “Does sepal length depend on sepal width
and species?”

• Can include interaction terms as well


The_iris_dataseT
Exploratory data analysis
Fitting_in_R
Summary output
Fitting_in_R
ANOVA table
ResidualS
• Recall the simple linear regression: Yi = α + βxi + εi
• We don’t know the εi values, but want to check assumptions on
them

• Use residuals instead: vertical distance of point from fit line


• ei = Yi − α̂ − βx̂ i
• Typically smaller than “true” error
• In a residuals vs. fitted plot, residuals are “standardised”: rescaled so
their sample variance is equal to 1
Residuals_vs_fitteD
Values_ploT
• Lets us find outliers (but we can’t just remove them!)
- Residual bigger than 2 or less than -2, by convention

• Tests equal variance and uncorrelated assumptions


- Funnel shape suggests unequal variance

- If red line spends a long period on one side of the zero line,
suggests correlation between residuals

• Also tests for linearity (curved red line = probably non-linear)


DiagnosticS
Residuals vs. fitted values for ANCOVA with iris data
Normal_q-q_ploT

• Tests whether residuals are normally distributed (which


implies that the assumption that the error ε was normally
distributed was valid)

• Points lying on dashed line implies normality


• If a few points at either end deviate from the line, that’s
fine
DiagnosticS
Residual Q–Q plot for ANCOVA with iris data
Model_diagnosticS

• Residuals vs. fitted plot works less well for ANOVA: can’t
test independence as easily

- Fails because multiple points have same fitted value

• Can still do other checks


DiagnosticS
Residuals vs. fitted values for ANOVA with ToothGrowth
data
DiagnosticS
Residual Q–Q plot for ANOVA with ToothGrowth data
Factorial_designS

• Factorial design: some number of factors at some number


of levels

• Full factorial design: all possible combinations of factor


levels occurs at least once

• Completely randomised design: treatments assigned to


individuals completely at random
Dealing_witH
Non-linearitY

• Solution is usually to transform the data, though other


techniques are available

• Exactly how you do that is beyond the scope of this course!


Replication_anD
BalancE
• Replication: applying every treatment to multiple,
independent experimental units

- Advantage: reduces standard error, increases power of


tests

• Balance: giving all treatments the same sample size


- Advantage: reduces standard error
BlockinG

• Blocks are groups with common features (e.g. different


chambers/fields)

• Randomised block design: treatment levels assigned


randomly within a block; each appears once in each block

- Advantage: taking blocks into account can increase


power of tests/reduce effect of environmental variation
on conclusions
RandomisatioN

• Treatments assigned at random (either within a block, or


overall)

- Advantage: reduces potential bias by equalising factors


that are unaccounted for

You might also like