You are on page 1of 17

09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

Week 6 – Model assumptions in linear


regression
Laura Brannelly, Saras Windecker, and Patrick Baker
02 April 2022

Objectives
In the previous prac we introduced linear regression, some of the assumptions that underpin it, and how to
implement it in R using lm() . In this week’s prac we will focus on what to do when the assumptions of linear
regression are violated.

Upon completion of this prac, you should be able to:

list the four key assumptions that are necessary to make valid inference from linear regression analyses;

identify when one or more of those assumptions has been violated; and

modify your analyses to address the violated assumption.

Assessment
This practical will be assessed. There are 10 questions in the Problem Sets section at the end of the prac
handout. You will need to provide answers for each of the 10 questions. Your answers may require written text,
R code and R output, R graphics, or some combination of these. We will provide an R Markdown answer
template for you to use.

Model Assumptions in Linear Regression


Valid inference from linear regression rests on several assumptions about the data that is being analysed. If
these assumptions are not met, then there is the possibility that the parameter estimates generated in the
analysis will be biased and that subsequent inference will be incorrect or inappropriate. To avoid this issue, it is
important to test whether your dataset meets these assumptions. We use exploratory data analysis and model
diagnostics to do this last week.

In this prac we will touch on the four key assumptions of linear regression, show some examples of them being
violated, and consider how to fix them. It is worth repeating that not all assumptions can be fixed during the
analysis stage with fancy data-wrangling. Many of the problems that arise with violated model assumptions are
problems of experimental design—which is why it is important to understand the core assumptions and be
thinking about them from the very beginning of your research.

Model Assumptions
There are four core assumptions of linear regression models that must be met to ensure proper inference.
Recall that the linear regression model is represented as:

Y ∼ β0 + β1 X + ϵ

This is equivalent to the equation that we used last week: y ∼ mx i + b + ϵi

Each of the model assumptions concerns the error term (ϵi ) of the regression model. These are:

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 1/17
09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

Individual observations are independent


Response data are normally distributed
Variance is homogeneous across range of predictor
Data are linear

In the previous prac and in this week’s mini-lectures we discuss these assumptions in detail. If you need to
reacquaint yourself with them, have a look at the Week 5 prac manual and the Week 6 lecture slides.

Let’s have a look at a few examples.

A worked example – I. Algal growth and light intensity


Background, data, and hypotheses
To help illustrate some of these issues it is helpful to work through an example in which we walk through the
process of assessing each of the model assumptions. The example that we’ll use examines how light intensity
(μE per m2 per second) influences the maximum growth rate r max of laboratory populations of the green alga
Chlorella vulgaris. The expectation is that with increasing amounts of light, the alga, which depend on light for
energy and growth, will increase in abundance.

light <- c(20,20,20,20,21,24,44,60,90,94,101)

rmax <- c(1.73,1.65,2.02,1.89,2.61,1.36,2.37,2.08,2.69,2.32,3.67)

alga <- data.frame(light, rmax)

We will start by asking two questions. First, what is the biological hypothesis that we are testing? Second, what
is the statistical hypothesis that we are asking?

The biological hypothesis is that light intensity influences the maximum growth rate of Chlorella vulgaris.
The statistical hypothesis is that the slope parameter of the linear model is non-zero.

Now, let’s see if the assumptions of linear regression are met with this dataset. This will allow us to decide if it
is appropriate to apply a linear regression analysis to this dataset. Here we will look at three of the four
assumptions:

Residuals are normally distributed


Variance of residuals is homogeneous across the range of the predictor
Relationship between predictor and response variables is linear

[Note: We don’t include here the first assumption about data independence, as it isn’t readily diagnosed once
the data have been collected.]

Exploratory data analyses


As always, let’s start by looking at the data:

p1 <- ggplot(data=alga, aes(x=light, y=rmax)) +

geom_point() +
geom_smooth(method='lm') +

labs(x=expression(paste("Light intensity", ~"(", mu, "E/", m^2, "/s)" )),

y="Maximum growth rate (g/day)")

p1

## `geom_smooth()` using formula 'y ~ x'

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 2/17
09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

We’ve added a linear smoothing function geom_smooth(method='lm') to show the linear relationship between
the predictor and response. It can also indicate if the data seem to be skewed away from the mean
relationship.

It would appear that the relationship is more or less linear (Assumption #4) and that the residuals are
homogeneous across the range of the predictor (Assumption #3).

We should then look at the response variable to see if it is normally distributed.

hist(alga$rmax)

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 3/17
09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

qqnorm(alga$rmax)
qqline(alga$rmax)

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 4/17
09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

The histogram and the qq plot suggests that the response data are normally distributed. However, we will want
to confirm this when we look at the residuals in the linear model output (Assumption #2).

So, we conclude that there is NO evidence that:

the response variable is non-normal (but, again, we’ll confirm with the residuals from the model)
the spread of values around the trendline shows a trend
the relationship between the predictor and response is non-linear

Therefore, this seems like a reasonable dataset for linear regression.

Fitting the model


So, let’s fit a linear regression model to these data using the lm() function and then look at a summary of the
resultant model. This should be familiar to you from last week, but we’ll just go through it again to make sure
everybody is on the same page.

alga_lm <- lm(rmax ~ light, data = alga)

summary(alga_lm)

##

## Call:

## lm(formula = rmax ~ light, data = alga)

##

## Residuals:

## Min 1Q Median 3Q Max

## -0.5478 -0.2607 -0.1166 0.1783 0.7431

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 1.580952 0.244519 6.466 0.000116 ***

## light 0.013618 0.004317 3.154 0.011654 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 0.4583 on 9 degrees of freedom

## Multiple R-squared: 0.5251, Adjusted R-squared: 0.4723

## F-statistic: 9.951 on 1 and 9 DF, p-value: 0.01165

There’s a lot of information packed into this output and it is always a useful place to begin to evaluate the
model. Here is a quick reminder of the model summary outputs:

1. The Estimates in the Coefficients table are the mean parameter values in your linear model
r max = β 0 + β 1 light + ϵ . The Intercept in the output (1.581) is β 0 , the slope in the output (0.0136) is

β1 .

2. The Standard Error in the Coefficients table is an indication of how variable the parameter estimate is.
If Standard Error is small relative to the Estimate, then it means that the Estimate is well described by
the model and suggests that we should have confidence in the estimate as an indication of the true, but
unknown, parameter value.

3. The t-value and Pr(>|t|) are the results of a statistical test to determine if each parameter value is
different from zero. The asterisks at the end indicate that the tests are significant (ie, the parameter
estimates are different from zero). We’ll discuss the details of statistical significance tests next week.

4. The Residual standard error is a measure of unexplained variance. It is an indication of the spread of
data points around the linear model.

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 5/17
09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

5. The Multiple R-squared and Adjusted R-squared give an indication of how much variability in the
dataset is explained by the predictor variables. Values closer to one mean that the model has good
predictive skill; values closer to zero mean that the model has relatively little predictive skill.

Model diagnostics
There are also important diagnostic plots for linear models. Here we return to the performance package to
show them.

check_model(alga_lm, check=c("qq", "normality", "ncv", "outliers"))

Each of these provides some indication of how individual points relate to or depart from the regression model
that best fits these data. In this case none of those figures show any patterns that would make us worry.
Specifically, the qq-plot ( qq , top) suggests that the residuals are normally distributed. The homoscedascticity
plot ( ncv , middle) shows no evidence of uneven spread around a line through zero on the y-axis. So, as we
noted above, there is no reason to believe that these data are violating the core assumptions of the linear
regression model.

It is also worth looking at outliers when evaluating a regression model. This points can have undue influence
on a regression model without violating the model assumptions. They may be due to coding errors (the
decimal point was put in the wrong place when copying the data from a notebook onto a spreadsheet) or they
may be real (sometimes you get extreme values). Identifying them allows you to check their origin and
consider their impact on the model. In this dataset data point 11 is a bit of an outlier as shown in the bottom
panel of the figure above.

A worked example – II. Body size and metabolic rate

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 6/17
09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

Nagy, Girard & Brown (1999) investigated the allometric scaling relationships for mammals (79 species),
reptiles (55 species), and birds (95 species). The observed relationships between body size and metabolic
rates of organisms have attracted a great deal of discussion amongst scientists from a wide range of
disciplines. Many of these data sets have non-linear, exponential relationships between predictor and response
variables. These violate the assumptions of linear regression models and so provide a useful example for
addressing this problem. Nagy et al. (1999) compiled a large data set of body mass measurements (measured
in grams) and field metabolic rates (FMR, measured in kiloJoules (kJ)/day) to explore the relationship between
the two. Let’s load the allometry data and look at the variables:

head(allometry)

## Species Common Mass FMR Class

## 1 Pipistrellus pipistrellus Pipistrelle 7.3 29.3 Mammal

## 2 Plecotus auritus Brown long-eared bat 8.5 27.6 Mammal

## 3 Myotis lucifugus Little brown bat 9.0 29.9 Mammal

## 4 Gerbillus henleyi Northern pygmy gerbil 9.3 26.5 Mammal

## 5 Tarsipes rostratus Honey possum 9.9 34.4 Mammal

## 6 Anoura caudifer Flower-visiting bat 11.5 51.9 Mammal

table(allometry$Class)

##

## Aves Mammal Reptiles

## 95 79 55

The variable Class is the categorical variable that identifies whether a species is a mammal, a reptile, or a
bird. We’re going to work with a subset of the data that just includes the mammals. So, we’ll use the subset()
the data and create a new dataframe called mammals for the rest of this example.

mammals <- subset(allometry, Class == "Mammal")

Now let’s plot the data as a scatterplot:

m1 <- ggplot(mammals, aes(x=Mass, y=FMR)) +

geom_point() +
geom_smooth(method="lm") +

labs(x= "Body mass (g)", y= "Field metabolic rate (kJ/day)")

m1

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 7/17
09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

Instead of looking at histograms for the response and the predictor variables individually, we can use a
package called ggExtra to look at the marginal distributions of the data. This should give us an indication if
the data are strongly skewed, which is often indicative of problems that we’ll have with the residuals once we
run the linear model.

ggMarginal(m1, type="histogram", size=7) # Try replacing "histogram" with "densigram"

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 8/17
09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

What you should immediately notice when you look at the histograms/density profiles in the margins is that the
data are very skewed with lots of small values and a few very large values. Most of the data points are in the
lower left hand corner and a few are scattered towards the upper right-hand side of the plot. This is a major red
flag that there may be a problem with our assumption of homogeneity of variance (Assumption #3) and
possibly the normality in the residuals (Assumption #2). On the positive side, the relationship looks more or
less linear (Assumption #4)…

For the purposes of illustration, let’s fit a linear model to these data:

mammals_lm1 <- lm(FMR ~ Mass, data=mammals)

summary(mammals_lm1)

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 9/17
09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

##

## Call:

## lm(formula = FMR ~ Mass, data = mammals)

##

## Residuals:

## Min 1Q Median 3Q Max

## -13551.2 159.9 181.2 239.3 11004.1

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) -135.8648 332.3814 -0.409 0.684

## Mass 0.4938 0.0156 31.649 <2e-16 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 2728 on 77 degrees of freedom

## Multiple R-squared: 0.9286, Adjusted R-squared: 0.9277

## F-statistic: 1002 on 1 and 77 DF, p-value: < 2.2e-16

The first thing we’ll want to look at are the residuals. Recall that linear regression assumes that the residuals
are normally distributed with a mean of zero. The median is 181.2, which is quite different from zero. But the
min and max are more than $$10000, so perhaps not too bad. The quartiles are a bit of a worry, though. The
1st quartile is 159 (22 less than the median) and the 3rd quartile is 239 (58 more than the median). That
suggests that the distribution of residuals may be quite asymmetric. And then there are the min and max,
which are two orders of magnitude greater than the median.

The estimate of the coefficient for β 1 (the slope) on Mass is highly significant and the adjusted R-squared is
very high. These are good, but residuals are worrying. Let’s have a look at the model diagnostics using
check_model() :

check_model(mammals_lm1, check=c("qq", "normality", "ncv", "outliers"))

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 10/17
09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

These panels suggest all sorts of probelms with the residuals. In the upper left panel, the qq-plot shows some
very serious departures from normality. In the upper right panel, the large spike of points is not centered on
zero. So, it appears Assumption #2 of normality has been violated. Similarly, in the lower left hand figure, which
plots the residuals against the fitted values, the variance is not constant—the spread of the data increases with
the fitted values (ie, from left to right). This violates the assumption of heterogeneity of variance (Assumption
#3). So, we conclude that this model (despite having a good R2 ) is not appropriate given the data.

So, what can we do about this?

The standard way of treating both of these problems (ie, non-normality, non-constant variance) is to transform
the data onto a different scale. There are a variety of ways to do this, but the most common, which we will use
here, is to log-transform the data. That is, to take the loge values of the data. Taking the logarithms can reduce
departures from normality associated with exponential scaling issues and stablise the variance of the data.

To do this we create two new variables logMass and logFMR in the mammals dataset:

mammals$logMass <- log(mammals$Mass)

mammals$logFMR <- log(mammals$FMR)

We can now plot them:

m2 <- ggplot(mammals, aes(x=logMass, y=logFMR)) + geom_point() + geom_smooth(method="lm")

ggMarginal(m2, type="densigram", size=7)

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 11/17
09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

That certainly looks much better and the data are spread across the scale of observations much more evenly.
So, log-transforming the data has removed the extreme skew. Now we need to see if this improves the quality
of the model:

mammals_lm2 <- lm(logFMR ~ logMass, data=mammals)

summary(mammals_lm2)

##

## Call:

## lm(formula = logFMR ~ logMass, data = mammals)

##

## Residuals:

## Min 1Q Median 3Q Max

## -1.38778 -0.30330 -0.01868 0.33249 0.96233

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 1.57269 0.12287 12.80 <2e-16 ***

## logMass 0.73412 0.01924 38.15 <2e-16 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 0.488 on 77 degrees of freedom

## Multiple R-squared: 0.9498, Adjusted R-squared: 0.9491

## F-statistic: 1456 on 1 and 77 DF, p-value: < 2.2e-16

check_model(mammals_lm2, check=c("qq", "normality", "ncv", "outliers"))

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 12/17
09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

Now the qq-plot looks much better! There are a few values that are a bit distant from the expected values
under a normal distribution, but with 79 data points that shouldn’t matter too much. The distribution of residuals
(upper right figure) looks pretty good, too. So, that means we’re much more comfortable with Assumption #2
(normality).

In terms of homogeneity of variance (Assumption #3), the lower left plot shows a more consistent spread of
data across the full range of fitted values. So, we are also more comfortable with meeting that assumption. It is
worth noting that there is a bit of a bend in the residuals, which suggests that the log-transformed data may be
slightly non-linear. But this is a major improvement over the untransformed data and certainly meets the
assumptions required for linear regression.

The last thing that we need to do is think about the model. The original model was:

F M R = β 0 + β 1 M ass

In log-transforming the data, we have created a slightly different model:

log(F M R) = β 0 + β 1 log(M ass)

To transform this back to a linear scale, we need to exponentiate both sides. This will give us:

β β
F MR = e 0
⋅ M ass 1

It’s also worth asking how to interpret log-transformed parameters. We can just plug values into the back-
transformed equation above. For example, when body mass is 1 gram:

1.573 0.734
F MR = e ⋅ 1

= 4.82 ⋅ 1

= 4.82

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 13/17
09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

If body mass is 10 grams:

1.573 0.734
F MR = e ⋅ 10

= 4.82 ⋅ 5.422

= 26.13

And so on…

A worked example – III. Tree diameter and height


Disturbances are important for regenerating forests. After a disturbance such as fire, tree seeds will germinate,
develop into seedlings, and grow rapidly in the resource-rich environment. During the early phase of stand
development the diameter and height of individual trees are often linearly related. Over time height growth
slows and often reaches a maximum or asymptotic height determined by site quality (ie, how fertile the soils
are). However, diameter growth usually continues for decades or centuries. As a consequence the relationship
between tree diameter and height is not always linear. Here we explore the relationship between the diameters
and heights of Jack Pine (Pinus banksiana) in northern Ontario, Canada using the trees data set. The data
set includes measurements diameter at breast height measured in centimetres and height measured in metres
for 80 trees. Let’s have a quick look at the dataset:

head(trees)

## dbh height

## 1 5.72632 1.181644

## 2 24.47893 18.530510

## 3 88.38823 24.288027

## 4 16.33033 15.040940

## 5 63.23994 23.343080

## 6 20.65425 14.472922

Now let’s plot the data and add the marginal distributions as histograms:

t1 <- ggplot(data=trees, aes(x=dbh, y=height)) +

geom_point() +
labs(x='DBH (cm)',

y='Height (m)')

ggMarginal(t1, type="histogram", size=7)

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 14/17
09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

There is is clearly a strong non-linear relationship between stem diameter and tree height. Let’s see what the
linear model shows us:

trees_lm <- lm(height ~ dbh, trees)

summary(trees_lm)

##

## Call:

## lm(formula = height ~ dbh, data = trees)

##

## Residuals:

## Min 1Q Median 3Q Max

## -12.3226 -2.4824 0.9134 2.9536 5.8930

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 11.25443 0.96336 11.68 <2e-16 ***

## dbh 0.17506 0.01611 10.86 <2e-16 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 4.005 on 78 degrees of freedom

## Multiple R-squared: 0.6021, Adjusted R-squared: 0.597

## F-statistic: 118 on 1 and 78 DF, p-value: < 2.2e-16

The residuals are centered close to zero and the 1st and 3rd quartiles are similar in magnitude. The absolute
value of the min is about twice that of the max, so perhaps some issues with the residuals. The parameter
estimate for the slope of the model is very significant (ie, different from zero), but we would expect that from

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 15/17
09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

looking at the data. The adjusted R isn’t fantastic, but it isn’t terrible either.
2

Now let’s check the model diagnostics to see how the assumptions hold up:

check_model(trees_lm, check=c("qq", "normality", "ncv", "outliers"))

The upper panels suggest that there are some issues with normality. The tail ends of the qq-plot are outside
the confidence interval and the middle points look like they may be above it, too. The distribution of residuals in
the upper right panel is a bit odd because it is cut off abruptly at $$6, but tails out to the left quite typically.

The big give-away, though, is the lower left panel which looks at the pattern of the residuals against the fitted
values. Therse are clearly not evenly distributed along the horizontal line at zero.The smaller fitted values are
well below zero, the middle values are well above zero, and the larger values are all below zero. This suggests
– as is quite obvious from the original plot of the data – that these data are not well-fit by a straight line using
linear regression. In this situation, we would next explore predictor variable transformations. And if that still
does not produce data that fits the model assumptions, then we could try using a non-linear regression model
to fit these data - which we will not attmept today, but know that it is a possibility.

Problem sets
Questions 4 and 9 are worth 2 points each; all other questions are worth 1 point each.

Fertilizer data
An agriculturalist was interested in the effects of fertilizer application on the yield of grass. Grass seed was
sown uniformly across a field. Different amounts of commercial fertilizer (in g/m2 ) were applied to each of ten
1 m2 plots located randomly within the field. Two months later the grass from each plot was harvested, dried,
and weighed (in g/m2 ). The data are in fertilizer .

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 16/17
09/04/2022, 21:14 Week 6 – Model assumptions in linear regression

The biological hypothesis that we are testing is that fertilizer application increases grass productivity.
The statistical hypothesis that we are testing is that the slope parameter (β 1 ) on grass yield is non-zero.

Question 1: Test the assumptions of linear regression using a scatterplot of YIELD against FERTILIZER. Is
there any evidence of violations of the simple linear regression assumptions?

If there is no evidence that the regression assumptions have been violated, fit the linear model

Y I ELD = β 0 + β 1 F ERT I LI Z ER + ϵi

Question 2: Examine the regression diagnostics. Do any of the diagnostic plots indicate potential problems
with the data?

Question 3: If there is no evidence that any of the assumptions have been violated, examine the regression
model output. Identify and interpret the following:

1. y-intercept
2. slope (including the statistical significance)
3. R-squared value

Question 4: What conclusions can you draw from your analysis? Consider both the biological and statistical
hypotheses being tested. What is the final equation for the model? (2 points)

Question 5: Make a plot of the data and the linear model. Make sure that the axes are labelled correctly with
the name of the relevant variables and the units of measurement.

Mussel data
An ecologist was interested in investigating the relationship between the number of individuals of invertebrates
living in clumps of mussels on a rocky intertidal shore and the area of those mussel clumps. The data are in
mussels .

The biological hypothesis that we are testing is that number of invertebrates increases with the size of
the mussel clumps.
The statistical hypothesis that we are testing is that the slope parameter (β 1 ) on clump area is non-zero.

Question 6: Test the assumptions of linear regression using a scatterplot of INDIV against AREA. Is there any
evidence of violations of the simple linear regression assumptions? If so, which assumption is being violated?
Describe how you made that determination.

Question 7: If there is evidence of violated assumptions, what is your proposed solution? Run your proposed
linear regression model and use it to answer the following questions.

Question 8: Examine the regression diagnostics from your model. Do any of the diagnostic plots indicate
potential problems with the data? If not, examine the regression model output. Identify and interpret the
following:

1. y-intercept
2. slope (including the statistical significance)
3. R-squared value

Question 9: What conclusions can you draw from your analysis? Consider both the biological and statistical
hypotheses being tested. What is the final equation for the model? If you transformed the model, make sure
that you address that. (2 points)

Question 10: Make a plot of the data and the linear model. Make sure that the axes are labelled correctly with
the name of the relevant variables and the units of measurement.

https://2ea8f3dc064f44d2a434f2e948fc0e2e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek06_ModelAssumptions.html 17/17

You might also like