You are on page 1of 39

Topic 7

Linear regression models

Departamento de Estadística e Investigación Operativa


Universitat de València
Topic 7. Linear regression models
7.1 Description of the linear relationship between two numerical
variables
Dispersion diagram
Covariance and linear correlation coefficient

7.2 Regression line

7.3 Statistical inference for regression

7.4 Confidence interval for the prediction

7.5 Other regression models: multiple linear regression

Chapter 12, Samuels et al (2012)


7.1 Description of the linear relationship
between two numerical variables
We are interested in describing and/or quantifying the possible
linear relationship between two numerical variables X and Y
Data: n pairs of observations (xi, yi) with the values of both
variables for each one of the n individuals in the sample
 Example 1
A study has been conducted in the Valencian
Community with 4-year-old children with the
objective of studying if there is a relationship
between the time spent in physical activities (TAF,
minutes/day) and other factors such as:
- The number of daily hours of sleep (Case 1)
- The number of daily hours the child watches
television (Case 2)
- The number of daily hours the child spends n=55
playing with consoles (Case 3)
Description of the linear relationship
between two numerical variables

 Example 2
A study was conducted in Valencia
aimed to find determinants of birth
weight. In that study, a multitude of
socio-demographic and lifestyle
variables were collected for a total of
160 newborns. Specifically, we want to
predict the weight of the newborn
according to his size
n=160
7.1.1 Dispersion diagram
 Is a two-dimensional graphical representation of the data
 It allows confirm (visually) the existence of linear relationship between the
variables X and Y, depending on the “likeness” of the point cloud with a
straight line

There is direct linear There is inverse linear There is NO linear


relationship relationship relationship
Dispersion Diagram
Gráficas -> diagramas de dispersión
Options:

Command: Scatterplot(TAF ~ sueño, data=Datos)


7.1.2 Covariance and linear correlation
coefficient
Definitions:
n
SS x
SS X = ∑ (x i − x)2 → S2x = Sample variance of X
i=1 n−1
n
SS
SS Y = ∑ (y i − y)2 → S2Y = Y Sample variance of Y
i=1 n−1
n
SPXY
SPXY = ∑ (x i − x)(y i − y) → S XY = Covariance
i=1 n−1

 The sign of the covariance indicates the direction of the relationship:


if it is positive, the relationship is direct
if it is negative, the relationship is inverse
 If the covariance is approximately 0, there is no linear relationship
 Weakness: The covariance depends on the units of X and Y, and so it
does not allow us to quantify the strength of linear association
Covariance and linear correlation coefficient
Definition: SPXY
r= Correlation coefficient
SS X ·SS Y

 It does not depend on the units of the variables: r ∈ [− 1, 1] , and so


it allows us to quantify the strength of linear association
 Interpretation:
 If r > 0, the association is direct. The closer the correlation is to 1, the stronger
the linear relationship. In particular, if r = 1, the association is perfect.
 If r < 0, the association is inverse. The closer the correlation is to -1, the stronger
the linear relationship. In particular, if r = -1, the association is perfect.
 If r = 0, there is no linear relationship.

 We can consider statistical inference based on r. In some investigations,


we are interested in testing the hypothesis H0: ρ=0, where ρ is the
correlation coefficient in the population
Covariance and linear correlation coefficient
H0 : ρ = 0
HA : ρ ≠ 0 400 11

10
300

r=0.961 r=0.452
9

200

p-value<0.0001 p-value=0.0003
8

n=149 n=149
7
100

0
5
Y1

Y3
-100 4
0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160

100 X 80 X

60
0

40

r=-0.995
-100

r=0.025
20

p-value<0.0001 -200
0

p-value=0.761
n=149 -20
n=149
-300
-40
Y2

Y4

-400 -60
0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160

X X
Covariance and linear correlation coefficient
Estadísticos -> Resúmenes-> Matriz de correlaciones

 We must select all the variables that we want to study


 We must select the option ‘p-valores pareados’
 Example 1:
Pearson correlations:
consolas sueño TAF television
consolas 1.0000 -0.0201 -0.2336 0.3135
sueño -0.0201 1.0000 0.5634 -0.2619
TAF -0.2336 0.5634 1.0000 -0.5393
television 0.3135 -0.2619 -0.5393 1.0000

Number of observations: 55

Pairwise two-sided p-values:


consolas sueño TAF television
consolas 0.8839 0.0860 0.0198
sueño 0.8839 <.0001 0.0534
TAF 0.0860 <.0001 <.0001
television 0.0198 0.0534 <.0001
Covariance and linear correlation coefficient
Pearson correlations:
consolas sueño TAF television
consolas 1.0000 -0.0201 -0.2336 0.3135
sueño -0.0201 1.0000 0.5634 -0.2619
TAF -0.2336 0.5634 1.0000 -0.5393
television 0.3135 -0.2619 -0.5393 1.0000

Number of observations: 55

Pairwise two-sided p-values:


consolas sueño TAF television
consolas 0.8839 0.0860 0.0198
sueño 0.8839 <.0001 0.0534
TAF 0.0860 <.0001 <.0001
television 0.0198 0.0534 <.0001

- Consolas and TAF: r=-0.234 (p=0.086) inverse association: the more the child plays with
consoles, the less physical activities he performs. The association is weak (no significant).

- TV and TAF: r=-0.539 (p<0.001) inverse association: the more hours the child watches TV,
the less physical activities he performs. The association is strong (significant).

- Sleep and TAF: r=0.563 (p<0.001) direct association: the more hours the child sleeps, the
more physical activities he performs. The association is strong (significant).
Covariance and linear correlation coefficient
Example 2:
Dispersion diagram and correlation coefficient for the weight of the newborn
according to his size

> rcorr.adjust(Dataset[,c("peso","talla")],
type="pearson", use="complete")

Pearson correlations:
peso talla
peso 1.0000 0.7542
talla 0.7542 1.0000

Number of observations: 160

Pairwise two-sided p-values:


peso talla
peso <.0001
talla <.0001

There is direct linear association between the size and the weight of the newborn,
r=0.75 and the association is significant (p<0.0001)
7.2. Regression line
Given two numerical variables X and Y that are linearly related,
we intend to obtain the best line to explain Y from X
Ŷ = b 0 + b1 ⋅ X

Definitions:
 X: explanatory variable or independent.
 Y: dependent variable.

 This line is known as regression line.

 The classical criterion to define this


regression line is the least-squares criterion.
Regression line
Least-squares criterion: the best straight line is the one that minimizes
the residual sum of squares
ŷ i = b 0 + b1x i

yˆ i : estimated value
ei = y i − ŷ i : error
yi : real value
n n
SSe = ∑ (ei ) = ∑ (y i − (b 0 + b1x i ))2
2

i=1 i=1

Residual sum of squares

The values of b0 and b1 that minimize SSe are: b0 = y − b1·x

SPxy
b1 =
SS x
Regression line
Interpretation:

ŷ1

ŷ0
1

0 1 2 3

- b1 is the slope: rate of change of Y with respect to X


- b0 is the intercept: estimation of Y when X = 0. Sometimes, it does not make sense
- The regression line passes through the joint mean:
Regression line
Example 2:
Estadísticos -> Ajuste de modelos -> Regresión lineal
lm(formula = peso ~ talla, data=pesonac)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5277.59 593.92 -8.886 1.35e-15 ***
talla 169.87 11.77 14.437 < 2e-16 ***

Weight = - 5277.59 + 169.87 size

For each additional centimeter of height at birth, an increase of 169.87 grams in the
weight of the newborn is estimated.
A newborn with size 0 centimeters would correspond with a weight of -5277.59 grams

This is one example where the intercept does not make sense
Regression line: Linear model
The values of the coefficients of the regression line depend on the data (xi,yi)
observed in the experiment; that is, each sample gives a different equation for
the regression line. However, the model (biological association that we intend to
characterize) is invariable and universal, even though it is unknown.

A universal linear relationship for variables 𝑦𝑦𝑖𝑖 and 𝑥𝑥𝑖𝑖 would be formulated as:
𝑦𝑦𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥𝑖𝑖 + 𝜀𝜀𝑖𝑖

 Coefficients 𝛽𝛽0 , 𝛽𝛽1 are unknown parameters, and so they must be estimated
 The 𝜀𝜀𝑖𝑖 term represents random error. We include this term in the model to
reflect the fact that y varies, even when x is fixed

To make inference about the parameters of the model 𝛽𝛽0 , 𝛽𝛽1 , the following
conditions are necessary: linearity, independence, normality, and homocedasticity
7.3. Statistical inference for regression
Confidence intervals:

 For the intercept: 𝐼𝐼𝐼𝐼1−𝛼𝛼 𝛽𝛽0

 For the slope: 𝐼𝐼𝐼𝐼1−𝛼𝛼 𝛽𝛽1

Hypothesis testing:

 For the intercept: H0: β0 =0

 For the slope: H0: β1 =0

The test on β1 is known as linearity test, and is very relevant. Note that the acceptance
of the null hypothesis (that is, the possibility that the slope is 0) indicates the uselessness
of the model.

The test on the intercept is not commonly used


Statistical inference for regression
The estimate of the slope can be formulated as a function of the sample
correlation coefficient,
𝑆𝑆𝑥𝑥𝑥𝑥 𝑆𝑆𝑥𝑥𝑥𝑥 𝑆𝑆𝑦𝑦 𝑆𝑆𝑦𝑦
�1 = 𝑏𝑏1 =
𝛽𝛽 = = 𝑟𝑟 ,
𝑆𝑆𝑥𝑥2 𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦 𝑆𝑆𝑥𝑥 𝑆𝑆𝑥𝑥

which estimates the correlation coefficient in the population 𝜌𝜌 (which


measures the real linear association between x and y)

So: H0: β1 =0 is equivalent to H0: ρ =0

HA: β 1 ≠0 HA: ρ ≠0

If we fail to reject the null hypothesis, it does not mean that there is no
association between both variables. There may be nonlinear association
If we reject the null hypothesis, we conclude that there is linear association,
but it may be weak
Statistical inference for regression
Estadísticos -> Ajuste de modelos -> Regresión lineal
Example 2:
>lm(formula = peso ~ talla, data=pesonac)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5277.59 593.92 -8.886 1.35e-15 ***
talla 169.87 11.77 14.437 < 2e-16 ***

Weight = - 5277.59 + 169.87 size

For each additional centimeter of height at birth, an increase


of 169.87 grams in the weight of the newborn is estimated.
This increase is significant.

The command Confint allows us to obtain confidence intervals for the


coefficients:
> Confint(RegModel.1, level=0.95)

Estimate 2.5 % 97.5 %


(Intercept) -5277.5916 -6450.6399 -4104.5433
talla 169.8742 146.6347 193.1136
Statistical inference for regression
Example 1 (case 2): Relationship between the number of daily hours of
television and TAF
lm(formula = TAF ~ television, data = datosTAF)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 662.39 54.85 12.076 < 2e-16 ***
television -55.32 11.86 -4.662 2.15e-05 ***

yˆ = 662.39 − 55.32 x

- For each additional hour that the child spends watching television, a decrease
of 55.3 minutes in the time spent in physical activities is estimated. The
decrease is significant.
Statistical inference for regression
 In linear regression, we can identify the elements of the ANOVA table:
Measures the variation in mean response: Model sum of
squares

𝑆𝑆𝑆𝑆𝑦𝑦 = ∑𝑖𝑖 𝑦𝑦𝑖𝑖 − 𝑦𝑦� 2 = ∑𝑖𝑖 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 + 𝑦𝑦�𝑖𝑖 − 𝑦𝑦� 2 = ∑𝑖𝑖 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 2 +∑𝑖𝑖 𝑦𝑦�𝑖𝑖 − 𝑦𝑦� 2 = 𝑆𝑆𝑆𝑆𝑒𝑒 + ∑𝑖𝑖 𝑦𝑦�𝑖𝑖 − 𝑦𝑦� 2

Numerator of the variance of y. It measures the initial Numerator of the residual variance. It measures the dispersion
variation. Total sum of squares with respect to the prediction. Error sum of squares

The ANOVA table allows us to test whether the model is explanatory.


Statistical inference for regression
 Coefficient of determination: Statistic that represents the proportion of the
variance of Y that is explained by the linear regression model.

∑𝑖𝑖 𝑦𝑦�𝑖𝑖 − 𝑦𝑦� 2


2
𝑆𝑆𝑆𝑆𝑒𝑒
𝑅𝑅 = =1 − ; 0 ≤ 𝑅𝑅2 ≤ 1
𝑆𝑆𝑆𝑆𝑦𝑦 𝑆𝑆𝑆𝑆𝑦𝑦

 If 𝑅𝑅 2 = 0, then ∑𝑖𝑖 𝑦𝑦�𝑖𝑖 − 𝑦𝑦� 2


= 0; that is, the model explains nothing about Y

A value of 𝑅𝑅2 close to 0 indicates a low explanatory capacity of the model

 If 𝑅𝑅 2 = 1, then ∑𝑖𝑖 𝑦𝑦�𝑖𝑖 − 𝑦𝑦� 2


= 𝑆𝑆𝑆𝑆𝑦𝑦 ; that is, the model provides a perfect fit

A value of 𝑅𝑅2 close to 1 indicates a high explanatory capacity of the model


Statistical inference for regression
Example 2: Summarized interpretation
> lm(peso~talla, data=pesonac)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5277.59 593.92 -8.886 1.35e-15 ***
talla 169.87 11.77 14.437 < 2e-16 ***

Residual standard error: 295.5 on 158 degrees of freedom


Multiple R-squared: 0.5688, Adjusted R-squared: 0.5661
F-statistic: 208.4 on 1 and 158 DF, p-value: < 2.2e-16

-The variance of Y that is not explained from X is 295.52; that is, the standard
deviation of variable Y | X is 295.5 (quantity in absolute units, same units of Y)

-The % of variance of Y that is explained from X is 56.88% (quantity in relative


units, percentage)
-The model is significant; that is, the % of the variance of weight that explains
the size of the newborn is not null
Statistical inference for regression
Example 1 (case 2): Summarized interpretation

lm(formula = TAF ~ television, data = datosTAF)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 662.39 54.85 12.076 < 2e-16 ***
television -55.32 11.86 -4.662 2.15e-05 ***
---

Residual standard error: 114.4 on 53 degrees of freedom


Multiple R-squared: 0.2909, Adjusted R-squared: 0.2775
F-statistic: 21.74 on 1 and 53 DF, p-value: 2.152e-05

-The variance of Y that is not explained from X is 114.42; that is, the standard
deviation of variable Y | X is 114.4 (this is not very informative if we do not
know the sd of Y, here 134.6)

-The % of variance of Y that is explained from X is 29.09%. The number of daily


hours watching television explains 29.09% of TAF
-The model is significant; that is, the % of TAF explained from variable television
is not null
7.4. Confidence interval for the prediction
 One of the main objectives of the linear regression is prediction, i.e.
“approximate” the value of Y, for a given value of X (X = x0)
 The model hypothesis (linearity, normality, homocedasticity and independence),
allow us to calculate confidence intervals
 In particular, we want to predict the mean of Y when X=x0

 That is, we want to predict the random variable:

Example 2: Let us assume that we know the


size of the newborn, 54 centimeters

Weight = - 5277.59 + 169.87 x 54 = 3895.6


Confidence interval for the prediction
Modelos -> Predecir usando el modelo activo-> Introducir los datos y predecir

 We get an empty matrix and we must introduce the value of variable X in the
corresponding column and close the window

predict(RegModel.1, .data)
1
468.7794

Modelos-> Intervalo de confianza método delta


 It allows us to obtain CI for the mean of Y|X=x0. We must formulate the model

> DeltaMethod(RegModel.1, "b0+b1*3.5", level=0.95)

parameter name
(Intercept) b0
television b1

Estimate SE 2.5 % 97.5 %


b0 + b1 * 3.5 468.7794 19.01361 431.5134 506.0454
7.5. Multiple linear regression
 Linear regression with m (m>1) explanatory variables: {xj}, j=1,…m.
 The hypothesis of the model are the same: linearity, independence,
normality and homocedasticity
 The model is expressed as:

 Now, there are m+1 coefficients that we have to estimate: 𝛽𝛽0 , 𝛽𝛽1 , … , 𝛽𝛽𝑚𝑚 .

 Applicability conditions: The explanatory variables have to be continuous.


However, because of the CLT, the explanatory variables can also be
dichotomous. This allows for the introduction of categorical variables using
several dichotomous variables
Multiple linear regression
Why? If the model has been fitted with the objective of predicting the response
variable, multiple linear regression may better explain the data and, as a
consequence, improve the predictive capability
> lm(formula = peso ~ sexo, data = pesonac)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3214.01 50.94 63.094 <2e-16 *** - Multiple regression increases
sexo[T.niño] 145.57 70.30 2.071 0.04 *
the variance that is explained
Residual standard error: 444.1 on 158 degrees of freedom
Multiple R-squared: 0.02642, Adjusted R-squared: 0.02026 - The model is more significant
F-statistic: 4.287 on 1 and 158 DF, p-value: 0.04003

> lm(formula = peso ~ sexo + paridad + pesom + preterm + talla,data = pesonac)


Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4528.214 585.887 -7.729 1.31e-12 ***
sexo[T.niño] 38.662 45.401 0.852 0.3958
paridad 76.602 28.074 2.729 0.0071 **
pesom 2.208 1.635 1.351 0.1787
preterm[T.sí] -520.850 107.887 -4.828 3.30e-06 ***
talla 151.485 11.829 12.807 < 2e-16 ***

Residual standard error: 276.5 on 154 degrees of freedom


Multiple R-squared: 0.6322, Adjusted R-squared: 0.6203
F-statistic: 52.95 on 5 and 154 DF, p-value: < 2.2e-16
Multiple linear regression
Why? If the model has been fitted with the objective of explaining the effect of
one variable, remove the effect of other related factors

> lm(formula = peso ~ sexo, data = pesonac)


Coefficients:
The relationship between the
Estimate Std. Error t value Pr(>|t|) gender of the newborn and
(Intercept) 3214.01 50.94 63.094 <2e-16 ***
sexo[T.niño] 145.57 70.30 2.071 0.04 *
his weight may be explained
by the relationship between
Residual standard error: 444.1 on 158 degrees of freedom
Multiple R-squared: 0.02642, Adjusted R-squared: 0.02026 the gender and the size.
F-statistic: 4.287 on 1 and 158 DF, p-value: 0.04003

> lm(formula = peso ~ sexo + paridad + pesom + preterm + talla,data = pesonac)


Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4528.214 585.887 -7.729 1.31e-12 ***
sexo[T.niño] 38.662 45.401 0.852 0.3958
paridad 76.602 28.074 2.729 0.0071 **
pesom 2.208 1.635 1.351 0.1787
preterm[T.sí] -520.850 107.887 -4.828 3.30e-06 ***
talla 151.485 11.829 12.807 < 2e-16 ***

Residual standard error: 276.5 on 154 degrees of freedom


Multiple R-squared: 0.6322, Adjusted R-squared: 0.6203
F-statistic: 52.95 on 5 and 154 DF, p-value: < 2.2e-16
Multiple linear regression
Example 2 (multiple regression): Explain the weight of the newborn
according to his gender, his condition of prematurity (≤ 37 weeks) and the
weight of his parents

 We must select the variables, starting for the response one, which will
appear before the ~ symbol in the equation
 We cannot select the menu “Regresión lineal” unless the dichotomous
variables are numerical and they are not identified as factors
Not recommended!
Multiple linear regression
> summary(LinearModel.1)
Call:
lm(formula = peso ~ pesom + pesop + sexo + preterm, data = pesonac2)

Residuals:
Min 1Q Median 3Q Max
-1234.64 -283.65 7.56 236.78 1204.17

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2871.159 258.758 11.096 < 2e-16 ***
pesom 5.280 2.413 2.188 0.03014 *
pesop 0.409 2.757 0.148 0.88224
sexo[T.niño] 178.927 65.248 2.742 0.00682 **
preterm[T.sí] -785.960 151.354 -5.193 0.000000642 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 409.8 on 155 degrees of freedom


Multiple R-squared: 0.1866, Adjusted R-squared: 0.1656
F-statistic: 8.892 on 4 and 155 DF, p-value: 0.000001724

Interpretation of the coefficients


- For each additional kilogram of the mother, the newborn weights 5.3 grams more, being this increase
significant
- For each additional kilogram of the father, the newborn weights 0.4 grams more, being this increase non-
significant
- Male newborns weight 178.9 grams more than female newborns, being this increase significant
- Premature babies weight 785.9 grams less than mature babies, being this decrease significant
Multiple linear regression
> summary(LinearModel.1)
Call:
lm(formula = peso ~ pesom + pesop + sexo + preterm, data = pesonac2)

Residuals:
Min 1Q Median 3Q Max
-1234.64 -283.65 7.56 236.78 1204.17

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2871.159 258.758 11.096 < 2e-16 ***
pesom 5.280 2.413 2.188 0.03014 *
pesop 0.409 2.757 0.148 0.88224
sexo[T.niño] 178.927 65.248 2.742 0.00682 **
preterm[T.sí] -785.960 151.354 -5.193 0.000000642 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 409.8 on 155 degrees of freedom


Multiple R-squared: 0.1866, Adjusted R-squared: 0.1656
F-statistic: 8.892 on 4 and 155 DF, p-value: 0.000001724

Interpretation of the coefficients


- Significant means that the corresponding coefficient is significantly different from 0. The
non-significant variables, in the presence of the others, do not provide any information
- So, they could be removed from the model (it should never be done in blocks)
- The order of the p-values allows us to order the variables according to their relevance (i.e.
smaller p-valor)
Multiple linear regression
Confidence intervals for the coefficients

> Confint(LinearModel.1, level=0.95)


Estimate 2.5 % 97.5 %
(Intercept) 2871.1593917 2360.0128245 3382.305959
pesom 5.2798947 0.5137154 10.046074
pesop 0.4090015 -5.0364105 5.854414
sexo[T.niño] 178.9271781 50.0363552 307.818001
preterm[T.sí] -785.9602712 -1084.9434952 -486.977047

Note that the value 0 is not included in the


confidence intervals associated with the
significant variables (if the confidence level
corresponds to the significance level 𝛼𝛼)
Multiple linear regression
Interpretation of the overall fit

 According to the ANOVA test, the model is significant


 However, the % of explained variance is 18.66%,
 The % of variance that is not explained by the model, that is the 81.34%, corresponds
to 409.82

To consider:
- According to the principle of parsimony, it is convenient to check the adjusted R2, which
penalizes the original coefficient for each added explanatory variable.
- That the model explains “something” (ANOVA test significant), does not necessarily implies
that it explains “a lot” (R2 small)
Multiple linear regression
Prediction

Modelos -> Predecir usando el modelo activo -> Introducir datos y predecir

> predict(LinearModel.1, .data)


1 2
2434.713 3220.673

 We must introduce the values for which we want to predict.


 Here, we have predicted the weight for pesom=60, pesop=80, sexo=“niña” and both
possibilities of preterm=“sí” and “no”. The result is obtained by substituting the
corresponding values of the explanatory variables in the model equation
� = 2871.16 + 5.23 · 60 + 0.41 · 80 + 178.93 · 0 − 785.96 · 1 = 3220.7
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝(1) Note that the difference
is exactly the coefficient
� = 2871.16 + 5.23 · 60 + 0.41 · 80 + 178.93 · 0 − 785.96 · 0 = 2434.7
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝(2) of “preterm”…
Multiple linear regression
Confidence interval for the prediction

 We must enter the formula based on the concrete values of the predictors and the name
assigned by R to each parameter.
 We have calculated a 95% confidence interval for the mean birth weight corresponding to
pesom=60, pesop=80, sexo=“niña” and preterm=“no”.
> DeltaMethod(LinearModel.3,"b0+b1*60+b2*80+b3*0+b4*0", level=0.95)
Estimate SE 2.5 % 97.5 %
b0 + b1 * 60 + b2 * 80 + b3 * 0 + b4 * 0 3220.673 47.54866 3127.48 3313.867
Multiple linear regression
Graphics of the effects
We can represent the estimated effects together with their confidence intervals for the
values in the sample:

Modelos -> Gráficas -> Gráfica de los efectos


Multiple linear regression
Example 1 (multiple regression): Explain the time the child spends in physical activities
(TAF) according to the variables: sleep (daily hours of sleep of the child), television (number
of daily hours the child spends watching television), consoles (daily hours the child spends
playing with consoles) and gender. In particular, answer the following questions:
- Is the model significant? Which percentage of the variance is explained?
- Which is the most explanatory variable? Do you think we can remove some of the
variables?
- Which is the effect of one additional hour of sleep on the TAF?
- With the selected model, get the prediction and its confidence interval for a girl with:
consoles=3, sleep=9 and television=5

You might also like