Professional Documents
Culture Documents
Topic 7
Topic 7
Example 2
A study was conducted in Valencia
aimed to find determinants of birth
weight. In that study, a multitude of
socio-demographic and lifestyle
variables were collected for a total of
160 newborns. Specifically, we want to
predict the weight of the newborn
according to his size
n=160
7.1.1 Dispersion diagram
Is a two-dimensional graphical representation of the data
It allows confirm (visually) the existence of linear relationship between the
variables X and Y, depending on the “likeness” of the point cloud with a
straight line
10
300
r=0.961 r=0.452
9
200
p-value<0.0001 p-value=0.0003
8
n=149 n=149
7
100
0
5
Y1
Y3
-100 4
0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160
100 X 80 X
60
0
40
r=-0.995
-100
r=0.025
20
p-value<0.0001 -200
0
p-value=0.761
n=149 -20
n=149
-300
-40
Y2
Y4
-400 -60
0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160
X X
Covariance and linear correlation coefficient
Estadísticos -> Resúmenes-> Matriz de correlaciones
Number of observations: 55
Number of observations: 55
- Consolas and TAF: r=-0.234 (p=0.086) inverse association: the more the child plays with
consoles, the less physical activities he performs. The association is weak (no significant).
- TV and TAF: r=-0.539 (p<0.001) inverse association: the more hours the child watches TV,
the less physical activities he performs. The association is strong (significant).
- Sleep and TAF: r=0.563 (p<0.001) direct association: the more hours the child sleeps, the
more physical activities he performs. The association is strong (significant).
Covariance and linear correlation coefficient
Example 2:
Dispersion diagram and correlation coefficient for the weight of the newborn
according to his size
> rcorr.adjust(Dataset[,c("peso","talla")],
type="pearson", use="complete")
Pearson correlations:
peso talla
peso 1.0000 0.7542
talla 0.7542 1.0000
There is direct linear association between the size and the weight of the newborn,
r=0.75 and the association is significant (p<0.0001)
7.2. Regression line
Given two numerical variables X and Y that are linearly related,
we intend to obtain the best line to explain Y from X
Ŷ = b 0 + b1 ⋅ X
Definitions:
X: explanatory variable or independent.
Y: dependent variable.
yˆ i : estimated value
ei = y i − ŷ i : error
yi : real value
n n
SSe = ∑ (ei ) = ∑ (y i − (b 0 + b1x i ))2
2
i=1 i=1
SPxy
b1 =
SS x
Regression line
Interpretation:
ŷ1
ŷ0
1
0 1 2 3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5277.59 593.92 -8.886 1.35e-15 ***
talla 169.87 11.77 14.437 < 2e-16 ***
For each additional centimeter of height at birth, an increase of 169.87 grams in the
weight of the newborn is estimated.
A newborn with size 0 centimeters would correspond with a weight of -5277.59 grams
This is one example where the intercept does not make sense
Regression line: Linear model
The values of the coefficients of the regression line depend on the data (xi,yi)
observed in the experiment; that is, each sample gives a different equation for
the regression line. However, the model (biological association that we intend to
characterize) is invariable and universal, even though it is unknown.
A universal linear relationship for variables 𝑦𝑦𝑖𝑖 and 𝑥𝑥𝑖𝑖 would be formulated as:
𝑦𝑦𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥𝑖𝑖 + 𝜀𝜀𝑖𝑖
Coefficients 𝛽𝛽0 , 𝛽𝛽1 are unknown parameters, and so they must be estimated
The 𝜀𝜀𝑖𝑖 term represents random error. We include this term in the model to
reflect the fact that y varies, even when x is fixed
To make inference about the parameters of the model 𝛽𝛽0 , 𝛽𝛽1 , the following
conditions are necessary: linearity, independence, normality, and homocedasticity
7.3. Statistical inference for regression
Confidence intervals:
Hypothesis testing:
The test on β1 is known as linearity test, and is very relevant. Note that the acceptance
of the null hypothesis (that is, the possibility that the slope is 0) indicates the uselessness
of the model.
HA: β 1 ≠0 HA: ρ ≠0
If we fail to reject the null hypothesis, it does not mean that there is no
association between both variables. There may be nonlinear association
If we reject the null hypothesis, we conclude that there is linear association,
but it may be weak
Statistical inference for regression
Estadísticos -> Ajuste de modelos -> Regresión lineal
Example 2:
>lm(formula = peso ~ talla, data=pesonac)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5277.59 593.92 -8.886 1.35e-15 ***
talla 169.87 11.77 14.437 < 2e-16 ***
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 662.39 54.85 12.076 < 2e-16 ***
television -55.32 11.86 -4.662 2.15e-05 ***
yˆ = 662.39 − 55.32 x
- For each additional hour that the child spends watching television, a decrease
of 55.3 minutes in the time spent in physical activities is estimated. The
decrease is significant.
Statistical inference for regression
In linear regression, we can identify the elements of the ANOVA table:
Measures the variation in mean response: Model sum of
squares
𝑆𝑆𝑆𝑆𝑦𝑦 = ∑𝑖𝑖 𝑦𝑦𝑖𝑖 − 𝑦𝑦� 2 = ∑𝑖𝑖 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 + 𝑦𝑦�𝑖𝑖 − 𝑦𝑦� 2 = ∑𝑖𝑖 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 2 +∑𝑖𝑖 𝑦𝑦�𝑖𝑖 − 𝑦𝑦� 2 = 𝑆𝑆𝑆𝑆𝑒𝑒 + ∑𝑖𝑖 𝑦𝑦�𝑖𝑖 − 𝑦𝑦� 2
Numerator of the variance of y. It measures the initial Numerator of the residual variance. It measures the dispersion
variation. Total sum of squares with respect to the prediction. Error sum of squares
-The variance of Y that is not explained from X is 295.52; that is, the standard
deviation of variable Y | X is 295.5 (quantity in absolute units, same units of Y)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 662.39 54.85 12.076 < 2e-16 ***
television -55.32 11.86 -4.662 2.15e-05 ***
---
-The variance of Y that is not explained from X is 114.42; that is, the standard
deviation of variable Y | X is 114.4 (this is not very informative if we do not
know the sd of Y, here 134.6)
We get an empty matrix and we must introduce the value of variable X in the
corresponding column and close the window
predict(RegModel.1, .data)
1
468.7794
parameter name
(Intercept) b0
television b1
Now, there are m+1 coefficients that we have to estimate: 𝛽𝛽0 , 𝛽𝛽1 , … , 𝛽𝛽𝑚𝑚 .
We must select the variables, starting for the response one, which will
appear before the ~ symbol in the equation
We cannot select the menu “Regresión lineal” unless the dichotomous
variables are numerical and they are not identified as factors
Not recommended!
Multiple linear regression
> summary(LinearModel.1)
Call:
lm(formula = peso ~ pesom + pesop + sexo + preterm, data = pesonac2)
Residuals:
Min 1Q Median 3Q Max
-1234.64 -283.65 7.56 236.78 1204.17
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2871.159 258.758 11.096 < 2e-16 ***
pesom 5.280 2.413 2.188 0.03014 *
pesop 0.409 2.757 0.148 0.88224
sexo[T.niño] 178.927 65.248 2.742 0.00682 **
preterm[T.sí] -785.960 151.354 -5.193 0.000000642 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residuals:
Min 1Q Median 3Q Max
-1234.64 -283.65 7.56 236.78 1204.17
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2871.159 258.758 11.096 < 2e-16 ***
pesom 5.280 2.413 2.188 0.03014 *
pesop 0.409 2.757 0.148 0.88224
sexo[T.niño] 178.927 65.248 2.742 0.00682 **
preterm[T.sí] -785.960 151.354 -5.193 0.000000642 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To consider:
- According to the principle of parsimony, it is convenient to check the adjusted R2, which
penalizes the original coefficient for each added explanatory variable.
- That the model explains “something” (ANOVA test significant), does not necessarily implies
that it explains “a lot” (R2 small)
Multiple linear regression
Prediction
Modelos -> Predecir usando el modelo activo -> Introducir datos y predecir
We must enter the formula based on the concrete values of the predictors and the name
assigned by R to each parameter.
We have calculated a 95% confidence interval for the mean birth weight corresponding to
pesom=60, pesop=80, sexo=“niña” and preterm=“no”.
> DeltaMethod(LinearModel.3,"b0+b1*60+b2*80+b3*0+b4*0", level=0.95)
Estimate SE 2.5 % 97.5 %
b0 + b1 * 60 + b2 * 80 + b3 * 0 + b4 * 0 3220.673 47.54866 3127.48 3313.867
Multiple linear regression
Graphics of the effects
We can represent the estimated effects together with their confidence intervals for the
values in the sample: