You are on page 1of 17

REGRESSION ANALYSIS

• Used for explaining or modelling the relationship


between a single dependent variable Y and one or
more independent variables, X1, X2…Xp.
• Simple Linear regression if p = 1
MULTIPLE REGRESSION Υ = β 0 + β1Χ 1 + ε
• Multiple regression if p > 1
Υ = β 0 + β1 Χ 1 + β 2 Χ 2 + ε
Y (response or dependent variable) - continuous
Slides prepared by: and
Leilani Nora Violeta Bartolome
X1 and X2 (predictor, independent or explanatory
Assistant Scientist Senior Associate Scientist variables)- can be continuous, discrete, or categorical.
PBGB-CRIL

Selection of Independent Variables Scope of the Model

• Extent to which the variable contributes in explaining


• Need to restrict the coverage of the model to some
the variation in Y.
interval or regions of values of the independent
• Importance of the variable as causal agent in the variable/s.
process under study.
• The scope is determined either by the design of
investigation or by the range of data at hand.
EXAMPLE ANOVA APPROACH TO REGRESSION
• Y is family disbursement (in thousand pesos)
ANALYSIS
total family income (X1 = income)
food (X2 = food) SV SS DF MS Fc
household operation (X3 = house) Regression SSR p-1 MSR=SSR/p-1
fuel and light expenditure (X4 = fulyt) Error SSE n-p MSE=SSE/n-p MSR/MSE

total number of family members (X5 = numem) TOTAL SST n-1


• Annual based which are all expressed in thousand pesos

• Linear Model
Υ = β 0 + β1Χ1 + β 2 Χ 2 + β 3Χ 3 + β 4 Χ 4 + β 5Χ 5 + ε

COEFFICIENT OF DETERMINATION DATAFRAME: Fies.csv


• One measure of the model fit is the R2, coefficient of • Dataframe with 50 observations, 1 dependent
determination which measures the reduction of the total variable (disburse) and 5 independent variables
variation in Y associated with the predictors stated in the (income, food, house, fulyt, numem)
model.
SSR SSE
R2 = = 1−
SST SST
• Values closer to 1 indicate better fit
• For simple linear regression R2 = r2

• An alternative measure of fit is σ̂ which is directly


related to the standard errors of the estimates of β
DATAFRAME: Fies.csv MLREG using lm()
> gfit <- lm(disburse~ ., data=DIS)
Read data file Fies.csv > summary(gfit)
> DIS <- read.table(“Fies.csv", header=T,
sep=‘,’, row.names=“OBS”)
> str(DIS)
'data.frame': 50 obs. of 6 variables:
$ disburse: num 158 227 335 181 141 ...
$ income : num 212 251 277 174 137 ...
$ food : num 69.8 96.8 75.8 60.7...
$ house : num 6.01 1.52 23.88 6.89...
$ fulyt : num 10.86 9.87 20.08 5.1 ...
$ numem : int 4 5 3 5 7 3 8 5 5 3 ...

ANOVA Table
INTERPRETING PARAMETER ESTIMATES
> anova(gfit)

Response: disburse
• Naïve interpretation: “A unit change in X1 will produce
Df Sum Sq Mean Sq F value Pr(>F) a change of β1 in the response.
income 1 635639 635639 809.0628 < 2.2e-16 ***
food 1 7651 7651 9.7391 0.003182 **
• β1 is the effect of X1 when all other (specified
house 1 21618 21618 27.5158 4.268e-06 *** predictors) are held constant
fulyt 1 46 46 0.0583 0.810301
numem 1 2632 2632 3.3496 0.074002 .
Residuals 44 34569 786
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05
‘.’ 0.1 ‘ ’ 1
CONFIDENCE INTERVALS(CI) FOR β CONFIDENCE INTERVALS FOR β

• A (1-α)*100% confidence for β: > CI.Beta <- confint(gfit)


> round(CI.Beta, digits=4)
> confint(object, level)
2.5 % 97.5 %
# object – a fitted model object (Intercept) -16.1704 40.9888
# level – the confidence level required income 0.5364 0.7169
food 0.4063 1.1491
house 2.5068 6.0633
fulyt -2.5573 1.6359
numem -9.2633 0.4461

CI FOR PREDICTIONS : predict() CI FOR PREDICTIONS


• predict() is a direct method for computing predicted • A (1-α)*100% confidence for future average response:
response from the results of model fitting functions. > xo <- data.frame(income=750,food=170,
house=25, fulyt=30, numem=11)
> predict(object, interval, level…) > predict(gfit,xo,interval="confidence")
# object – a fitted model object
fit lwr upr
# interval – the type of interval calculation which 1 659.4134 609.8933 708.9334
can be “none”, “confidence” and “prediction”
• A (1-α)*100% prediction interval for a single future
# level – the confidence level, default is 0.95 response:
> predict(gfit,xo,interval=“prediction")

fit lwr upr


1 659.4134 584.2914 734.5353
USEFUL COMMANDS MULTICOLLINEARITY
> names(gfit)
• Exists when 2 or more independent variables are
[1] "coefficients" "residuals" "effects" correlated
"rank" "fitted.values" "assign" "qr"
[8] "df.residual" "xlevels" "call" "terms“ • Effects of multicollinearity
"model"
1. Imprecise estimates of β
2. The interpretation of regression coefficients is not
applicable.
3. t-tests may fail to reveal significant factors

VARIANCE INFLATION FACTOR (VIF)


VARIANCE INFLATION FACTOR (VIF) • Check the correlation matrix of all independent variables
> library(agricolae)
• Most reliable way to examine multicollinearity > Xs <- DIS[-1]
• Rule of thumb: If VIF > 10 (or >5 to be very > corr.Xs <- correlation(Xs)
conservative) there is a multicollinearity problem.
• vif() is one of the functions in package ‘faraway’ which
computes the variance inflation factor

> vif(object, …)

# object – a data matrix of X’s or a model object


REMEDIAL MEASURES OF
VARIANCE INFLATION FACTOR (VIF) : vif() MULTICOLLINEARITY
1. Drop predictors which are correlated with other
• vif() calculates the VIF predictors
2. Add some observation to break collinearity pattern
> library(faraway)
> vif(Xs) # or vif(gfit) 3. Ridge Regression – variant of multiple regression
whose objective is to solve multicollinearity
income food house fulyt numem
2.691149 2.327383 1.237047 1.900161 1.380973 4. Regression using Principal Components

CRITERION-BASED PROCEDURES
VARIABLE SELECTION • Fit the 2p possible models and choose the best one
according to some criterion.

• Intended to select the “best” subset of predictors • Also known as “All possible-regression procedure”

• To explain the data in the simplest way – redundant Criteria for Comparing Regression models
predictors are removed.
1. Akaike Information Criterion (AIC) and Bayes
Information Criterion (BIC)
AIC = -2log-likelihood + 2p
BIC = -2log-likelihood + plogn
where:
-2log-likelihood =nlog(SSE/n) also known as
the deviance
Rule : Lowest AIC or BIC is the best model
CRITERION-BASED PROCEDURES CRITERION-BASED PROCEDURES
Criteria for Comparing Regression models Criteria for Comparing Regression models
2. Adjusted R2 or Ra2 : SSE p
3. Mallow’s Cp Statistic : C p = + 2p − n
MSE
 n − 1  SSE MSE
Ra2 = 1 −   = 1−
 n − p  SST SST / n − 1 - MSE is from the model with all predictors and SSEp
if from the model with p parameters
Rule : The model with the highest adjusted R2
value is considered best. - When number of independent variables = p, Cp=p.
Thus a model with bad fit will have Cp much bigger
than p

BACKWARD ELIMINATION FORWARD SELECTION

1. Fit a model with all predictors in the model 1. Fit the intercept or the null model, without variables in
2. Remove the predictor with highest p-value greater than the model
significance level ∝crit 2. For all predictors not in the model, check their p-values
3. Refit the model and repeat Step 2. if they are added to the model. Choose the one with
lowest p-value less than ∝crit.
4. Stop when all the pvalues < ∝crit
3. Continue until no new predictors can be added.
• This approach is known as the “saturated” model
approach, where we saturate the model with all terms
and remove those that are insignificant relative to the
presence of the others.
STEPWISE REGRESSION regsubsets()
• Combination of backward elimination and forward • One of the functions in package ‘leaps’, which is used
selection for regression subset selection.
• At each stage a variable may be added or removed ad > regsubsets(x, data, method =
there are several variations on exactly how this is done. c(“exhaustive", “backward", “forward",
• Some drawbacks “seqrep“), force.in=n[,n...])
1. Possible to miss the “optimal model” # x – design matrix or model formula
2. The procedures are not directly linked to final # method – method of variable selection such as
objectives of prediction or explanation and may not exhaustive search, forward, backward or sequential
help solve the problem replacement. (Note: for few number of independent
3. Tends to pick smaller models than desirable for variables, the exhaustive method is recommended)
prediction or explanation. # force.in – specifies one or more variables to be
included in all models.

subsets() BACKWARD ELIMINATION : regsubsets()


> library(leaps)
• One of the functions in package ‘car’, which contains > bwd <- regsubsets(disburse~.,data=DIS,
mostly functions for applied regression, linear models, method="backward")
and GLM. > bwds <- summary(bwd); bwds
• subsets() finds an optimal subsets of predictors. This Subset selection object
function plots a measure of fit against a subset size. 5 Variables (and intercept)
Forced in Forced out
> subsets(object, statistic=c("bic", income FALSE FALSE
"cp", "adjr2", "rsq", "rss") food FALSE FALSE
house FALSE FALSE
fulyt FALSE FALSE
# object – a regsubsets object produced by the numem FALSE FALSE
regsubsets() in the leaps package 1 subsets of each size up to 5
Selection Algorithm: backward
# statistic – statistic to plot for each predictor subset income food house fulyt numem
1 ( 1 ) "*" " " " " " " " "
such as bic, cp, adjusted R2, unadjusted R2 and 2 ( 1 ) "*" " " "*" " " " "
residual sum of squares. 3 ( 1 ) "*" "*" "*" " " " "
4 ( 1 ) "*" "*" "*" " " "*"
5 ( 1 ) "*" "*" "*" "*" "*"
BACKWARD ELIMINATION CRITERION subsets()
> names(bwds)
[1] "which" "rsq" "rss" "adjr2" "cp" par(mfrow=c(2,2))
"bic" "outmat" "obj" subsets(bwd,statistic="rss")
subsets(bwd,statistic="adjr2")
> STAT.bwd <- rbind(bwds$rss,
bwds$adjr2, bwds$cp, bwds$bic) subsets(bwd,statistic="cp")
> rownames(STAT.bwd) <- c("rss", subsets(bwd,statistic="bic")
"adjr2", "cp", "bic") par(mfrow=c(1,1))
> colnames(STAT.bwd) <- c("i", "i-h",
"i-fd-h", "i-fd-h-n", "i-fd-fl-h")
> round(STAT.bwd, 4)

i i-h i-fd-h i-fd-h-n i-fd-fl-h-n


rss 66515.2057 48606.4305 37245.9663 34722.6207 34568.5604
adjr2 0.9033 0.9278 0.9435 0.9462 0.9452
cp 38.6627 17.8679 5.4079 4.1961 6.0000
bic -110.0121 -121.7838 -131.1824 -130.7780 -127.0883

FORWARD ELIMINATION CRITERION


> fwd <- regsubsets(disburse~.,data=DIS,
method="forward")
> fwds <- summary(fwd)
> STAT.fwd <- rbind(fwds$rss,
fwds$adjr2, fwds$cp, fwds$bic)
> rownames(STAT.fwd) <- c("rss",
"adjr2", "cp", "bic")
> colnames(STAT.fwd) <- c("i", "i-h",
"i-fd-h", "i-fd-h", "i-fd-fl-h")
> round(STAT.fwd, 4)
i i-h i-fd-h i-fd-h-n i-fd-fl-h-n
rss 66515.2057 48606.4305 37245.9663 34722.6207 34568.5604
adjr2 0.9033 0.9278 0.9435 0.9462 0.9452
cp 38.6627 17.8679 5.4079 4.1961 6.0000
bic -110.0121 -121.7838 -131.1824 -130.7780 -127.0883
STEPWISE REGRESSION : step() STEPWISE REGRESSION : step()
> library(MASS)
• Use to select a model using formula-based model by
> stepw <- step(gfit, direction="both“)
AIC
Start: AIC=338.93
> step(object, direction=c(“both", disburse ~ income + food + house + fulyt + numem
Df Sum of Sq RSS AIC
“forward", “backward”)) - fulyt 1 154 34723 337.16
<none> 34569 338.93
# object – an lm object - numem 1 2632 37200 340.60
- food 1 13991 48560 353.93
# statistic – mode of stepwise search, can be one of - house 1 18529 53098 358.39
- income 1 153873 188441 421.73
“both”, “backward”, or “forward”
Step: AIC=337.16
disburse ~ income + food + house + numem
Df Sum of Sq RSS AIC
<none> 34723 337.16
- numem 1 2523 37246 338.66
+ fulyt 1 154 34569 338.93
- food 1 13845 48568 351.93
- house 1 18562 53284 356.57
- income 1 192301 227023 429.04

STEPWISE REGRESSION : step() STEPWISE REGRESSION : step()


> bward <- step(gfit, direction=“backward“)
> fward <- step(gfit, direction=“forward“)
Start: AIC=338.93
disburse ~ income + food + house + fulyt + numem
Manual Variable Selection Manual Variable Selection
• Start with the Null model • Choose term to add to the Null model
Null <- lm(disburse ~ 1, data=DIS) newmodel <- update(Null,.~. + income)
AIC(newmodel)
• Add one term to the Null model at a time
addterm(Null,scope = gfit, test="F")
• Add one term to the new model at a time
addterm(newmodel,scope = gfit, test="F")

Final Model
Manual Variable Selection
gfit2 <- lm(disburse ~
• Choose term to add to the new model income+food+house+numem,
newmodel2 <- update(newmodel,.~. + food) data=DIS)
AIC(newmodel2) summary(gfit2)

• Continue the process till no more terms


could be added to the model
• If a term should be deleted from the model
newmodel <- update(newmodel2,.~. - food)
ASSUMPTIONS
• The MLR model is based on several assumptions.
Provided that the assumptions are satisfied, the
regression estimators are optimal in the sense that they
are unbiased, efficient, and consistent.

DIAGNOSTICS • Basic assumptions for the regression model


1. The error terms is independent and normally
distributed
2. Linearity – the relationship between Y and predictors
is linear.
3. The variance of the residuals is constant : E(ei2) = σ2

RESIDUAL ANALYSIS RESIDUAL ANALYSIS


• Useful means of examining the aptness of a model • Non-Constancy of Variance
• Residual plot is a graphical presentation relating the ei

residuals with the fitted values or individual predictors


• Non-Linearity ei

ei ei
Increasing trend

ei

Constant Variance

Non-Linear Linear

Decreasing trend
RESIDUAL ANALYSIS RESIDUAL ANALYSIS
• Non-Independence of Errors • Outlier detection
ei ei

Independence of Errors

ei ei

Positive Correlation Negative Correlation

RESIDUAL PLOT RESIDUAL PLOT

> plot(gfit2$fit,
gfit2$res,ylab="Residuals",
xlab="Fitted",
main="Residual Plot")
> abline(h=0)
RESIDUAL PLOT VS. PREDICTORS RESIDUAL PLOT VS. PREDICTORS
> par(mfrow=c(2,2))
> plot(DIS$income, gfit2$res,
ylab="Residuals", xlab="Income")
> plot(DIS$house, gfit2$res,
ylab="Residuals", xlab="House")
> plot(DIS$food, gfit2$res,
ylab="Residuals", xlab="Food")
> plot(DIS$numem, gfit2$res,
ylab="Residuals", xlab="No. of
family members")
> par(mfrow=c(1,1)

REMEDIAL MEASURES ASSESSING NORMALITY OF RESIDUALS


FOR VARIANCE HETEROGENEITY > par(mfrow=c(2,2)
> qqnorm(trans2$res,ylab="Raw
• Most common remedy is transformation. But sometimes Residuals")
the “best" transformation is still not good enough. Also, > qqline(trans2$res)
sometimes a transformation only improves things a tiny > qqnorm(rstudent(trans2),
bit, in which case it may be better to go with the ylab="Studentized residuals")
untransformed variable, because it is much easier to > abline(0,1)
interpret. > hist(trans2$res,10)
• Weighted regression (beyond the scope of the course) > boxplot(trans2$res,main="Boxplot of
residuals")
> par(mfrow=c(1,1)
Outlier

• An observation that is unconditionally unusual in either


its Y or X value
• Can cause us to misinterpret patterns in plots. Outliers
can affect visual resolution of remaining data in plots
(forces observations into “clusters”)
• Can have a strong influence on statistical models.
Deleting outliers from a regression model can sometimes
give completely different results

OUTLIER TESTS IDENTIFYING INFLUENTIAL CASES


• The popular measures of influential point is the Cook’s
• An outlier tests is useful to enable us to distinguish
Distance Statistics (Di).
between truly unusual points and residuals which are
large but are not exceptional
• Use to ascertain whether or not outliers are influential
• Before performing a test for outlier, check if the extreme cases
observations are due to errors in computations or some
other explainable causes. • An index plot of Di can be used to identify influential
points.
• Cook’s original criteria is 1.0 (too liberal)
• Fox proposed 4/(n-p-1)
COOK’S DISTANCE : cooks.distance()
• cooks.distance() is one of the functions of the package
‘stats’ that computes the cook’s distance statistics
> cooks.distance(model, …)
# model – an object returned by lm or glm

• Illustration
> cook <- cooks.distance(gfit2)
> plot(cook,ylab="Cooks distances",
xlim=c(0,60))
> OBS <- rownames(DIS)
> identify(1:50,cook,OBS)

REMEDIAL MEASURES FOR Summary


INFLUENTIAL CASES
When an outlying observation is accurate but no • Use of lm() to fit multiple linear regression
explanation can be found for it, the following may be
• Different methods in selecting independent
considered:
variables
• Check for any independent variable that may have an
unusual distribution that produces extreme values. • How to handle any violation in the
• Check if the dependent variable has properties that may assumptions of regression
lead to a potential or occasional large residuals – Multicollinearity
• Check appropriateness of model
– Variance heterogeneity
• Apply appropriate transformation.
• Use robust approaches (refer to regression books) – Non-normality of residuals
• How to handle outliers
Thank you!