761 views

Uploaded by Vivay Salazar

Attribution Non-Commercial (BY-NC)

- Regression and Correlation Analysis - Regression and Correlation Analysis
- Data Manipulation and Statistical analysis - Analysis of Variance
- Data Management and Statistical Analysis - Basic R Graphics
- Intermediate R - Nonlinear Regression in R
- R CropStat Introduction
- Power View Reporting
- Data Management and Statistical Analysis - Descriptive Statistics
- Intermediate R - Analysis of Categorical Data
- Intermediate R - Principal Component Analysis
- Introduction to Gene Mapping
- Powerpoint - Regression and Correlation Analysis
- Data Management and Statistical Analysis - Data Manipulation
- Intermediate R - Cluster Analysis
- Data Management and Statistical Analysis - Generating Randomization Layout
- QTL Mapping
- Intermediate R - Analysis of Count and Proportion Data
- Data Management and Statistical Analysis - Loading data
- Principles of Experimental Design and Data Analysis
- Intermediate R - Multidimensional Scaling
- A few Basics about QTL Mapping

You are on page 1of 17

between a single dependent variable Y and one or

more independent variables, X1, X2…Xp.

• Simple Linear regression if p = 1

MULTIPLE REGRESSION Υ = β 0 + β1Χ 1 + ε

• Multiple regression if p > 1

Υ = β 0 + β1 Χ 1 + β 2 Χ 2 + ε

Y (response or dependent variable) - continuous

Slides prepared by: and

Leilani Nora Violeta Bartolome

X1 and X2 (predictor, independent or explanatory

Assistant Scientist Senior Associate Scientist variables)- can be continuous, discrete, or categorical.

PBGB-CRIL

• Need to restrict the coverage of the model to some

the variation in Y.

interval or regions of values of the independent

• Importance of the variable as causal agent in the variable/s.

process under study.

• The scope is determined either by the design of

investigation or by the range of data at hand.

EXAMPLE ANOVA APPROACH TO REGRESSION

• Y is family disbursement (in thousand pesos)

ANALYSIS

total family income (X1 = income)

food (X2 = food) SV SS DF MS Fc

household operation (X3 = house) Regression SSR p-1 MSR=SSR/p-1

fuel and light expenditure (X4 = fulyt) Error SSE n-p MSE=SSE/n-p MSR/MSE

• Annual based which are all expressed in thousand pesos

• Linear Model

Υ = β 0 + β1Χ1 + β 2 Χ 2 + β 3Χ 3 + β 4 Χ 4 + β 5Χ 5 + ε

• One measure of the model fit is the R2, coefficient of • Dataframe with 50 observations, 1 dependent

determination which measures the reduction of the total variable (disburse) and 5 independent variables

variation in Y associated with the predictors stated in the (income, food, house, fulyt, numem)

model.

SSR SSE

R2 = = 1−

SST SST

• Values closer to 1 indicate better fit

• For simple linear regression R2 = r2

related to the standard errors of the estimates of β

DATAFRAME: Fies.csv MLREG using lm()

> gfit <- lm(disburse~ ., data=DIS)

Read data file Fies.csv > summary(gfit)

> DIS <- read.table(“Fies.csv", header=T,

sep=‘,’, row.names=“OBS”)

> str(DIS)

'data.frame': 50 obs. of 6 variables:

$ disburse: num 158 227 335 181 141 ...

$ income : num 212 251 277 174 137 ...

$ food : num 69.8 96.8 75.8 60.7...

$ house : num 6.01 1.52 23.88 6.89...

$ fulyt : num 10.86 9.87 20.08 5.1 ...

$ numem : int 4 5 3 5 7 3 8 5 5 3 ...

ANOVA Table

INTERPRETING PARAMETER ESTIMATES

> anova(gfit)

Response: disburse

• Naïve interpretation: “A unit change in X1 will produce

Df Sum Sq Mean Sq F value Pr(>F) a change of β1 in the response.

income 1 635639 635639 809.0628 < 2.2e-16 ***

food 1 7651 7651 9.7391 0.003182 **

• β1 is the effect of X1 when all other (specified

house 1 21618 21618 27.5158 4.268e-06 *** predictors) are held constant

fulyt 1 46 46 0.0583 0.810301

numem 1 2632 2632 3.3496 0.074002 .

Residuals 44 34569 786

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05

‘.’ 0.1 ‘ ’ 1

CONFIDENCE INTERVALS(CI) FOR β CONFIDENCE INTERVALS FOR β

> round(CI.Beta, digits=4)

> confint(object, level)

2.5 % 97.5 %

# object – a fitted model object (Intercept) -16.1704 40.9888

# level – the confidence level required income 0.5364 0.7169

food 0.4063 1.1491

house 2.5068 6.0633

fulyt -2.5573 1.6359

numem -9.2633 0.4461

• predict() is a direct method for computing predicted • A (1-α)*100% confidence for future average response:

response from the results of model fitting functions. > xo <- data.frame(income=750,food=170,

house=25, fulyt=30, numem=11)

> predict(object, interval, level…) > predict(gfit,xo,interval="confidence")

# object – a fitted model object

fit lwr upr

# interval – the type of interval calculation which 1 659.4134 609.8933 708.9334

can be “none”, “confidence” and “prediction”

• A (1-α)*100% prediction interval for a single future

# level – the confidence level, default is 0.95 response:

> predict(gfit,xo,interval=“prediction")

1 659.4134 584.2914 734.5353

USEFUL COMMANDS MULTICOLLINEARITY

> names(gfit)

• Exists when 2 or more independent variables are

[1] "coefficients" "residuals" "effects" correlated

"rank" "fitted.values" "assign" "qr"

[8] "df.residual" "xlevels" "call" "terms“ • Effects of multicollinearity

"model"

1. Imprecise estimates of β

2. The interpretation of regression coefficients is not

applicable.

3. t-tests may fail to reveal significant factors

VARIANCE INFLATION FACTOR (VIF) • Check the correlation matrix of all independent variables

> library(agricolae)

• Most reliable way to examine multicollinearity > Xs <- DIS[-1]

• Rule of thumb: If VIF > 10 (or >5 to be very > corr.Xs <- correlation(Xs)

conservative) there is a multicollinearity problem.

• vif() is one of the functions in package ‘faraway’ which

computes the variance inflation factor

> vif(object, …)

REMEDIAL MEASURES OF

VARIANCE INFLATION FACTOR (VIF) : vif() MULTICOLLINEARITY

1. Drop predictors which are correlated with other

• vif() calculates the VIF predictors

2. Add some observation to break collinearity pattern

> library(faraway)

> vif(Xs) # or vif(gfit) 3. Ridge Regression – variant of multiple regression

whose objective is to solve multicollinearity

income food house fulyt numem

2.691149 2.327383 1.237047 1.900161 1.380973 4. Regression using Principal Components

CRITERION-BASED PROCEDURES

VARIABLE SELECTION • Fit the 2p possible models and choose the best one

according to some criterion.

• Intended to select the “best” subset of predictors • Also known as “All possible-regression procedure”

• To explain the data in the simplest way – redundant Criteria for Comparing Regression models

predictors are removed.

1. Akaike Information Criterion (AIC) and Bayes

Information Criterion (BIC)

AIC = -2log-likelihood + 2p

BIC = -2log-likelihood + plogn

where:

-2log-likelihood =nlog(SSE/n) also known as

the deviance

Rule : Lowest AIC or BIC is the best model

CRITERION-BASED PROCEDURES CRITERION-BASED PROCEDURES

Criteria for Comparing Regression models Criteria for Comparing Regression models

2. Adjusted R2 or Ra2 : SSE p

3. Mallow’s Cp Statistic : C p = + 2p − n

MSE

n − 1 SSE MSE

Ra2 = 1 − = 1−

n − p SST SST / n − 1 - MSE is from the model with all predictors and SSEp

if from the model with p parameters

Rule : The model with the highest adjusted R2

value is considered best. - When number of independent variables = p, Cp=p.

Thus a model with bad fit will have Cp much bigger

than p

1. Fit a model with all predictors in the model 1. Fit the intercept or the null model, without variables in

2. Remove the predictor with highest p-value greater than the model

significance level ∝crit 2. For all predictors not in the model, check their p-values

3. Refit the model and repeat Step 2. if they are added to the model. Choose the one with

lowest p-value less than ∝crit.

4. Stop when all the pvalues < ∝crit

3. Continue until no new predictors can be added.

• This approach is known as the “saturated” model

approach, where we saturate the model with all terms

and remove those that are insignificant relative to the

presence of the others.

STEPWISE REGRESSION regsubsets()

• Combination of backward elimination and forward • One of the functions in package ‘leaps’, which is used

selection for regression subset selection.

• At each stage a variable may be added or removed ad > regsubsets(x, data, method =

there are several variations on exactly how this is done. c(“exhaustive", “backward", “forward",

• Some drawbacks “seqrep“), force.in=n[,n...])

1. Possible to miss the “optimal model” # x – design matrix or model formula

2. The procedures are not directly linked to final # method – method of variable selection such as

objectives of prediction or explanation and may not exhaustive search, forward, backward or sequential

help solve the problem replacement. (Note: for few number of independent

3. Tends to pick smaller models than desirable for variables, the exhaustive method is recommended)

prediction or explanation. # force.in – specifies one or more variables to be

included in all models.

> library(leaps)

• One of the functions in package ‘car’, which contains > bwd <- regsubsets(disburse~.,data=DIS,

mostly functions for applied regression, linear models, method="backward")

and GLM. > bwds <- summary(bwd); bwds

• subsets() finds an optimal subsets of predictors. This Subset selection object

function plots a measure of fit against a subset size. 5 Variables (and intercept)

Forced in Forced out

> subsets(object, statistic=c("bic", income FALSE FALSE

"cp", "adjr2", "rsq", "rss") food FALSE FALSE

house FALSE FALSE

fulyt FALSE FALSE

# object – a regsubsets object produced by the numem FALSE FALSE

regsubsets() in the leaps package 1 subsets of each size up to 5

Selection Algorithm: backward

# statistic – statistic to plot for each predictor subset income food house fulyt numem

1 ( 1 ) "*" " " " " " " " "

such as bic, cp, adjusted R2, unadjusted R2 and 2 ( 1 ) "*" " " "*" " " " "

residual sum of squares. 3 ( 1 ) "*" "*" "*" " " " "

4 ( 1 ) "*" "*" "*" " " "*"

5 ( 1 ) "*" "*" "*" "*" "*"

BACKWARD ELIMINATION CRITERION subsets()

> names(bwds)

[1] "which" "rsq" "rss" "adjr2" "cp" par(mfrow=c(2,2))

"bic" "outmat" "obj" subsets(bwd,statistic="rss")

subsets(bwd,statistic="adjr2")

> STAT.bwd <- rbind(bwds$rss,

bwds$adjr2, bwds$cp, bwds$bic) subsets(bwd,statistic="cp")

> rownames(STAT.bwd) <- c("rss", subsets(bwd,statistic="bic")

"adjr2", "cp", "bic") par(mfrow=c(1,1))

> colnames(STAT.bwd) <- c("i", "i-h",

"i-fd-h", "i-fd-h-n", "i-fd-fl-h")

> round(STAT.bwd, 4)

rss 66515.2057 48606.4305 37245.9663 34722.6207 34568.5604

adjr2 0.9033 0.9278 0.9435 0.9462 0.9452

cp 38.6627 17.8679 5.4079 4.1961 6.0000

bic -110.0121 -121.7838 -131.1824 -130.7780 -127.0883

> fwd <- regsubsets(disburse~.,data=DIS,

method="forward")

> fwds <- summary(fwd)

> STAT.fwd <- rbind(fwds$rss,

fwds$adjr2, fwds$cp, fwds$bic)

> rownames(STAT.fwd) <- c("rss",

"adjr2", "cp", "bic")

> colnames(STAT.fwd) <- c("i", "i-h",

"i-fd-h", "i-fd-h", "i-fd-fl-h")

> round(STAT.fwd, 4)

i i-h i-fd-h i-fd-h-n i-fd-fl-h-n

rss 66515.2057 48606.4305 37245.9663 34722.6207 34568.5604

adjr2 0.9033 0.9278 0.9435 0.9462 0.9452

cp 38.6627 17.8679 5.4079 4.1961 6.0000

bic -110.0121 -121.7838 -131.1824 -130.7780 -127.0883

STEPWISE REGRESSION : step() STEPWISE REGRESSION : step()

> library(MASS)

• Use to select a model using formula-based model by

> stepw <- step(gfit, direction="both“)

AIC

Start: AIC=338.93

> step(object, direction=c(“both", disburse ~ income + food + house + fulyt + numem

Df Sum of Sq RSS AIC

“forward", “backward”)) - fulyt 1 154 34723 337.16

<none> 34569 338.93

# object – an lm object - numem 1 2632 37200 340.60

- food 1 13991 48560 353.93

# statistic – mode of stepwise search, can be one of - house 1 18529 53098 358.39

- income 1 153873 188441 421.73

“both”, “backward”, or “forward”

Step: AIC=337.16

disburse ~ income + food + house + numem

Df Sum of Sq RSS AIC

<none> 34723 337.16

- numem 1 2523 37246 338.66

+ fulyt 1 154 34569 338.93

- food 1 13845 48568 351.93

- house 1 18562 53284 356.57

- income 1 192301 227023 429.04

> bward <- step(gfit, direction=“backward“)

> fward <- step(gfit, direction=“forward“)

Start: AIC=338.93

disburse ~ income + food + house + fulyt + numem

Manual Variable Selection Manual Variable Selection

• Start with the Null model • Choose term to add to the Null model

Null <- lm(disburse ~ 1, data=DIS) newmodel <- update(Null,.~. + income)

AIC(newmodel)

• Add one term to the Null model at a time

addterm(Null,scope = gfit, test="F")

• Add one term to the new model at a time

addterm(newmodel,scope = gfit, test="F")

Final Model

Manual Variable Selection

gfit2 <- lm(disburse ~

• Choose term to add to the new model income+food+house+numem,

newmodel2 <- update(newmodel,.~. + food) data=DIS)

AIC(newmodel2) summary(gfit2)

could be added to the model

• If a term should be deleted from the model

newmodel <- update(newmodel2,.~. - food)

ASSUMPTIONS

• The MLR model is based on several assumptions.

Provided that the assumptions are satisfied, the

regression estimators are optimal in the sense that they

are unbiased, efficient, and consistent.

1. The error terms is independent and normally

distributed

2. Linearity – the relationship between Y and predictors

is linear.

3. The variance of the residuals is constant : E(ei2) = σ2

• Useful means of examining the aptness of a model • Non-Constancy of Variance

• Residual plot is a graphical presentation relating the ei

• Non-Linearity ei

ei ei

Increasing trend

ei

Constant Variance

Non-Linear Linear

Decreasing trend

RESIDUAL ANALYSIS RESIDUAL ANALYSIS

• Non-Independence of Errors • Outlier detection

ei ei

Independence of Errors

ei ei

> plot(gfit2$fit,

gfit2$res,ylab="Residuals",

xlab="Fitted",

main="Residual Plot")

> abline(h=0)

RESIDUAL PLOT VS. PREDICTORS RESIDUAL PLOT VS. PREDICTORS

> par(mfrow=c(2,2))

> plot(DIS$income, gfit2$res,

ylab="Residuals", xlab="Income")

> plot(DIS$house, gfit2$res,

ylab="Residuals", xlab="House")

> plot(DIS$food, gfit2$res,

ylab="Residuals", xlab="Food")

> plot(DIS$numem, gfit2$res,

ylab="Residuals", xlab="No. of

family members")

> par(mfrow=c(1,1)

FOR VARIANCE HETEROGENEITY > par(mfrow=c(2,2)

> qqnorm(trans2$res,ylab="Raw

• Most common remedy is transformation. But sometimes Residuals")

the “best" transformation is still not good enough. Also, > qqline(trans2$res)

sometimes a transformation only improves things a tiny > qqnorm(rstudent(trans2),

bit, in which case it may be better to go with the ylab="Studentized residuals")

untransformed variable, because it is much easier to > abline(0,1)

interpret. > hist(trans2$res,10)

• Weighted regression (beyond the scope of the course) > boxplot(trans2$res,main="Boxplot of

residuals")

> par(mfrow=c(1,1)

Outlier

its Y or X value

• Can cause us to misinterpret patterns in plots. Outliers

can affect visual resolution of remaining data in plots

(forces observations into “clusters”)

• Can have a strong influence on statistical models.

Deleting outliers from a regression model can sometimes

give completely different results

• The popular measures of influential point is the Cook’s

• An outlier tests is useful to enable us to distinguish

Distance Statistics (Di).

between truly unusual points and residuals which are

large but are not exceptional

• Use to ascertain whether or not outliers are influential

• Before performing a test for outlier, check if the extreme cases

observations are due to errors in computations or some

other explainable causes. • An index plot of Di can be used to identify influential

points.

• Cook’s original criteria is 1.0 (too liberal)

• Fox proposed 4/(n-p-1)

COOK’S DISTANCE : cooks.distance()

• cooks.distance() is one of the functions of the package

‘stats’ that computes the cook’s distance statistics

> cooks.distance(model, …)

# model – an object returned by lm or glm

• Illustration

> cook <- cooks.distance(gfit2)

> plot(cook,ylab="Cooks distances",

xlim=c(0,60))

> OBS <- rownames(DIS)

> identify(1:50,cook,OBS)

INFLUENTIAL CASES

When an outlying observation is accurate but no • Use of lm() to fit multiple linear regression

explanation can be found for it, the following may be

• Different methods in selecting independent

considered:

variables

• Check for any independent variable that may have an

unusual distribution that produces extreme values. • How to handle any violation in the

• Check if the dependent variable has properties that may assumptions of regression

lead to a potential or occasional large residuals – Multicollinearity

• Check appropriateness of model

– Variance heterogeneity

• Apply appropriate transformation.

• Use robust approaches (refer to regression books) – Non-normality of residuals

• How to handle outliers

Thank you!

- Regression and Correlation Analysis - Regression and Correlation AnalysisUploaded byVivay Salazar
- Data Manipulation and Statistical analysis - Analysis of VarianceUploaded byVivay Salazar
- Data Management and Statistical Analysis - Basic R GraphicsUploaded byVivay Salazar
- Intermediate R - Nonlinear Regression in RUploaded byVivay Salazar
- R CropStat IntroductionUploaded byVivay Salazar
- Power View ReportingUploaded byAmrit Pal Singh
- Data Management and Statistical Analysis - Descriptive StatisticsUploaded byVivay Salazar
- Intermediate R - Analysis of Categorical DataUploaded byVivay Salazar
- Intermediate R - Principal Component AnalysisUploaded byVivay Salazar
- Introduction to Gene MappingUploaded byVivay Salazar
- Powerpoint - Regression and Correlation AnalysisUploaded byVivay Salazar
- Data Management and Statistical Analysis - Data ManipulationUploaded byVivay Salazar
- Intermediate R - Cluster AnalysisUploaded byVivay Salazar
- Data Management and Statistical Analysis - Generating Randomization LayoutUploaded byVivay Salazar
- QTL MappingUploaded byVivay Salazar
- Intermediate R - Analysis of Count and Proportion DataUploaded byVivay Salazar
- Data Management and Statistical Analysis - Loading dataUploaded byVivay Salazar
- Principles of Experimental Design and Data AnalysisUploaded byVivay Salazar
- Intermediate R - Multidimensional ScalingUploaded byVivay Salazar
- A few Basics about QTL MappingUploaded byVivay Salazar
- Powerpoint presentation - Experimental Design Used in Rice ResearchUploaded byVivay Salazar
- Introduction to RUploaded byVivay Salazar
- Introduction to R ExercisesUploaded byVivay Salazar
- R-Cheat SheetUploaded byPrasad Marathe
- Applied Logistic Regression - Hosmer, LemeshowUploaded byAnderson Ospino
- Multivariate Statistical Inference and ApplicationsUploaded byNaveen Kumar Singh
- Principles of Statistical InferenceUploaded byEdmundo Caetano
- BPC_Project_PhasesUploaded bybaskharan
- Encyclopedia of Research Design-Multiple RegressionUploaded byscottleey

- Powerpoint presentation - Partitioning Sum of SquaresUploaded byVivay Salazar
- PublicationsUploaded byVivay Salazar
- Powerpoint - Regression and Correlation AnalysisUploaded byVivay Salazar
- Introduction to Gene MappingUploaded byVivay Salazar
- QTL MappingUploaded byVivay Salazar
- Intermediate R - Principal Component AnalysisUploaded byVivay Salazar
- Missing DataUploaded byVivay Salazar
- A few Basics about QTL MappingUploaded byVivay Salazar
- Regression and CorrelationUploaded byVivay Salazar
- Intermediate R - Cluster AnalysisUploaded byVivay Salazar
- Intermediate R - Multidimensional ScalingUploaded byVivay Salazar
- Intermediate R - Analysis of Categorical DataUploaded byVivay Salazar
- Data Management and Statistical Analysis - Data ManipulationUploaded byVivay Salazar
- Data Management and Statistical Analysis - Generating Randomization LayoutUploaded byVivay Salazar
- Intermediate R - Analysis of Count and Proportion DataUploaded byVivay Salazar
- R-Cheat SheetUploaded byPrasad Marathe
- Data Management and Statistical Analysis - Descriptive StatisticsUploaded byVivay Salazar
- Data Management and Statistical Analysis - Loading dataUploaded byVivay Salazar
- Introduction to R ExercisesUploaded byVivay Salazar
- Introduction to RUploaded byVivay Salazar
- Powerpoint presentation - Missing DataUploaded byVivay Salazar
- Powerpoint presentation - Data TransformationUploaded byVivay Salazar

- Multiple Linear Regression 05-10-12Uploaded byVivekMandal
- Effect of Bank Diversification on the Financial Distress of Commercial Banks Listed at the Nairobi Securities Exchange, KenyaUploaded byDr. Kennedy Mwengei B. Ombaba
- Performing stock closing price predictionUploaded bysvarogus
- Chapter 08Uploaded bynguyenkhang2503
- Market Mix Modeling Using RUploaded byRahul Rawat
- Analyzing shopping habits of IIMB students at campus retail store MARSUploaded byAbhishek Ghosh
- Ecotrix DummyUploaded byCharchit Bahl
- DOE Wizard - Quantitative and Categorical FactorsUploaded byAnonymous FZNn6rB
- Does Word Frequency Affect PhonologyUploaded byTayse Marques
- DO PHYSICAL BARRIERS AFFECT URBAN CRIME TRIPS?Uploaded byMarlijnWouters-Peeters
- Multiple Regression - HandoutUploaded byMarkus Santos
- Supervisor Support and Organizational Climate as Predictors of Work Family ConflictUploaded byarcherselevators
- Intermediate R - Multiple RegressionUploaded byVivay Salazar
- attitudes towards online shopping in China.pdfUploaded byAbbas Zia
- Umair Mazher ThesisUploaded byumair_mazher
- The Effects of Multicollinearity in Multilevel ModelsUploaded byIsmael Neu
- Multicoliniarity by CytelUploaded byReet Kanjilal
- Psychology of Music 2013 Kenny 306 28Uploaded byAndré Sinico
- Tesso_uta_2502D_12321Uploaded byLUcas
- art%3A10.1007%2Fs11199-013-0297-9Uploaded byMaya Camarasu
- Sas 7Uploaded bySumit Kumar
- Multi CollUploaded bySabbir Mahmud
- fqa_1370490999mbaUploaded byyashkapilsingh
- 1 IJAEBM Volume No 1 Issue No 1 Factors Contributing to Perish Ability 001 005 2Uploaded byiserp
- Project Success Analysis Framework a Knowledge-based Approach in Project mUploaded byamin
- 21371968Uploaded byNiraj Turakhia
- MASH Multiple Regression RUploaded bysheilaabad8766
- R_ Companion to Applied RegressionUploaded byartemdotgr
- Multi Col Linearity 1Uploaded byNilesh Kulkarni
- Corporate social responsibility disclosure and financial performa.pdfUploaded byFatiaa Nffz