This action might not be possible to undo. Are you sure you want to continue?

STATA: The Red tutorial

This tutorial presentation is prepared by

Mohammad Ehsanul Karim ehsan.karim@gmail.com

STATA: The Red tutorial

This tutorial presentation is prepared by

Mohammad Ehsanul Karim ehsan.karim@gmail.com

**Contents Linear Regression Analysis
**

1. Introduction to Linear Regression 2. Tests for Normality of Residuals 3. Tests for Heteroscedasticity 4. Tests for Multicollinearity 5. Tests for Autocorrelation 6. Detecting Unusual and Influential Data 7. Tests for Model Specification

Introduction to Linear Regression .1.

and the list of the independent variables that we chose to include in the estimation model follows ( right-hand-side variables). The first variable after the regress command is always the dependent variable ( left-hand-side variable). .Linear Regression The command regress is used to perform linear regressions.

clear .Linear Regression . clear . regress write read female . use hs1.

4394 -------------+-----------------------------Adj R-squared = 0.22837 2.713756 7.000 3.0493849 11.Linear Regression . Err. clear .875 199 89.32118 2 3928.5658869 .0000 Residual | 10022. Std.46 0.21 Model | 7856. regress write read female Source | SS df MS Number of obs = 200 -------------+-----------------------------F( 2.45 0. t P>|t| [95% Conf.41 0. use hs1.014261 5.87663 25.843593 Root MSE = 7.486894 1.468496 .58011 ------------------------------------------------------------------------------ .6632778 female | 5.000 14.487098 _cons | 20.8759077 R-squared = 0.4337 Total | 17878.000 . clear . Interval] -------------+---------------------------------------------------------------read | . 197) = 77.16059 Prob > F = 0.5538 197 50.1327 -----------------------------------------------------------------------------write | Coef.48669 7.

014261 5.1327 -----------------------------------------------------------------------------write | Coef.41 0. Std.000 14. use hs1. Err.487098 _cons | 20.21 Model | 7856.843593 Root MSE = 7.5658869 . regress write read female Source | SS df MS Number of obs = 200 -------------+-----------------------------F( 2. clear .87663 25.486894 1. 197) = 77.58011 ------------------------------------------------------------------------------ .0000 Residual | 10022. clear . Interval] -------------+---------------------------------------------------------------read | .16059 Prob > F = 0.6632778 female | 5.46 0.713756 7.Linear Regression .4394 -------------+-----------------------------Adj R-squared = 0.468496 .875 199 89.32118 2 3928. t P>|t| [95% Conf.22837 2.000 .4337 Total | 17878.8759077 R-squared = 0.48669 7.000 3.5538 197 50.45 0.0493849 11.

Tests for Normality of Residuals .2.

resid .Tests for Normality of Residuals We use the predict command with the resid option to generate residuals and we name the residuals r. . predict r.

we use Shapiro-Wilk W test for normal data .Tests for Normality of Residuals Shapiro-Wilk W test for Normality For verifying that the residuals are normally distributed. which is a very important assumption for regression.

we use Shapiro-Wilk W test for normal data . swilk r .Tests for Normality of Residuals Shapiro-Wilk W test for Normality For verifying that the residuals are normally distributed. which is a very important assumption for regression.

we use Shapiro-Wilk W test for normal data .06692 .Tests for Normality of Residuals Shapiro-Wilk W test for Normality For verifying that the residuals are normally distributed.919 1.98714 1. which is a very important assumption for regression.499 0. swilk r Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+------------------------------------------------r | 200 0.

Tests for Normality of Residuals In verifying that the residuals are normally distributed. which is a very important assumption for regression. the kdensity command with the normal option displays a density graph of the residuals with an normal distribution superimposed on the graph. .

kdensity r. normal .Tests for Normality of Residuals .

Tests for Normality of Residuals . normal . kdensity r.

Tests for Normality of Residuals The pnorm command produces a normal probability plot and it is another method of testing whether the residuals from the regression are normally distributed. .

pnorm r .Tests for Normality of Residuals .

Tests for Normality of Residuals . pnorm r .

Tests for Normality of Residuals The qnorm command produces a normal quantile plot. . It is yet another method for testing if the residuals are normally distributed.

Tests for Normality of Residuals . qnorm r .

qnorm r .Tests for Normality of Residuals .

. pnorm graphs a standardized normal probability (P-P) plot. qnorm plots the quantiles of varname against the quantiles of a normal distribution. kdensity produces kernel density plot with normal distribution overlayed.Tests for Normality of Residuals Summary of Tests for Normality of Residuals swilk performs the Shapiro-Wilk W test for normality.

Tests for Heteroscedasticity .3.

. There are graphical and non-graphical methods for detecting heteroscedasticity.Tests for Heteroscedasticity One of the basic assumptions for the ordinary least squares regression is the homogeneity of variance of the residuals.

Tests for Heteroscedasticity Cook-Weisberg test for heteroskedasticity .

hettest Cook-Weisberg test for heteroskedasticity using fitted values of write Ho: Constant variance chi2(1) = 5.79 Prob > chi2 = 0.Tests for Heteroscedasticity Cook-Weisberg test for heteroskedasticity .0161 .

.Tests for Heteroscedasticity we use the rvfplot command with the yline(0) option to put a reference line at y=0.

. rvfplot.Tests for Heteroscedasticity we use the rvfplot command with the yline(0) option to put a reference line at y=0. yline(0) .

yline(0) . rvfplot. .Tests for Heteroscedasticity we use the rvfplot command with the yline(0) option to put a reference line at y=0.

Tests for Heteroscedasticity Summary of Tests for Heteroscedasticity hettest performs Cook and Weisberg test rvfplot graphs residual-versus-fitted plot. .

4. Tests for Multicollinearity .

but for its degree. the regression model estimates of the coefficients become unstable and the standard errors for the coefficients can get wildly inflated. . For severe degree of multicollinearity.Tests for Multicollinearity Multicollinearity is a concern for multiple regression. not for its existence.

Tests for Multicollinearity We can use the vif command after the regression to check for multicollinearity. . vif stands for variance inflation factor.

997182 -------------+---------------------Mean VIF | 1.997182 read | 1.00 . . vif stands for variance inflation factor.00 0. vif Variable | VIF 1/VIF -------------+---------------------female | 1.00 0.Tests for Multicollinearity We can use the vif command after the regression to check for multicollinearity.

A read | 1. greater than 10 may merit further investigation. is used to check on Variable | VIF 1/VIF -------------+---------------------.00 0.the degree of female | 1.00 comparable to a VIF of 10. . vif Tolerance= 1/VIF. .lower than 0. VIF values are vif stands for variance inflation factor.997182 tolerance value -------------+---------------------.Tests for Multicollinearity We can use the vif command after the regression to A variable whose check for multicollinearity.1 is Mean VIF | 1.00 0.997182 collinearity.

Tests for Multicollinearity Summary of Tests for Multicollinearity vif calculates the variance inflation factor for the independent variables in the linear model. .

Tests for Autocorrelation .5.

200) = 1.93992 . 1 to 200 Durbin-Watson d-statistic( 3. dwstat id. tsset id time variable: .Tests for Autocorrelation .

Detecting Unusual and Influential Data .6.

• Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Leverage is a measure of how far an independent variable deviates from its mean. • Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. it is an observation whose dependent-variable value is unusual given its values on the predictor variables. Influence can be thought of as the product of leverage and outlierness. These leverage points can have an effect on the estimate of regression coefficients.Detecting Unusual and Influential Data • Outliers: In linear regression. an outlier is an observation with large residual. . An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem. In other words.

Detecting Unusual and Influential Data Here we summarize the general rules of thumb we use for these measures to identify observations worthy of further investigation (where k is the number of predictors and n is the number of observations). Measure leverage abs(rstu) Cook's D abs(DFITS) abs(DFBETA) Value >(2k+2)/n >2 > 4/n > 2*sqrt(k/n) > 2/sqrt(n) .

.Detecting Unusual and Influential Data We use the predict command with the rstudent option to generate studentized residuals and we name the residuals r. Studentized residuals are a type of standardized residual that can be used to identify outliers.

Detecting Unusual and Influential Data We use the predict command with the rstudent option to generate studentized residuals and we name the residuals r. . predict r. Studentized residuals are a type of standardized residual that can be used to identify outliers. rstudent .

42 33.31.Detecting Unusual and Influential Data .49.13.07.52.93.92.19 21.24.94.11.84 16 .42.57.96.27.89.16.30.73.98.46.20.22.70.66.64.74.28.96.35.15.13.93.20.29.04.16.20 17.16.99 01.78 88.04.33.60 50.01 97.42 26.89.55 60.32.23.89.73.36.57.37.02.90.31.80.14.64.16.53.44.04.69.54.31.39 42.32.23.54.12.75.13.80 74. stem r Stem-and-leaf plot for r (Studentized residuals) r rounded to nearest multiple of .09.28.53.94.49.58.37.54.13.83 75.28.01 -2** -2** -2** -1** -1** -1** -1** -1** -0** -0** -0** -0** -0** 0** 0** 0** 0** 0** 1** 1** 1** 1** 1** 2** | | | | | | | | | | | | | | | | | | | | | | | | 50.84.40 35.73.19 23.82.29.59.71.64.33.68.04.28.70.30.21 18 92.24.25.70.35 40.86.44.36.85.28.88.22 19.02.46.77 80.26.01 plot in units of .14.97.13.06.16.03.04.42.61.57 61.08.23.63.89.93.06.73.56.43.47.97.67 59.09.64.64.16.17.07.48.48.33.08.10.09.13.16.92.82.74.51.56.61.44.51.13.54.86.02 00.51.23.84.35.73.72.03.71.

**Detecting Unusual and Influential Data
**

. stem r . sort r . list r in 1/10

1. 2. 3. 4. 5. 6. 7. 8. 9. r -2.503566 -2.421219 -2.255832 -2.210221 -2.178212 -1.916192 -1.848524 -1.843611 -1.831068

10. -1.750652

**Detecting Unusual and Influential Data
**

. stem r . sort r . list r in -10/l . list r in 1/10 r

1. 2. 3. 4. 5. 6. 7. 8. 9. r -2.503566 -2.421219 -2.255832 -2.210221 -2.178212 -1.916192 -1.848524 -1.843611 -1.831068 191. 192. 193. 194. 195. 196. 197. 198. 199. 200. 1.551833 1.602682 1.677923 1.726393 1.730591 1.749522 1.774811 1.798141 1.840841 2.160904

10. -1.750652

. stem r attention to . sort r . list r in -10/l studentized . list r in 1/10 residuals that r r 191. 1.551833 exceed +2 or 1. -2.503566 192. 1.602682 2, and get even 2. -2.421219 193. 1.677923 more concerned 3. -2.255832 194. 1.726393 about residuals 4. -2.210221 195. 1.730591 5. -2.178212 that exceed 196. 1.749522 6. -1.916192 +2.5 or -2.5 and 197. 1.774811 7. -1.848524 even yet more 198. 1.798141 8. -1.843611 concerned 199. 1.840841 9. -1.831068 about residuals 10. -1.750652 200. 2.160904 that exceed +3 or -3.

Detecting Unusual and Influential Data We should pay .

**Detecting Unusual and Influential Data We should pay .
**

attention to studentized residuals that exceed +2 or 2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

5 or -2. list r if r<-2 | r>2 r 1.160904 Detecting Unusual and Influential Data We should pay .210221 5. attention to studentized residuals that exceed +2 or 2. -2. -2. -2.503566 2. -2.178212 200. -2. and get even more concerned about residuals that exceed +2.255832 4. .5 and even yet more concerned about residuals that exceed +3 or -3.. 2.421219 3.

2. -2.503566 2. -2..5 and even yet more concerned about residuals that exceed +3 or -3.210221 5.421219 3. and get even more concerned about residuals that exceed +2.160904 Detecting Unusual and Influential Data We should pay .5 or -2.5 r 1. -2. -2.178212 200. -2. .255832 4.503566 . list r if r<-2. list r if r<-2 | r>2 r 1. attention to studentized residuals that exceed +2 or 2. -2.5 | r>2.

we use the predict command with the leverage option and we name them lev. .Detecting Unusual and Influential Data To get Leverage points.

we use the predict command with the leverage option and we name them lev.Detecting Unusual and Influential Data To get Leverage points. predict lev. . leverage .

Detecting Unusual and Influential Data Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers. .

cooksd . predict d. .Detecting Unusual and Influential Data Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.

Detecting Unusual and Influential Data Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers. 123. list female read d if d>4/_N 13. 39.0202435 .0234054 . female male male female male read 50 47 57 76 d .0327483 . predict d. .0212312 . 142. cooksd .

predict dfit. dfits . .Detecting Unusual and Influential Data Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers. list dfit if abs(dfit)>2*sqrt(3/51) .

predict dfit. list dfit if abs(dfit)>2*sqrt(3/51) The above measures are general measures of influence.Detecting Unusual and Influential Data Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers. . dfits . .

This measure is called DFBETA and is created for each of the predictors. .Detecting Unusual and Influential Data We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation.

. Apparently this is more computational intensive than summary statistics such as Cook's D. This measure is called DFBETA and is created for each of the predictors.Detecting Unusual and Influential Data We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation.

This measure is called DFBETA and is created for each of the predictors. . In Stata. the dfbeta command will produce the DFBETAs for each of the predictors.Detecting Unusual and Influential Data We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation.

Detecting Unusual and Influential Data We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. dfbeta DFread: DFfemale: DFbeta(read) DFbeta(female) . . This measure is called DFBETA and is created for each of the predictors. the dfbeta command will produce the DFBETAs for each of the predictors. In Stata.

1617497 . 5. .1740918 .0915453 . DFread DFfemale in 1/5 DFread DFfemale .0492348 .1802994 . 4.1971976 -. list 1.Detecting Unusual and Influential Data We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. 3. 2.0887463 -. This measure is called DFBETA and is created for each of the predictors.1374498 .0717626 -.0434659 .

Detecting Unusual and Influential Data There are also several graphs that can be used to search for unusual and influential observations. The avplot command graphs an addedvariable plot. .

.Detecting Unusual and Influential Data avplot command not only works for the variables in the model. it also works for variables that are not in the model. We can do an avplot on variable grade. which is why it is called added-variable plot.

avplot grade .Detecting Unusual and Influential Data .

Detecting Unusual and Influential Data . avplot grade Added-Variable plot .

.Detecting Unusual and Influential Data rvpplot is another convenience command which produces a plot of the residual versus a specified predictor and it is also used after regress or anova.

rvpplot read .Detecting Unusual and Influential Data .

Detecting Unusual and Influential Data . rvpplot read .

Detecting Unusual and Influential Data lvr2plot stands for leverage versus residual squared plot. .

lvr2plot .Detecting Unusual and Influential Data .

lvr2plot .Detecting Unusual and Influential Data .

residuals. and measures of influence. rvpplot graphs a residual-versus-predictor plot. .Detecting Unusual and Influential Data Summary of Detecting Unusual and Influential Data predict create predicted values. dfbeta DFBETAs for all the independent variables avplot graphs an added-variable plot lvr2plot graphs a leverage-versus-squaredresidual plot. rvfplot graphs residual-versus-fitted plot.

7. Tests for Model Specification .

.Tests for Model Specification A model specification error can occur when one or more relevant variables are omitted from the model or one or more irrelevant variables are included in the model.

The linktest command performs a model specification link test for single-equation models.Tests for Model Specification There are several methods to detect specification errors. .

67 0.86 Model | 8005.72 0.086 -.4477 -------------+-----------------------------Adj R-squared = 0.0000 Residual | 9873. t P>|t| [95% Conf.843593 Root MSE = 7.120597 R-squared = 0.70 0.008 .480201 ------------------------------------------------------------------------------ .55869 Prob > F = 0.0170281 .0796 -----------------------------------------------------------------------------write | Coef.052071 2.0024615 _cons | -47.75761 197 50.29516 27. Linktest Source | SS df MS Number of obs = 200 -------------+-----------------------------F( 2.77544 -1.0705 7.090 -102. Err.882264 _hatsq | -. 197) = 79.0098827 -1.7327302 4.11739 2 4002.4421 Total | 17878.875 199 89. Std.Tests for Model Specification .807497 1.0365176 . Interval] -------------+---------------------------------------------------------------_hat | 2.

.Tests for Model Specification The ovtest command performs performs a regression specification error test (RESET) for omitted variables.

Tests for Model Specification The ovtest command performs performs a regression specification error test (RESET) for omitted variables. . ovtest .

Tests for Model Specification The ovtest command performs performs a regression specification error test (RESET) for omitted variables. ovtest Ramsey RESET test using powers of the fitted values of write Ho: model has no omitted variables F(3. .95 Prob > F = 0. 194) = 1.1233 .

Tests for Model Specification Summary of Tests for Model Specification linktest performs a link test for model specification. . ovtest performs regression specification error test (RESET) for omitted variables.

STATA: The Red tutorial .

Sign up to vote on this title

UsefulNot useful- 16358468 Statistical Analysis Tools
- Heteroscedasticity- What Happens When Error
- Anova 2 Dec 2015
- Spatial Analysis - Uncertainty
- Regression Explained SPSS
- Regression Explained SPSS
- Solver 1
- Regression Analysis
- 2010-05+STA2020F+test+2
- lec2
- Index Introductory Econometrics for Finance
- Classnotes2-ANOVATable
- ClustSLS
- Eco No Metric ModelING
- mplus2
- sol114
- Team Project Template
- SPSS for Beginners
- soln
- Cost Prediction 1
- Econometrics_ch4
- 4
- Risk 2
- Zeeshan Iqbal (ID 3703) Final
- Linear Regression
- Test 2A AP Statistics
- L12_Nominal_Taguchi_1.xls
- Investigación
- CM1902 Formula Sheet May 2011
- Bond Risk
- STATA Red Tutorial