You are on page 1of 11

Outline Regression

Outline: Regression analysis in Python


1 Introduction to statsmodels module
2 Ordinary least squares (OLS) regression
3 Presenting regression results

1 / 11
Outline Regression

Regression analysis

The statsmodels module provides a large range of cross-sectional models as


well as sometime-series models.

The documentation is available at


http://www.statsmodels.org/stable/index.html.

We will use the statsmodels.api and statsmodels.formula.api module to


run regressions.
import numpy as np
import pandas as pd
import statsmodels . api as sm
import statsmodels . stats . api as sms
import statsmodels . formula . api as smf
from statsmodels . iolib . summary2 import summary_col

There are two options for running OLS regressions: (i) smf.ols() and (ii)
sm.OLS(). In the first option, statsmodels allows users to fit statistical
models using R-style formulas.

2 / 11
Outline Regression

Wage Regression

The data set is wage1000.csv with headers.


# Importing data
# Use read_csv () to import data
csv_data = pd . read_csv ( ' wage1000 . csv ' )
type ( csv_data )
csv_data . head ()
In [114]: wage_data . columns
Out [114]:
Index ([ ' wage ' , ' female ' , ' nonwhite ' , ' unionmember ' , ' education ' , ' experience ' ,
' age ' , ' const ' ] ,
dtype = ' object ' )

It is wage data with 1000 observations from the US Bureau of Census Current
Population survey, March 1995.
The underlying population is the employed labor force, age 18-65. The
variables are as follows:
1 hourly wage
2 female (1= worker = female)
3 non-white (1= worker = non-white)
4 union (1 = worker = unionized)
5 education (years of education)
6 experience (years of work experience)
7 age

3 / 11
Outline Regression

Wage Regression: using smf.ols()

Our goal is to run the following linear regression model

wage = β0 + β1 female + β2 nonwhite + β3 unionmember + β4 education


+ β5 experience + ,

where β’s are unknown coefficients and  is the disturbance term.

The smf module hosts many of the same functions found in the sm module
(e.g. OLS, GLM). Use dir(smf) to list available models.
model1 = smf . ols ( formula = ' wage ~ female + nonwhite + unionmember + education +\
experience ' , data = wage_data )
# We need to use . fit () to obtain parameter estimates
result1 = model1 . fit ()
# We now have the fitted regression model stored in result1
# To view the OLS regression results , we can call the . summary () method
result1 . summary ()

Note that we do not need to specify an intercept term smf.ols(). The


function will include an intercept term by default.

4 / 11
Outline Regression

Wage Regression: using smf.ols()

The result1.summary() function prints the following output.


In [175]: print ( result1 . summary ())
OLS Regression Results
==============================================================================
Dep . Variable : wage R - squared : 0.348
Model : OLS Adj . R - squared : 0.345
Method : Least Squares F - statistic : 106.0
Date : Wed , 19 Jun 2019 Prob (F - statistic ): 9.45 e -90
Time : 15:34:18 Log - Likelihood : -3314.2
No . Observations : 1000 AIC : 6640.
Df Residuals : 994 BIC : 6670.
Df Model : 5
Covariance Type : nonrobust
===============================================================================
coef std err t P >| t | [0.025 0.975]
-------------------------------------------------------------------------------
Intercept -8.5786 1.161 -7.388 0.000 -10.857 -6.300
female -3.0985 0.424 -7.313 0.000 -3.930 -2.267
nonwhite -1.6072 0.603 -2.664 0.008 -2.791 -0.423
unionmember 0.8212 0.583 1.408 0.159 -0.323 1.966
education 1.4983 0.075 19.948 0.000 1.351 1.646
experience 0.1697 0.018 9.197 0.000 0.133 0.206
==============================================================================
Omnibus : 370.409 Durbin - Watson : 1.899
Prob ( Omnibus ): 0.000 Jarque - Bera ( JB ): 2099.721
Skew : 1.598 Prob ( JB ): 0.00
Kurtosis : 9.339 Cond . No . 141.
==============================================================================
Warnings :
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifi

5 / 11
Outline Regression

Wage Regression: using smf.ols()

We can use print(dir(result1)) and print(dir(result1.model)) to see


available attributes.

6 / 11
Outline Regression

Wage Regression: attributes

Some important attributes are listed in the following table.


Table 1: Some important attributes
nobs returns the number of observations used in the estimation.
params returns the estimated parameters in list.
resid returns the residuals in list.
predict returns predicted values in array.
model.exog returns exogenous variables in array.
model.exog_names returns the names of exogenous variables in a list.
model.endog/model.endog_names returns the endogenous variable values/name.
model.loglike returns the likelihood function evaluated at params.
rsquared/rsquared_adj returns unadjusted/adjusted R2 .

Try the following:


# Some attributes
result1 . nobs
result1 . params
result1 . resid
result1 . model . endog_names
result1 . model . exog_names
result1 . rsquared

7 / 11
Outline Regression

Wage Regression: using smf.ols()


Run some alternative models.
# # Run some alternative models
# Add squared - experience as an exogenous variable
model2 = smf . ols ( formula = ' wage ~ female + nonwhite + unionmember + education +\
experience + I ( experience **2) ' , data = wage_data )
result2 = model2 . fit ()
result2 . summary ()
# Normality of the residuals
JB , JBpv , Skew , Kurtosis = sms . jarque_bera ( result2 . resid )
print ( ' Test statistic is ' ,JB , ' with a p - value of ' , JBpv )
# H e t e r o s ked ast ici ty tests
In [143]: test = sms . het_breuschpagan ( result2 . resid , result2 . model . exog )
...: print ( ' LM statistic is ' , np . round ( test [0] ,3) , ' with a p - value of ' ,\
...: np . round ( test [1] ,3))
LM statistic is 46.249 with a p - value of 0.0
# Using the option cov_type = ' HC0 ' : White ' s het ero sked ast ici ty consistent
# covariance estimator .
result3 = model2 . fit ( cov_type = ' HC0 ' )
print ( result3 . summary ())
# Compare standard errors
In [181]: sde = pd . concat ([ result2 . bse , result3 . bse ] ,1)
In [182]: sde . columns = [ ' No option ' , ' HC0 ' ]
In [183]: print ( sde )
No option HC0
Intercept 1.184103 1.286034
female 0.419360 0.418680
nonwhite 0.596781 0.488253
unionmember 0.577060 0.490091
education 0.075091 0.095138
experience 0.061721 0.060487
I ( experience ** 2) 0.001389 0.001464

8 / 11
Outline Regression

Wage Regression: using sm.OLS()

Next, we describe how to use sm.OLS() for regression analysis


# # An alternative approach based on sm . OLS
# Define endogenous and exogenous variables
wage_data [ ' const ' ]=1 # add a constant column to wage_data
y = wage_data [ ' wage ' ]
X = wage_data [[ ' const ' , ' female ' , ' nonwhite ' , ' unionmember ' , ' education ' ,\
' experience ' ]]
model1 = sm . OLS ( endog =y , exog = X )
type ( model1 )
# We need to use . fit () to obtain parameter estimates
result1 = model1 . fit ()
type ( result1 )
# We now have the fitted regression model stored in result1
# To view the OLS regression results , we can call the . summary () method
result1 . summary ()
# Add experience **2 to X and form a new model
X [ ' I ( experience **2) ' ]= wage_data [ ' experience ' ]**2
model2 = sm . OLS ( endog =y , exog = X )
result2 = model2 . fit ()
result2 . summary ()
# Use cov_type = ' HC0 '
result3 = model2 . fit ( cov_type = ' HC0 ' )
result3 . summary ()

Note that we need to explicitly specify the intercept term:


wage_data[’const’]=1 adds a column of ones to the dataframe wage_data.

9 / 11
Outline Regression

Using summary_col() for reporting regression results

Next, we show how to use summary_col() to report regression results.


# Print results in table form
info_dict ={ ' R - squared ' : lambda x : x . rsquared ,\
' No . observations ' : lambda x : x . nobs }

regressors =[ ' const ' , ' female ' , ' nonwhite ' , ' unionmember ' , ' education ' , ' experience ' ,\
' I ( experience **2) ' ]

results_table = summary_col ( results =[ result1 , result2 , result3 ] ,\


float_format = ' %0.3 f ' , stars = True ,\
model_names = models , info_dict = info_dict ,\
regressor_order = regressors )
results_table . add_title ( ' Table 1: Wage Regressions ' ) # add a title
results_table

10 / 11
Outline Regression

Using summary_col() for reporting regression results

The resulting table is in the following form.


In [196]: results_table
Out [196]:
< class ' statsmodels . iolib . summary2 . Summary ' >
"""
Table 1: Wage Regressions
=================================================
Model 1 Model 2 Robust Model
-------------------------------------------------
const -8.579*** -9.960*** -9.960***
(1.161) (1.184) (1.286)
female -3.099*** -3.026*** -3.026***
(0.424) (0.419) (0.419)
nonwhite -1.607*** -1.553*** -1.553***
(0.603) (0.597) (0.488)
unionmember 0.821 0.741 0.741
(0.583) (0.577) (0.490)
education 1.498*** 1.446*** 1.446***
(0.075) (0.075) (0.095)
experience 0.170*** 0.452*** 0.452***
(0.018) (0.062) (0.060)
I ( experience **2) -0.007*** -0.007***
(0.001) (0.001)
R - squared 0.3478 0.3626 0.3626
No . observations 1000.0000 1000.0000 1000.0000
=================================================
Standard errors in parentheses .
* p <.1 , ** p <.05 , *** p <.01
"""

11 / 11

You might also like