Outline: Regression Analysis in Python

Outline Regression
Outline: Regression analysis in Python

1 Introduction to statsmodels module
2 Ordinary least squares (OLS) regression
3 Presenting regression results
1 / 11
Outline Regression
Regression analysis
The statsmodels module provides a large range of cross-sectional models as

well as sometime-series models.
The documentation is available at

http://www.statsmodels.org/stable/index.html.
We will use the statsmodels.api and statsmodels.formula.api module to

run regressions.
import numpy as np
import pandas as pd
import statsmodels . api as sm
import statsmodels . stats . api as sms
import statsmodels . formula . api as smf
from statsmodels . iolib . summary2 import summary_col
There are two options for running OLS regressions: (i) smf.ols() and (ii)
sm.OLS(). In the first option, statsmodels allows users to fit statistical
models using R-style formulas.
2 / 11
Outline Regression
Wage Regression
The data set is wage1000.csv with headers.

# Importing data
# Use read_csv () to import data
csv_data = pd . read_csv ( ' wage1000 . csv ' )
type ( csv_data )
csv_data . head ()
In [114]: wage_data . columns
Out [114]:
Index ([ ' wage ' , ' female ' , ' nonwhite ' , ' unionmember ' , ' education ' , ' experience ' ,
' age ' , ' const ' ] ,
dtype = ' object ' )
It is wage data with 1000 observations from the US Bureau of Census Current
Population survey, March 1995.
The underlying population is the employed labor force, age 18-65. The
variables are as follows:
1 hourly wage
2 female (1= worker = female)
3 non-white (1= worker = non-white)
4 union (1 = worker = unionized)
5 education (years of education)
6 experience (years of work experience)
7 age
3 / 11
Outline Regression
Wage Regression: using smf.ols()
Our goal is to run the following linear regression model
wage = β0 + β1 female + β2 nonwhite + β3 unionmember + β4 education

+ β5 experience + ,
where β’s are unknown coefficients and is the disturbance term.
The smf module hosts many of the same functions found in the sm module
(e.g. OLS, GLM). Use dir(smf) to list available models.
model1 = smf . ols ( formula = ' wage ~ female + nonwhite + unionmember + education +\
experience ' , data = wage_data )
# We need to use . fit () to obtain parameter estimates
result1 = model1 . fit ()
# We now have the fitted regression model stored in result1
# To view the OLS regression results , we can call the . summary () method
result1 . summary ()
Note that we do not need to specify an intercept term smf.ols(). The

function will include an intercept term by default.
4 / 11
Outline Regression
The result1.summary() function prints the following output.

In [175]: print ( result1 . summary ())
OLS Regression Results
==============================================================================
Dep . Variable : wage R - squared : 0.348
Model : OLS Adj . R - squared : 0.345
Method : Least Squares F - statistic : 106.0
Date : Wed , 19 Jun 2019 Prob (F - statistic ): 9.45 e -90
Time : 15:34:18 Log - Likelihood : -3314.2
No . Observations : 1000 AIC : 6640.
Df Residuals : 994 BIC : 6670.
Df Model : 5
Covariance Type : nonrobust
===============================================================================
coef std err t P >| t | [0.025 0.975]
-------------------------------------------------------------------------------
Intercept -8.5786 1.161 -7.388 0.000 -10.857 -6.300
female -3.0985 0.424 -7.313 0.000 -3.930 -2.267
nonwhite -1.6072 0.603 -2.664 0.008 -2.791 -0.423
unionmember 0.8212 0.583 1.408 0.159 -0.323 1.966
education 1.4983 0.075 19.948 0.000 1.351 1.646
experience 0.1697 0.018 9.197 0.000 0.133 0.206
==============================================================================
Omnibus : 370.409 Durbin - Watson : 1.899
Prob ( Omnibus ): 0.000 Jarque - Bera ( JB ): 2099.721
Skew : 1.598 Prob ( JB ): 0.00
Kurtosis : 9.339 Cond . No . 141.
==============================================================================
Warnings :
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifi
5 / 11
Outline Regression
We can use print(dir(result1)) and print(dir(result1.model)) to see

available attributes.
6 / 11
Outline Regression
Wage Regression: attributes
Some important attributes are listed in the following table.

Table 1: Some important attributes
nobs returns the number of observations used in the estimation.
params returns the estimated parameters in list.
resid returns the residuals in list.
predict returns predicted values in array.
model.exog returns exogenous variables in array.
model.exog_names returns the names of exogenous variables in a list.
model.endog/model.endog_names returns the endogenous variable values/name.
model.loglike returns the likelihood function evaluated at params.
rsquared/rsquared_adj returns unadjusted/adjusted R2 .
Try the following:

# Some attributes
result1 . nobs
result1 . params
result1 . resid
result1 . model . endog_names
result1 . model . exog_names
result1 . rsquared
7 / 11
Outline Regression

Run some alternative models.
# # Run some alternative models
# Add squared - experience as an exogenous variable
model2 = smf . ols ( formula = ' wage ~ female + nonwhite + unionmember + education +\
experience + I ( experience **2) ' , data = wage_data )
# Normality of the residuals
JB , JBpv , Skew , Kurtosis = sms . jarque_bera ( result2 . resid )
print ( ' Test statistic is ' ,JB , ' with a p - value of ' , JBpv )
# H e t e r o s ked ast ici ty tests
In [143]: test = sms . het_breuschpagan ( result2 . resid , result2 . model . exog )
...: print ( ' LM statistic is ' , np . round ( test [0] ,3) , ' with a p - value of ' ,\
...: np . round ( test [1] ,3))
LM statistic is 46.249 with a p - value of 0.0
# Using the option cov_type = ' HC0 ' : White ' s het ero sked ast ici ty consistent
# covariance estimator .
result3 = model2 . fit ( cov_type = ' HC0 ' )
print ( result3 . summary ())
# Compare standard errors
In [181]: sde = pd . concat ([ result2 . bse , result3 . bse ] ,1)
In [182]: sde . columns = [ ' No option ' , ' HC0 ' ]
In [183]: print ( sde )
No option HC0
Intercept 1.184103 1.286034
female 0.419360 0.418680
nonwhite 0.596781 0.488253
unionmember 0.577060 0.490091
education 0.075091 0.095138
experience 0.061721 0.060487
I ( experience ** 2) 0.001389 0.001464
8 / 11
Outline Regression
Wage Regression: using sm.OLS()
Next, we describe how to use sm.OLS() for regression analysis

# # An alternative approach based on sm . OLS
# Define endogenous and exogenous variables
wage_data [ ' const ' ]=1 # add a constant column to wage_data
y = wage_data [ ' wage ' ]
X = wage_data [[ ' const ' , ' female ' , ' nonwhite ' , ' unionmember ' , ' education ' ,\
' experience ' ]]
model1 = sm . OLS ( endog =y , exog = X )
type ( model1 )
# We need to use . fit () to obtain parameter estimates
type ( result1 )
# We now have the fitted regression model stored in result1
# To view the OLS regression results , we can call the . summary () method
# Add experience **2 to X and form a new model
X [ ' I ( experience **2) ' ]= wage_data [ ' experience ' ]**2
model2 = sm . OLS ( endog =y , exog = X )
# Use cov_type = ' HC0 '
result3 = model2 . fit ( cov_type = ' HC0 ' )
Note that we need to explicitly specify the intercept term:

wage_data[’const’]=1 adds a column of ones to the dataframe wage_data.
9 / 11
Outline Regression
Using summary_col() for reporting regression results
Next, we show how to use summary_col() to report regression results.

# Print results in table form
info_dict ={ ' R - squared ' : lambda x : x . rsquared ,\
' No . observations ' : lambda x : x . nobs }
regressors =[ ' const ' , ' female ' , ' nonwhite ' , ' unionmember ' , ' education ' , ' experience ' ,\
' I ( experience **2) ' ]
results_table = summary_col ( results =[ result1 , result2 , result3 ] ,\

float_format = ' %0.3 f ' , stars = True ,\
model_names = models , info_dict = info_dict ,\
regressor_order = regressors )
results_table . add_title ( ' Table 1: Wage Regressions ' ) # add a title
results_table
10 / 11
Outline Regression
Using summary_col() for reporting regression results
The resulting table is in the following form.

In [196]: results_table
Out [196]:
< class ' statsmodels . iolib . summary2 . Summary ' >
"""
Table 1: Wage Regressions
=================================================
Model 1 Model 2 Robust Model
-------------------------------------------------
const -8.579*** -9.960*** -9.960***
(1.161) (1.184) (1.286)
female -3.099*** -3.026*** -3.026***
(0.424) (0.419) (0.419)
nonwhite -1.607*** -1.553*** -1.553***
(0.603) (0.597) (0.488)
unionmember 0.821 0.741 0.741
(0.583) (0.577) (0.490)
education 1.498*** 1.446*** 1.446***
(0.075) (0.075) (0.095)
experience 0.170*** 0.452*** 0.452***
(0.018) (0.062) (0.060)
I ( experience **2) -0.007*** -0.007***
(0.001) (0.001)
R - squared 0.3478 0.3626 0.3626
No . observations 1000.0000 1000.0000 1000.0000
=================================================
Standard errors in parentheses .
* p <.1 , ** p <.05 , *** p <.01
"""
11 / 11

Outline: Regression Analysis in Python

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Outline: Regression Analysis in Python

Uploaded by

Copyright:

Available Formats

Outline Regression

Outline: Regression analysis in Python

The statsmodels module provides a large range of cross-sectional models as

The documentation is available at

We will use the statsmodels.api and statsmodels.formula.api module to

The data set is wage1000.csv with headers.

Wage Regression: using smf.ols()

Our goal is to run the following linear regression model

wage = β0 + β1 female + β2 nonwhite + β3 unionmember + β4 education

where β’s are unknown coefficients and  is the disturbance term.

Note that we do not need to specify an intercept term smf.ols(). The

Wage Regression: using smf.ols()

The result1.summary() function prints the following output.

Wage Regression: using smf.ols()

We can use print(dir(result1)) and print(dir(result1.model)) to see

Wage Regression: attributes

Some important attributes are listed in the following table.

Try the following:

Wage Regression: using smf.ols()

Wage Regression: using sm.OLS()

Next, we describe how to use sm.OLS() for regression analysis

Note that we need to explicitly specify the intercept term:

Using summary_col() for reporting regression results

Next, we show how to use summary_col() to report regression results.

results_table = summary_col ( results =[ result1 , result2 , result3 ] ,\

Using summary_col() for reporting regression results

The resulting table is in the following form.

You might also like

where β’s are unknown coefficients and is the disturbance term.