You are on page 1of 21

Semester II

Application of Econometric Theory with STATA

STATA Instruction Manual

By

TANIMA BANERJEE

UNIVERSITY OF CALCUTTA

Email id: tanima.bnrj@gmail.com


Contents
Unit 1: Data Management

Importing if data from excel file

Joining data sets

Generating variables

Describing data

Tabulation and graphical presentation

Unit 2: Multiple Regression Analysis and hypothesis testing

T test

OLS estimation

Goodness of fit

Tests for heteroskedasticity

Tests for Multicollinearity


Unit 1: Data Management

1. Importing data from Excel

Open Stata Go to Menu Bar Click on File Select Import Select Excel
Spreadsheet(*.xlx) In the Import box browse the file name Choose the option ‘Import first
row as variable name’ Select worksheet, e.g. Sheet 1 Click Ok.

2. Joining data sets

There are two types of joining datasets. First, we can merge two data sets where data sets are
joined horizontally with one another. Second, we can append two datasets where data sets are
joined vertically with one another.

2.1 Merging

In case of merging, we horizontally join two data sets. When we merge two data sets, the number
of variables increases in the merged file while the number of observations may remain same.
Suppose we have two files, e.g. education_employees.dta and wage_employees.dta. The
education_employees.dta data file offers education related information of a group of employees,
while the wage_employees.dta data file offers wage related information for the same group of
employees. Then, to join these two datasets we can use merge command in Stata. When we
merge two data sets, there should be one common identity variable in both the files that offer
identity number for each observation within the data sets. In our example, say psid (person serial
identity number) is the common variable in both the files that offers person serial number of each
employee within the data sets. Then, we can merge these two files using psid.

Merge two files using the following steps:

Open File 1, say education_employees.dta

Sort the data set in terms of psid, so type in the command box: sort psid

Save the sorted data set: click on the save icon in the tool bar

Open File 2, say wage_employees.dta


Sort the data set in terms of psid, so type in the command box: sort psid

Type in the command box: merge psid using “path and file name of File 1”

Suppose, the data sets are located in D drive and within the folder “Stata_work_files”, then, type:

merge psid using “D:\Stata_work_files\education_employees.dta”

To save the merged data file, click on File in the Menu Bar and select Save as option, and then,
within the Save Stata Data File window type a name for the merged file and click the Save
button.

2.2 Appending

In case of appending, we vertically join two data sets. When we append two data sets, the
number of variables may remain same in the appended file, but the number of observations
increases. Suppose we have two files, e.g. wage_employees_rural.dta and
wage_employees_urban.dta. The wage_employees_rural.dta data file offers wage related
information for employees located in rural areas, while the wage_employees_urban.dta data file
offers wage related information for employees located in urban. Then, to join these two datasets
we can use append command in Stata. Variable names should be same in both the files that we
intend to append. Append two files using the following steps:

Open File 1, say wage_employees_rural.dta

Type in the command box: append using “path and file name of File 2”

Suppose, the data sets are located in D drive and within the folder “Stata_work_files”, then, type:

append using “D:\Stata_work_files\wage_employees_urban.dta”

To save the appended data file, click on File in the Menu Bar and select Save as option, and then,
within the Save Stata Data File window type a name for the merged file and click the Save
button.
3. Generating Variables and Renaming Variables

In a given data set saved in Stata file format, we can generate new variables if they are required
for estimation purposes. Suppose in a given data set, say the wage.dta data file, we have a
variable named ‘age’, and we need to have squared values of age of the respondents for running
a regression. Then, we need to create a new variable ‘age_square’. To generate ‘age_square’,
type:

g age_square = age*age

We can also generate log variables. For example, we need to have log values of wage. Suppose,
there is a variable named ‘wage’ within the given data set. Then to generate a new variable,
‘log_wage’, type:

g log_wage = log(wage)

We can also rename existing variables in a given data set. When we rename a variable, then
variable values remain same and the name of the variable only gets changed. Suppose, there is a
variable named ‘wage_total’ in the data set and we intend to rename it as ‘wage’, then type:

rename wage_total wage

4. Describing data

To learn more about a data set saved in Stata in .dta format, we can type the following command:

describe

This will provide a non statistical description of the data set. We have a wage data file,
wage.dta, with 25,521 observations offering wage information for 25,521 individuals and also
containing a few more variables. Then, by typing describe in the command window, we get the
following output:
. describe

Contains data from H:\Data_Practical_Sem_II_2017\wage.dta.dta


obs: 25,521
vars: 47 9 Jan 2017 13:05
size: 3,904,713

storage display value


variable name type format label variable label

hhd_size float %9.0g


hhd_type byte %8.0g
hhd_religion float %8.0g
relation_head byte %8.0g
sex byte %8.0g
age float %9.0g
marital_stat byte %8.0g
gen_edu float %9.0g
tech_edu float %9.0g
wage_cash float %9.0g
wage_kind float %9.0g
wage_total float %9.0g
prin_us_act_s~t float %9.0g
prin_us_in~1dig str1 %9s
prin_us_in~2dig str2 %9s
prin_us_indus~e str5 %9s
prin_us_oc~1dig str1 %9s
prin_us_oc~2dig str3 %9s
prin_us_occup~e str3 %9s
subsi_us_act_~t float %9.0g
hhid str9 %9s
psid str12 %12s
sector byte %8.0g
state float %9.0g
whether_regul~i byte %8.0g
hhd_gr float %8.0g
hhd_land_owned float %9.0g
total_hhd_lan~d float %9.0g
total_hhd_lan~i float %9.0g
mpce float %9.0g
state_region float %9.0g
year float %9.0g
district float %9.0g
enter_type_prin byte %8.0g
no_worker_prin byte %8.0g
whether_union byte %8.0g
member_union float %8.0g
weight float %9.0g
enterprise_type byte %8.0g
no_workers_en~e byte %8.0g
type_job_cont~t byte %8.0g
social_sec_be~t byte %8.0g
voc_training float %9.0g
hhd_land_owbed float %9.0g
age_sqaure float %9.0g
log_wage float %9.0g
d_female float %9.0g
According to this result, the data set contains 25,521 observations and 47 variables, where some
variables are non-numeric (storage type: str), e.g. psid, prin_indus_, and the rests are numeric
(storage type: byte, float), such as sex, age, hhd_religion, wage_total, etc..

In a given dataset, if there are too many variables, we can drop a few variables if they are not
needed for our estimation purposes. For example, in the wage.dta data file, if we don’t need a
few variables, like relation_head, hhd_land_owned and mpce, for some estimation purpose, then
we can just drop them from the data set by using the following command:

drop relation_head hhd_land_owned mpce

Alternatively, for any particular estimation, of we want to keep only a few variables, for example
log_wage, age and sex only, then we can keep only these three variables and drop the others by
using the following command:

keep log_wage age sex

5. Tabulation
Using a given data set saved in .dta format, we can tabulate frequency tables for categorical
variables and descriptive statistics for continuous variables.

To generate simple frequency table for a categorical variable, type:

tabulate variable_name

For example, if we want to get the frequency table for sex in the dataset wage.dta, type:

tabulate sex

We can get a frequency table like the following:

. tabulate sex

sex Freq. Percent Cum.

1 13,006 50.96 50.96


2 12,515 49.04 100.00

Total 25,521 100.00

In the table above, Sex = 1 stands for male and Sex = 2 stands for female, and we find that 50.96
percent individuals in the data set are male and 49.04 percent individuals are female.
Now, we can generate frequency table for more than one categorical variable as well. For
example, if we want to get a frequency table for sex and hhd_gr (household group- consists of 4
values, 1, 2, 3 and 9, representing four groups), then type:

tabulate sex hhd_gr

We will get a frequency table like the following:

. tabulate sex hhd_gr

hhd_gr
sex 1 2 3 9 Total

1 504 3,258 1,387 7,857 13,006


2 494 3,116 1,275 7,630 12,515

Total 998 6,374 2,662 15,487 25,521

For continuous variables, we can generate descriptive statistics by using summarize or tabstat
command.

Use of summarize command: The command summarize helps to get a simple numerical
summary of all the variables in the data set. It calculates mean, standard deviation, minimum and
maximum values. Using our data set, wage.dta, and typing the command

summarize

in the command window, we get the following output:

Variable Obs Mean Std. Dev. Min Max

hhd_size 25,521 4.968732 2.34132 1 23


hhd_type 25,513 2.597852 1.817047 1 9
hhd_religion 25,521 1.269425 0.564959 1 9
relation_h~d 25,519 3.485756 2.043578 1 9
sex 25,521 1.49038 0.499917 1 2

age 25,521 30.87167 19.49218 0 110


marital_stat 25,515 1.640447 0.601579 1 4
gen_edu 25,488 5.828508 3.378419 1 13
tech_edu 25,519 1.096046 0.85878 1 12
wage_cash 4,120 1966.407 2580.637 0 36000
wage_kind 978 121.6902 162.2977 0 2100
wage_total 4,120 1995.294 2573.5 30 36000
prin_us_ac~t 25,521 69.94083 32.72926 11 99
~c_code_1dig 0
~c_code_2dig 0

prin_us_in~e 0
~o_code_1dig 0
~o_code_2dig 0
prin_us_oc~e 0
subsi_us_a~t 0

hhid 0
psid 0
sector 25,521 1.401748 0.490261 1 2
state 25,521 19 0 19 19
whether_re~i 0

hhd_gr 25,521 6.313036 3.358038 1 9


hhd_land~ned 22,506 217.3888 489.4045 0 6360
total_hhd_~d 24,695 202.2302 635.1051 0 34197
total_hhd_~i 7,434 552.4342 1008.58 0 33388
mpce 25,521 7544.566 6421.782 235 109815

state_region 25,521 193.1337 1.253593 191 195


year 25,521 2011 0 2011 2011
district 25,521 11.044 5.049557 1 19
enter_type~n 0
no_worker_~n 0

whether_un~n 24,577 1.96452 1.43207 1 9


member_union 7,053 1.174394 0.379475 1 2
weight 25,521 3300.049 4780.511 4.1 74753.24
enterprise~e 6,940 2.606052 2.441857 1 9
no_workers~e 6,926 2.152469 2.038297 1 9

type_job_c~t 3,577 2.023204 1.40081 1 4


social_sec~t 3,578 7.329514 1.55547 1 9
voc_training 16,826 6.563057 1.244437 1 7
hhd_land~bed 0
In this table, categorical variables are also included. But, in practice, mean values of these
categorical variables actually make no sense. So, to get numerical summary for the continuous
variables only, we just need to type the variable names after summarize in the command box. For
example, if we want to get numerical summary for the variables age and wage_total, type:

summarize age wage_total

We will get a table like the following:\

Variable Obs Mean Std. Dev. Min Max

age 5547 33.68505 12.28535 15 59


wage_total 1321 2466.345 2395.037 160 17500

Now, including the option detail offers additional statistics including median. Type the command

summarize age wage_total, detail

in the command window. We will get a table like this:


. summarize age wage_total, detail

age

Percentiles Smallest
1% 1 0
5% 3 0
10% 7 0 Obs 25,521
25% 15 0 Sum of Wgt. 25,521

50% 29 Mean 30.87167


Largest Std. Dev. 19.49218
75% 45 98
90% 59 99 Variance 379.9452
95% 65 99 Skewness .4738753
99% 79 110 Kurtosis 2.505431

wage_total

Percentiles Smallest
1% 100 30
5% 229 35
10% 350 40 Obs 4,120
25% 625 40 Sum of Wgt. 4,120

50% 910 Mean 1995.294


Largest Std. Dev. 2573.5
75% 2250 24500
90% 5500 25000 Variance 6622901
95% 7000 27000 Skewness 3.301218
99% 11200 36000 Kurtosis 22.42779

.
We can also get summary statistics for a continuous variable for different sub-groups. For
example, we can get summary statistics for the variable age for different sex. If we want to get
summary statistics of age for male, then type

summarize age if sex == 1

in the command box. We will get a table like the following:

. summarize age if sex == 1

Variable Obs Mean Std. Dev. Min Max

age 13,006 30.88059 19.57041 0 95

Use of tabstat command: We can use tabstat to generate descriptive statistics for
continuous variables. For a single variable, the command should be like the following:

tabstat variable_name, stat(mean median sd var count range min max)

Example 1: If we intend to find out descriptive statistics for age, then type:

tabstat age, stat(mean median sd var count range min max)

We will get the output as follows:

. tabstat age, stat(mean median sd var count range min max)

variable mean p50 sd variance N range min max

age 30.87167 29 19.49218 379.9452 25521 110 0 110

Example 2: If we intend to find out descriptive statistics for more than one variable, say age and
wage_total, then type:

tabstat age wage_total, stat(mean median sd var count range min max)

We will get a table as the following:


. tabstat age wage_total, stat(mean median sd var count range min max)

stats age wage_t~l

mean 30.87167 1995.294


p50 29 910
sd 19.49218 2573.5
variance 379.9452 6622901
N 25521 4120
range 110 35970
min 0 30
max 110 36000

Example 3: If we intend to find out descriptive statistics for a set of continuous variables, say age
and wage_total, for different subgroups defined based on some categorical variable, say sex, then
type:

tabstat age wage_total, stat(mean median sd var count range min max) by(sex)

We will get the following output table:


. tabstat age wage_total, stat(mean median sd var count range min max) by(sex)

Summary statistics: mean, p50, sd, variance, N, range, min, max


by categories of: sex

sex age wage_t~l

1 30.88059 2023.056
29 910
19.57041 2531.863
383.0008 6410329
13006 2185
95 24970
0 30
95 25000

2 30.86241 1963.945
29 910
19.41134 2620.023
376.8 6864521
12515 1935
110 35960
0 40
110 36000

Total 30.87167 1995.294


29 910
19.49218 2573.5
379.9452 6622901
25521 4120
110 35970
0 30
110 36000

6. Graphical Presentation

Stata allows us to a large range of graphs by using the following basic command:

graphtype varname(s)

The type of graph is specified by graphtype within the command. We can choose from a wide
range of graphs based on the requirements: histogram, box, bar chart, scatter plots, etc.

Example 1: To make a histogram with wage_total, type:

histogram wage_total

Example 2: To make a scatter plot to see the relationship between two variables, type:
scatter variable_1 variable_2

This produces a scatter plot with variable_1 on the y-axis(vertical) and variable_2 on the x-axis
(horizontal).
Unit 2: Multiple Regression Analysis and hypothesis testing

1. T test

T test is conducted for hypothesis testing purposes. Generally, in regression analysis t test is
performed to check whether the regression coefficients are statistically significant or not at a
given level of significance (1%, 5% or 10% level of significance). Besides, we can also conduct
t test separately for comparing mean values of some variables across groups.

Example: We can perform t test for comparing mean values of one of more continuous variables
between two groups within a sample. Let us consider the case where we intend to compare mean
values of wage between male and female workers in wage.dta file. We can do this by performing
the steps described below:

Open the specific file, say wage.dta

Go to Statistics in the menu bar

Select the option Summaries

Select the option Classical Tests

Select t-test (mean comparison)

Select two-sample using two groups

t-test window will open up- Select wage_total in variable name box and sex in group variable
name box

Click Ok

We will get the output like the following:


. ttest wage_total, by(sex)

Two-sample t test with equal variances

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

1 2,185 2023.056 54.16446 2531.863 1916.837 2129.275


2 1,935 1963.945 59.56136 2620.023 1847.134 2080.756

combined 4,120 1995.294 40.09364 2573.5 1916.689 2073.899

diff 59.11062 80.33979 -98.39878 216.62

diff = mean(1) - mean(2) t= 0.7358


Ho: diff = 0 degrees of freedom = 4118

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0


Pr(T < t) = 0.7690 Pr(|T| > |t|) = 0.4619 Pr(T > t) = 0.2310

Three alternative results are presented under three alternative hypotheses, where:

H0: diff = 0 stands for the null hypothesis that the mean wage of male workers is not statistically different
from the mean wage of female workers.

Ha: diff < 0 stands for the one tailed alternative hypothesis that the mean wage of male workers is
smaller than mean wage of female workers.

Ha: diff >0 stands for the one tailed alternative hypothesis that the mean wage of male workers is greater
than mean wage of female workers.

Ha: diff != 0 stands for the two tailed alternative hypothesis that the mean wage of male workers is
significantly different from the mean wage of female workers.

We can infer about the mean difference in wage_total looking at the P values [Pr(T<t)] under each
hypothesis. On the basis of the results presented above, we can infer that the mean wage of male workers
is not statistically significantly different than the mean wage of female workers at the 5 percent level of
significance or even at 10 percent level of significance as P value of the t statistics under Ha: diff != 0 is
greater than 0.05 or even less that 0.10. This result automatically rejects the one tailed alternative
hypotheses as well and accepts the null hypothesis.
Note: Stata allows us to do some additional exercises with t-test. To be more specific, we can
restrict the t-test for some specific case, say for a specific sector, industry, occupation, and so on.
For example, we intend to compare mean values of wage between male and female workers only
in rural areas. Here, in our example, the variable sector takes the value 1 for rural areas. In this
case, we will repeat the same steps, and then, add a few more steps. The additional steps are as
follows:

Click on the by/if/in bar in the ttest window

Under Restrict observations option, type the restriction, say sector == 1

Click Ok

2. OLS estimation

Let us consider the following regression model,

log_wagei = β1 + β2 agei + β3 sexi + ui

Given the data set, we can run the ordinary least square regression, by typing the following
command:

regress dependent_variable independent_variable(s)

Here, log_wage is the dependent variable, and age and sex are independent variables. Sex is a
categorical variable that consists of two categories, male and female. The easiest way to include
a categorical variable as independent variable in a regression is by using the prefix ‘i’. By
default, the first category (or the lowest value) is used as the reference group/category. Here, the
variable sex assumes the value 1 for male and the value 2 for female. Hence, if we add the prefix
‘i’ to sex, then male will automatically be considered as the reference group. So, to perform
OLS, we need to type,

regress log_wage age i.sex

We will get the regression output like the following:


. regress log_wage age i.sex

Source SS df MS Number of obs = 4,120


F(2, 4117) = 31.57
Model 68.0065914 2 34.0032957 Prob > F = 0.0000
Residual 4434.62284 4,117 1.0771491 R-squared = 0.0151
Adj R-squared = 0.0146
Total 4502.62943 4,119 1.09313655 Root MSE = 1.0379

log_wage Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .0068111 .0008812 7.73 0.000 .0050834 .0085388


2.sex -.0581914 .0323988 -1.80 0.073 -.1217105 .0053276
_cons 6.836737 .03611 189.33 0.000 6.765941 6.907532

Alternatively, we can create a dummy variable using sex. For example, we can create a dummy
variable for female, say d_female (the detail method of generation of dummy variables will be
discussed in the next unit). Then, we need to type the following command to run the least square
regression:

regress log_wage age d_female

We will get the result as the following:

. regress log_wage age d_female

Source SS df MS Number of obs = 4,120


F(2, 4117) = 31.57
Model 68.0065914 2 34.0032957 Prob > F = 0.0000
Residual 4434.62284 4,117 1.0771491 R-squared = 0.0151
Adj R-squared = 0.0146
Total 4502.62943 4,119 1.09313655 Root MSE = 1.0379

log_wage Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .0068111 .0008812 7.73 0.000 .0050834 .0085388


d_female -.0581914 .0323988 -1.80 0.073 -.1217105 .0053276
_cons 6.836737 .03611 189.33 0.000 6.765941 6.907532

Even if the commands are different, in both the cases the results are same as can be seen from
the tables above as both the commands intend to run the same regression.
In the regression output, the important things to take into account at the time of explaining the
regression result are the value of R-squared or adjusted R-squared, P value of F statistics (Prob
>F), the values of the coefficients (Coef.) and the P values of the t statistics of the coefficients
(P> |t|).

We need to look at R-squared and F statistics to check the overall significance of the regression
model. It is often useful to compute a number that summarizes how well the OLS regression line
fits the data set. We generally use R-squared to measure the goodness of fit of a regression
model. R-squared measures the fraction of the sample variation in the dependent variable that is
explained by the independent variables in the regression model. R-squared can range from 0 to 1.
The closer R-squared is to 1, the better fir is the regression model. We generally multiply the
value of R-squared by 100 to present it in percentage form. However, for cross-sectional
analysis, a low R-squared does not necessarily mean that the OLS regression is useless. In
fact, in case of large cross-sectional data, it is quite common to have low value of R-squared. In
this case, to find out the overall significance of the regression model, we need to check the F
statistics.

It can be possible that the explanatory variables are not at all useful in predicting the dependent
variable in the given data set. We can formulate this as a null hypothesis that all regression
parameters (except the intercept) are zero.

H0 : β1 = β2 = … = βk = 0

Under this null hypothesis, the true regression line does not at all depend on the chosen
explanatory variables. Under alternative hypothesis, HA at least one of the coefficients should be
significantly different from zero. We can test this hypothesis through F test. Unlike, t tests that
can only assess one regression coefficient at a time, F test evaluates multiple regression
coefficients simultaneously. If the result of F test in an estimated regression model is statistically
significant, then we can infer that the fit of the regression model is better than the fit of an
intercept only model where all other regression coefficients are zero. In other words to say, if the
F statistic is significant, then we reject the null hypothesis just mentioned above. It implies that
there is at least one explanatory variable in our regression model that can well explain the
dependent variable. If the value of Prob > F in the regression output obtained through
running the regression in Stata is smaller than 0.05, then we can say that the regression
model is overall significant at 5 percent level of significance, i.e. the predictors can jointly
significantly explain the dependent variable in the model.

However, F test can not tell us which predictors can significantly influence the dependent
variable, i.e. which βi significantly different from zero. For this we need t test. And, from the
regression output presented above, we need to look at ‘p’ value of the t statistics to find out βi is
different from zero for each predictor. If ‘p’ value of the t statistics corresponding to a particular
predictor is less than 0.05, then we can infer that βi for that predictor is different from zero at 5
percent level of significance, i.e. that particular predictor in the regression model can
significantly influence the dependent variable. In the present example, we can find that ‘p’ value
of the t statistics corresponding to age is less than 0.05. Thus, we can infer that age significantly
influence wage earnings of individuals. However, sex(d_female) is found to have no significant
impact on wage earnings as the corresponding ‘p’ value of the t statistics is greater than 0.05.

Now, to find out the magnitude of impact of a predictor on the dependent variable, we need to
look at the value of regression coefficients (Coef.). We generally look at the value of
coefficients only of significant predictors. Here, only age is significant. Now, as the dependent
variable is given in log form, then from the value of the coefficient corresponding to age, we can
infer that one year increase in age can increase the wage earnings of individuals by 0.06 percent
as the coefficient value is .006 (we should multiply the coefficient values by 100 to get
percentage form) holding other factors constant.

3. Tests for Heteroskedasticity


A non-graphical test for heteroskedasticity is “ Breusch Pagan Test”. The null hypothesis is the
homoskedasticity. To conduct the Breusch Pagan Test, type:

estat hottest

If in the test result, the ‘p’ value of chi2 is greater than 0.05 at 95 percent confidence interval,
then we accept the null hypothesis (i.e. we accept homoskedasticity) , while we reject the null
where the P value is less than 0.05. If residuals are found to be heteroskedastic, i.e. the null is
rejected, then we may have wrong estimates of the standard errors for the coefficients and
therefore for their t values.
To deal with this problem, we can use heteroskedasticity robust standard errors. Hence, we
should use the option robust in the regression command. For example, we should write:

regress log_wage age d_female, robust

4. Tests for Multicollinearity


An important assumption for the multiple regression model is that the independent variables in
the regression model are not perfectly collinear, i.e. one regressor should not be a linear function
of another regressor. In the presence of multicollinearity, the standard error will be inflated. In
Stata, after running a regression, we can check for multicollinearity by typing:

vif

A vif > 10 or a vif < 0.10 indicates trouble. We can drop the problematic variable.

However, a high VIF is not a problem and can easily be ignored in the following cases:

 If the variable for which we find a very high VIF is just a control variable in the
regression model, and the VIFs of the variables of interest are not high.

 If high VIF is caused by the inclusion of powers and/or products of other independent
variables as separate independent variables in the model.

 If the variables with high VIFs are categorical variables with three or more categories.

You might also like