Professional Documents
Culture Documents
By
TANIMA BANERJEE
UNIVERSITY OF CALCUTTA
Generating variables
Describing data
T test
OLS estimation
Goodness of fit
Open Stata Go to Menu Bar Click on File Select Import Select Excel
Spreadsheet(*.xlx) In the Import box browse the file name Choose the option ‘Import first
row as variable name’ Select worksheet, e.g. Sheet 1 Click Ok.
There are two types of joining datasets. First, we can merge two data sets where data sets are
joined horizontally with one another. Second, we can append two datasets where data sets are
joined vertically with one another.
2.1 Merging
In case of merging, we horizontally join two data sets. When we merge two data sets, the number
of variables increases in the merged file while the number of observations may remain same.
Suppose we have two files, e.g. education_employees.dta and wage_employees.dta. The
education_employees.dta data file offers education related information of a group of employees,
while the wage_employees.dta data file offers wage related information for the same group of
employees. Then, to join these two datasets we can use merge command in Stata. When we
merge two data sets, there should be one common identity variable in both the files that offer
identity number for each observation within the data sets. In our example, say psid (person serial
identity number) is the common variable in both the files that offers person serial number of each
employee within the data sets. Then, we can merge these two files using psid.
Sort the data set in terms of psid, so type in the command box: sort psid
Save the sorted data set: click on the save icon in the tool bar
Type in the command box: merge psid using “path and file name of File 1”
Suppose, the data sets are located in D drive and within the folder “Stata_work_files”, then, type:
To save the merged data file, click on File in the Menu Bar and select Save as option, and then,
within the Save Stata Data File window type a name for the merged file and click the Save
button.
2.2 Appending
In case of appending, we vertically join two data sets. When we append two data sets, the
number of variables may remain same in the appended file, but the number of observations
increases. Suppose we have two files, e.g. wage_employees_rural.dta and
wage_employees_urban.dta. The wage_employees_rural.dta data file offers wage related
information for employees located in rural areas, while the wage_employees_urban.dta data file
offers wage related information for employees located in urban. Then, to join these two datasets
we can use append command in Stata. Variable names should be same in both the files that we
intend to append. Append two files using the following steps:
Type in the command box: append using “path and file name of File 2”
Suppose, the data sets are located in D drive and within the folder “Stata_work_files”, then, type:
To save the appended data file, click on File in the Menu Bar and select Save as option, and then,
within the Save Stata Data File window type a name for the merged file and click the Save
button.
3. Generating Variables and Renaming Variables
In a given data set saved in Stata file format, we can generate new variables if they are required
for estimation purposes. Suppose in a given data set, say the wage.dta data file, we have a
variable named ‘age’, and we need to have squared values of age of the respondents for running
a regression. Then, we need to create a new variable ‘age_square’. To generate ‘age_square’,
type:
g age_square = age*age
We can also generate log variables. For example, we need to have log values of wage. Suppose,
there is a variable named ‘wage’ within the given data set. Then to generate a new variable,
‘log_wage’, type:
g log_wage = log(wage)
We can also rename existing variables in a given data set. When we rename a variable, then
variable values remain same and the name of the variable only gets changed. Suppose, there is a
variable named ‘wage_total’ in the data set and we intend to rename it as ‘wage’, then type:
4. Describing data
To learn more about a data set saved in Stata in .dta format, we can type the following command:
describe
This will provide a non statistical description of the data set. We have a wage data file,
wage.dta, with 25,521 observations offering wage information for 25,521 individuals and also
containing a few more variables. Then, by typing describe in the command window, we get the
following output:
. describe
In a given dataset, if there are too many variables, we can drop a few variables if they are not
needed for our estimation purposes. For example, in the wage.dta data file, if we don’t need a
few variables, like relation_head, hhd_land_owned and mpce, for some estimation purpose, then
we can just drop them from the data set by using the following command:
Alternatively, for any particular estimation, of we want to keep only a few variables, for example
log_wage, age and sex only, then we can keep only these three variables and drop the others by
using the following command:
5. Tabulation
Using a given data set saved in .dta format, we can tabulate frequency tables for categorical
variables and descriptive statistics for continuous variables.
tabulate variable_name
For example, if we want to get the frequency table for sex in the dataset wage.dta, type:
tabulate sex
. tabulate sex
In the table above, Sex = 1 stands for male and Sex = 2 stands for female, and we find that 50.96
percent individuals in the data set are male and 49.04 percent individuals are female.
Now, we can generate frequency table for more than one categorical variable as well. For
example, if we want to get a frequency table for sex and hhd_gr (household group- consists of 4
values, 1, 2, 3 and 9, representing four groups), then type:
hhd_gr
sex 1 2 3 9 Total
For continuous variables, we can generate descriptive statistics by using summarize or tabstat
command.
Use of summarize command: The command summarize helps to get a simple numerical
summary of all the variables in the data set. It calculates mean, standard deviation, minimum and
maximum values. Using our data set, wage.dta, and typing the command
summarize
prin_us_in~e 0
~o_code_1dig 0
~o_code_2dig 0
prin_us_oc~e 0
subsi_us_a~t 0
hhid 0
psid 0
sector 25,521 1.401748 0.490261 1 2
state 25,521 19 0 19 19
whether_re~i 0
Now, including the option detail offers additional statistics including median. Type the command
age
Percentiles Smallest
1% 1 0
5% 3 0
10% 7 0 Obs 25,521
25% 15 0 Sum of Wgt. 25,521
wage_total
Percentiles Smallest
1% 100 30
5% 229 35
10% 350 40 Obs 4,120
25% 625 40 Sum of Wgt. 4,120
.
We can also get summary statistics for a continuous variable for different sub-groups. For
example, we can get summary statistics for the variable age for different sex. If we want to get
summary statistics of age for male, then type
Use of tabstat command: We can use tabstat to generate descriptive statistics for
continuous variables. For a single variable, the command should be like the following:
Example 1: If we intend to find out descriptive statistics for age, then type:
Example 2: If we intend to find out descriptive statistics for more than one variable, say age and
wage_total, then type:
tabstat age wage_total, stat(mean median sd var count range min max)
Example 3: If we intend to find out descriptive statistics for a set of continuous variables, say age
and wage_total, for different subgroups defined based on some categorical variable, say sex, then
type:
tabstat age wage_total, stat(mean median sd var count range min max) by(sex)
1 30.88059 2023.056
29 910
19.57041 2531.863
383.0008 6410329
13006 2185
95 24970
0 30
95 25000
2 30.86241 1963.945
29 910
19.41134 2620.023
376.8 6864521
12515 1935
110 35960
0 40
110 36000
6. Graphical Presentation
Stata allows us to a large range of graphs by using the following basic command:
graphtype varname(s)
The type of graph is specified by graphtype within the command. We can choose from a wide
range of graphs based on the requirements: histogram, box, bar chart, scatter plots, etc.
histogram wage_total
Example 2: To make a scatter plot to see the relationship between two variables, type:
scatter variable_1 variable_2
This produces a scatter plot with variable_1 on the y-axis(vertical) and variable_2 on the x-axis
(horizontal).
Unit 2: Multiple Regression Analysis and hypothesis testing
1. T test
T test is conducted for hypothesis testing purposes. Generally, in regression analysis t test is
performed to check whether the regression coefficients are statistically significant or not at a
given level of significance (1%, 5% or 10% level of significance). Besides, we can also conduct
t test separately for comparing mean values of some variables across groups.
Example: We can perform t test for comparing mean values of one of more continuous variables
between two groups within a sample. Let us consider the case where we intend to compare mean
values of wage between male and female workers in wage.dta file. We can do this by performing
the steps described below:
t-test window will open up- Select wage_total in variable name box and sex in group variable
name box
Click Ok
Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
Three alternative results are presented under three alternative hypotheses, where:
H0: diff = 0 stands for the null hypothesis that the mean wage of male workers is not statistically different
from the mean wage of female workers.
Ha: diff < 0 stands for the one tailed alternative hypothesis that the mean wage of male workers is
smaller than mean wage of female workers.
Ha: diff >0 stands for the one tailed alternative hypothesis that the mean wage of male workers is greater
than mean wage of female workers.
Ha: diff != 0 stands for the two tailed alternative hypothesis that the mean wage of male workers is
significantly different from the mean wage of female workers.
We can infer about the mean difference in wage_total looking at the P values [Pr(T<t)] under each
hypothesis. On the basis of the results presented above, we can infer that the mean wage of male workers
is not statistically significantly different than the mean wage of female workers at the 5 percent level of
significance or even at 10 percent level of significance as P value of the t statistics under Ha: diff != 0 is
greater than 0.05 or even less that 0.10. This result automatically rejects the one tailed alternative
hypotheses as well and accepts the null hypothesis.
Note: Stata allows us to do some additional exercises with t-test. To be more specific, we can
restrict the t-test for some specific case, say for a specific sector, industry, occupation, and so on.
For example, we intend to compare mean values of wage between male and female workers only
in rural areas. Here, in our example, the variable sector takes the value 1 for rural areas. In this
case, we will repeat the same steps, and then, add a few more steps. The additional steps are as
follows:
Click Ok
2. OLS estimation
Given the data set, we can run the ordinary least square regression, by typing the following
command:
Here, log_wage is the dependent variable, and age and sex are independent variables. Sex is a
categorical variable that consists of two categories, male and female. The easiest way to include
a categorical variable as independent variable in a regression is by using the prefix ‘i’. By
default, the first category (or the lowest value) is used as the reference group/category. Here, the
variable sex assumes the value 1 for male and the value 2 for female. Hence, if we add the prefix
‘i’ to sex, then male will automatically be considered as the reference group. So, to perform
OLS, we need to type,
Alternatively, we can create a dummy variable using sex. For example, we can create a dummy
variable for female, say d_female (the detail method of generation of dummy variables will be
discussed in the next unit). Then, we need to type the following command to run the least square
regression:
Even if the commands are different, in both the cases the results are same as can be seen from
the tables above as both the commands intend to run the same regression.
In the regression output, the important things to take into account at the time of explaining the
regression result are the value of R-squared or adjusted R-squared, P value of F statistics (Prob
>F), the values of the coefficients (Coef.) and the P values of the t statistics of the coefficients
(P> |t|).
We need to look at R-squared and F statistics to check the overall significance of the regression
model. It is often useful to compute a number that summarizes how well the OLS regression line
fits the data set. We generally use R-squared to measure the goodness of fit of a regression
model. R-squared measures the fraction of the sample variation in the dependent variable that is
explained by the independent variables in the regression model. R-squared can range from 0 to 1.
The closer R-squared is to 1, the better fir is the regression model. We generally multiply the
value of R-squared by 100 to present it in percentage form. However, for cross-sectional
analysis, a low R-squared does not necessarily mean that the OLS regression is useless. In
fact, in case of large cross-sectional data, it is quite common to have low value of R-squared. In
this case, to find out the overall significance of the regression model, we need to check the F
statistics.
It can be possible that the explanatory variables are not at all useful in predicting the dependent
variable in the given data set. We can formulate this as a null hypothesis that all regression
parameters (except the intercept) are zero.
H0 : β1 = β2 = … = βk = 0
Under this null hypothesis, the true regression line does not at all depend on the chosen
explanatory variables. Under alternative hypothesis, HA at least one of the coefficients should be
significantly different from zero. We can test this hypothesis through F test. Unlike, t tests that
can only assess one regression coefficient at a time, F test evaluates multiple regression
coefficients simultaneously. If the result of F test in an estimated regression model is statistically
significant, then we can infer that the fit of the regression model is better than the fit of an
intercept only model where all other regression coefficients are zero. In other words to say, if the
F statistic is significant, then we reject the null hypothesis just mentioned above. It implies that
there is at least one explanatory variable in our regression model that can well explain the
dependent variable. If the value of Prob > F in the regression output obtained through
running the regression in Stata is smaller than 0.05, then we can say that the regression
model is overall significant at 5 percent level of significance, i.e. the predictors can jointly
significantly explain the dependent variable in the model.
However, F test can not tell us which predictors can significantly influence the dependent
variable, i.e. which βi significantly different from zero. For this we need t test. And, from the
regression output presented above, we need to look at ‘p’ value of the t statistics to find out βi is
different from zero for each predictor. If ‘p’ value of the t statistics corresponding to a particular
predictor is less than 0.05, then we can infer that βi for that predictor is different from zero at 5
percent level of significance, i.e. that particular predictor in the regression model can
significantly influence the dependent variable. In the present example, we can find that ‘p’ value
of the t statistics corresponding to age is less than 0.05. Thus, we can infer that age significantly
influence wage earnings of individuals. However, sex(d_female) is found to have no significant
impact on wage earnings as the corresponding ‘p’ value of the t statistics is greater than 0.05.
Now, to find out the magnitude of impact of a predictor on the dependent variable, we need to
look at the value of regression coefficients (Coef.). We generally look at the value of
coefficients only of significant predictors. Here, only age is significant. Now, as the dependent
variable is given in log form, then from the value of the coefficient corresponding to age, we can
infer that one year increase in age can increase the wage earnings of individuals by 0.06 percent
as the coefficient value is .006 (we should multiply the coefficient values by 100 to get
percentage form) holding other factors constant.
estat hottest
If in the test result, the ‘p’ value of chi2 is greater than 0.05 at 95 percent confidence interval,
then we accept the null hypothesis (i.e. we accept homoskedasticity) , while we reject the null
where the P value is less than 0.05. If residuals are found to be heteroskedastic, i.e. the null is
rejected, then we may have wrong estimates of the standard errors for the coefficients and
therefore for their t values.
To deal with this problem, we can use heteroskedasticity robust standard errors. Hence, we
should use the option robust in the regression command. For example, we should write:
vif
A vif > 10 or a vif < 0.10 indicates trouble. We can drop the problematic variable.
However, a high VIF is not a problem and can easily be ignored in the following cases:
If the variable for which we find a very high VIF is just a control variable in the
regression model, and the VIFs of the variables of interest are not high.
If high VIF is caused by the inclusion of powers and/or products of other independent
variables as separate independent variables in the model.
If the variables with high VIFs are categorical variables with three or more categories.