Professional Documents
Culture Documents
Guide to Stata
Niels-Hugo Blunch
and
Carol Hansen Karsch
Table of Contents
Stata Resources........................................................................................................................................................1
Components of Stata:...............................................................................................................................................1
Stata Basics:............................................................................................................................................................. 2
Opening a Stata File.........................................................................................................................................2
. use...........................................................................................................................................................2
. set memory – no longer needed...............................................................................................................2
Command Syntax.............................................................................................................................................2
= Versus ==......................................................................................................................................................3
. set more off.....................................................................................................................................................3
Making Stata Stop............................................................................................................................................3
“Help” in Stata..................................................................................................................................................3
Interactive vs. Batch Mode.......................................................................................................................................4
Interactive Mode...............................................................................................................................................4
Batch Mode Using Do Files..............................................................................................................................4
Best of Both Approach.....................................................................................................................................4
Documenting A Stata Session – The Log - File...............................................................................................4
Importing Data into Stata.........................................................................................................................................5
Descriptive Statistics................................................................................................................................................6
. describe...........................................................................................................................................................6
. codebook varname..........................................................................................................................................6
. summarize varname, detail.............................................................................................................................7
. tabulate varname.............................................................................................................................................7
. tabulate varone vartwo....................................................................................................................................8
. tabstat varone vartwo, etc...............................................................................................................................8
. histogram varname.........................................................................................................................................9
. graph box varname.........................................................................................................................................9
. tsline varname...............................................................................................................................................10
. correlate - Checking For Correlation............................................................................................................11
. scatterplot.....................................................................................................................................................11
Verifying the Data..................................................................................................................................................11
. assert.............................................................................................................................................................11
Modifying & Creating Variables............................................................................................................................13
MODIFYING VARIABLES..........................................................................................................................13
i
. destring.................................................................................................................................................13
. tostring..................................................................................................................................................14
. encode...................................................................................................................................................14
. rename..................................................................................................................................................14
. replace...................................................................................................................................................14
. recode....................................................................................................................................................15
CREATING NEW VARIABLES..................................................................................................................15
Rules for Variable Names:......................................................................................................................15
. generate........................................................................................................................................................15
Addition/Subraction................................................................................................................................15
Division..................................................................................................................................................16
Multiplication (Interaction).....................................................................................................................16
Exponentiation........................................................................................................................................16
Lag..........................................................................................................................................................16
Log:.........................................................................................................................................................16
. egen..............................................................................................................................................................16
Dummy/Indicator Variables...........................................................................................................................17
The Easy Way.........................................................................................................................................17
Second Method.......................................................................................................................................17
Third Method..........................................................................................................................................18
Dates...............................................................................................................................................................18
Modifying and Combining Files.............................................................................................................................19
Modifying a File.............................................................................................................................................19
. sort........................................................................................................................................................19
. keep & . drop........................................................................................................................................19
. duplicates list & duplicates drop...........................................................................................................19
. move and . order: reordering the list of variables.................................................................................20
. reshape (wide long or long wide)................................................................................................20
Combining Files.............................................................................................................................................21
. append...................................................................................................................................................21
. merge....................................................................................................................................................22
Time Series and Panel Data............................................................................................................................22
Time Series Operators.............................................................................................................................24
Regression Analysis...............................................................................................................................................25
ii
OLS (Studenmund, pp. 35 – 40).....................................................................................................................25
Restricting Stata commands to observations with certain conditions satisfied........................................25
Ensuring Consistency Among the Number of Observations in the various estimation samples..............25
Check for Violations of the Classical Assumptions................................................................................................26
Omitted Variable Test (Studenmund, pp. 202 - 204)..............................................................................26
Comparing Alternative Specifications for Model (Studenmund, p. 204 - 206)......................................26
Detection of Multicollinearity – VIF Score (Studenmund, p. 259 - 261)...............................................26
Detecting Serial Correlation....................................................................................................................26
Durbin-Watson D Test for First-Order Autocorrelation (Studenmund, p. 315)...............................26
Dealing With Autocorrelation.........................................................................................................27
o GLS Using Cochrane-Orcutt Method (Studenmund, p. 322)..........................................................27
o Correct Serial Correlation Using Newey-West Standard Errors (Studenmund, p. 324)..................27
Check for Homoscedasticity...................................................................................................................27
Park Test for Heteroskedasticity (Studenmund, p. 356).................................................................27
White test for Heteroskedasticity (Studenmund, p. 360).................................................................28
Weighted-Least Squares (WLS) (Studenmund, p. 363)..................................................................28
Heteroskedasticity-corrected (HC) standard errors (Studenmund, p. 365)......................................28
Hypothesis Testing (Studenmund p.559)...............................................................................................................29
t-tests (Studenmund p.561).............................................................................................................................29
f-tests.............................................................................................................................................................. 29
Tests for joint statistical significance of explanatory variables...............................................................29
Testing for equality of coefficients.........................................................................................................29
Making “Nice” Tables of Regression Results.........................................................................................................30
. outreg2..........................................................................................................................................................30
The “estimates store” and “estimates table” Commands:................................................................................30
Other Recently Used Regression Models...............................................................................................................32
Models with Binary Dependent Variables......................................................................................................32
Logit with marginal effects reported.......................................................................................................32
Probit with marginal effects reported......................................................................................................32
Multinomial logit....................................................................................................................................32
Appendix A............................................................................................................................................................34
Do File Example.............................................................................................................................................34
Appendix B............................................................................................................................................................ 35
Importing an Excel file into Stata...................................................................................................................35
iii
Appendix C............................................................................................................................................................ 36
Downloading Data from ICPSR.....................................................................................................................36
Using a Stata Setup File to Import ASCII Data into Stata..............................................................................37
iv
STATA RESOURCES
Stata’s website is the natural starting place – the following link leads you to a wealth of info on Stata, including
Tutorials, FAQs, and textbooks:
http://www.stata.com/links/resources.html
One of the most helpful links found on the Stata site is UCLA’s Stata Resources website. It includes a set of
learning modules:
http://www.ats.ucla.edu/stat/stata/modules/default.htm
COMPONENTS OF STATA:
Stata opens several windows when it launches. A window can be opened and made active either by clicking on it
with the mouse or by selecting it from the “Window” menu. Right-clicking on a window displays options for that
window.
1) Results are displayed in the Results window. The results can be searched by typing Cntl + F.
2) Commands are entered into the command line in the Command window.
3) Previously entered commands are displayed in the Review window. Click the magnifying glass to
filter the commands.
4) The variables for the current dataset are displayed in the Variables window.
a) The Variables window can be used to enter variable names into the Command window (simply
double click on a variable name)
b) Right clicking on a variable in the window gives options to keep or drop selected variables
(Ctrl-click to select discontinuous variables.)
5) The Properties window contains two areas: one pertains to Variables and the other the Data file as a
whole. The Variables section displays attributes of the variable highlighted in the Variable window.
If the area is unlocked, you can edit the attributes, e.g. rename the variable, create or edit a label,
change its format, apply a value label or add notes. The Data section allows you to label the dataset
and add notes. It also tells you the number of variables and the number of observations in the file and
the amount of memory being allocated to the data.
The Stata toolbar provides access to commonly used features:
1) Holding the mouse pointer over an icon displays a description of the icon.
2) A window can be opened by clicking its icon on the toolbar.
Menus and dialogs provide access to commands in Stata until the time comes when you want to directly enter the
command in the command window.
The working directory is displayed on the status bar in the lower left-hand corner.
Another important component of Stata is the Data Editor. It allows you to see the data that you are working on. It
is available in two different modes: the edit mode and the browse mode. The edit mode allows you to manipulate
the data. It can be accessed by clicking on the toolbar icon that looks like a pencil on a spreadsheet. All work
done while in the Data Editor that changes the data is documented by commands sent to the Review window.
1
These commands can be copied and pasted into a do file insuring that all your work is documented and
reproducible. They are also captured in an open log file.
The browse mode allows you to see the data but it cannot be changed. It can be accessed by clicking on the icon
in the toolbar that looks like a magnifying glass examining a spreadsheet. The edit mode is a convenient way to
keep an eye on the results of your work without the danger of accidently editing something.
STATA BASICS:
In a lot of Stata documentation, commands are preceded by a “.” and that convention is followed in this guide. Do
NOT type the period!!!!
Stata is case sensitive. Commands are written in lower-case. Variable and file names can be in upper and/or lower
case, but consistency is necessary. Stata interprets GDP and gdp as two different variables that can co-exist in the
same file.
There are three basic variable formats in Stata: string, numeric and date. Strings are alphanumeric. They may
consist solely of numerals but if a variable is declared a string, mathematical operations cannot be performed on
it. In Stata’s Data Editor, the default color for strings variables is red. The default color for numeric data is black
as are dates. Numeric variables with labels, such as “male” and “female”, but the underlying value is really a
number, are blue.
. use
The easiest way to open a Stata file is to double click on it in the directory. If you are already in the program, the
menu system is the next easiest: click on the File menu, select Open and then browse to your file. Alternatively,
type the command: . use filename in the command window or in a batch file. A file cannot be opened when
another file is already open. Either close the first file, type .clear in the command line or use the clear option in
the . use command.
. use filename, clear
COMMAND SYNTAX
Stata commands have a common syntax, which is written in the following manner:
. prefix_cmd: command varlist if exp in range, options
(Note: in a lot of Stata documentation, commands are preceded by a “.” and that convention is followed in this
guide. Do NOT type the period when entering a command!!!!)
Commands can be extremely simple. The command:
. list
instructs Stata to list the entire file that is currently in memory. No need to use a menu to issue that command!
2
Obviously, commands can be more complex and powerful. Following the standard syntax, the list command
presented below instructs Stata to break the list up into groups based on the rep78 variable, include only the mpg,
price and weight variables, and only list observations where the price is greater than 20,000 and is in observations
1 through 100. The option “clean” instructs the program to list the observations without a box around the
variables which makes the listing nice and compact.
. by rep78, sort: list mpg price weight if price > 10000, clean
The if qualifier is extremely useful and works with many commands. The if qualifier tests for equality. To specify
a particular string value, enclose it in double quotes. Stata is case-sensitive, so each of the following is evaluated
as a unique string. Also note the use of the double equal sign. See the next heading for details.
if gender == “male”
if gender == “Male”
if gender == “ male”
if gender == “male ”
= VERSUS ==
A single equal sign, =, is used in assignment expressions. Use = with the . generate and . egen commands.
A double equal sign, ==, is a logical operator that returns a 0 if an expression is false and a 1 if it is true. Use ==
with an if qualifier to test for equality. If you could substitute a >= (greater than or equal to) operator or a <= (less
than or equal to) operator in place of the == operator, then the double equal sign is probably what you want to use.
Examples include:
. regress price mpg foreign if foreign == 1
. su price if region == “West”
(Recall: su, sum, summ are all equally valid abbreviations for the . summarize command.)
“HELP” IN STATA
Stata’s “Help” utility is very good. Use it to see how to use commands. Typing . help commandname, for
example, . help summarize, makes Stata search through all the official Stata files. Examples given within the
help files use the common Stata command syntax. Stata 12 has a new Viewer window. It is now possible to have
3
multiple help files open in one window. Tabs allow you to switch between files. There are also tabs for “Also
See” and “Jump To.” So now, instead of scrolling to find the examples, you can jump right to them.
Alternatively, . findit search_term, makes Stata search the Web in addition to the official Stata files. This is
particularly useful if searching for a keyword, concept and/or a Stata command that is NOT part of official Stata
(there are many user-written Stata commands on the Web that can be downloaded into official Stata this way, by
following the appropriate links resulting from the search.)
INTERACTIVE MODE
In interactive mode, commands are typed “as you go” into the command window. The drawback is that when you
end your Stata session, all your work is gone… (EXCEPT if you created a do file or a log file.)
5
DESCRIPTIVE STATISTICS
The following are a few commands that will familiarize you with a data file. The next several examples use the
auto.dta dataset that comes installed with Stata. Type . sysuse auto to load it.
. describe
gives an overview of a file. The describe command gives a count of the total number of observations in the file,
and lists each variable by: name, type, display format, value label and label information. If the file has been
sorted, the sort variable(s) is/are identified.
. describe
. codebook varname
gives much the same information as summarize, detail, except it includes the number of missing values, which is
extremely valuable information. Observations with missing values for variables used in a regression equation will
not be included. This is important to know early in a project and could have a huge impact on your analyses. To
get this information for ALL the variables in the file, simply issue the codebook command without specifying the
variable(s.)
. codebook mpg
mean: 21.2973
std. dev: 5.7855
6
. summarize varname, detail
gives mean, standard deviation, variance, skewness, kurtosis, and percentiles for the specified “varname”
variable. Can be abbreviated as: . su, . sum, . summ, etc.
. summ mpg, detail
Mileage (mpg)
Percentiles Smallest
1% 12 12
5% 14 12
10% 14 14 Obs 74
25% 18 14 Sum of Wgt. 74
The above example calculate statistics based on the full sample, To limit the calculations to particular portions of
the data, foreign or domestic cars, for example, use the “if” suffix.
. summarize mpg if foreign==0
. summarize mpg if foreign==1
Notice, the use of two equality signs, rather than one. We are testing for equality, not assigning a value
. tabulate varname
creates a frequency table of occurrences. This is useful for categorical variables or continuous variables with a
limited range of values. The by option is permitted. If the missing option is specified, missing values are included
in the frequency counts. To ask for more than one frequency table in a single command, use the .tab1 command:
. tab1 varone vartwo, etc.
. by foreign: tab mpg
Mileage
(mpg) Freq. Percent Cum.
12 2 3.85 3.85
14 5 9.62 13.46
15 2 3.85 17.31
16 4 7.69 25.00
17 2 3.85 28.85
18 7 13.46 42.31
19 8 15.38 57.69
20 3 5.77 63.46
21 3 5.77 69.23
22 5 9.62 78.85
24 3 5.77 84.62
25 1 1.92 86.54
26 2 3.85 90.38
28 2 3.85 94.23
29 1 1.92 96.15
30 1 1.92 98.08
34 1 1.92 100.00
Total 52 100.00
Mileage
(mpg) Freq. Percent Cum.
14 1 4.55 4.55
17 2 9.09 13.64
18 2 9.09 22.73
21 2 9.09 31.82
23 3 13.64 45.45
24 1 4.55 50.00
25 4 18.18 68.18
26 1 4.55 72.73
28 1 4.55 77.27
30 1 4.55 81.82
31 1 4.55 86.36
35 2 9.09 95.45
41 1 4.55 100.00
Total 22 100.00
7
. tabulate varone vartwo
creates a crosstab table. This is useful for categorical variables.
Repair
Record Car type
1978 Domestic Foreign Total
1 2 0 2
2 8 0 8
3 27 3 30
4 8 9 17
5 2 9 11
Total 47 21 68
. 5705 3690 16 4
5705 3690 16 4
. . . .
. . . .
1 1 1 1
8
. histogram varname
Histograms are useful for checking that data is normally distributed and for looking for outliers. By default, the
histogram is scaled to density units, i.e., the sum of their areas is equal to 1. Alternatively, frequency, fraction or
percent scales can be specified. The discrete option tells Stata to give each value of varname its own bar. The by
option is permitted.
. histogram mpg, frequency discrete by(foreign, col(1))
Domestic
8
6
4
2
Frequency
0
Foreign
8
6
4
2
0
10 20 30 40
Mileage (mpg)
Graphs by Car type
9
. tsline varname
plots time series data. (Type . sysuse tsline2 to load the example dataset.) This command must be preceded by
the . tsset command, which declares the data to be time-series. (See “Time Series” in “Modifying and Combining
Files” for more information.) Like the histogram and the box plot graphs, the time series graph can be used to
look for anomalies and outliers. In the graph below, the spikes in calorie consumption at the end of the year are
probably holiday eating, but the data should be checked.
4400
4200
Calories consumed
3800 3600
3400 4000
Another way to check for outliers is to see if there are values that lie more than three standard deviations away
from the mean. This involves the use of the r(mean) scalar. To load the scalar with the mean for the variable that
is being checked, submit the . summarize varname, detail command, then list the observations that exceed ±
three standard deviations. Here are the commands and the results for the calories consumed datafile. The
excessive calorie consumption did occur during the holidays and therefore are probably not clerical errors.
. su calories, detail
. l if calories >=r(mean) + 3*r(sd) & calories < . | calories <=r(mean)-3*r(sd)
10
. correlate - CHECKING FOR CORRELATION
Checking pairwise correlations gives an overview of the relationship between numeric variables in a study.
Looking at the data at this level can provide useful information for regression model specification and identifying
possible sources of multicollinearity in the independent variables.
. correlate varone vartwo
. scatterplot
Scatterplots, especially with a linear prediction line, are extremely useful for exploring the relationship between
variables. The format is tw scatter y x. See: http://www.ats.ucla.edu/stat/stata/faq/scatter.htm for more examples.
. tw (scatter mpg price) (lfit mpg price)
40
30
20
10
. assert
is useful for checking data, particularly in do files, where many asserts can be listed back to back to check that
the data really are as expected. If assertions are true, Stata quietly goes through the do file executing the
commands. If an assertion is false, Stata aborts operation and gives an error message that includes the number of
instances where the assertion is false.
. assert gender==0 | gender== 1
2 contradictions in 365 observations
assertion is false
r(9);
11
. list if gender !=0 & gender !=1
.assert is also helpful after two files have been merged to see if there are non-matches. A merge value equal to 1
(_merge==1) means that the data came from the “master” file only; _merge==2 means the data came from the
“using file”; a _merge==3 means both files contributed data to the observation.
.assert _merge==3
More information about merging files can be found in the section on “Modifying & Combining Files.”
12
MODIFYING & CREATING VARIABLES
MODIFYING VARIABLES
. destring
If a variable was mistakenly imported as a string variable when it should have been numeric, the . destring
command will convert it. The Properties window gives information on how a variable is stored, and by default,
string variables are displayed as red in the Data Editor window.) Before converting the variable to a numeric
format, the input error that caused the variable to be stored as a string must be fixed. For example, if the following
set of numbers was imported as the variable bat_avg, the data will be stored as a string because of the comma in
the third observation.
.308
.272
,215
.299
Correct the error using either the Data Editor, or the . replace command.
. replace bat_avg = “.215” in 3
After the data entry error is fixed, the . destring command will convert the string to a numeric variable. The safest
approach is to create a new variable using the generate option, and to check the results. The original variable can
be dropped and the new one renamed, once the accuracy of the conversion has been verified.
. destring bat_avg, gen(batavg_num)
// Check your results, if correct, continue
. drop bat_avg
. rename bat_avg_num bat_avg
If there are many instances of a specific problem, such as the use of a thousands comma separator or “n/a” rather
than a “.” for missing values, then the ignore option can fix the problem. You must be extremely cautious when
using this option because it can have unforeseen, undesired consequences. Use of the generate option to create a
new variable is much safer than writing over the existing variable. Check your work, then drop the old varname
and rename the new variable to match the old name.
. destring varname, ignore("," "n/a") gen(varname_new)
When two periods “..” are used to represent missing values and the variable does contain decimal values, DO
NOT USE “ignore(“..”)” to fix the problem. Stata will drop decimal points in that variable, so 30.5 becomes 305,
even though you specified two periods!!! Not Good !!!
Another way to fix the problem is to use the “ real” function. Unless you create a loop, new variables have to be
created one at a time, but it may be a safer method than the “ignore” option described above. The not_num
variable created in the command below retains the value of the varname variable if the datum is numeric and
assigns a missing value if it is not numeric. The edit command opens the Data Editor with just the observations
containing non-numeric values listed, ready for correction.
. gen not_num = real(varname)
. edit if not_num == .
13
. tostring
A string variable can consist solely of numbers, but mathematical operations cannot be performed with it.
Therefore it can be a good idea to format numbers for which a mathematical operation is inappropriate, such as
identification numbers, as string variables. See Stata’s help file for more information on the following two
approaches:
. tostring or
. generate as_str
. encode
If you have a string variable, such as country, where the values are “USA” and “Canada” and “Mexico” and you
want to assign numeric values to them (so that you can create dummies or . tsset or . xtset the data to declare it as
time-series or panel data) the encode command will assign numeric values and create value labels using the
values of the string variable.
. encode country, generate(cntryid)
If you list the data or look at it in the Data Editor, it will appear that the original variable and the newly generated
variable are the same. But really the new variable, country_id in this example, is using the label, the underlying
value is numeric. Use the nolabel option to see the underlying values.
. list country cntryid, nolabel
. rename
Click on a variable in the Variables window and its properties are displayed in the Properties window. If the
Properties window is unlocked, the variable name can be edited. The command can also be typed directly in the
Command window or included in a do file.
. rename sex gender
Groups of variables can be renamed in one command and the wildcards “*”, “?” and “#” can be used. The “*”
specifies one or more characters; “?” substitutes for one character, though multiple question marks can be used;
the “#” acts like the asterick but is used to specify digits. The example below renames variables gdp90 through
gdp99 to gdp1990 through gdp1999.
. rename gdp9# gdp199#
. replace
This command is used to change the contents of a variable when the variable already exists. Because it is so
powerful, Stata makes you type out the entire command; there are no abbreviations. It is often used with the if
qualifier. The first example corrects typos in a string variable. The second changes missing values that were coded
as -999 to Stata’s “.” missing value.
. replace gender = “male” if gender == “mael”
. replace price = . if price == -999
Note, the . replace command does NOT ignore missing values. Careful consideration must be paid to missing
values in a datafile. By default, Stata uses a “.” for numeric missing values. Internally, “.” is stored as a very,
very large number. Most commands, like . regress, drop observations with missing values, which can cause
problems if too many observations are dropped from the analysis. Other commands, like . replace, do NOT
ignore them. The command .replace group=3 if group>=3 will replace any group with a missing value with a 3.
To exclude missing values from the replacement, the command must be written as:
. replace group=3 if group>=3 & group < .
14
The .egen rmiss() function can be used to create a new variable that stores a count of the number of missing
numeric values in each observation. An extension to the egen command rmiss2() counts the number of missing
values in both numeric and string variables. It can be downloaded by typing the command: . findit rmiss2.
The . tabmiss program creates a frequency table of the number of missing values by variable. The program
works for both numeric and string variables. It can be downloaded by typing, findit tabmiss.
See: http://www.ats.ucla.edu/stat/Stata/faq/nummiss_stata.htm for more information about tabmiss and rmiss2().
. recode
This is a useful command to collapse a continuous variable into categorical groups. While the . replace command
can be used, it is quicker with . recode. To be safe, generate a new variable with the same values as the variable to
be recoded, then work with the NEW variable, just in case a mistake is made. In the following commands,
categories for the continuous variable “ age” are being created:
. gen AgeGrps=age
. recode AgeGrps (min/18=1) (19/21=2) (22/29=3) (30/49=4) (50/66=5) (67/max=6)
The recode command is also used to swap the values of a categorical variable. This is often done to match the
coding of similar variables so an index can be created. For example, if two questions have a value of “1” for
“Very Dissatisfied” and a value of “5” for “Very Satisfied.” but one question, quest3, has the coding reversed,
recode can be used to change the values in quest3 to match those in questions 1 and 2. In the example code, the
value of “3” remains unchanged, so it does not need to be specified. Again, for safety’s sake, work is being done
with a copy of the variable.
. gen quest3_recode=quest3
. recode quest3_recode = 1=5 2=4 4=2 5=1
. generate
To create a new variable, use the . generate (abbreviated as . gen) or . egen command. Spacing is not important;
operators can have spaces before and/or after or none at all. Constants and variables can both be used to create
new variables. For a complete list of operators, type: . help operators.
Addition/Subraction
. gen sum = varone + vartwo
. gen sum2 = varone + 10 // constants can also be used to create variables
. gen net_pay = gross – deductions
15
Division
. gen mpg = mileage/gallons // mpg is equal to mileage divided by gallons
Multiplication (Interaction)
. gen total = price * quantity // total is equal to price times quantity
Exponentiation
. gen x_sq = x^x OR . gen x_sq = x*x
. gen x_cubed = x^3 OR . gen x_cubed=x*x*x
Lag
Useful with time-series data.
. gen price_lag1 = price[_n-1] // price_lag equals the price in the previous observation
. gen price_lag2 = price[_n-2] // lag two years and so on….
_n is a system variable that refers to the current observation. _n-1 refers to the previous observation. For a
complete list of system variables, type: .help _variables in the command window.
If the dataset is panel data, then a by prefix will be needed. This is so the first observation in one group does not
use a previous group’s value when creating the lag.
. bysort country: gen price_lag1=price[_n-1]
Note: If the dataset has been declared to be time-series with the . tsset command or panel data with the . xtset
command, then the . L.varname operator can be used in a regression equation. This method works without
creating new variables. See “Time Series” in “Modifying and Combining Files” for more information.
Log:
Often used when the relationship between the dependent variable and the independent variable is not constant.
The ln function returns the natural log.
. gen log_x = ln(x)
. egen
The . egen command provides extensions to the generate command and offers very powerful capabilities for
creating new variables with summary statistics. The . egen command works by observation (row) or by variable
(column.) In the next two commands the new variables will have values based upon other variables in the same
row/observation.
. egen AvgScore=rowmean(test1 test2 test3 test4)
. egen answered=rownonmiss(question1 – question25)
The following code works across observations. It generates a unique id number for each country in a panel
dataset
. sort country
. egen cntryid=group(country) // This assigns the values of 1, 2, 3, etc to the various countries.
In the next command, a summary statistic for a column of data will be added to each observation. The same value
will be added, unless subsets are created using the by option.
. egen TotalPrice = total(price)
16
DUMMY/INDICATOR VARIABLES
There are a number of different methods that can be used to create dummy (indicator) variables which take on a
value of 0 if the condition evaluates to false, and 1 if it is true.
The easiest way ONLY works if there are no coding errors, so it is important to check first using either the
. tab varname, missing (tabulate with the missing option specified) or .assert commands.
Second Method
This way is more tedious but it is also more foolproof than the first approach. It’s foolproof because it explicitly
takes into account the EXACT definition of the underlying variable from which the dummy variables were
17
created. The dummy variables are created one at a time. Assuming that the coding of “educat” follows the first
convention explained in the previous method, issue the following commands to create the dummy for “Noedu”:
The dummies for “Pri” through “Voc” are created using the same approach.
generate Noedu = .
replace Noedu = 0 if educat >=0 & educat <= 4
replace Noedu = 1 if educat==0
The first command creates the new variable, Noedu, initially with “missing” values.
The second command “fills up” the missing values with zeros – note: the if qualifier explicitly takes the valid
range of the “educat” variable into account.
Finally, the third command replaces the zeros with ones when the if qualifier evaluates to 1 (true.) Otherwise, the
zeros are left intact.
Third Method
The third method uses Stata operators to make virtual variables. The virtual dummies are automatically generated
and their coefficients are reported in the regression output, but no new variables are actually created. Stata
determines the group(s) to drop. You may see examples that use these operators, so it may be important to know
what they do.
Operators include:
i. unary operator for indicator/categorical dummies
c. unary operator for continuous variables
# binary operator for creating an interaction
## binary operator for factorial interactions
Get more information and see examples of usage at: . help fvvarlist.
DATES
Any model that involves time-series data or date manipulation, such as generating an elapsed time variable, needs
a date variable(s) in Stata date format. The base date used by Stata is January 1, 1960, which is assigned a value
of 0. Dates prior to 1/1/1960 have negative values while dates post 1/1/1960 have positive values. Two functions
are provided that will do the conversions, which one to use depends upon how the dates in the file are formatted.
If there are separate variables for month, day and year, then use the mdy type function:
. gen birthday=mdy(month, day, year)
If the date is a string variable, such as: Jan 1, 2008, then use the date function:
. gen birthday=date(datevar, “MDY”)
Once created, the date variable can be formatted to look more “intelligible” to humans. The format command
causes a Stata date such as 17532, (17532 being the number of days since Jan 1, 1960,) to be written as 1jan2008.
. format birthday %td
Handling dates can be a bit tricky. Two sites that can be helpful are: Princeton’s
(http://dss.princeton.edu/online_help/stats_packages/stata/time_series_data.htm) and UCLA’s
(http://www.ats.ucla.edu/stat/stata/modules/dates.htm).
New in Stata 12 are business calendars which are user definable. The user can specify when the business is
closed. This has great potential for market analysis type work. For more information about this subject, see
Stata’s . help date.
18
MODIFYING AND COMBINING FILES
MODIFYING A FILE
. sort
By default, Stata sorts variables in ascending order, from smallest to biggest or from A to Z, based upon the
variables specified in the varlist. . gsort sorts in descending or ascending order. The stable option insures that the
observations stay in the same relative order that they held before the sort. Without this option, the order of
observations that have equal sort values is randomized.
/* Sorts observations in ascending order of mpg. Relative order of observations prior to the sort will be
maintained for observations with equal mpg values. */
. sort mpg, stable
//
. gsort +mpg, stable //equivalent to the previous command; mpg is sorted in ascending order.
//
/* Sorts observations in descending order based on mpg. Order of observations with equivalent mpg
values is randomized. */
. gsort –mpg
//
//* Sorts the variable make in ascending order (from A to Z) and then within each make, mpg is sorted in
descending order. */
. gsort make –mpg
To reshape the data, the identifying variable (i) and the stub name must be determined. In this case, country is the
i variable, and gdp is the “stub name” (gdp1980, gdp1981, etc.) The j name is the name of the new variable. The
command becomes:
. reshape long gdp, i(country) j(year)
and it issues this result:
(note: j = 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
> 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008)
The number of observations went from 172 to 4988 and there are now 3 variables as opposed to the 30 variables
there were when the data were “wide.” Open the Data Editor to see the results. Here’s an image of a few of the
observations:
20
Here’s the command that would change the data back from “long” to “wide”:
. reshape wide gdp, i(country) j(year)
and the result:
It is possible to have multiple identifying variables, e.g. state and city, and the j variable can be a string. If Stata
detects any mistakes in the data that affect its ability to execute the reshape command, an error message instructs
you to type “reshape error” to get the details.
COMBINING FILES
There are three ways to combine files. Additional observations from one datafile can be added to the end of
another with the . append command. Additional variables contained in one file can be added to corresponding
observations in another with the . merge command. The . joinby command forms all pairwise combinations
within groups. It keeps only the observations that are found in both files. All three methods use a similar
approach. The datafile in memory (the one that is currently open) is referred to as the “master” file. The file that is
to be joined with the “master” is known as the “using” datafile. Both files must be Stata files.
. append
Append adds observations from the using file to the end of the master file. The files are stacked vertically.
. append fileinmemory using AddObsFile
21
. merge
Merge adds additional variables to observations. If the two datafiles are in EXACTLY the same order, a
matching variable contained in both files is not needed. But usually, both files have a matching variable (or
variables) that is used to associate an observation from the master file with an observation in the using file. Before
files can be merged, they must be SORTED by the matching variable(s).
Merge creates a system variable named _merge that is added to the bottom of the list in the variable window. The
_merge variable has five possible values, but in most cases, unless the using file is being used to update the
master file, only the first three are of interest:
_merge==1 observation found in master file only
_merge==2 observation found in using file only
_merge==3 observation built from both master & using files. Normally, this is the desired value.
Before another merge can be executed the system generated _merge variable must be dropped or renamed.
Otherwise, Stata will not allow you to do another merge because the _merge variable already exists.
There are a few different types of merge. In a one to one match (. merge 1:1) each observation in the master file
has a corresponding observation in the using file. In a one to many merge (. merge 1: m) the using file has
multiple observations per each unique key variable in the master file. There are also many to 1 (m:1) and many to
many (m:m) merges.
Options available with the merge command are update and replace. Update replaces missing values in the master
file with values from the using file. Replace, which is used in conjunction with update, replaces missing and non-
missing values in the master file with values found in the using file.
Here is some example code for a 1:m merge:
* Load, sort and save master file
use patient.dta, clear
sort id //Note, each patient has a unique id, there are no duplicates.
save patient.dta
//
/* Load, sort and save “using” file
Note: patients may have made many visits to the doctor’s office, so there might be mulitple observations
for each id. */
use visits, clear
sort id, stable // stable option maintains the relative order of observations
save visits
//
* Reload master file then merge with using file by id
use patient.dta, clear
merge 1:m id using visits // joins observations in the patient file with obs in the visit file by id.
// Stata will give you a frequency table for the system generated _merge variable.
drop _merge // If the merge worked as planned, the variable could either be dropped or renamed.
save AllInfo, replace
To declare a file with yearly data to be a time-series dataset and then to plot the data, issue these commands:
. sort year // Data must be sorted before the . tsset command can be used.
. tsset year, yearly // Declares data to be time series.
. tsline gdp // Plots the time series data. See “Descriptive Statistics.”
Stata’s response would be:
. tsset year
time variable: year, 1980 to 2008
delta: 1 year
To . xtset a panel data file, that contains gdp values for multiple countries where each country is identified by an
identification number, the commands would be:
. sort cntryid year
. xtset cntryid year, yearly
. xtline gdp // Plots the a time series graph for each country in the dataset.
Note, the panel variable comes before the time variable and it must be numeric ! So the above commands would
not work if the country variable is a string variable. First, a numeric id number for each country must be
generated.
This code will generate an id number for each country and then declare that the file contains panel data.
. sort country year
. egen cntryid=group(country) // This assigns the values of 1, 2, 3, etc to the various countries.
/* or
. encode country, generate(cntryid) Use this OR the egen method-not both!!! */
. xtset cntryid year, yearly
Here are the results:
23
Frequently, in time series analysis, it is desirable to “lag,” “lead,” or compute the difference between the value of
a variable and adjacent observations. Since the data has been declared to be a time series or panel data, the time
series operators (L., F., and D) can be used. There are several advantages to this approach, the most important
being that it is less error prone than other techniques. The operator is cognizant of changes in panel. So, for
example, it will not lag to the previous observation if the country has switched from Mexico to Venezuela.
Another positive is that this method uses temporary variables, new variables do not have to be created.
24
REGRESSION ANALYSIS
Ensuring Consistency Among the Number of Observations in the various estimation samples
In descriptive analyses, the number of observations often differs across the variables. Similarly, the number of
observations will likely differ for regressions using various specifications of explanatory variables. This is due to
some observations having missing values for some variables but not for others, thus creating “holes” in the
number of observations.
You want:
1) The number of observations for each variable within the descriptive analysis to match up.
2) The number of observations used in the different regression specifications to match up.
3) The number of observations between (1) and (2) to match up.
. regress Y X1 X2 X3
. estat ovtest // estat => postEstimation Statistics ovtest => omitted variable test
. regress Y X1 X2 X3
. estat ic // estat => postEstimation Statistics ic => information criterion
Look for patterns in the graph; there shouldn’t be any if the residuals are homoskedastic. If you see a trumpet
shaped pattern, try to determine which of the predictor variable(s) is the cause. One way do doing this is the
residuals versus predictor plot which plots the residuals against a predictor variable of your choosing.
. rvpplot X3 // Plots residuals vs. a predictor variable (X3) in the model
Another, perhaps more flexible method, to check the residuals against predictor variables is to calculate the
residuals and save the results as a variable called "Residuals,” (or any other name, you like). The Residuals
variable can be used in subsequent tests.
. predict Residuals, resid
Next create a scatter plot to check for the possibility of “trumpet-shape” residuals. Here are the steps to plot the
residuals against a suspected proportionality factor. First run the regression of interest. Then estimate the residuals
and save as “Residuals”. Finally, create the scatterplot.
. regress Y X1 X2 X3
. predict Residuals, resid
. twoway scatter X3 Residuals // “X3” is the variable with the suspected proportionality factor
28
HYPOTHESIS TESTING (Studenmund p.559)
f-tests
The t-test is a partial test, i.e., it tests for the statistical significance of each individual regressor/explanatory
variable in turn If you want to perform tests that involve a group of variables, you need something else – one
possibility is the f-test.
There are two “flavors” to consider:
29
MAKING “NICE” TABLES OF REGRESSION RESULTS
The default in Stata is to report results in a “wide” format, i.e., the estimated parameters in the first column, the
estimated standard errors in the second column, etc. There are two different approaches to creating “nice” tables –
similar to how results are reported in journal articles, i.e., with the standard errors below the parameter estimates,
along with fit measures (R2, adjusted R2, Akaike Criterion, etc.,) and the number of observations, etc., at the
bottom.
The first gives you the brackets around the standard errors, as in most journal articles. The second is more
comprehensive, allowing you to include t-statistics, p-values, several different fit measures (rather than merely
adj-R2), etc., but you have to make the brackets around the standard errors yourself. The two methods work as
follows:
. outreg2
1) Install the command by typing:
. ssc install outreg2, all replace
2) Run the first regression that you want included in the table.
. regress Y X1 X2 X3
3) Next run the . outreg2 command replacing “filename” with the name you want to give your file.
The file will be saved to your current working directory, or you can specify a different location by
including the entire path.
. outreg2 using filename, word bdec(3) bracket addstat(Adj R-Squared, `e(r2_a)') replace
The word option saves your file in rich text format (rtf) which can be opened directly in MS Word. If
you specify excel instead of word, the file will be saved as an xml file which can be directly opened
in Excel. The replace option causes the program to overwrite any existing file with the same name.
4) Run the additional regressions that you want to include in the table, appending them to the previous
results:
. regress Y X1 X2 X3 if X3 == 1
. outreg2 using filename, word bdec(3) bracket addstat(Adj R-Squared, `e(r2_a)') append
. regress Y X1 X2 X3 if X3 == 0
. outreg2 using filename, word bdec(3) bracket addstat(Adj R-Squared, `e(r2_a)') append
5) Stata responds each time outreg2 is run with your filename in quotes in blue text. Click on it, and if
you have included the word option, your file will automatically open in a copy of MS Word.
30
2) Create the table, by calling the model results, that you saved previously, and add the statistics, etc,
you want to include (here, we ask for coefficients, standard errors, t-statistics, p-values + R2,
adjusted-R2, the Akaike Information Criterion, the Bayesian Information Criterion, and the number
of observations) :
. estimates table model1 model2 model3, stats(r2 r2_a aic bic N) b(%7.3g) se(%6.3g) t(%6.3g)
p(%4.3f)
NOTE: if there are parts of the previous table, you don’t want to include, just modify the previous
Stata command accordingly. Say, you are not really interested in the t-statistics or p-values but only
want to include the coefficients and their standard errors (+ the fit measures and N from before):
. estimates table model1 model2 model3, stats(r2 r2_a aic bic N) b(%7.3g) se(%6.3g)
Note, if you want “stars” to indicate the level of statistical significance, you CANNOT combine this
with the “se”, “t”, and/or “p” options from above. An example of getting “stars” in the fashion used
in economic journals, etc, similar to the previous, is:
. estimates table model1 model2 model3, stats(r2 r2_a aic bic N) b(%7.3g) star(0.1 0.05 0.01)
3) Highlight the table (i.e. from top edge of table down to and including the bottom line of the table – do
NOT highlight the legend, that will mess up the formatting in Excel subsequently!!) in the results
window with the cursor, right-click on it and chose “Copy Table” (NOT “Copy Text”) and then copy
and paste into Excel, to create the table, then copy and paste into Word.
31
OTHER RECENTLY USED REGRESSION MODELS
Multinomial logit
Relevant whenever the dependent variable is a qualitative variable with more than two outcomes that cannot be
ordered/ranked (if it could be ordered/ranked, we would use an ordered probit – and if it had only two outcomes,
we would use instead the (simple) probit and/or logit model, discussed previously.) Examples include
transportation choice (car, bus, train, bike, etc), health provider (doctor, healer, etc).
NOTE: Consult an econometrics textbook for details.
In a do file, type:
mlogit Y X1 X2 X3 X4 X5, basecategory(#)
mfx, predict(outcome(#))
mfx, predict(outcome(#))
mfx, predict(outcome(#))
mfx, predict(outcome(#))
where Y is the dependent variable and the Xs are the explanatory variables.
32
The "#" after “basecategory” sets the category you want to be the reference category (all results are relative to this
category). Which one you choose matters mostly for the interpretation, although it is better to have a reference
group with a relatively high number of observations (yields relative more precise estimates for the estimated
parameters (i.e. the coefficients), since the coefficients, again, are relative to the base category).
Again we have a problem with the estimated coefficients not being interpretable as marginal effects (since the
multinomial logit model is non-linear in the estimated parameters – as were also the (simple) logit and probit
models from before) – and again the “mfx” command calculates the marginal effects (again, here these are to be
interpreted as the marginal probability, ceteris paribus, for the outcome in question – since the dependent variable
is qualitative). Again, we will typically set all the other explanatory variables at their mean value (which is also
the default for the “mfx” command).
Aside: because marginal effects add up to one, unlike what was the case for the coefficients, we can calculate
marginal effects for ALL outcomes (the "#" after “outcome” refers to the particular outcome of the dependent
variable for which you want to calculate the marginal effects.
If you want to know more about the command, type . help mlogit in Stata.
33
APPENDIX A
DO FILE EXAMPLE
* Note: actual commands are in bold face type and all 3 types of comment markers are used as examples.
/* Three good commands to include no matter what the rest of the program is doing. */
version 12.0
/* Set the version of Stata (you can see the current version number by typing “version” in Stata’s command
window.) (NOTE: Sometimes user-written commands may work only under an older
version of Stata, say, version 9 – in that case, one would type “version 9.0”, instead.)
*/
capture log close // Close any open log-files.
clear // Clear memory allocated to Stata in order to start with a clean slate.
/*
The next two commands may be necessary when working with large datasets and an older version of Stata.
The amount of memory that can be assigned depends upon the memory of the computer being used and the
number of other programs that are running.
*/
set memory 300m // increase memory assigned to Stata.
set matsize 200 // increase the number of parameters allocated for dataset
/*
Open a log file to document the session. Specify the relevant path rather than the “X”s, “Y”s, and “Z”s. The
replace option permits overwriting of a file with the same name.
*/
log using "C:\XXXX\YYYYY\ZZZZZ.log", replace
/*
Load the Stata file by specifying the complete path rather than the “X”s, “Y”s, and “Z”s.
There are many methods for creating a Stata datafile: copying and pasting, data entry in the Data Editor, and the .
insheet command are a few examples.
*/
use " C:\XXXX\YYYYY\ZZZZZ.dta", clear
/*
Get descriptive statistics, run a regression, save residuals and plot residuals to look for heteroskedasticity.
*/
summarize // Get basic descriptive statistics for all variables. Specify specific variables, if desired.
histogram X1 // Check variable X1 for outliers.
graph box X2 // Create a box plot for variable X2 to check for outliers.
twoway scatter X1 X2 // Create scatter plot with X1 on Y axis and X2 on X axis
correlate X1 X2 // Estimate the partial correlation between X1 and X2
regress Y X1 X2 X3 // Estimate an OLS regression of Y on X1 and X2
predict Residuals, resid // Predict residuals and save as a new variable called, “Residuals”
twoway scatter Residuals X3 /* Plot residuals with suspected proportionality factor, X3,
(to detect heteroskedasticity)
*/
Wrap up work: save file with new name, close log file and exit do file.
*/
save " C:\XXXX\YYYYY\newfile.dta" // Save file with Residuals variable under new name, if desired.
capture log close // Close log file.
exit // Return control to the operating system, not necessary, but good practice.
34
APPENDIX B
35
APPENDIX C
Step 1:
If there is a Stata System File option, use that column to select the desired file(s). These are Stata *.dta
files.
If there isn’t a Stata system file, select the ASCII Data File + Stata Setup Files option if one exists.
36
If neither a *.dta nor a Stata setup file is available, ask Carol Karsch, the Data & Statistical Support
Specialist at Leyburn Library. She can work with both SAS and SPSS and transform the data to Stata.
Step 2: Login or register, if asked. (It’s free. The University Library maintains a subscription.)
Step 3: Select “Save File” option
You’ll be sent a “zipped” folder that needs to be decompressed before you can work with it. Usually, there is a
“Codebook” , a “descriptioncitation” which gives a brief synopsis of the study’s methodology, a manifest, which
gives technical details, a related_literature document, in addition to the data and setup files.
The codebook should be carefully read before work with the data begins. A codebook describes the study. It
gives details about the data that are extremely useful. If the variables in the dataset are not labeled, the codebook
is indispensable. The codebook describes the variables, their locations in the file, what the values mean and
potential information about the variables that is necessary to using them correctly. It helps you verify that the data
will be useful for the purpose you intend. It may not! The data might not have the level of detail required. If the
analysis is a time series, the questions important to the research may not be asked in the years under
consideration. Groups important to the analysis maybe dropped because of privacy issues. A careful reading of
the codebook(s) can prevent a lot of wasted effort and much frustration.
This ICPSR webpage gives thorough instructions for using Stata setup files:
https://www.icpsr.umich.edu/cocoon/ICPSR/FAQ/0127.xml
Basically, the *.do file must be edited with a text editor, e.g., Notepad. The paths and filenames of the ASCII data
file (*.txt), the Stata dictionary file (*.dct) and the output file, i.e., the Stata dataset, have to be specified with
COMPLETE paths, including extensions. If there is an embedded space in the filename, it must be enclosed in
double quotes. Here is an example for ICPSR Study # 07644.
37
This is the section that
needs to be edited. The
program will use the
filenames supplied in
the sections below.
38
The code to replace missing values is usually commented out, so it isn’t executed. To change missing values
to a “.” which Stata recognizes as a missing value, delete the comment markers ( /* and */) at the beginning
and end of the section.
Deleting these
will change all
missing values
to “.”
Finally, it is good practice to type the . exit command at the bottom of the file.
Save and close your edited version of the *.do file. Opening the file will launch Stata, run the do file, and if all
goes well, the Stata dataset will be created!
39