You are on page 1of 17

Takashi Yamano Fall Semester 2005

Lecture Notes on Advanced Econometrics

Lecture A2: All you need to know about STATA

In this lecture note, I explain STATA commands that you typically need to do homework in my class. What I can do, however, is just to introduce some STATA commands to you. To master STATA, you need to consult with the STATA manuals and practice with actual data. For now, I assume that you would be typing each command in the STATA Command Window. But I urge you to learn how to use STATA-do-files. In STATA-do files, you can keep your commands in a file and can execute all the commands in one file at once. In this way, you can keep what you have done in a file as long as you have the file. You will find do-files very useful. Instructions about do-files are presented later. But, for now, lets start with some of very important commands. First, I open a STATA file and close it.
. clear . use C:/Docs/FASID/Classes/Econometrics/wooldridge_data/WAGE1.DTA . clear

STATA can hold only one data file in its memory.

So before you open a STATA data

file, you need to clear the STATA memory. You can open a data file by typing use followed by a file name with its directory. Alternatively, you can open a file by pulling down File menu and choosing Open. After opening a file, you can simply discard the data file from the STATA memory by typing clear again. Note that the original file is still in the same folder. So you can open the same file again if you like. Next, I open the same STATA file and save it into a different holder.
. clear . use C:/Docs/FASID/Classes/Econometrics/wooldridge_data/WAGE1.DTA . save C:/Docs/tmp/WAGE1.DTA

Note that I am saving this file in a different folder so that I do not replace the original data file. My advice to you is Do not replace original files! If you create new variables (such as a squared variable) and want to save it, save it in a different folder or use a different file name. There is always a danger of replacing an original file with a new file which has fewer observations or variables. Descriptive Statistics Next, there are some commands to obtain descriptive information: describe and summarize. describe provides you types and definitions of variables. This is especially helpful when you use the data file for the first time. summarize provides descriptive statistics of variables: mean, standard deviations, minimums, and maximums. If you type summarize x, d(etail), you can get detailed information about a variable. Here is how they work:
. describe
Contains data from C:DocsFASIDClassesEconometricswooldridge_dataWAGE1.DTA obs: vars: size: 526 24 16 Sep 1996 15:52

18,936 (97.8% of memory free)

--------------------------------------------------------------storage variable name type display format value label variable label

--------------------------------------------------------------wage educ exper lwage expersq tenursq float byte byte float int int %8.2g %8.0g %8.0g %9.0g %9.0g %9.0g average hourly earnings years of education years potential experience log(wage) exper^2 tenure^2


. summarize Variable | Obs Mean Std. Dev. Min Max

-------------+----------------------------------------------------wage | educ | 526 526 5.896103 12.56274 3.693086 2.769022 .53 0 24.98 18

exper | lwage | expersq | tenursq |

526 526 526 526

17.01711 1.623268 473.4354 78.15019

13.57216 .5315382 616.0448 199.4347

1 -.6348783 1 0

51 3.218076 2601 1936

. summarize wage, d average hourly earnings ------------------------------------------------------------Percentiles 1% 5% 10% 25% 50% 1.67 2.75 2.92 3.33 4.65 Largest 75% 90% 95% 99% 6.88 10 13 20 21.86 22.2 22.86 24.98 Variance Skewness Kurtosis 13.63888 2.007325 7.970083 Smallest .53 1.43 1.5 1.5 Obs Sum of Wgt. Mean Std. Dev. 526 526 5.896103 3.693086

To obtain frequency of a categorical variable, you can use table. table can also provide you descriptive statistics of other variables for each value of the categorical variables.
. table educ ---------------------years of | education | Freq.

----------+----------0 | 2 | 2 1

12 | 13 |

198 39

18 |



. table educ, c(mean wage sd wage min wage max wage n wage) ---------------------------------------------------------------------years of | education | mean(wage) sd(wage) min(wage) max(wage) N(wage)

----------+----------------------------------------------------------0 | 2 | 3.53 3.75 .9050967 2.89 3.75 4.17 3.75 1 2

12 | 13 |

5.37136 5.59897

3.092932 3.026567

.53 2.00

22.20 15.38

198 39

18 |







Creating Variables You can create variables by using generate or gen for short:
. gen educsq=educ*educ

. gen educsq=educ^2

If you want to drop (or delete) a variable, then we use drop.

. drop educsq

Suppose that you want to modify a variable, you need to use replace.
. replace female=2 if female==0

Here I have replaced zeros in female by 2. So now, female has one for female workers and two for male workers, instead of zero for male workers. In STATA, you need to type = twice to indicate the value of a variable is equal to something. Other cases are: >, >=, <=, and <. These are respectively larger than, equal to or larger than, equal to or smaller than, and smaller than.

Now, because female is not a dummy variable, I create a new dummy variable by typing:
. gen women=0 . replace women=1 if female == 1

Or STATA can create a dummy variable automatically by typing:

. gen women=(female==1)

Neat, isnt it? OLS estimations It is very easy to estimate OLS models in STATA.
. regress y x1 x2 x3 x4 x5

You just need to type:

You can obtain a predicted variable by typing:

. predict y

Then, a predicted variable called y is created. need to type:

. predict e, residual

If you want a residual variable, then we

Here is an example: .
regress lwage Source | educ exper expersq female married northcen south west SS df MS Number of obs = F( 8, 517) = = = 526 45.95

-------------+-----------------------------Model | Residual | 61.6387993 86.6909521 8 7.70484991 517 .167680758

Prob > F R-squared

0.0000 0.4156

-------------+-----------------------------Total | 148.329751 525 .28253286

Adj R-squared = 0.4065 Root MSE = .40949

----------------------------------------------------------------------------lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+-------------------------------------------------------------educ | exper | expersq | .0808207 .0363615 -.000645 .0070101 .0052269 .0001128 11.53 6.96 -5.72 0.000 0.000 0.000 .067049 .0260929 -.0008665 .0945925 .0466301 -.0004235

female | married | northcen | south | west | _cons |

-.3345661 .0711934 -.070182 -.1162238 .04643 .4625799

.0364315 .0417725 .0519674 .0486825 .0576308 .1074952

-9.18 1.70 -1.35 -2.39 0.81 4.30

0.000 0.089 0.177 0.017 0.421 0.000

-.406138 -.0108712 -.1722752 -.2118637 -.0667894 .2513988

-.2629941 .153258 .0319111 -.0205839 .1596494 .673761

--------------------------------------------------------------------------. predict y (option xb assumed; fitted values)

Graphs (this section is for version 8) It is sometimes a good idea to examine the data visually. Here, I just explain two

types of graphs: histograms and two-way graphs. Histogram is useful to see frequency and two-way is useful to examine a relationship between two variables.
. twoway histogram wage

Density 0 0 .1



10 15 average hourly earnings



When you have a discrete variable, by specifying it you can have a column for each value of a discrete variable:

. twoway histogram educ, discrete

.4 0 0 .1 Density .2 .3

10 years of education



When you want to examine a relationship between two variables, you can create a two-way graph by typing:
. graph twoway scatter wage educ

0 0

average hourly earnings 10 15 20


10 years of education



Or you can omit graph and type

. twoway scatter wage educ

to get the same graph. You can also include a fitted line by typing lfit wage educ. But because there are two types of ploy-types, you need to specify that way:
. twoway (scatter wage educ) (lfit wage educ)
25 0 0 average hourly earnings/Fitted values 5 10 15 20

10 years of education

15 Fitted values


average hourly earnings

You can learn more about graphs in a STATA manual called Graphics.

All you need to know about using Do-files in STATA

There are three types of files in STATA: data-file (.dta), log-file (.log), and do-file (.do). (There is one more type called ado-file, but I ignore this type of files in this note.) Data-files contain data. Any kinds of data files can be converted into STATA data files by using Stat-Transfer (from STATA Corporation). Log-files record commands and results displayed on the STATA Results window. will discuss about log-files later. I

Do-files execute commands recorded in them. By recording all of your commands in a do-file, you can keep a history of your work. This way, you can execute the exact same commands days or years later. You do not need to remember what you have done. Just you need to remember the files names. (Actually this is not easy either. Occasionally, I spend many hours looking for old do-files. I recommend descriptive file names.) Why do you need to use do-files? Even though the advantages of using do-files become clear as you get used to using them, you may think do-files are cumbersome at the beginning because you have to type every single command in do-files. There are three major reasons for using do-files: (i) it is easy to use do-files, (ii) you will be able to reproduce your results (even after many years), (iii) you can communicate with your colleagues by exchanging do-files. (i) You may not like typing all of your commands in do-files, instead of drag-and-click on STATA platform. However, once you remember some of important commands, you can do most of your work. When necessary, you can look up the manuals or use the help command in STATA to learn about commands. (ii) You will need to reproduce your results even after many months. For instance, your adviser may want you to modify your models. With do-files you can just make small changes and produce results according to your advisers comments; you do not need start from the scratch every time you change specifications.

(iii) When you work with your colleagues, it is useful to share the same data sets among your colleagues and exchange do-files. As long as data sets are the same, the same do-files will produce the same results. This way, your colleagues can check your work and make adjustments. So lets start using do-files! How to open a do-file Just click File-Do. You can open existing do-files. pencil, a new do-file will show up.

Or click an icon with a note and

How to execute do-files After typing commands in a do-file, you can just click an icon with a lined-note. instance, type the following commands in a do-file: clear use c:docsfasideconometricshomeworkwage1.dta sum wage sum wage, d table female table female, c(mean wage) Then click an icon with a lined-note. You will probably see an error message file c:docsfasideconometricshomeworkwage1.dta not found This is because you dont have the wage1.dta data-file in the specified directory. But at least you know that the do-file has tried to execute your commands. correct the directory and execute the do-file again. If you did not face any problems, you should find:
. sum wage



Variable |



Std. Dev.



---------+----------------------------------------------------wage | 526 5.896103 3.693086 .53 24.98

. table female ----------+----------female | Freq.

----------+----------0 | 1 | 274 252


You have run a do-file. We will learn these two commands (sum and table) later. But for now, you should save the do-file by clicking File-Save As.

Commands You Need to Know There is a note made by Wooldridge called Rudiments of STATA. This note explains most of important commands, so I do not repeat. Instead, I will show you an example of a do-file:
Example 6-1 *This is a do-file, called how_to_STATA, for Lecture 6 clear use c:docsfasideconometricshomeworkwage1.dta *log close log using c:docsfasideconometricshomeworkwage1.log, replace *Describe the data des sum wage sum wage, d table female table female, c(mean wage min wage max wage)

*Generate a wage variable in log gen logwage=ln(wage)

*Generate a squared variable of experience gen expersq=exper*exper

*Run OLS, predict logwage, and do F-test reg logwage female educ exper expersq predict yhat test exper expersq

End of Example 6-1 One very useful command is this: *. This is called a star. This is not exactly a command because a star (*) does not execute any work. Instead a star (*) prevents a command from executing. For instance, in the above do-file, the second star (*) is preventing a command log close from executing. I have left a star in front of log close because I do not want to execute this command yet. At this point there is no log file open. If I try to close a log-file (by saying log close), STATA will give me an error message and does not execute other commands. Thus I leave the second star. After running this do-file once, a log-file will be open and keep recording all the results on STATA-Results window. Thus from the second time, I will delete the second star in front of log close. As you can see, the star (*) is very useful to prevent some commands from executing temporary. Another way of using a star (*) is to put notes in do-files. Sometimes, you want to leave some notes in do-files to remind yourself or explain your colleagues. Remember you may need to open your do-files after many months or years. You may not remember all the details about your do-files at that time. From my experiences, it is a good idea to leave some notes in your do-files, as I have done in this do-file. Using log-files As I mentioned above, a log-file records all the results displayed on STATA screen. You can open a log-file in a word processor, such as Word. A font called Courier works the best with STATA outputs.

When you need to replace an old log-file under the same name, you need to add replace after a comma: log using c:docsfasideconometricshomeworkwage1.log, replace If you want to add new results at the end of an old log-file, you need to add append after a comma log using c:docsfasideconometricshomeworkwage1.log, append As I mention before, you can close a log-file by using log close

All you need to know about managing data in STATA

Sorting the data sort arranges the observations into ascending order of the values of the variable. instance, assume that income contains income, then sort income arranges the observations from the lowest income observation to the highest. You can sort observations according to more than one variable. For instance, if you type sort female_head income STATA sort observations first by female_head then sort the observations according to income, separately for male and female headed households. To see the sorted data, you can look into the data window, or you can use list. shows identified variables on the screen. For instance, list income shows income values from the lowest. list income in 1/20 shows income values from the lowest to the 20th observations. Although, sort is a useful command, it can only sort the observations ascending order. Sometimes, you may want to sort observations descending order, from the largest to the smallest. For this purpose, you can use gsort: list For

gsort - income This will sort observations from the largest to the smallest. You can also use more

than one variables. gsort female_head - income This will sort observations from the largest to the smallest for male and female headed households separately.

Aggregating the data In surveys and data, information is collected at different units. For instance, a typical household survey not only collects information at the household level (e.g., How much does this household use?) but also at the individual level (e.g., How old is this person?). To combine information collected at different units, we need to either aggregate data up to a higher unit or merge data from a higher unit to data at a lower unit. For instance, we need to create an aggregated data from the individual level up to the household level. In STATA, we can use collapse to create an aggregated data. that we have demographic information at the individual level: HHID 1 1 1 2 2 PersonID 1 2 3 1 2 Age 42 37 10 28 24 Gender Male Female Female Male Female For instance, assume

HHID indicates in household ID numbers in which each individual belongs; PersonID indicates ID numbers for each individual; and Age and Gender indicate personal information. Suppose that we want to create a variable called HHsize that indicates the household size. To create HHsize, I would create HHsize which is one for all individuals:

gen HHsize = 1 HHID 1 1 1 2 2 PersonID 1 2 3 1 2 Age 42 37 10 28 24 Gender Male Female Female Male Female HHsize 1 1 1 1 1

Then, I would aggregate up the data to the household level. collapse (sum) HHsize, by(HHID)

collapse aggregates up the data to the level identified by the identifying variable specified in by( ). In this example, I am aggregating the data up to HHID level. In the example, we will get an aggregated data looks like: HHID 1 2 HHsize 3 2

Notice that all the other variables are eliminated. In addition to summing up, you can also calculate means, standard deviations, maximums, minimums, median, etc. For instance, you can calculate average ages and find the maximum age within the household by typing: collapse (sum) HHsize (mean) Age (max) Agemax = Age, by(HHID) HHID 1 2 HHsize 3 2 Age 29.7 26 Agemax 42 28

After creating an aggregated data, you can combine this to another data using an

identifying variable. In the example, the identifying variable is HHID. Before merging this file with other data files at the household level, you need to sort the data according to the identifying variable. Thus, sort HHID save c:/data/tmp/hhsize, replace

Merging data files To combine data from different files, we need to merge files. Files must be sorted by the same identifying variable in the same order before merging. For instance, suppose that we have a data set of household income at the household level and bring in HHsize from a different file to crease a per capita income variable, called PCincome. First, we need to open a base file. income: HHID 1 2 income 302 189 I this example, this is a file with household

Then, we merge this file with a file that contains HHsize:

sort HHID merge HHID using c:/data/tmp/hhsize HHID 1 2 income 302 189 HHsize 3 2 Age 29.7 26 Agemax 42 28 merge 3 3

Thus, we have merged two data files at the household level (HHID).