## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

ii

An Introduction to Stata

F. PERACCHI

Faculty of Economics, Tor Vergata University, Rome, Italy

iv .

. . . . . . . . . .2 THE REFERENCE MANUAL . . . . . . . . . . . . . . . . . .5 CHANGING AND RENAMING VARIABLES . .2. . . 1. . . . . . 1. . . . . . . . . . . . . . . 1. . . . . . . . . . . . . .1. . . . . . . . . 1. . . . . . . . . . . . . . . . .1 STARTING AND STOPPING STATA . . . . . . .5. . . .1 GENERAL SYNTAX . . . . . . . . . . . . . . . . . . . . . . . . . . . .5. . . 1.4. .5. 1.5. . . . . . . . . . . . . . . . . . . . . . . .7 INCREASING THE NUMBER OF OBSERVATIONS DATASET . . . . . . . . . . . . . . . . . . . . . .1. .4 TUTORIALS . . . . . .1 DIRECT TYPING . . . . . . . . 1 1 1 1 2 2 3 3 3 3 4 4 4 4 5 5 6 6 6 7 8 8 8 9 9 11 12 12 12 13 15 15 2 Stata Commands . . . . . . . .4. . . . . . . 1. . . . . . . . .5. . . .2. . . . . . . . . . 1. . . . . . . 1. . . . . . . . . . . . . . 1. . . . .1. .1 DISPLAYING DATA . . . . . . . . . . . . .2. . . . . . . . . . . . .2 THE DATA EDITOR . 1. . . . 1. . . . . . . . . .2. .4 INPUTTING DATA . . . . . . . . . . . . . . . . . . . . . . 1. . . . . . . . . . . . . . . . . . . . . . . .2 LABELING DATA . . . A . . . . . . . . . . . . . . . . .4 LOADING A STATA DATA FILE . . .5. . .1 THE STATA WINDOWS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 LOADING AN ASCII (TEXT) DATA FILE . 1. . . .6 OUTPUTTING DATA . . . . . . .2 OBSERVATIONS . 1 Getting Started . . . . . . . . . . . . . . . . . .5 STATA UPDATES . . . . . . . . . . . 2. . . . . . . . . . . . . . .7 LOG FILES . . .4 CREATING NEW VARIABLES . . . . . . . . . . .3 THE STATA TECHNICAL BULLETIN . . . . 1.6 ELIMINATING VARIABLES OR OBSERVATIONS . . . . . . . . . . . .2. . . 1. 1. . . . . . . . . . . . . . . . 1. . .3 ALLOCATING MEMORY TO STATA . . 1. . 1. . . . . . . . . . . . . . . . .4. .Contents Introduction . . . .5 BASIC DATA MANIPULATION . . 1. . . . . . . . . . . . . . . .1 THE HELP SYSTEM . . . . 1. . .3. . . . . . . . . . . . . . . . . . . . . . . . . .5. . . . . . . . . . 1.1 VARIABLES . . . . . . . . . . . . . . . . . . . 1. . .2 THE STATA TOOLBAR . . . . IN . . . 1. 1. . . . . .3 VARIABLES AND OBSERVATIONS . . . . . . . . . . . .3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1. .4. . . . . . . . .2 STATA DOCUMENTATION AND UPDATES . . . . . 1. . . . . . . . . . . . .3 SUMMARIZING DATA . . . . . . . . . . . . . . . . . . . . . .

. . . 2. . . .4. . . . . . . . . . . .4. . . . . . .1 APPEND . . . . . . . . .2 MERGE . . . . . . . . . . . .2 SUMMARIZE . . . . . . . . . . . . . . . . . .1 BASIC SYNTAX AND GRAPHIC STYLES 3. . 15 16 16 16 17 17 17 17 18 19 20 21 21 21 22 22 22 23 23 23 23 23 24 24 24 25 25 25 26 26 26 27 29 29 30 32 32 33 33 35 35 35 36 36 37 2. . .1. . . . . .6. 2.2 LIST . . . . . . . . . . . 4. . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . 2. . . . . . BASIC SAMPLE STATISTICS .4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1. . .4 QUIETLY AND NOISILY BASIC DATA COMMANDS . . . . . .5. . . .1. . . . . . . . . . . .5. . . . .2 SYSTEM MACROS . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.2. . . . . . . . . . 2. . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . .6.6 SORT AND GSORT . . . . .1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . 2. . . 2. . .3 MEANS . . . . . . . . . . . . . . . . . . . . . . .1 DESCRIBE . . . . . . . . . . . . . . . . . . .2. . . . . . . . . . . . . . . . . . . . . . . .3 TABSUM .4 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1. . . . . . . . . . . . . . . . . . . . . . .1. . 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 BRANCHING . . . . . . . . . . . . . . . .2. . . . . . . . . . . . . . . . 2. .2 COMMON GRAPH OPTIONS . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . .4 TWO-WAY SCATTERPLOTS . .2 CONTRACT . . . . 2.5. . . . . . . . . . . . . . .4 GENERATE AND EGEN 2. . . . . .5 REPLACE . . . . . . . . . . . . . . . . . . . . . . TABLES . . CONTENTS . . . . . . . . . . . . . . . . . . . . .6. . .3 EXPAND . . . . . .1 COLLAPSE . . . . .6 3 Graphics . . . . . . . . . . . . . . . . . .5 TWO-WAY SCATTERPLOT MATRICES . . . . . . .1 PROGRAMMING STATA . . . . . . . . . . . 4 Programming and Matrix Commands 4. .5. . . . . . . . . . . . . . .3 HISTOGRAMS . . . . . . . .6 BOX PLOTS .6 CORRELATE . . . . . . . . . . . . . . . . . . 2. . . . . .1 COUNT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 MACROS .2. . . . . . . . . . . 3. . . . . . . . . .7 REGRESS . . 2. . . . . . .5 2. 4. . . . . . . . . . . . .1. . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . 4. . . . 2. . . . .4 FILLIN . . . . . . . . . . . . . . . . .5. . . . . . . . . . . . .3 DROP AND KEEP . .1 TABLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5. . . . . . . .3 LOOPING . . . . . . . . . . . . . . 2. . .5.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . .4 CENTILE . . . . . . . . . COMBINING DATA . . . . . . . . . .2 2. . . . . . . . . . . . . . . . . . . . . . . . .2 WEIGHTS . . . . .2. . . . . . . . . . . . . . . . . . . . . .4. . . . . . .vi 2. . . . . . . . . . . . . . . . . . . . .2. . . . .1. .3. . . . . . . . . . . . 2. . . . 2. . . . . . . . . . . . . . . . 2. . . . 2. . . . . . . . .5 CUMUL . . . . . . . 2. . . . . . . . . .2 TABULATE . 3. . . . . . . . . . .1 BY . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . . . . . . . RESHAPING DATA . . . . . 2. . . .5 RESHAPE . . . . .3 IF AND IN . . . . . . . . . . . . . 2. . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . .3 2. . . . . . . . . . . . 2. . . . . . . . . . . . .

.3 PREDICTIONS AND RESIDUALS . . . . . . . . . . . . 6. . . . . . . . 6. . . . . . .1. 6. . . . . . . . . .2. . .7 MATRIX DECOMPOSITION . 6. . . . . . . . . . . . . . . . .4. . . . . . . . . . . . . . . . . . . . . . .3 BOOTSTRAPPING AND MONTE CARLO SIMULATIONS . . . .2. . . . . . . . .3. . .5 BIPROBIT .3. . . . . . .2 SUBSCRIPTING AND SUBMATRICES .6 TEMPORARY OBJECTS . 6. . . . . .1 ORDINARY LEAST SQUARES . . . . .2. . . . . . . . . . . . . . 6. . . . . . . . . .1. . . . . . 4. . . . . . .4 5 Statistical Inference Using Stata . .4 ROBUST VARIANCE ESTIMATES . . . . . . . . . . . . . . .2 POST-ESTIMATION COMMANDS . 5. . . . . . . . . . . . . .3. . . .1. . . . 6. . . . . . . . . . . . . . . . . .1. .1. . . . . .2 WEIGHTED ESTIMATION .3. . . . . . . . . .4. . . . . . .4 DURATION DATA . . . . . . . . . . . . . .4. . .2 DISPLAYING THE VARIANCE ESTIMATES . . . . . . . . . . . . . . . . .3 LINEAR INSTRUMENTAL VARIABLES . . . . . . . MATRIX COMMANDS . . .2 MONTE CARLO SIMULATION . . . . . . . . . . .1 GLM . . . . . . . . . . . . . . . . . . . . . . 5.3 OTHER LIMITED DEPENDENT VARIABLES MODELS 6. . . . . . . . .1 ROW AND COLUMN NAMES .3 NESTED LOGIT . . . . . . . . . . . .1. . . . . . . .1.2. . . . 6. . 5. . . . . . . 6. . . . . . . . . . . . . . .1. 5. . . 6. . 4. . . . . . . . . . . . . . . . .4 CROSS-PRODUCT MATRICES . 4. . 6 Statistical Models in Stata . 5. . . . . . . . . . . . . . .4. .2 4. . . . . . . . . . . .3 4. . . . . . .2 GENERALIZED LINEAR MODELS . . . . . . . . . . . . . . . . . ADO FILES . . . . . . . 5. . . . . .7 EXCHANGING RESULTS BETWEEN PROGRAMS DO FILES . . . . . . . 6. . .2 CONSTRAINED LINEAR REGRESSION .1 ACCESSING COEFFICIENTS AND STANDARD ERRORS 5. . 5. . . . . 5. . . . . . . . . . 4. . . . . 4. . . 5. .1 ESTIMATION . .3. . . . . . . . . . . . . . . . . . . . . .1 BOOTSTRAP . . . . . 4. .3 POISSON AND NBREG . . . . . . . . . . . . . . . .1 LINEAR MODELS . . . . . 6. . . . . . . . . . .2. . .3 CONSTRAINED ESTIMATION . . . . . . . . . . .4. . 6. . . . . .6 GETTING SYSTEM MATRICES . . . . . . . . . . . . . . . . . .4. .1. .5 PROGRAM ARGUMENTS . .2. . . .4 HYPOTHESIS TESTING . . . . . . . . . . 5. . .2 LOGIT AND PROBIT . . . .3. . . . . . . . . . . . . . . . . . vii 37 39 39 40 41 42 42 42 43 45 45 46 46 47 47 47 48 48 48 49 49 49 49 51 51 51 52 53 53 53 54 54 55 55 57 57 58 58 58 59 59 59 59 59 59 4. . . . . . . . . . . 5. . . . . . .3. . . . . . . . . 4. 4. . 6. . . . . . . . . . . . . . . . . . . . . . . . . .1 GENERAL SYNTAX OF ESTIMATION COMMANDS . . . . . . . . .1 PARAMETRIC DURATION MODELS . . . . . . . . . . . . . . . . . . .4. . .6 CENSORED AND TRUNCATED REGRESSION . . . .3 MATRIX OPERATORS AND FUNCTIONS . . . . . . . . . . . 6. . . . . . . . . . . . . . . 4. . . . . . .4 MULTINOMIAL LOGIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . .2 ORDERED CATEGORICAL RESPONSES . . . . . . . . . 6. . . . .2. . . . . .4. . . . . . . . . . . . . . . . . . 5. .5 DATA TO MATRIX CONVERSION . . . . . . . . .1 GROUPED BINARY RESPONSES . . .1. . . . . .CONTENTS 4. . . . .

. PANEL DATA . .2 ARIMA MODELS .2 DYNAMIC PANEL DATA MODELS . . .5. . . . . . . . . . 6. . . . . . . . . . . . . . . . 6. . . . . . . . . . . . . 6. . . . . . . . . . . . . . . .3 SEEMINGLY UNRELATED REGRESSION EQUATIONS . . . 6. . . . . . . . . .6. . . . . . . . . . . . . . ROBUST AND QUANTILE REGRESSION . . . . . . . . . . . . . . . .1 LINEAR PANEL DATA MODELS . . . .1 LINEAR MODELS WITH AUTOCORRELATED ERRORS 6. . . . . . . . 60 60 60 60 60 61 61 62 62 62 62 63 63 63 63 64 64 64 64 65 6. . . . .3 ARCH-TYPE MODELS . . . . . . . . . . . . . . .5 LOGIT AND PROBIT FOR PANEL DATA . . . . .4 GEE FOR PANEL DATA . . . . . .6. . . GENERAL NONLINEAR METHODS . . . . . . . . . . . . . . . . 6. . . . . . . . 6. . . . . . . . . .8. . . . . . .7. . . 6.6. . . . . . . .4. . . . . . . . . . . . . . . .5 CONTENTS 6. . 6. . . .8 6. . . . . . . . . . . . . .2 QUANTILE REGRESSION . . . . . . .6. . . . .2 COX PROPORTIONAL HAZARD MODEL . . . .5. . . . . . . .6 POISSON AND NEGATIVE BINOMIAL MODELS . . . . . . . . . . . . . .viii 6. . . . . .2 REGRESSION SMOOTHERS .6. . . . . . . . . . 6. . . . . . 6. . . . . . . . . . .8. . . . . .5. . . . . .6 6. . . . . . .1 ROBUST REGRESSION . . . . . . . TIME SERIES . . . . . . . . . . . 6. . . . . . . . . .7 6. .9 References . . . . . . . 6.1 DENSITY ESTIMATION .7. . . . . . . . . . . NONPARAMETRIC ESTIMATION . .6. . . . . . . . . .

0.0 for Windows 98/95/NT.Introduction Why use Stata? In my view. equivalently. Chapter 6 reviews the main classes of statistical models implemented in Stata. Finally. In what follows I focus on Stata 7. it has three main advantages over other statistical packages. The only limit is the amount of RAM available. The latest release is Stata 7. Windows). With 100MB of RAM. Chapter 3 describes Stata graphic capabilities. The second is speed: Stata is fast because all data manipulations are carried out in the RAM. hypothesis testing) using Stata. Chapter 2 introduces the syntax of a Stata command and presents some of the most used commands. and is fully integrated with a matrix language. . The ﬁrst is portability: Stata runs on several platforms (Macintosh. one can work with a dataset containing 5 million observations on 4 real-valued variables or. Unix. prediction. Chapter 1 describes the main features of the program. The third advantage is that Stata contains “state-of-the-art” statistical procedures. Chapter 4 introduces the elements of Stata programming and the Stata matrix language. Chapter 5 shows how to carry out statistical inference (estimation. and Stata programs written for one of them run with (almost) no change on any other one. This introduction to Stata is organized as follows. is programmable. with one million observations on 20 real-valued variables.

.

describe or generate). basic data manipulation (Section 1. I adopt the following typographic conventions: the typewriter-style typeface is used for Stata commands or options that have to be typed in (e. the deﬁnition of variables and observations (Section 1.3) are correctly installed.1). type verinst. how to input data (Section 1.6). small caps is used for keyboard keys (e. click the Break button or press Ctrl+Break. 1.1 Getting Started This chapter introduces the main aspects of Stata.g. where to look for documentation and updates (Section 1. • the Stata Results window (where results are displayed). italics is used for things that must be substituted for by some other word (e. type exit. click on the Stata icon. All these windows may be resized and rearranged. 1. To test that the supplied ado ﬁles (see Section 4. • the Review window (it shows the past commands). To exit Stata when there are data in memory which have not been saved. Enter or Ctrl+Break) and boldface is used for Windows commands or switches (e. The windowing preferences may be saved by choosing Prefs from the main menu bar.7).g.1 STARTING AND STOPPING STATA To start Stata.g. clear. and how to open and close log ﬁles (Section 1.2). To test the installation of Stata.1. To exit Stata.5). To make Stata stop what is doing and return to the Stata prompt.1 THE STATA WINDOWS The Stata windows consists of: • the Stata Command window (where commands are typed in and then issued by pressing Enter). Exit or Help).g. .4).3). varname or varlist). type exit or choose Exit from the File menu. namely how to start and stop the program (Section 1. • the Variables window (it shows the list of variables). type crc. how to output data (Section 1.

Bring Dialog Window to Front (brings the Dialog window to the front of the other Stata windows). Do-ﬁle Editor (opens the Do-ﬁle editor or brings the Do-ﬁle Editor window to the front of the other Stata windows). 9. 7. a box with a brief description will appear): 1. 1. Esc.2 The Stata Command window follows standard Window editing style. Bring Results Window to Front (brings the Result window to the front of the other Stata windows). One can copy one line at a time from the Results window into the clipboard and paste into the Command window. Break (stops the current task in Stata). 5. 4. 8.3 ALLOCATING MEMORY TO STATA Initially. 6.2 THE STATA TOOLBAR Going from left to right. to permanently change the amount of memory used every time Stata is invoked. Stata allocates 1MB of memory to each session. Begin Log (starts a new log. the Stata toolbar contains the following buttons (holding the mouse pointer over each button. Start Viewer (opens the Stata viewer for help on Stata).1. Save (saves to disk the Stata dataset currently in memory). Open (opens a Stata dataset). 3. Backspace. 10. The keys for editing in the Command window are Delete. Bring Graph Window to Front (brings the Graph window to the front of the other Stata windows). click on . 1. Print (prints a graph or log). 11. End. and stops or suspends the current log). 2. 13. Home. Clicking once a command in the Review window copies the command on the Command window where it can be edited before being entered. appends to an existing log. Page Up and Page Down. Under Windows. Data Editor (opens the data editor or brings the Data Editor window to the front of the other Stata windows).1. Clear — more — Condition (tells Stata to continue when it has puased in the middle of a long output). 12. Data Browser ((opens the data browser or brings the Data Browser window to the front of the other Stata windows).

September and November). discussions on teaching statistics. Memory allocation may also be changed within a given Stata session (although not permanently) by using the set memory #k command (in kilobytes) or the set memory #m command (in megabytes). It contains articles written by Stata Corp.1 THE HELP SYSTEM On-line help can be accessed by opening the Stata viewer from the toolbar or by choosing Help from the main menu bar. If more memory is used than physically available on the computer. Stata users.com.2. illustrations of data analysis techniques.. debates on appropriate statistical techniques. Every year.2 STATA DOCUMENTATION AND UPDATES The main documentation comes from the help system and the Stata reference manual. tutorials on programming strategies. July.2. Stata slows down.stata. the 6 issues are bound into a volume. and then entering keyword searches for keyword in the list of help entries. and suggestions. set virtual on 1. 1. 1. plus seven volumes: the User’s Guide. The Start in line speciﬁes the initial working directory. Additional documentation on Stata developments and updates is available through the Stata Technical Bulletin and the Stata Web site http://www. . selecting Contents opens the table of contents for on-line help. On-line help can also be accessed from the Command line by typing help keyword or lookup keyword. This may be changed by editing the line or by using the cd drive:/directory_name command from the Command line. Articles have included enhancements to Stata (ado-ﬁles). 1. or help list. reports on other programs. March. pull down File and choose Properties. In this case. the Programming Manual and the Reference Manual in four volumes. and others. or help contents. May. click on the Shortcut tab and put k# (in kilobytes) or m# (in megabytes) after the call to wstata. The STB is published every two months (in January. along with interesting datasets. it is recommended to set virtual memory on by typing .exe in the Target line. Try help. questions. . In either case.2.GETTING STARTED 3 the Stata icon. This command requires that no data be present in memory. . the Graphics Manual.3 THE STATA TECHNICAL BULLETIN The Stata Technical Bulletin (STB) is a printed and electronic journal with corresponding software. Selecting Search .2 THE REFERENCE MANUAL It consists of the introductory booklet Getting Started with Stata.

• survival (estimating maximum-likelihood survival models).3. where tutname is any of the following: • contents (lists the available oﬃcial Stata tutorials). • regress (estimating regression models. • graphics (how to make graphs). • tables (how to make tables). 1. which are released fairly frequently (every 3—4 weeks).stata.4 TUTORIALS Stata provides tutorials on a variety of aspects: introduction to Stata (intro. The net search keywords command searches the Internet for user-written additions to Stata that contain the speciﬁed keywords. two. Collections of ﬁles are bound together into packages.tut).4 1. To run a tutorial.and N-way ANOVA and ANCOVA models).2. • yourdata (how to input data). • intro (introductory tutorial). among other things. The latter can be downloaded directly using the update command. which fetches and installs additions to Stata obtained from the Internet or from media.2. or even datasets.5 STATA UPDATES The Web site http://www. 1. which must be 1 to 8 characters long. • ourdata (description of the data provided by Stata). Variables are called by their name. data input. type tutorial tutname. answers to frequently asked questions (FAQs). variables are associated with the columns of a data matrix.3 VARIABLES AND OBSERVATIONS In Stata. The additions can be ado-ﬁles (new commands). tables. 1. observations with its rows. and procedures for statistical modeling. • logit (estimating maximum-likelihood logit and probit models). free additions to Stata (“Cool ado-ﬁles”) and the latest oﬃcial updates to Stata. The ﬁrst . • anova (estimating one-. • factor (estimating factor and principal component models). graphics.1 VARIABLES Variables come in two types: alphabetic (strings) or numeric (real or integer valued).com contains. help ﬁles. Another useful command is net. including 2SLS).

It is better to avoid using the name e for variables and beginning variable names with an underscore (all Stata build-in variables begin with an underscore).GETTING STARTED 5 character must be a letter or an underscore. Notice that Stata uses ‘.766) and byte (integers between -127 and 126). The compress command may be used to automatically optimize the storage type of the data in memory.4 INPUTTING DATA Data can be inputted into Stata by direct typing. Stata automatically creates and updates the build-in system variable _n. Thus.g. the other characters can be letters. where # is an integer between 1 and 80 specifying the number of characters in the string. which contains the total number of observation in the dataset. xX.483. if. in.’ (a period) to denote both the decimal symbol and missing numerical values. There are a few reserved names that cannot be used: e. and the system macro _N. The default for storing numeric variables is float. The maximum length of a string is 80 characters.147. The available storage types are: • String variables: str#.646). Xx and XX are all diﬀerent names). Notice that when the sorting of the data changes. Associated with each type of variable is a storage type. a byte occupies 1/4 of the space of a float and 1/8 of the space of a double. 1. with the data editor or from a ﬁle (an ASCII ﬁle or a Stata data ﬁle).768 and 32. with.483. Variables can be renamed using the rename command.648 and 2. so does the counter _n.147. but Stata performs all internal calculations in double. a long occupies the same space as a float. • Real valued variables: double (double precision or about 16 digits of accuracy) and float (single precision or about 7 digits of accuracy). For example: rename x y renames the variable x as y. int. thereby allowing to store categorical and indicator variables very eﬃciently. A double occupies twice as much space as a float. and a byte occupies half the space of an int. an int occupies half the space of a float. • Integer valued variables: long (integers between -2. Stata is case sensitive (xx. int (integers between -32.2 OBSERVATIONS Observations correspond to the row of a data matrix. . which is a counter containing the number of the current observation. 1. digits or underscores (spaces or other characters are not allowed).3.

. The data editor initially names variables var1. Data may be entered or modiﬁed by choosing the cell. ﬁlename is the name of the disk dataﬁle (including the path. . . and a single observation may span any number of input lines. Stata will interpret the number 1. It is important to always check the numeric format of a spreadsheet before copying data to Stata. The ﬁrst command.5 "J. Neyman" 7 . .A. Tukey" 3 21. . 1. With the data editor. .4. tabs or commas) is: infile varlist using ﬁlename [. Variables may be renamed by doubly-clicking anywhere in the variable’s column. .4.2 The data editor corresponds to the edit command.5 "J. The data editor allows copying and pasting data created by other spreadsheet or database programs. infile.1 DIRECT TYPING The input command allows typing data directly into the dataset in memory. .6 1. The data editor is like a standard spreadsheet with colums corresponding to variables and rows to observations.01 "R. Cox" THE DATA EDITOR 1. Fisher" 2 11. var2. The basic syntax for data in free-format (data may be separated by spaces.1 "D. Missing numeric values are recorded as ‘.314 as a string. . . quotes around strings are unnecessary. input x1 x2 x3 4 3.or ﬁxed-format. typing the value and then pressing Enter or Tab.’ (a period). if necessary) and clear is an option that clears data loaded in memory without saving them (I follow the convention of denoting items that are optional by enclosing them in square brackets). missing string values are just empty strings. thus bringing up the Variable information dialog.R. Example: . The data can be in either free. ). If the ﬁle name is speciﬁed without an extension. It may be accessed by clicking the Data Editor button on the Stata toolbar. Unless otherwise instructed.4.raw is assumed. clear] where varlist is a list of variable names with blanks in between (that is. is a very ﬂexible way of reading an ASCII (text) dataﬁle from disk into memory.3 LOADING AN ASCII (TEXT) DATA FILE Stata oﬀers three basic commands for loading an ASCII (text) data ﬁle. . . varname1 varname2 .

Regardless of the creator. The data may be in the same ﬁle as the dictionary or in another ﬁle. using(ﬁlename2) clear] where ﬁlename is the name of a dictionary ﬁle and ﬁlename2 is the name of the ﬁle containing the data.GETTING STARTED 7 If the data are in ﬁxed-format. The basic syntax is: use ﬁlename [. 1. the data is assumed to follow the dictionary in ﬁlename or. A dictionary is an ASCII (text) ﬁle which describes the contents of a dataﬁle. Notice that if using(ﬁlename2) is speciﬁed. Its basic syntax is: infix using ﬁlename [.dct) ﬁle is necessary. The ﬁrst line of the ﬁle may contain the variable names. Specifying tab or comma is not necessary because insheet can determine the separation character for itself when the character is a tab or comma. . It is somewhat easier but less ﬂexible than infile. with a vertical bar as a separator). The third is the insheet command.4 LOADING A STATA DATA FILE Stata data ﬁles have the default extension dta. which reads ASCII ﬁles created by a spreadsheet or database program. a dictionary (. If the using() option is not speciﬁed. that ﬁle is assumed to contain the data. a single observation may span any number of input lines. The basic syntax is: infile ﬁlename [. {comma|tab|delimiter("char")} clear] The {comma|tab|delimiter("char")} option tells Stata how values are separated in the ﬁle (I follow the convention of denoting the available alternatives by enclosing them in curley brackets.4. this command reads ASCII ﬁles where there is one observation per line and the values are separated by tabs or commas. clear] where ﬁlename contains the full path to the data. which reads ASCII ﬁles in ﬁxed-column format. ﬁlename2 is used to obtain the data even if the dictionary itself says otherwise. The basic syntax of a dictionary ﬁle is the following: [infile] dictionary [using ﬁlename] { * comments may be included freely * [type] varname } (data might appear here) The second is the infix command. Again. Stata data ﬁles on disk may be loaded using the use command. The insheet command can also determine for itself whether the ﬁle includes variable names. using(ﬁlename2) clear] where ﬁlename is the name of the dictionary ﬁle and ﬁlename2 is the name of the ﬁle containing the data. if the dictionary speciﬁes the name of some other ﬁle. The basic syntax is: insheet [varlist] using ﬁlename [.

The list [varlist] command displays the values of variables. The display format of a variable may be speciﬁed using the command format varlist %fmt where %fmt is the chosen format for varlist.5.8 1. To switch the decimal character back to period type set dp period. display "this is a string" . If no label is speciﬁed.0g displays the variable x in g (generic numeric) format. The complementary ds command lists variable names in a compact format. The content of a Stata data ﬁle on disk may be described without actually loading it by using the describe using ﬁlename command. This command may also be used interactively as a substitute for a hand calculator. Variables are labeled by using the command label variable varname "string" where string (typed in quotes) is up to 80 character long. For example. 1. display "the value of f(x) is" 5+exp(ln(10)) The content of the dataset in memory may be displayed by using the describe command. display 5+exp(ln(10)) . the values of all the variables are displayed.2 LABELING DATA Stata contains a number of commands for manipulating labels.5 BASIC DATA MANIPULATION I now introduce some basic facilities for manipulating data. 1.1 DISPLAYING DATA Strings and values of scalar expressions may be displayed using the display command. Examples: .3f displays x in f (ﬁxed numeric) format with three decimals. format x %9. Changing the display format does not aﬀect the internal precision with which variables are stored and manipulated. The values of a variable are labeled by using the command label value varname [lblname] . any existing variable label is removed. whereas lookfor string helps in ﬁnding variables by searching for string among all variable names and labels. The set dp comma command may be used to display numerical values using comma as the decimal character.5. If no varlist is speciﬁed. whereas format x %8.

5. mean. + and > are examples of Stata operators. If no label is speciﬁed. minimum and maximum value). Data labels are displayed when the data are used or described. . log(x) is an example of a Stata function.. If no varlist is speciﬁed. For example. any existing label is removed. summary statistics are calculated for all the variables in the data. detail produces additional statistics including skewness. the four smallest and four largest values.GETTING STARTED where lblname is the name of a value label deﬁned through the command label define lblname # "string" [# "string" . 1.3 SUMMARIZING DATA The command summarize [varlist] calculates and displays a variety of univariate summary statistics (number of nonmissing observations. The command summarize [varlist].. kurtosis. standard deviation. to generate the new variable y using positive values of an existing variable x.5. generate y = x*x + log(x) if x>0 where *. The basic syntax of this command is: generate newvar = exp [options] where exp is an expression and options are optional instructions that may restrict the application of exp. type: .] Example: label define sexlbl 0 "male" 1 "female" label value sex sexlbl 9 The label dir command lists the names of value labels stored in memory.4 CREATING NEW VARIABLES New variables are created using the generate command. along with various percentiles. label drop _all eliminates all value labels. and if x>0 is a qualiﬁer that restricts the scope of the command to the observations for which x > 0. whereas label list lists the names and contents of value labels stored in memory. Data ﬁles are labeled by using the command label data string where string is up to 80 characters long. label drop lblnames eliminates the value labels lblnames. 1.

| (or). * (multiplication). they are separated by commas. Functions are used in expressions. for example exp(x). distribution functions and quantile functions of various probability distributions.10 To generate the new variable y containg lagged values of x.x) (chi-square distribution with df degrees of freedom). whereas typing generate y = x[_n] and y = x are equivalent.L.x. • Statistical functions: density. If X denotes the name of a continuous distribution. including other functions. ˜= (not equal). display binorm(x. == (equal). Examples: . display normden(x) (standard normal density) ..5) . > (greater than).25) (cumulative binomial with parameters n = 5 and π = .. Stata has built in a number of functions: • Mathematical functions. generate y = x[_n-1] Typing generate y = x[1] sets every observation of y equal to the ﬁrst observation in x.25) . . unit variances and correlation ρ = . The arguments of a function are enclosed in parentheses.df2 . display invnorm(p) (standard normal quantile function) . ˜ (not). The currently available distributions include chi2(df .x) (t distribution with df degrees of freedom). If there are multiple arguments. display Binomial(5. log(x) or ln(x). • string operators: + (string concatenation). then Stata usually provides X() (cumulative distribution function). The Stata operators are: • arithmetic operators: + (addition). Parentheses may be used to force a diﬀerent order of evaluation. norm(x) (standard Gaussian). invX() (quantile function) and invXtail() (upper quantile function). for example the expression "abc" + "def" produces the string "abcdef". <= (less or equal). ˆ (power).y. F(df1 . nchi2(df . • relational operators: < (less than). Xtail() (upper tail cumulative distribution function). sqrt(x). / (division). type: . both discrete and continuous. display norm(x) (cumulative standard normal) . >= (greater or equal).x) (noncentral chi-square distribution with df degrees of freedom and noncentrality parameter L) and t(df . The argument(s) of a function may be any expression.5) (cumulative bivariate Gaussian with zero means.(subtraction). • logical operators: & (and). abs(x) and the main trigonometric functions. Functions return missing when the value of the function is undeﬁned. The order of evaluation follows the standard rules.x) (F distribution with df1 and df2 degrees of freedom).

.). round(x. n2 ) (returns the substring of s starting at n1 for a length of n2 . max(x1 . • Special functions. string(n) (converts n into a string). x2 . where the function invnorm evaluates the quantile function of the standard normal. Do not use encode if . y) (returns x rounded into units of y). 4 and 5 to 6. recode x 1=2 3=4 changes 1 in x to 2 and 3 to 4.GETTING STARTED • Pseudo-random number generator: uniform(). . and upper(s) (returns the uppercased variant of s). for example lower(s) (returns the lowercased variant of s). int(x) (returns the integer part of x). 1 if x > 0. . recode x 1 3/5 = 6 changes 1. Pseudorandom numbers according to any other continuous distribution may be generated through the inverse probability integral transform. ignoring missing values).3). For example: pseudo-random numbers according to the standard normal distribution may be generated with the invnorm(uniform()) command. x2 .5 CHANGING AND RENAMING VARIABLES The content of an existing variable may be changed by using the replace command.4. whereas the name of an existing variable may be changed (its contents remain unchanged) by using the rename command. if x = . . . the command encode varname. for example float(x) (returns the value of x rounded to float storage type). and sum(x) (returns the running sum of x. which generates uniformly distributed pseudo-random numbers on the interval [0. as well as matrix functions returning scalars (see Section 4. . treating missing values as zero). For example. The recode varname command changes the values of varname according to the rules speciﬁed. . 0 if x = 0. substr(s. Given a string variable named varname. 1. . if n2 = . 11 A variety of date and time-series functions are also available. and . creating at the same time (or just using as necessary) the value label newvar. real(s) (converts s into a numeric value). • String functions (which apply to string variables).5. xn ) (return respectively the maximum and the minimum of the arguments.. the remaining portion of the string is returned). It takes no arguments. and is George Marsaglia’s KISS (Keep It Simple Stupid). generate(newvar) generates a new numeric variable named newvar based on varname. n1 . whereas .1). xn ) and min(x1 . . sign(x) (returns -1 if x < 0. 3.

The keep command works the same as drop except that we specify the variables or observations to be kept rather than those to be deleted. If there are variables in memory.6 OUTPUTTING DATA Stata oﬀers three basic commands for outputting data. . .4.12 varname contains numbers that merely happen to be stored as strings (e. . . .5. corresponding to the use. . the number ‘1. The clear command essentially resets Stata and is equivalent to the set of commands: . .0 drop _all label drop _all (drop all labels in memory) scalar drop _all (drop all scalar variables in memory) matrix drop _all (drop all matrices in memory) eq drop _all (drop all equations in memory) constraint drop _all (drop all constraints in memory) discard (drop all programs in memory) INCREASING THE NUMBER OF OBSERVATIONS IN A DATASET 1. drop _all .7 The set obs # command changes the number of observations in the current dataset to #. 1. where # is an integer at least as large as the current number _N of observations. version 7.g. . makes 100 observations and assigns the variable x the values from 1 to 100.5. the values of all new observations are set to missing. In this case use instead generate newvar = real(varname) The decode command creates a new string variable named newvar based on the “encoded” numeric variable varname and its value label. . infile and insheet commands discussed in Section 1. gen x = _n clears memory. To eliminate variables use drop varlist To eliminate observations use drop in range [if exp] The drop _all command eliminates all variables and observations in memory. set obs 100 . 1.314’). For example.6 ELIMINATING VARIABLES OR OBSERVATIONS The drop command eliminates variables or observations from the data in memory.

GETTING STARTED The ﬁrst command save [ﬁlename] [, options]

13

stores the dataset currently in memory on disk in Stata format under the name ﬁlename. If ﬁlename is not speciﬁed, the name under which the data was last known to Stata is used. If ﬁlename is speciﬁed without an extension, .dta is assumed. The available options are nolabel old replace all. The old option enables a dataset to be readable by someone with Stata 6.0, the option replace permits save to overwrite an existing dataset. The second command outfile [varlist] using ﬁlename [, options] writes data to a disk ﬁle in ASCII (text) format. The data saved by outfile can be read back by infile. If ﬁlename is speciﬁed without an extension, .raw is assumed unless the dictionary option is speciﬁed, in which case .dct is assumed. The third command outsheet [varlist] using ﬁlename [, options] writes data in tab- or comma-separated ASCII format into a ﬁle. This is the format that most spreadsheet programs prefer. If ﬁlename is speciﬁed without an extension, .out is assumed. 1.7 LOG FILES

The log command echos a copy of a Stata session to a ﬁle or a device. More precisely: log using ﬁlename [, options] opens the ﬁle ﬁlename and echos a copy of the Stata session to the ﬁle. If ﬁlename is speciﬁed without an extension, .smcl is assumed (SMCL is Stata’s output language). The available options are noproc append replace. The log close command stops logging the session and closes the ﬁle, log off temporarily stops logging the session leaving the ﬁle open, while log on resumes logging to the ﬁle. The set log command controls the dimensions of output sent to the log. Its format is: set {display|log} {linesize|pagesize} # where # is the line or page length, for example set linesize 120 or set pagesize 40.

2

Stata Commands

In this chapter I describe the syntax of some frequently used Stata commands. My selection is of course subjective. 2.1 GENERAL SYNTAX

The general syntax of a Stata command is: command [varlist] [ = exp] [weight] [if exp] [in range] [, options] If no varlist appears, the command assumes a varlist of _all, that is, the command is applied to all the variables in the data. The option = exp speciﬁes the value to be assigned to a variable. It is most often used with generate and replace. For example: . replace newvar = oldvar+2 Many commands take command-speciﬁc options. A single comma separates a command’s options from the rest of the command. Most commands can be abbreviated. For example, one may type gen or simply g instead of generate, summ or simply su instead of summarize, des or simply d instead of describe, l instead of list, etc. See the on-line help or the Reference Manual for the shortest allowable abbreviation of a command. The F -keys may be used to create shortcuts to some command. For example, the F 3-key comes deﬁned as describe Enter. 2.1.1 BY

Most Stata commands allow the by varlist: preﬁx. This causes command to be repeated for each subset of the data for which the values of the variables in varlist are equal. The use of by requires the data to be preliminarily sorted by varlist. Example: . sort x . by x: summarize y Not all commands allow the by varlist: preﬁx. Some replace it with by(groupvar) in the options. For example, the syntax of the ttest command is:

2. For example. pweight. 4. frequency weights (fweight) are integer-valued and indicate multiple observations. #/l or f/#. summarize x y in 1/10 . Example: . replace y = x+2 if x>0 The in range qualiﬁer restricts the scope of the command to a speciﬁc observation range. which correspond to the four kind of weights that Stata understands (although not every command supports all four of them): 1. #/#. The default treatment (weight) is each command’s idea of what the “natural” weights are and is one of the above weight types. summarize x y in f/10 2. 2.1.1. if appropriate. summarize x y in 10 . 3. where range is any of #.2 WEIGHTS The option weight indicates the weight to be attached to each observation.4 QUIETLY AND NOISILY Typing quietly command suppresses all terminal output for the duration of command. analytic weights (aweights) are inversely proportional to the variance of an observation. noisily command turns back on terminal output. to summarize the values of x and y for the ﬁrst 10 observations: .1. The syntax of weight is [weightword = exp]. by(groupvar) [unequal welch level(#)] 2. importance weights (iweights) indicate the relative “importance” of an observation. quietly by x: generate y = sum(z) . aweight and iweight. for the duration of command. For example: . probability or sampling weights (pweight) are inversely proportional to the sample inclusion probabilities.3 IF AND IN The if exp qualiﬁer restricts the scope of the command to those observations for which the value of the expression is true. where weighword is one either weight (the default treatment of weights) or one of fweight.16 ttest varname [if exp] [in range].

. Its syntax is describe [varlist] [. 2.2 BASIC DATA COMMANDS 17 I discuss nine basic commands: describe. . fullnames displays the full names of the variables (the default is to present an abbreviation when the variable name is longer than 15 characters). The numbers and fullnames options may not both be speciﬁed together. and the maximum size of the dataset).STATA COMMANDS 2. list. Its syntax is: drop varlist . drop and keep. and numbers presents the variable number along with the variable name. short detail fullnames numbers] in the ﬁrst case. [no]display nolabel noobs] where [no]display forces the format into display or tabular (nodisplay) format (if one of these two options is not speciﬁed.3 DROP AND KEEP The drop command eliminates variables or observations from the data in memory. Its syntax is: list [varlist] [if exp] [in range] [.2. the maximum width for an observation. generate and its extension egen. list list list list list list in 1/10 x y x y in 1/10 if x>20 x y if z>20 x y z if z>20 in 1/10 2. the maximum number of variables holding the numbers of observations constant. . then Stata chooses one based on its judgment). Examples: . . detail includes more detailed information (the width of a single observation. and describe using ﬁlename [. sort and gsort.2 LIST This command displays the values of variables. replace.1 DESCRIBE This command displays a summary of the contents of either the data in memory or the data stored in a Stata-format dataset. 2. the maximum number of observations holding the number of variables constant. nolabel causes the numeric codes rather than label values to be displayed. where short suppresses the speciﬁc information about each variable.2.2. . short detail] in the second case. and noobs suppresses printing of the observation numbers.

stuﬀ refers to an expression. Examples: . or a list of numbers. drop in 1/33 . . . Note that egen may change the sort order of the data. the number of missing values in newvar is always reported. drop in -10/l . a list of variables. Depending on fcn(). . by y: keep if _n==_N 2. options] egen creates newvar equal to fcn(stuﬀ). drop if x<21 .2. generate x2 = x*x generate bigz = z>100000 & z˜=. Important examples of egen functions include: • count(exp) [.4 GENERATE AND EGEN (drop ﬁrst 33 observations) (drop last 10 observations) The generate command creates a new variable. keep in 34/l . Examples: . by(varlist)] creates a constant (within varlist) containing the number of nonmissing observations of exp. . gen double w = x/y gen xlag = x[_n-1] gen u = uniform() (U(0. type must be set to str#. sort y . If missing values are generated. Its syntax is: generate [type] newvar[:lblname] = exp [if exp] [in range] If type is not speciﬁed.18 drop if exp drop in range [if exp] The keep command works exactly the same as drop except that one speciﬁes the variables or observations to be kept. float is the default (the default type may be changed using the set type command). 1) pseudo-random numbers) gen z = invnorm(uniform()) (N (0. keep if x>=21 . It may not be combined with by. The options are similarly function dependent. Its syntax is: egen [type] newvar = fcn(stuﬀ) [if exp] [in range] [. 1) pseudo-random numbers) The egen command provides an extension to generate. To prevent Stata from returning an error when string variables are generated. . . • diff(varlist) creates an indicator variable equal to 1 where the variables in varlist are not equal and 0 otherwise.

STATA COMMANDS • group(varlist) [. • iqr(exp) [. . such as rmean (row mean).5 REPLACE This command changes the contents of an existing variable.means subtraction) y = rmean(x1 x2 x3) y = rmean(x1-x3) (varlist. The same syntax holds for a number of other functions with argument exp. The label option returns integers from 1 up according to the distinct groups of varlist in sorted order. . If p() is not speciﬁed. Notice that # must be odd and exp must not produce missing values. medv (mean absolute deviation from the mean). The default is mean() and std() producing a variable with zero mean and unit variance. mad (median absolute deviation from the median). . by(varlist)] creates a constant (within varlist) containing the interquartile range of exp. . It may not by combined with by. . mean(#) std(#)] creates the standardized value of exp using the speciﬁed mean and standard deviation. . It may not be combined with by. • ma(exp) [. Examples: . median (median). meaning medians. t(#) nomiss] creates a #-period moving average of exp. gen dev = x-avgx . mean(100) std(10) sumx = sum(x). such as kurt (coeﬃcient of kurtosis). max) (maximum value). The truncate() option truncates the values contributed to the label from each variable in varlist to the length speciﬁed by the integer argument num. skew (coeﬃcient of skewness).2. by(y) xy = group(x y) 19 2. Its syntax is: . . p(#) by(varlist)] creates a constant (within varlist) containing the #-th percentile of exp. 50 is assumed. rmiss (row number of missing values). If t() is not speciﬁed. egen avgx = mean(x) . • std(exp) [. t(3) is assumed. • rmax(varlist) gives the maximum value in varlist for each observation (row). . or the value labels if they exist.means through) sdx = sd(x) stdx = std(x). missing label truncate(num)] creates a single variable taking on values 1..2. mean (mean). The integers are labeled with the values of varlist. egen egen egen egen egen egen egen x = median(x2-x1) (expression. . sum (sum). min (minimum value). . rmin (row minimum). • pctile(exp) [. for the groups formed by varlist. sd (standard deviation). The same syntax holds for a number of other functions with argument varlist.

. Examples: .6 SORT AND GSORT The sort commmad arranges the observations of the current data in ascending order of the values of the variables in varlist.2. replace z=. gsort may arrange the observations in either ascending or descending order. Its syntax is sort varlist [in range] There is no limit to the number of variables in varlist and each variable can be numeric or string.. generate(newvar) mfirst] The observations are placed in ascending order of varname if + or nothing is typed in front of the name and in descending order if . . The dataset is marked as being sorted by varlist unless in range is speciﬁed. Examples: . . by z: gen avgx = sum(x)/sum(x˜=. Its syntax is gsort [+|-]varname [[+|-]varname [. sort lstname frstname midinitl Unlike sort. by z: replace avgx = avgx[_N] 2. descending y) . The generate(newvar) option creates newvar containing 1. .. . This is useful when one wishes to use the ordering with a subsequent by. ascending y) x -y (ascending x. . replace y = 25 in 1007 . that can produce only ascending-order arrangements.is typed.]] [. Missing values are interpreted as being larger than any other number and are thus placed last (there is an exception: When sorting on a string variable. sort z .20 replace oldvar = exp [if exp] [in range] [.3.) . null strings are placed ﬁrst). .2. for each of the groups denoted by the ordered varnames.. nopromote] where nopromote prevents replace from promoting the variable type to accommodate the change. gsort gsort gsort gsort gsort gsort x (same as sort x) +x (same as gsort x) -x (reverse sort) -name (reverse alphabetical) x y (ascending x. Examples: . if z<=0 . . sort personid . The mfirst option speciﬁes that missing values are to be placed ﬁrst in descending orderings rather than last.

from both master and using data) . . Even if this option is not speciﬁed. from using data) (obs. 2. 2.dta is assumed. . The default is _merge(_merge).2 MERGE This command joins corresponding observations from the dataset currently in memory (called the master dataset) with those from the Stata-format dataset stored as ﬁlename (called the using dataset) into single observations (if ﬁlename is speciﬁed without an extension.3. sort x (match merge) (obs. 2.STATA COMMANDS . . use data2 . Its syntax is: append using ﬁlename [.3 gsort -x. nolabel] where nolabel prevents copying the value label deﬁnitions from the disk dataset.dta is assumed).3. . gen(revx) quietly by revx: gen rcum = _N if _n==1 replace rcum = sum(rcum) replace rcum = rcum/rcum[_N] COMBINING DATA 21 I discuss two commands: append and merge. merge using data2 . Its syntax is: merge [varlist] using ﬁlename [. tab _merge . nolabel update replace nokeep _merge(varname)] where nokeep causes merge to ignore observations in the using data that have no corresponding observation in the master (the default is to add these observations to the merged result and mark them with _merge==2) and _merge(varname) speciﬁes the name of the variable that will mark the source of the resulting observation. If ﬁlename is speciﬁed without an extension. use data1 (one-to-one merge) . It can perform both one-to-one and match merges.1 APPEND This command appends a Stata-format dataset stored on disk to the end of the dataset in memory. label deﬁnitions from the disk dataset never replace deﬁnitions already in memory. . from master data) (obs. which adds a new variable _merge to the data whose values are: _merge==1 _merge==2 _merge==3 Examples: .

max (maxima). Its syntax is: contract varlist [weight] [if exp] [in range] [. and nomiss speciﬁes that observations with missing values on any of the variables in varlist will be dropped (if not speciﬁed. and stat is one of the following: mean (means).1 COLLAPSE This command replaces the data in memory with a new dataset consisting of the means. the name must be new). . zero speciﬁes that combinations with frequency zero are wanted. The by(varlist) option speciﬁes the groups over which the means.. median (medians). etc. 2. medians.22 . 2.4. min (minima).4 save data2.] [(stat)] target_var=varname [target_var=varname .2 CONTRACT This command makes datasets of frequencies. count (number of nonmissing observations).4. replace use data1 sort x merge x using data2 tab _merge RESHAPING DATA I discuss ﬁve commands: collapse. Its syntax is: collapse clist [weight] [if exp] [in range] [. all observations possible are used). are to be calculated.. _freq is used. .. iqr (interquartile range). of the speciﬁed variables. 2.. It replaces the data in memory with a new dataset consisting of all combinations of varlist that exist in the data together with a new variable that contains the frequency of each combination. expand.. all observations possible are used for each calculated statistic) and fast speciﬁes that collapse not go to extra work so that it can restore the original data should the user press Break. .. If stat is not speciﬁed. sum (sums). cw speciﬁes casewise deletion (if not speciﬁed. p# (#th percentile). rawsum (sums ignoring optionally speciﬁed weights). sd (standard deviations). etc. fillin and reshape. contract.] or any combination of the varlist or target_var forms. . . freq(varname) zero nomiss] where freq(varname) speciﬁes a name for the frequency variable (if not speciﬁed. by(varlist) cw fast] where clist is either [(stat)] varlist [[(stat)] .] [[(stat) .. mean is assumed.

reshape ..5 RESHAPE This command converts data from wide to long form and vice versa.4. Its basic syntax is: reshape wide varnames. i(varlist) [j(varname) string] where i(varlist) speciﬁes the variable(s) whose unique values denote a logical observation. i(z) (single i() variable) .1 COUNT This command counts observations satisfying the speciﬁed conditions.4 FILLIN This command rectangularizes a dataset by adding observations with missing data so that all interactions of the variables in varlist exist. centile. i(y) j(z) string 2. Examples: ... and the observation is retained but not duplicated). summarize. Its syntax is: expand [=]exp [if exp] [in range] Example: . i(y) j(z 1-3 5) . i(varlist) [j(varname) string] reshape long varnames. i(z1 z2) (two i() variables) ..3 EXPAND 23 This command replaces each observation in the current dataset with n copies of the observation. It also adds the variable _fillin to the data (with value 1 for created observations and 0 for previously existing observations). reshape long x1 x2. reshape long x. reshape . expand 2 2.. correlate and regress. cumul.STATA COMMANDS 2. i(y) j(z) (converts from wide to long) .4.5. where n is equal to the integer part of the required expression (if the expression is less than one or equal to missing. j(varname) speciﬁes the variable whose unique values denote a subobservation. reshape wide (converts back to wide) . 2.4. reshape long x.5 BASIC SAMPLE STATISTICS (specifying j() values) (allow string variables in j()) I discuss seven commands: count. and string speciﬁes that the j() may contain string values. then it is interpreted as if it were one.. Its syntax is: . means. Its syntax is: fillin varlist 2.

cci (conservative conﬁdence interval) prevents centile from interpolating when calculating the distribution-free (binomialbased) conﬁdence limits. the add() option only adds # to variables with at least one nonpositive value) and level(#) speciﬁes the percentage conﬁdence level for conﬁdence intervals.3 MEANS This command reports the arithmetic. the four smallest and four largest values.2 SUMMARIZE This command reports a variety of univariate summary statistics. meansd . medians are reported). By default. and harmonic means. count displays the number of observations in the dataset. count if y<0 . centile(numlist) cci normal meansd level(#)] where centile(numlist) speciﬁes the centiles to be reported. only modiﬁes the action of the add() option (if speciﬁed. kurtosis. add(#) only level(#)] where add(#) adds the value # to each variable in varlist before computing the means and conﬁdence intervals (this may be useful when analyzing variables with nonpositive values).5. along with various percentiles).24 count [if exp] [in range] If no condition is speciﬁed. conﬁdence intervals are obtained using a binomial method that makes no assumptions as to the underlying distribution of the variable. meanonly suppresses display of the results and calculation of the variance (it is allowed only when detail is not speciﬁed) and format requests that the summary statistics be displayed using the display format associated with the variables rather than the default g format. {detail|meanonly} format] where detail produces additional statistics (including skewness.5.4 CENTILE This command reports the (per)centiles of the speciﬁed variables and their conﬁdence intervals. along with their respective conﬁdence intervals.5. Its syntax is: summarize [varlist] [weight] [if exp] [in range] [. The syntax is: centile [varlist] [if exp] [in range] [. for the speciﬁed variables. by x: count if y<0 2. 2. The ci command may be used if one simply wants arithmetic means and corresponding conﬁdence intervals. geometric. Its syntax is: means [varlist] [if exp] [in range] [. 2. Examples: . for example centile(25 50 75) (if not speciﬁed. normal speciﬁes that conﬁdence intervals are to be obtained assuming that both the data and the centiles are normally distributed.

means noformat covariance wrap] where means causes summary statistics (means. 2.1. 2. 2.1. Estimation is carried out by least squares (either ordinary least squares or weighted least squares). The syntax is: correlate [varlist] [weight] [if exp] [in range] [. Its basic syntax is: regress yvar [xvars] [weight] [if exp] [in range] [. Observations are excluded from the calculation due to missing values on a casewise basis. noformat displays the summary statistic requested by the means option in g format regardless of the display formats associated with the variables. otherwise it is normalized so that newvar is 1 for the largest value of varname. standard deviations. freq requests the edf to be in frequency units.5. minima and maxima) to be displayed along with the matrix. covariance displays the covariances rather than the correlation coeﬃcients. Its syntax is: cumul varname [weight] [if exp] [in range] .5 CUMUL This command creates a new variable containing the empirical distribution function (edf) of a variable.6 CORRELATE This command reports the covariance or correlation matrix of the speciﬁed variables.5. and wrap requests that no action be taken on wide matrices to make them readable. level(#) noconstant regress_options] where level(#) speciﬁes the conﬁdence level (in percent) for the regression parameters (the default is 95%). The related command pctile newvar = exp creates a new variable containing the percentiles of exp. gen(newvar) [freq by(varlist)] where gen(newvar) speciﬁes the name of the new variable to be created (it is not optional).5. .STATA COMMANDS 25 calculates conﬁdence intervals assuming that the estimated centiles themselves are normally distributed.7 REGRESS This command estimates linear regression models with a single response or dependent variable. and the additional regress_options are described in more detail in Section 6. where exp is typically just another variable. and by(varlist) speciﬁes that edf’s be generated separately for each by-group. noconstant suppresses the constant term (intercept) in the regression.

column displays in each cell of a two-way table the relative frequency of that cell within its column. The syntax for one-way tables is: tabulate varname [weight] [if exp] [in range] [. format(%fmt) speciﬁes the display format for presenting numbers in the table’s cells.26 2. row speciﬁes a row is to be added to the table reﬂecting the total across rows.6. including the common Pearson chi-squared. and Kendall’s tau-b. Its syntax is a little involved: table rowvar [colvar [supercolvar]] [weight] [if exp] [in range] [. if contents() is not speciﬁed it is assumed to be contents(freq)). summarize. chi2 calculates and displays Pearson’s chi-squared for the hypothesis that the rows and columns in a two-way table are independent.1 TABLE This command provides tables of summary statistics. clist is as in collapse. See the on-line help or the Reference Manual for a description of the other options. missing speciﬁes that missing statistics are to be shown in the table as periods (the default is to leave them blank). 2. center speciﬁes results are to be centered in the table’s cells (the default is to right align). generate(varname) matcell(matname) matrow(matname) missing nofreq nolabel plot subpop(varname)] The syntax for two-way tables is: tabulate varname1 varname2 [weight] [if exp] [in range] [. left speciﬁes that column labels are to be left aligned (the default is to right align). Goodman and Kruskal’s gamma. the likelihood ratio chi-squared.6. all cell chi2 column exact gamma lrchi2 matcell(matname) matcol(matname) matrow(matname) missing nofreq nolabel row taub V wrap] where all is equivalent to specifying chi2 lrchi2 V gamma taub.and two-way tables of frequency counts along with various measures of association. exact displays the signiﬁcance calculated by Fisher’s exact test. cell displays the relative frequency of each cell in a two-way table. 2. Fisher’s exact test. Cramer’s V. tabulate and tabulate. col speciﬁes a column is to be added to the table reﬂecting the total across columns. contents(clist) by(superrow_varlist) cw row col scol format(%fmt) center left concise missing replace name(string) cellwidth(#) csepwidth(#) scsepwidth(#) stubwidth(#)] where contents(clist) speciﬁes the content of the table’s cells (up to 5 statistics may be speciﬁed.2 TABULATE This command provides one.6 TABLES Stata oﬀers three basic commands for producing tables: table. gamma displays Goodman and Kruskal’s gamma along with .

[no]obs includes only or suppresses only the reported number of observations from the table. and missing requests that missing values of varname1 and varname2 be treated as categories rather than as observations to be omitted from analysis. subpop(varname) excludes observations for which varname = 0 in tabulating frequencies. matrow(matname) saves the numeric values of the row stub in the vector matname.and two-way tables of summary statistics. and. Examples: . summarize() is faster. and V (note capitalization) displays Cramer’s V.STATA COMMANDS 27 its asymptotic standard error. replace indicates that the immediate data speciﬁed as arguments to the command are to be left as the current data in place of whatever data was there. summarize(z) (one-way table) . summarize() command produces one. summarize(z) means nofreq . [no]freq includes only or suppresses only the frequencies from the table. Although table is better. matcell(matname) saves the reported frequencies in the matrix matname. tabulate.6. frequency. nolabel causes the numeric codes to be displayed rather than the value labels. [no]means includes only or suppresses only the means from the table (the summarize() table normally includes the mean. the number of observations). summarize(z) (two-way table) . nofreq suppresses printing the frequencies. [no]standard includes only or suppresses only the standard deviations from the table. row displays in each cell of a two-way table the relative frequency of that cell within its row. percentages. by x: tabulate y1 y2. if the data is weighted. and other statistics. summarize(varname3) [[no]means [no]standard [no]freq [no]obs wrap nolabel missing] where summarize(varname3) identiﬁes the name of the variable for which summary statistics are to be reported (if this option is not speciﬁed. sort x (n-way table) . standard deviation. Its syntax is: tabulate varname1 [varname2] [weight] [if exp] [in range] . missing requests that missing values be treated like other values in calculations of counts. tabulate y. lrchi2 displays the likelihoodratio chi-squared statistic (the request is ignored if any cell of the table contains no observations). then a table of frequencies is produced).3 TABSUM The tabulate. 2. tabulate y1 y2. generate(varname) creates a set of indicator variables reﬂecting the observed values of the tabulated variable. matcol(matname) saves the numeric values of the column stub in the vector matname. plot produces a bar chart of the relative frequencies in a oneway table. taub displays Kendall’s tau-b along with its asymptotic standard error.

.

translate mygraph. The syntax to review a saved Stata graph is: graph using ﬁlename [ﬁlename] [. 3. ksm (regression smoothers) and logistic (logistic regression diagnostic plot). options] An existing Stata graph can be translated to another format (e.gph mygraph.eps (converts to Encapsulated PostScript) .prn (converts to printer format) . . oneway (one-way scatterplots). box (box plots). PostScript) using the translate command and printed using the print command. translate mygraph. 2.1 BASIC SYNTAX AND GRAPHIC STYLES The basic syntax of the graph command is: graph [varlist] [weight] [if exp] [in range] [.gph mygraph. twoway (two-way scatterplots).gph (print mygraph) . Stata oﬀers eight basic graph styles: 1.gph mygraph. 4. matrix (two-way scatterplot matrices).wmf (converts to Windows metaﬁle) . but allow considerable ﬂexibility and are relatively simple to use. histogram. Examples: . print mygraph.3 Graphics Stata graphics are not very fancy. graph recalculates and redisplays the last graph. options] Typed without arguments. 5. translate mygraph. 3.g. print @Graph (print the graph in the Graph window) Notice that graphs may also be produced by other Stata commands such as kdensity (nonparametric density estimation).

3. • Specifying titles: graph allows up to two titles on every side of the graph (top. The ﬁrst title (e. The yline and rline options draw horizontal lines. replace]). star (star charts). Additional ticking may be obtained with the options {x|y|r|t}tick[(#. denoted by the options t1. . t1) is always the farther from the ﬁgure. bar (bar charts). r1. graph y x. After discussing some options of the graph command that are common across all styles (Section 3.gph is assumed.. xline and tline draw vertical lines. Typed without arguments. l2. The argument of each option is some text enclosed in quotes. t2) is the closest. The default is gap(8). and r2. 8..2). • Adding lines: lines across the graph may be drawn with the options {x|y|r|t}line[(#.g. . • Saving a graph to disk: saving(ﬁlename [. b1 (same as title or ti). More aesthetically pleasing results may be obtained with the options {x|y|r|t}label[(#. {x|y|r|t}label chooses “round” values to be labelled.. • Multiple-imaging options: by(varname) is allowed for all styles except matrix and star.. left and right). l1. • Printing a graph: after the graph command..2 COMMON GRAPH OPTIONS In this section I brieﬂy discuss some of the general options of the graph command... It requests that graphs be drawn separately for the groups deﬁned by varname and be combined into a single image. Quotes can be omitted if text contains no special character. t2. 7. I shall focus on the ﬁrst four styles.#)].. • Adding ticks: graph automatically places tick marks on axes anywhere they are labelled. use the Print button in the Stata toolbar. b2. Example: .30 6.. l1(y) b2(x) title("Figure 1: x-y scatterplot") • Setting the gap: gap(#) sets the amount of space between the left title and the values along the y-axis. the second (e. pie (pie charts)..#)]. graph labels just the minimum and maximum of each variable. If an extension is not speciﬁed. bottom.g. I will henceforth refer to these options as common_options.. • Labeling axes: by default..#)].

graph scales each axis according to the minimum and maximum of all things that go on the axis (data.(a medium dash). By default. • Setting the axes rendition: by default. T (large triangle). s (connect median bands using cubic splines). d.5. If connect is not speciﬁed. m (connect median bands using straight lines). • Plotting symbols: graph uses the following plotting symbols to specify the location of a point on a scatterplot: O (large circle. where s is any of the above symbols. symbol(. . The border option replaces axes with borders. . by default. graph draws an axis on any style that has an axis. graph includes a sort option that automatically sorts the data according to the x-axis before graphing. S (large square). The sequence of plotting symbols for the variables in varlist is speciﬁed with the option symbol(s . the {x|y|r}log options with the twoway style. where s is any of the above alternatuves. L (straight line between ascending x-points).1) ytick(. (dot). S.1) • Setting the scale: by default.. p. The options {x|y|r}xscale(#. and the remainder .5) yline(0. . o (small circle. gap(4) xlabel ylabel(0. graph y x. _ (a long dash). d (small diamond). it chooses instead o. s). || (connect two variables vertically (high-low)). T. twoway chooses the symbols O. l (straight lines between points). default). point are not connected. Combined with by(). not the order of the x-axis. graph y1 y2 y3 x.) connect(ll[_]l[-]) 31 .GRAPHICS Example: . _n (observation number). or matrix). p (small plus). default for twoway). The connect option connects points in the order of the data. default for twoway with by. How the variables in varlist are connected is speciﬁed by the option connect(s . and the remainder . where pattern is any combination of the following: l (a solid line. the default). one can specify the pattern of the line by adding a [pattern] after the line type.#) may be used to widen (but never to narrow) the scale used for drawing a graph on any style that has an axis. • Connecting points: graph oﬀers the following alternatives to connect points on a scatterplot: . • Line patterns: For each line type. II (same as || but cap bottom and top of line). • Creating log scales: the log option is used with the histogram style. . . . J (connect rectilinearly making steps). Example: . . The noaxis option suppresses both axes and borders. if symbol is not speciﬁed. almost a dot) and # (a space). (do not connect. [varname] (variable to be used as text)... i (invisible). (a short dash. s). labeling or ticking).

one must specify twoway explicitly. Examples: . std. . twoway [common_options rescale rbox {y|x|r}reverse] where rescale scales each y-variable independently (if there are two y-variables. normal(10. I refer to the Graphics Manual for other common options and to the remainder of this chapter for options speciﬁc to the various graph styles. This is not an option but a separate command that must be issued before graph. if there are more than two y-variables. twoway may be combined with oneway or box. and {y|x|r}reverse reverses the indicated scale to run from high-to-low.3 HISTOGRAMS This is the default for graph when only one variable is speciﬁed. bin(15) (uses 11 bins for histogram) graph x.32 The set textsize # command controls the size of the text used in a graph. but in that case. The basic syntax is: graph [yvars xvar] [weight] [if exp] [in range].#)] overlays a normal density with speciﬁed mean and standard deviation (normal by itself uses the observed mean and standard deviation). graph y x (graph of y against x) .3) (overdraws a normal density with mean 10 and dev. no vertical scale is labeled). rescale (graph of z and y against x) . graph z y x. 3. Examples: . 3. freq and percent aﬀect how the vertical axis is labeled (respectively. histogram [common_options bin(#) {freq|percent} normal[(#. the scale of the ﬁrst is presented on the left axis and the scale for the second on the right axis. . rbox places a rangeﬁnder box plot on the graph. normal[(#.#)] density(#)] where bin(#) speciﬁes the number of (equally spaced) bins to use for constructing the histogram (the default is bin(5)). 3) TWO-WAY SCATTERPLOTS This is the default for graph when more than one variable is speciﬁed.4 graph x (draws a histogram of x) graph x. in frequency units and in percent). The basic syntax is: graph [variable] [weight] [if exp] [in range]. and density(#) (only used with normal) speciﬁes the number of points along the density to be calculated (the default is density(100)).

graph y x.5 TWO-WAY SCATTERPLOT MATRICES 33 A two-way scatterplot matrix is a set of two-way scatterplots arranged in a matrix.6 BOX PLOTS A box plot is a graphical procedure with the following features: (i) it combines a measure of location (the median) and a measure of spread (the interquartile range).GRAPHICS 3. box by(z) (graphs y against x by z) . vwidth makes the width of the box proportional to the number of observations. and root (only used with vwidth) makes the width of the box proportional to the square root of the number of observations. matrix [common_options half] where half draws only the lower half of the matrix. graph x y z if z>0. graph y. and (iii) it provides some indication about the shape of the distribution of the data in terms of their symmetry or skewness. (ii) it shows the presence of possible outliers. box [common_options [no]alt vwidth root] where [no]alt forces the labeling of the groups to be on single line (noalt) or multiple lines. box by(z) (graphs box-and-whiskers for y by z groups) . The basic syntax is: graph [varlist] [weight] [if exp] [in range]. matrix 3. graph y x. Examples: . Example: . box (graphs box-and-whiskers for y and x) . The basic syntax is: graph [varlist] [weight] [if exp] [in range].

.

Stata allows programs to be nested 32 deep.2 and 4. called the macro name. Their content is set respectively by the local and global commands. unless the program is short. Programs may call other programs. Displaying of the underlying commands is suppressed. called the macro content.1. the general structure of a Stata program is program define progname Stata commands end Programs must be deﬁned (loaded in memory) before they can be used. A Stata program is just a sequence of Stata commands enclosed between the commands program define progname and end. speciﬁc to the program where they are deﬁned. global macro are public.4 Programming and Matrix Commands In this note I discuss the elements of Stata programmming and Stata matrix language.1 MACROS A macro is a user-deﬁned string of characters. Local macros are private. 4. 4. however. The simplest way to do so is to type directly the commands from the keyboard.1 PROGRAMMING STATA The capabilities of Stata may be extended considerably by using programs.3. local and global. Their general syntax is {local|global} mname [[`]"[string]"[´]|= exp|: extended_fcn] where the macro name mname can be up to 7 character long for local macros and up to 8 characters for global macros. Thus. This is not recommended. For the use of a extended macro function see the Stata manual. . that is. Alternative ways of deﬁning programs are described in Sections 4. Stata has two types of macros. and exp may be either a numeric or a string expression. Programs are executed by typing progname. that stands for another string of characters.

1. Typing macro drop _all does not eliminate system macros and the content of system macros such as S_DATE and S_TIME cannot be changed.36 To copy string to mname (the maximum length of string is 18.3 LOOPING Stata provides two commands for looping. The syntax of while is simpler: while exp { Stata commands } . Typing macro drop mname eliminates the global macro mname. The content of a local macro is accessed by enclosing the macro in `´.1. do-ﬁles and ado-ﬁles (see below). Examples: $S_DATE : contains the current date in the format dd mon yyyy $S_TIME : contains the current time in the format hh:mm:ss $S_FN : contains the ﬁlename last speciﬁed with use or save User-written programs may examine and change the content of system macros.2 SYSTEM MACROS In addition to user-deﬁned macros. $options If a macro contains double quotes.623 characters) use: {local|global} mname "string" To evaluate exp and store the result in mname (the maximum length of exp is 80 characters) use: {local|global} mname = exp Macros can be used everywhere in programs. 4. compound double quotes `""´ may be used to deﬁne a macro. 4. This simply replaces the name of the macro with its content. Stata has number of built-in global system macros that begin with the characters S_. Typing macro drop _all eliminates all global macros. while and forvalues. `options´ sort z global options "by(z) gap(4) sy(.) c(l) xlab ylab" graph x y. Examples: local options "gap(4) sy(i) xlab(10) ylin" graph x y. that of a global macro by preﬁxing it with a $.

It then repeats the process until exp evaluates to false (zero). If exp refers to any variables. the commands inside the braces are executed. If the result is false (zero). those statements are ignored and the statements following the else. if it is true (nonzero). are executed.4 BRANCHING The syntax of this programming command is: if exp { Stata commands } else { other Stata commands } This command evaluates exp. the arguments of a program are not enclosed in parentheses but simply follow the program name. executes the commands enclosed in the braces. however. PROGRAM ARGUMENTS Programs may take arguments. Example: The following code fragment may be used to iterate Stata commands 10 times local i = 1 local I = 10 while `i´<=`I´ { Stata commands local i = `i´+1 } The while command may also be used interactively.1. just like functions.5 y = log(x) { y = log(-x) y = . Do not confuse this command with the if qualiﬁer at the end of a command. whiles may be nested within whiles.1.PROGRAMMING AND MATRIX COMMANDS 37 This command evaluates exp and. . if speciﬁed. their values in the ﬁrst observation are used unless explicit subscripts are speciﬁed. If the result is true (nonzero). 4. Example: if x>0 { replace } else if x<0 replace } else { replace } 4. Unlike functions.

Arguments are passed to programs via local macros: `0´. `3´. `2´. The outer while loop continues the process until macro `1´ is empty. two or more variables. etc. Programs with this feature may be coded by shifting through its arguments program define myprog while "`1´" ˜= "" { Stata commands in terms of `1´ macro shift } end where macro shift shifts `1´. `1´ is the ﬁrst argument of the program. `2´. etc. if prog1 is a program and we type prog1 x y then x and y are the program’s arguments. . . Thus. `1´. where the local macro `0´ is exactly what the user typed. . . respectively the ﬁrst and the second argument.38 For example. for example. An alternative is the following: program define myprog local i = 1 while "``1´´" ˜= "" { Stata commands in terms of `1´ . the following two programs both produce a sequence of n pseudo-random numbers according to the U(a. `2´. which may be applied to one. An example is the summarize command. .. what was `2´ becomes `1´. `2´ the second argument. may be renamed to facilitate reading and understanding of a program. b) distribution: program define prog1 drop _all set obs `1´ generate x = `2´+(_n-1)/(_N-1)*(`3´-`2´) end program define prog2 args n a b drop _all set obs `n´ generate x = `a´+(_n-1)/(_N-1)*(`b´-`a´) end Sometimes programs involves a variable number of arguments. one to the left: what was `1´ disapears. . etc. with the same thing done to each argument. . The positional macros `1´. . what was `3´ becomes `2´.

Stata provides three commands to deal with this: tempvar creates names for temporary variables. tempvar x y gen `x´ = exp gen `y´ = exp .. .) that are temporary. tempname creates names for temporary scalars and matrices. Regardless of their ﬂavor. Most commonly. After an e-class command. one may refer to saved results in two ways. macros and matrices. Typing return list after an r-class command or estimates list after an e-class (estimation class) command summarizes what the command saved. e-class (estimation) commands (such as regress) save their results in e(). Example: . They all have the same syntax: {tempvar|tempname|tempfile} mname [mname . etc.. and tempfile creates names for temporary ﬁles. commands save results in one of two places: 1. can be discarded once the program completes. The other is to use macro substitution characters to produce `r(name)´ or `e(name)´. r-class commands (such as summarize) save their results in r(). 4.1.1. because Stata automatically drops any variables with names assigned by tempvar. ] The command creates local macros containing names one may use.PROGRAMMING AND MATRIX COMMANDS local i = `i´+1 } end 4. data. the number of observations used by a command are saved in the scalars r(N) or e(N).. One is just by simply typing r(name) or e(name). . the command name and the name of the response (dependent) variable are saved in the macros e(cmd) and e(depvar). For example.7 EXCHANGING RESULTS BETWEEN PROGRAMS Stata commands that report results save them in places where they can be subsequently used by other commands or programs. that is.. Results saved in r() and e() come in three ﬂavors: scalars. scalars.6 TEMPORARY OBJECTS 39 Programs often require objects (variables. whereas the estimated coeﬃcients and their variance matrix are saved in the matrices e(b) and e(V). The drop `x´ `y´ command is not necessary when the program completes. 2. matrices. .

Here are some rules and recommendations for constructing a do-ﬁle..40 Example: After regress . . A do-ﬁle may be used to deﬁne one or more programs or may call programs already deﬁned. The sequence of commands is executed using the do or run commands.or e-class.. (ii) an exit is executed. display "You can refer to " e(cmd) " or to `e(cmd)´" You can refer to regress or to regress Notice that after running an r-class command. If ﬁlename is speciﬁed without an extension. The code to save results in r() is return scalar name = exp return local name . This allows the do-ﬁle to run under later releases.2 DO FILES A do-ﬁle is a standard ASCII (text) ﬁle containing a sequence of Stata commands. running another one would change the content of r() but not the content of e(). • Blank lines and comments may be included freely. A do-ﬁle completes the execution when: (i) the end of the ﬁle is reached. On the other hand. . nostop] where nostop allows the do-ﬁle to continue executing even if an error occurs. where # is the Stata release under which the ﬁle was written..do is assumed. whose syntax is {do|run} ﬁlename [arguments] [. Their proper use may considerably enhance understanding of a program. Thus. while run is silent. running a new e-class command may change the content of both e() and r(). depending on whether the program is intended to be r.. a separate command on each line. Stata allows do-ﬁles to be nested 32 deep. it is important to do so immediately. The diﬀerence between do and run is that do echos the commands and their output. Do-ﬁles may also call other do-ﬁles. if one wants to access the results produced by a command. return local name matname while the code to save results in e() is estimates scalar name = exp estimates local name . estimates local name matname 4. or (iii) an error (nonzero return code) occurs (pressing Break while executing a do-ﬁle causes a nonzero return code and therefore stops the do-ﬁle). • Start a do-ﬁle by typing version #. As for programs. User-deﬁned programs may save their results if their class is speciﬁed on the program define line through the option rclass or eclass.

because x is the ﬁrst argument typed after do try.3 ADO FILES An ado-ﬁle deﬁnes a Stata command. do try x y The second command (drop if `1´==. the ci command produces conﬁdence intervals and is implemented as an ado-ﬁle. although many commands (e. just like programs.g.) would be interpreted as drop if x==. . clear drop if `1´==. For example. or by placing the comment in /* */ delimiters. replace]. The delimited may later be changed back to carriage return by including the #delimit cr command. Arguments are stored in local macros `1´. summarize or regress) are not deﬁned by ado-ﬁles but are build directly into Stata.0 /* do-file written under Stata 7. Example: version 7. For example. Output to the log ﬁle is suppressed if run is used to execute a do-ﬁle. the end-of-line delimiter may be changed from carriage return to. 41 Do-ﬁles accept arguments. Logging stops and the log ﬁle is closed when log close is encountered. summarize `2´.ado is stored on some directory that Stata can access. • To prevent Stata from pausing when the screen is full. ‘.0 */ * read in the data use mydata. detail • To avoid lines wider than the screen. detail and then execute it by typing . • The output of a do-ﬁle may be sent to a log ﬁle by including the command log using ﬁlename [. The /* */ delimiters can be put anywhere. clear * summary statistics summarize x y z. to repeat the same set of instruction for diﬀerent variables one could write the do-ﬁle try. say. command. at the end of a line or even in the middle. 4.do use mydata. and so on.’ by including the #delimit . This means that a ﬁle called ci. `2´. include the set more off command. An ado-ﬁle is an ASCII (text) ﬁle that contains a Stata program which deﬁnes (implements) a command.PROGRAMMING AND MATRIX COMMANDS • Comments may be included either by beginning a line with a ‘*’ (a star). the third command (summarize `2´) would be interpreted as summarize y because y is the second argument typed after do try.

they could also be handled as special cases of a matrix. matrix D = G[1.r1.4 MATRIX COMMANDS A Stata matrix is a rectangular array of double-precision numbers.. Thus. the personal ado-directories. just as it does for the ﬁle ci. Typing help ci (or pulling down Help and searching for ci). c0. do-ﬁles and ado-ﬁles. The matrix list command displays a matrix with its row and column names. The maximum matrix size can be increased to 800×800 by issuing the command set matsize 800.subname. the maximum matrix size is 40 × 40.2 SUBSCRIPTING AND SUBMATRICES The basic syntax for subscripting is matrix A = . none of which can be missing. Examples: . These names are used to produce “pretty” output.c] . where r and c are numeric or string scalar expressions.c1] . 4. c0..42 Ado-ﬁles typically come with an associated help-ﬁle. and which is bordered by a row and a column of names. where r0. The ﬁrst two parts may be blank.help. the current directory. By default..ado after the command ci is typed. r1.4. Row and column names have three parts: equation_name:ts_operator.1] ."displ"] ..gnp"] The basic syntax for extracting submatrices is matrix A = . Stata matrices are unsuited for holding large amounts of data.. Although Stata has scalars.1 ROW AND COLUMN NAMES Stata matrices always have row and column names. prompts Stata to look for the ﬁle ci. Examples: . 4. A vector is a special case of a matrix.B[r...B[r0. and c1 are numeric or string scalar expressions. matrix A = A/A[1..."eq1:l1. Stata looks for ado-ﬁles (and the associated help-ﬁles) in several places: the oﬃcial adodirectories (the base directory and the updates directory).. Row and column names may be reset using the matrix rownames and matrix colnames commands. Matrices can be used interactively or in programs. matrix B = A["weight".4. 4.

Examples: . "eq1:"] 43 The basic syntax for substituting submatrices is matrix A[r.. colnumb("mpg")] = sqrt(2) 4.. • B \ C (adds the rows of C below the rows of B). • det(B) (determinant of a square matrix). 3. • B / z (division by a scalar z). the indicated element of A is replaced.4. • B + C (addition). Examples ."mpg"] B["eq1:". . .2] = B .. matrix A[rownumb("price"). • trace(B) (trace of a square matrix). where r and c are numeric scalar expressions. • B .. including multiplication by a scalar).6] B[2. • B .C) (relative diﬀerence). • B * C (multiplication..c] = . matrix A[2. "price".3 MATRIX OPERATORS AND FUNCTIONS The matrix operators are: • -B (negation). • B’ (transposition). matrix C = (B + B’)/2 The matrix functions returning scalar are: • mreldif(B. Parentheses may be used to enforce a particular order of evaluation.. matrix matrix matrix matrix A A A A = = = = B[2.. C (adds the columns of C to the right of the columns of B). • B # C (Kronecker product). . the resulting matrix is placed in A with its upper left corner at (r.C (subtraction). If the matrix expression to the right of the equal sign evaluates to a scalar or a 1 × 1. c)...4.] B[1..PROGRAMMING AND MATRIX COMMANDS . . If the matrix expression evaluates to a matrix. 2.

• sweep(B. • syminv(B) (inverse of a symmetric matrix. this function is the same as A[i. rename a matrix (matrix rename) and drop a matrix (matrix drop).s) (the ﬁrst column number named s.s) (ﬁrst row number named s.i. • corr(B) (correlation transform).1*I(rowsof(X)) + 0. returns a generalized inverse). where s is a string or string expression). display trace(X) . • colnumb(A. where s is a string or string expression).m. The matrix drop _all command drops all matrices. if B is not positive deﬁnite.z) (n × m matrix containing the constant element z).j) (the i.9*X) There are matrix utilities to list the currently deﬁned matrices (matrix dir). • el(A.44 • diag0cnt(B) (number of zeros on diagonal). The matrix functions returning matrix are: • I(n) (n × n identity matrix). matrix L = cholesky(0. • vecdiag(B) (returns a row vector containing the diagonal of a square matrix). not just matrix expression contexts. Matrix functions returning scalar may be used in any expression context. it returns a diagonal n × n matrix with diagonal elements equal to those of V ). • rowsof(B) (number of rows). returns B with jth row/column swept). matrix beta = syminv(X’*X)*X’*y . . • J(n. j element of A). • cholesky(B) (Cholesky decomposition of a symmetric matrix). • inv(B) (inverse of a square matrix). display the contents of a matrix (matrix list). • get(mname) (returns the system matrix mname).j) (sweep of a square matrix.j]). • colsof(B) (number of columns). • rownumb(A. • diag(V) (V is a row or column n-vector. Examples .

) matrix list d . A is the name of an existing matrix. .). whereas W takes on a restricted form (diagonal. . matrix(X) . names(col|eqcol|matcol|string)] where type is a storage type for new variables. mkmat mpg . matrix(matname)] svmat [type] A [.PROGRAMMING AND MATRIX COMMANDS 4. rows(range) columns(range) explicit] renames the rows and columns of a matrix. where X = (X2 . Their syntax is mkmat varlist [if exp] [in range] [. block diagonal. . X3 .4 CROSS-PRODUCT MATRICES 45 Statistical computations often involve matrix operations such as X> X or X> WX. The matrix glsaccum command accumulates cross-product matrices from the data using a speciﬁed inner weight matrix to form A = X> BX. Stata ha a number of commands to compute these results eﬃciently. c(. X usually has a large number of rows and a small to moderate number of columns.4.4. or is known in some functional form and need not be stored).5 DATA TO MATRIX CONVERSION Variables can be converted into matrices and matrices into variables through the mkmat and svmat commands. where B is a block diagonal matrix. if matrix() is speciﬁed. = matrix list bvector1 matrix d = bvector1’ matname d wei gr for _cons. The matrix vecaccum command accumulates the ﬁrst variable against the remaining variables to form the row vector a = X> X. they can be stored as a single matrix. In these cases. The matrix accum command accumulates cross-product matrices from the data to form A = X> X. mkmat bvector1 if bvector1˜. The related command matname A namelist [. . The svmat command is the reverse of mkmat: it takes a matrix and stores its columns as new variables. matrix b = syminv(X’*X) * X’*mpg . mkmat foreign weight displ. Computing X> X or X> WX by storing the matrices and then directly performing the matrix multiplications is ineﬃcient and wasteful. and the names(col|eqcol|matcol|string) option speciﬁes how the new variables are to be named. . 1 4. . Examples: . Optionally. . The mkmat command stores the k variables listed in varlist in k column vectors of the same name.

2] contains the second largest eigenvalue and X[1. and so on. the columns of U are orthogonal.1] contains the largest eigenvalue and X[1. the matrix get command also obtains matrices after certain commands. These matrices can be referenced directly...7 MATRIX DECOMPOSITION The matrix symeigen command returns the eigenvectors in the columns of the n × n matrix X and the corresponding eigenvalues in the n-vector V . Alternatively.. a row n-vector W and an n × n matrix V such that A = U diag(W ) V > .4. matrix list e(b) . Examples: .. The singular value decomposition of a symmetric nonnegative deﬁnite matrix A is carried out through the matrix svd command. matrix S = vecdiag(e(V)) Other matrices are returned by various commands..2] its corresponding eigenvector. . and V is orthonornmal.1] its corresponding eigenvector. In addition.. For example.1) return the coeﬃcent vector e(b) and the variance-covariance matrix v(b) of the estimates.6 GETTING SYSTEM MATRICES The usual way to obtain matrices after a command that produces matrices is to refer to the returned matrix in the standard way.46 4. 4. This command returns an m × n matrix U . V[1. the elements of W are positive or zero. The eigenvalues are sorted from largest to smallest: V[1. all estimation commands (see Section 5.4. They are obtained in the same way.

(varlist) [weight] [if exp] [in range] [. • After estimation.3). denoted by yvar. options] All estimation commands share the following common features: • To review the last estimates. one may obtain prediction. one can perform tests of hypotheses about the model parameters (Section 5.2.1 ESTIMATION GENERAL SYNTAX OF ESTIMATION COMMANDS The general syntax of an estimation command is: command varlist [weight] [if exp] [in range] [.2. .4). The default is level(95).1.5 Statistical Inference Using Stata 5.. residuals and inﬂuence statistics using the predict command (Section 5. options] The ﬁrst variable in varlist is the response or outcome variable. • In addition to the estimated parameters and their standard errors. is similar: command (varlist) (varlist) . one may obtain the estimated variance matrix of the estimators using the vce command (Section 5. The conﬁdence level may be set using the level(#) option.2). • The estimated variance matrix of the estimators is computed under the assumption that the statistical model is correctly speciﬁed. • After estimation. conﬁdence intervals for the coeﬃcients are displayed. just type the estimation command without arguments.1 5. namely commands that estimate systems of equations.. denoted by xvars. the other variables are the covariates or predictors. The general syntax for multiple-equation commands. where # is the desired percentage level.2. but some commands allow for certain forms of model misspeciﬁcation with the robust option. • After estimation.

with 1 ≤ # ≤ 999. The constraint command deﬁnes.. The following example estimates the linear model E(Y ) = α + β1 X1 + · · · + β6 X6 subject to the constraints that β1 = β2 = β3 = β6 and β4 = −β5 = α/10: . Its syntax is: constraint constraint constraint constraint define # [exp=exp|coeﬃcientlist] dir [clist|_all] drop {clist|_all} list [clist|_all] where coeﬃcientlist lists the variables whose coeﬃcients are set equal to zero. p-values and conﬁdence intervals for a linear combination of coeﬃcients after any estimation command except anova. thus allowing for heteroskedasticity of unknown form. regress or glm) oﬀer the option of estimating the variance matrix of the parameter estimators by relaxing the assumption that the statistical model is correctly speciﬁed and allowing for certain forms of model misspeciﬁcation. . The following example estimates the linear model E(Y ) = α+β1 X1 +β2 X2 respectively without and with heteroskedasticity-robust variance estimates: .1. robust . regress y x1 x2 [pweight=w] gives a weighted least squares regression of y on x1 and x2 using the probability weights contained in the variable W . #[-#] .4-5) ROBUST VARIANCE ESTIMATES 5. constraint define 1 x1=x6 constraint define 2 x2=x6 constraint define 3 x3=x6 constraint define 4 x4=-x5 constraint define 5 x4=_cons/10 cnsreg y x1-x6.2 WEIGHTED ESTIMATION Specifying weights allows weighted estimation.3 CONSTRAINED ESTIMATION Several commands (e. The robust options relaxes the assumption that the observations are identically distributed. regress y x1 x2.g. . For example . 5. standard errors.4 Some commands (e. . regress y x1 x2 . t statistics. cnsreg) allow estimation subject to linear constraints on the model parameters through the constraint(clist) option. . where clist is of the form #[-#][. . 5. constraint(1-3.].1.1.g. lists and drops linear constraints.48 • The command lincom computes point estimates..

the estimated probit probabilities Φ(β 2. where eqno is the equation number. stdp command creates newvarname containing the standard error of the linear prediction. 5. it creates ˆ> Xi ). xb command creates newvarname containing ˆ the linear prediction β > Xi . the estimated variance or correlation matrix of the estimators is displayed using the command vce [. thus relaxing the assumption of independence. .2 DISPLAYING THE VARIANCE ESTIMATES After model estimation. after linear regression. The general features of predict are: 1. The predict newvarname.2. After probit.3 PREDICTIONS AND RESIDUALS The predict command calculates predictions. The predict newvarname.STATISTICAL INFERENCE USING STATA 49 The robust cluster(varname) option only requires observations to be independent across clusters speciﬁed by the variable varname. 3. both recorded to machine precision. The command mfx produces tables displaying the marginal eﬀects or the elasticities (and their standard errors) instead of the estimated coeﬃcients.2. use [eqno]_b[varname] (or simply [eqno][varname] or [eqno]varname) and [eqno]_se[varname]. this command produces the same result as predict newvarname.1 POST-ESTIMATION COMMANDS ACCESSING COEFFICIENTS AND STANDARD ERRORS After a (single-equation) estimation conmmand. corr rho] where corr and rho are synomis and either displays the correlation matrix instead of the variance matrix. residuals and inﬂuence statistics after estimation. Adding the nooffset option to any of the above makes the calculation ignoring any oﬀset or exposure variable speciﬁed in the estimation command. ˆ predict newvarname creates the ﬁtted values β > Xi . What predict can do depends to some extent on the previous estimation command. To obtain a copy of the estimated variance matrix for manipulations type matrix matname = e(V) 5.2. 5. In case of multiple-equation estimation command. _b[varname] (or _coef[varname]) contain the coeﬃcient on varname and its standard error. For example.2 5. 4. The predict newvarname command creates newvarname containing the “predicted values” of the response. For linear models.

predict newvarname if e(sample). 6. Some statistics make sense only with respect to the estimation sample.. equation(#1) is the default when equation() is not speciﬁed. the calculation is automatically restricted to the estimation sample. One can restrict the prediction to the estimation sample by typing . one may use equation(#1. predict calculates the requested statistic for all observations possible. It computes the standard error of the diﬀerence in linear predictions between two equations. one could refer to the equations by their names. predict can be used to make in-sample or out-of-sample predictions. For example. Other statistics (for example stddp) refer to between-equation concepts. equation(hours) to the one named hours. In general. equation(income) would refer to the equation named income. 7.. • stdp calculates the standard error of the linear prediction. Alternatively. . • stddp is allowed only after multiple-equation estimation commands.. • equation(eqno[. • The nooffset option may be combined with most statistics and speciﬁes that the calculation should be made ignoring any oﬀset or exposure variable speciﬁed when the model was estimated. This option is available even if not documented for predict after a speciﬁc command. predict hat. 9. and so on. equation(#1) means that calculations are to be made for the ﬁrst equation. It speciﬁes to which equation one is referring. In such cases. use data1 model estimation commands ..hours).#2) or equation(income. equation() is not optional. Example: . /* fill in the predictions */ The options of predict are: • xb calculates the linear prediction from the estimated model.50 5. If neither the offset() . Out-of-sample predictions may be obtained by applying predict to other datasets. In those cases. use data2 /* another dataset */ . 8. . equation(#2) that they are to be made for the second. When two equations must be speciﬁed.eqno]) is only relevant after multiple-equation estimation commands. whether they were used in estimating the model or not.

then bstat can be run on the data in ﬁlename to view the bootstrap estimates again. and the lrtest command performs likelihood-ratio tests after ML estimation. • other_options refers to command-speciﬁc options that are documented with each command. Note that command and exp_list must both be enclosed in double quotes. and calculates conﬁdence intervals using three diﬀerent methods: normal approximation.1 BOOTSTRAP The command bstrap progname [. 5. Reproducibility of the results requires setting the random-number seed by typing set seed #.. stat(#) level(#) title(text)] displays bootstrap estimates of standard error and bias.4 HYPOTHESIS TESTING 51 The test command performs Wald-type tests of linear hypothees.3 BOOTSTRAPPING AND MONTE CARLO SIMULATIONS Bootstrap and Monte Carlo simulations rely on Stata’s uniform() random number generator. The command bs "command" "exp_list" [. The expressions in exp_list must be separated by spaces and there must be no spaces within each expression. reps(#) size(#) dots args(.2. the command bsample [exp] [. specifying nooffset does nothing. 5. percentile. the testnl command performs Wald-type tests of nonlinear (or linear) hypothees. If the user speciﬁes the saving(ﬁlename) option with bstrap or bs. and bias corrected. Finally.. cluster(varnames) idcluster(newvarname)] . This command takes the same options as bstrap except for args(). The bstrap and bs commands automatically run bstat after completing all the bootstrap replications. The command bstat varlist [. 5.STATISTICAL INFERENCE USING STATA nor the exposure() option was speciﬁed at the model estimatio stage.3. bstrap_options nowarn noesample] runs the user-speciﬁed command bootstrapping the statistics speciﬁed in exp_list.) level(#) cluster(varnames) idcluster(newvarname) saving(ﬁlename) double every(#) replace noisily] runs the user-deﬁned program progname reps(#) times on bootstrap samples of size size(#).

52 is a low-level utility for those who prefer not to use bstrap or bs. Typing "simul progname. If exp is not speciﬁed. the sample replaces the data in memory. reps(#)" iterates progname # replications and collects the results. a sample size of _N is drawn (or size n_c when the cluster() option is speciﬁed where n_c is the number of clusters). It draws a sample with replacement from the existing data. reps(#) [args(whatever) dots double saving(ﬁlename) every(#) replace noisily] where progname is the name of a program that performs a single simulation. exp speciﬁes the size of the sample and must be less than or equal to _N. reps(#) (not optional) speciﬁes the number of replications to be performed. and args(sl whatever) speciﬁes any arguments to be passed to progname.2 MONTE CARLO SIMULATION The simul command is aimed at easing the programming task of performing Monte Carlo simulations. Its syntax is: simul progname. 5.3. .

statistic may be: pr(a. ystar(a. eform(). 6. mse1 and plus are for ado-ﬁle writers. residuals (residuals). depname().1. The syntax of predict following regress is: predict [type] newvarname [if exp] [in range] [.4.6 Statistical Models in Stata A broad range of statistical models may be estimated directly using the available Stata commands. level(#) beta robust cluster(varname) hc2 hc3 hascons noconstant tsscons noheader eform(string) depname(varname) mse1 plus] where beta requests that the normalized regression coeﬃcients be reported instead of conﬁdence intervals. and only provide a brief description for a number of other models.1). in addition to xb (the default) and stdp. In this note. generalized linear models (Section 6.1). I focus on estimation of linear models (Section 6. tsscons forces the total sum of squares to be computed as though the model has a constant (i.4. stdf (standard error of the forecast).2) and parametric models for duration data (Section 6.e. rstandard (standardized residuals). 6. hc2 and hc3 specify alternative bias corrections for robust (they may not be speciﬁed with cluster()). Its syntax is: regress yvar [xvars] [weight] [if exp] [in range] [. rstudent (Studentized or jackknifed residuals). statistic] where. and noheader.1 LINEAR MODELS Stata oﬀers several commands for estimating linear models.b) (E max(a. min(Y. b)). as deviations from the mean of the dependent variable). noconstant suppresses the constant term (intercept) in the regression.. e(a.b) (Pr(Y | a < Y < b}). leverage|hat (diagonal elements of hat matrix).1 ORDINARY LEAST SQUARES The regress command estimates a linear model by least squares (ordinar least squares or weighted least squares). .1.b) (E(Y | a < Y < b)). hascons indicates that a user-deﬁned constant or its equivalent is speciﬁed among the independent variables (some caution is recommended when using this option as resulting estimates may not be as accurate as they otherwise would be). stdr (standard error of the residual). robust and cluster(varname) have been discussed in Section 5. cooksd (Cook’s distance).

• dfbeta (calculates the DFBETAs).2 CONSTRAINED LINEAR REGRESSION The cnsreg command estimates constrained linear regression models. dfits (DFITS). 6. • rvpplot (graphs a residual-versus-predictor plot). • vif (calculate the variance inﬂation factors). • cprplot (graphs a partial residual plot). 6. dfbeta(varname) (DFBETA for varname). level(#) beta hascons noconstant robust cluster(varname) first noheader eform(string) depname(varname) mse1] . or welsch (Welsch distance).1. • hettest (performs the Cook-Weisberg test for heteroskedasticity). those in xvars2 are the endogenous variables. Constraints are deﬁned using the constraint command. In addition to predict. The syntax of this command is: ivreg yvar [xvars1] (xvars2=ivars) [weight] [if exp] [in range] [. The variables in xvars1 and ivars are the exogenous variables. • dwstat (computes the Durbin-Watson test statistic). • lvr2plot (graphs a leverage vs. • rvfplot (graphs a residual-versus-ﬁtted plot). • ovtest (performs Ramsey’s RESET test for omitted variable).1.3 LINEAR INSTRUMENTAL VARIABLES The ivreg command estimates a linear regression model using instrumental variables (or two-stage least squares) of yvar on xvars1 and xvars2 using ivars (along with xvars1) as instruments for xvars2.3). Its syntax is: cnsreg yvar xvars [weight] [if exp] [in range].54 covratio (COVRATIO). constraints(numlist) [level(#)] where constraint(numlist) (not optional) speciﬁes the constraint numbers of the constraints to be applied (see Section 5.1. squared residual or L-R plot). the following commands can be used after regress for diagnosing sensitivity to individual observations: • avplot (graphs an added-variable or leverage plot).

Estimation is carried out either by iteratively reweighted least squares (IRLS) or by using the Newton-Raphson (NR) method. mu(varname) speciﬁes varname as the initial estimate for the mean of yvar. not with IRLS). which is the default. search speciﬁes that the command should search for good starting values. special commands (e. jknife and jknife1 specify that jackknife estimates of variance be used. ltolerance(#) speciﬁes the convergence criterion for the change in deviance between iterations (ltolerance(1e-6) is the default). robust and cluster(varname) have already been deﬁned. The basic syntax is: glm yvar [xvars] [weight] [if exp] [in range] [. logit. scale(x2|dev|#) overrides the default scale parameter (by . and irls requests IRLS minimization of the deviance instead of NR maximization of the log-likelihood. Hall.g. Hall. nwest(wtname [#]) speciﬁes that a heteroskedasticity and autocorrelation consistent variance estimate be used.2 GENERALIZED LINEAR MODELS 55 Stata oﬀers a single and very ﬂexible command (glm) to estimate generalized linear models (McCullagh & Nelder 1989). brep(#) speciﬁes the number of bootstrap samples to consider in forming the bootstrap estimate (the default is brep(199)). max_options var_options output_options spec_options] The max_options are: iterate(#) ltolerance(#) mu(varname) nolog search fisher(#) irls where iterate(#) speciﬁes the maximum number of iterations allowed in estimating the model (iterate(50) is the default). and Hausman (1976) variance estimator (this option is not allowed when cluster() is speciﬁed). for selected models in this class. nolog suppresses the iteration log. bstrap speciﬁes that the bootstrap estimate of variance be used. fisher(#) speciﬁes the number of NR steps that should use the Fisher scoring Hessian or expected information matrix before switching to the observed information matrix (both search and fisher() are only useful with NR optimization. vfactor(#) speciﬁes a scalar by which to multiply the resulting variance matrix. 6. It also oﬀers.1 GLM The glm command ﬁts generalized linear models. probit. especially diagnostics and other post-estimation output. The var_options are: oim opg vfactor(#) robust cluster(varname) unbiased nwest(wtname [#]) jknife jknife1 bstrap brep(#) scale(x2|dev|#) disp(#) score(newvar) t(varname) where oim speciﬁes that the variance matrix should be calculated using the observed information matrix rather than the usual expected information matrix (option ignored if irls is not speciﬁed). opg speciﬁes that the variance matrix be calculated using the Berndt. poisson) with a broader and more speciﬁc set of options.2. unbiased speciﬁes that the unbiased sandwich estimate of variances be used (robust is implied when unbiased is used).STATISTICAL MODELS IN STATA 6.

If family() is speciﬁed but not link(). gamma. probit. a single dot character is printed for each estimation that is performed). scale(1) is assumed for discrete distributions and scale(x2) for continuous distributions). linkname is either a user-written program or one of: cloglog (complementary log-log). log. then the canonical link for the family is obtained. opower # (odds power). link(linkname) speciﬁes the link function. scale(dev) sets the scale parameter to the deviance divided by the residual degrees of freedom (this provides an alternative to scale(x2) for continuous distributions and over. and [ln]offset(varname) speciﬁes an oﬀset to be added to the linear predictor. • link(logit) for family(binomial) (same as logit). loglog (log-log). noconstant speciﬁes that the linear predictor has no intercept term. igaussian (inverse Gaussian). familyname is either a user-written program or one of: binomial (Bernoulli/binomial). The output_options are: eform level(#) trace noheader nodisplay nodots where eform displays the exponentiated coeﬃcients and corresponding standard errors and conﬁdence intervals (for binomial models with the logit link. exponentiation results in odds ratios. nbinomial (negative binomial). poisson. power #. The spec_options are: family(familyname) link(linkname) noconstant [ln]offset(varname) where family(familyname) speciﬁes the parametric family.or under-dispersed discrete distributions) scale(#) sets the scale parameter to #. nbinomial (negative binomial). disp(#) multiplies the variance of yvar by # and divides the deviance by #. noheader suppresses the header information from the output (the coeﬃcient table is still printed).56 default. • link(log) for family(nbinomial) (same as nbreg). and nodots speciﬁes that a dot should not be printed for each ﬁtted model when calculating jackknife or bootstrap estimates (by default. logit. score(newvar) creates the new variable newvar containing each observation’s contribution to the score. • link(log) for family(poisson) (same as tt poisson). identity. . for Poisson models with the log link. logc (log-complement). nodisplay suppresses the output (the iteration log is still displayed). trace requests that the estimated coeﬃcient vector be printed at each iteration. exponentiated coeﬃcients are rate ratios). namely: • link(identity) for family(gaussian) (same as regress). gaussian. scale(x2) speciﬁes the scale parameter be set to the Pearson chisquared (or generalized chi-squared) statistic divided by the residual degrees of freedom. and t(varname) speciﬁes the variable name corresponding to the time index (this option is required if nwest() is speciﬁed). • link(power -2) for family(igaussian).

except that it reports odds ratios rather than coeﬃcients by default. The logistic command is just the same as logit. statistic nooffset standardized studentized modified adjusted] 57 where. ddeviance (Delta-D inﬂuence statistic. deviance (deviance residual). hat (diagonal of the hat matrix). The syntax of predict after glm is: predict [type] newvarname [if exp] [in range] [.2. dbeta (Delta-Beta inﬂuence statistic. stat. hat (leverage.2 LOGIT AND PROBIT The logit command estimates logit models by maximum-likelihood (ML). Hosmer & Lemeshow 1989). response (response residual). Pregibon 1981). and lsens (graphs sensitivity and speciﬁcity versus probability cutoﬀ). statistic may be: mu (predicted mean of the response. lstat (reports summary statistics including classiﬁcation table). the default). Its syntax is: logit yvar [xvars] [weight] [if exp] [in range] [. The probit commands is completely analogous and estimates probit models by ML. or working (working residual). deviance (deviance residual). The syntax of predict following logit or logistic is: predict [type] newvarname [if exp] [in range] [. yvar˜=0 & yvar˜=. likelihood (likelihood residual). in addition to xb and stdp. or residuals (Pearson residual).STATISTICAL MODELS IN STATA • link(power -1) for family(gamma). the default). Its syntax is: poisson yvar [xvars] [weight] [if exp] [in range] [. Pregibon 1981).3 POISSON AND NBREG The command poisson produces ML estimates of the Poisson regression model.2. cooksd (Cook’s distance). 6. Hosmer & Lemeshow 1989). number (sequential number of the covariate pattern). 6. pearson (Pearson residual). statistic rules asif nooffset] where. (typically yvar==1) indicates a positive outcome. in addition to xb and stdp.. The following commands can be used after both logit or logistic to explore the nature of the ﬁt: lfit (performs goodness-of-ﬁt tests). dx2 (Delta chi-squared inﬂ. score (score residual). eta (same as the xb option). statistic may be: p (predicted probability of a positive outcome. level(#) nocoef noconstant or robust cluster(varname) score(newvar) offset(varname) asis max_options] where yvar==0 indicates a negative outcome. lroc (graphs the ROC curve). irr level(#) .

Up to 50 outcomes are allowed. delta = 0 (or ln(δ) = −∞) corresponds to unit dispersion. statistic may be n (predicted number of events. it is simply a Poisson model. The alternative parameterization.. Two diﬀerent parameterizations of the negative binomial model may be estimated. nbreg yvar [xvars] [weight] [if exp] [in range] [. thus.1 OTHER LIMITED DEPENDENT VARIABLES MODELS GROUPED BINARY RESPONSES The blogit and bprobit commands produce ML estimates of the logit and probit models for grouped data.3 6.3.2 ORDERED CATEGORICAL RESPONSES The ologit and oprobit commands estimate ordered logit and probit models of ordinal variable depvar on the covariates. the dispersion is a function of the expected mean of the counts for the ith observation: exp(β > Xi + oﬀset). is equivalent to predict . The syntax of predict after nbreg is the same as after poisson. and. has constant dispersion for all observations equal to 1 + δ. given by the option dispersion(constant). also given by the option dispersion(mean)..3. that is. statistic nooffset] where. The default. 6. For the default model. in addition to xb and stdp. n nooffset). has dispersion for the ith observation equal to 1 + α exp(β > Xi + oﬀset). 6. and is simply a Poisson model. the default). The glogit and gprobit commands produce weighted leastsquares (mimimum chi-square) estimates. . For the alternative parameterization. The actual values taken on by the response variable are irrelevant except that larger values are assumed to correspond to “higher” outcomes. The command nbreg produces ML estimates of the negative binomial regression model (Poisson regression with overdispersion). α = 0 (or ln(α) = −∞) corresponds to unit dispersion. dispersion(mean|constant) level(#) irr exposure(varname) offset(varname) robust cluster(varname) score(newvarnames) noconstant constraints(numlist) nolrtest nolog max_options] where depvar is a nonnegative count variable. The syntax of predict after poisson is: predict [type] newvar [if exp] [in range] [.. or ir (incidence rate.58 exposure(varname) offset(varname) robust cluster(varname) score(newvarname) noconstant constraints(numlist) nolog max_options] where depvar is a nonnegative count variable and irr reports estimated coeﬃcients transformed to incidence rate ratios.

g. a ≥ Y1 ). The intreg command estimates a bivariate censored Gaussian model where the response variables (Y1 . 6. Y1 ≤ b). in addition to xb and stdp.3 NESTED LOGIT 59 The command nlogit estimates a nested logit model by ML. The heckman command estimates Gaussian linear models with sample selection using either Heckman’s two-step estimator (Heckman 1976) or full ML.3..3.g. a ≤ Yi ≤ b).1 DURATION DATA PARAMETRIC DURATION MODELS The ereg and weibull commands produce ML estimates respectively of the exponential and Weibull (survival time) models.3. 6. (predicted median . The cnreg estimates the same class of models but allows the censoring points to vary across observations.3.g. Constraints may be deﬁned to perform constrained estimation.5 BIPROBIT The biprobit command produces ML estimates of two-equation probit models. Their syntax is: {ereg[het]|weibull[het]} yvar [xvars] [weight] [if exp] [in range] [.STATISTICAL MODELS IN STATA 6. The truncreg command produces ML estimates of a truncated Gaussian regression model. statistic] where.4. either a bivariate probit or a system of two seemingly unrelated probit equations. leftcensored (e. with and without gamma-distributed or inverse Gaussian unobserved heterogeneity (frailty). interval data (e.6 CENSORED AND TRUNCATED REGRESSION The tobit command produces ML estimates of censored Gaussian regression models with a ﬁxed censoring point. statistic may be: median time. 6. hazard hr tr dead(varname) t0(varname) frailty(gamma|invgaussian) ancillary(varlist) strata(varname) robust cluster(varname) score(newvars) constraints(numlist) level(#) nocoef noheader nolog maximize_options] The syntax of predict following ereg and weibull is: predict [type] newvarname [if exp] [in range] [. nlogit estimates the same model as clogit. Y2 ) can be point data. The model may contain one or more levels.4 6. The related command heckprob estimates probit models with sample selection by ML. or right-censored (e.4 MULTINOMIAL LOGIT The mlogit command estimates multinomial logit models by ML. For a single-level model. 6.

mean lntime (predicted mean log survival time).5. The covariates may be either ﬁxed or time-varying (ﬁxed within intervals). The response variable and the covariates may be diﬀerenced or seasonally diﬀerenced to any degree. models may include multiplicative heteroskedasticity. (predicted median log survival time). 6. Missing data are allowed and are handled using the Kalman ﬁlter.60 survival time. the conmmand estimates autoregressive integrated moving-average (ARIMA) models for the response variable. or mgale (partial martingale-like residuals). 6. The tsset command must be used before time-series operators may be used.5 TIME SERIES The tsset timevar command declares the data to be a time series and designates that variable timevar (which must take on integer values) represents time. The following options may be used to estimate a number of models in the ARCH family: . the default). models may also contain ARCH-in-mean and/or ARMA terms. hr (predicted hazard ratio). 6. The newey command produces estimated standard errors for the OLS coeﬃcients of linear regression models with heteroskedastic and possibly autocorrelated errors (see Newey & West 1987). When no covariate is speciﬁed.5.2 COX PROPORTIONAL HAZARD MODEL The cox command estimates proportional hazards models by ML. Concerning the regression equation itself. as well as gaps and right censoring. csnell (partial Cox-Snell residuals). hazard (predicted hazard). The failure event may be unique or recurring. In addition to ARCH terms.3 ARCH-TYPE MODELS The arch command estimates models with autoregressive conditional heteroskedasticity (ARCH) using conditional ML.1 LINEAR MODELS WITH AUTOCORRELATED ERRORS The prais command estimates a linear model with ﬁrst-order autoregressive errors using the Prais—Winsten transformed regression estimator. the data will be sorted on timevar. After tsset.4. mean time (predicted mean survival time). 6.2 ARIMA MODELS The arima command estimates a linear model with autoregressive moving-average (ARMA) errors. The procedure allows for left truncation (delayed entry). or a version of the Hildreth—Lu search method. median lntime. 6. the Cochrane—Orcutt transformed regression estimator.5. surv (predicted survival probability). A simpliﬁed version of cox is stcox.

• archm arch() [garch()] (ARCH-in-mean). • arch() tarch() [garch()] (GJR. • re (GLS random-eﬀects estimator).1 LINEAR PANEL DATA MODELS The xtreg command estimates linear panel data model. • fe. . This command can estimate ﬁxed-eﬀects (within-group).6. The iis command or the i() option set the name of the variable corresponding to index i.6 PANEL DATA 61 The xt series of commands provides tools for analyzing longitudinal (panel) data. NARCH with a single shift). power ARCH). 6. 6. simple asymmetric ARCH). and random-eﬀects models as well as population-averaged models. Which estimator is used is determined by the following options: • be (between-group estimator). • arch() saarch() [garch()] (SAARCH). • nparch() [pgarch()] (NPARCH. (ﬁxed-eﬀects estimator). threshold ARCH). while the tis command or the t() option set the name of the variable corresponding to index t. asymmetric power ARCH). nonlinear power ARCH). • arch() garch() (GARCH). • narch() [garch()] (NARCH. Each Each observation in a longitudinal dataset is indexed by a unit-speciﬁc index i and a time-speciﬁc index t. • arch() garch() ar() ma() (GARCH with ARMA terms). • narchk() [garch()] (NARCHK. between-group.STATISTICAL MODELS IN STATA • arch() (ARCH). • pa (GEE population-averaged estimator). Some of the xt commands use time-series operators in their internal calculations and thus require the data to be tsset. form of threshold ARCH). nonlinear ARCH). • earch() egarch() (EGARCH). • abarch() atarch() sdgarch() (TARCH. • aparch() [pgarch()] (A-PARCH. • parch() [pgarch()] (PARCH.

negative binomial. The integrals in the individual terms of the log-likelihood of the random-eﬀects model are computed using Gauss-Hermite quadrature.6. Poisson and usersupplied. one-step robust or two-step estimators. negative binomial. B If no option is speciﬁed. unstructured (arbitrary nonstationary) and user-supplied.4 GEE FOR PANEL DATA The xtgee command generalizes the glm command to panel data. log. logit. The command can be used with exogenously unbalanced panels and handles embedded gaps in the time series as well as opening and closing gaps.3 SEEMINGLY UNRELATED REGRESSION EQUATIONS The sureg command estimates a system of seemingly unrelated linear regression equations by feasible generalized least-squares. inverse Gaussian. namely Bernoulli/binomial. re perform. re is assumed. The xtprobit command estimates random-eﬀects (re) and population-averaged (pa) probit models for panel data. or a population-averaged (pa) logit model for panel data.5 LOGIT AND PROBIT FOR PANEL DATA The xtlogit command estimates a ﬁxed-eﬀects (fe). kth order moving average. 6. the quality of the Gauss-Hermite quadrature approximation may be checked using the quadchk command. power. odds power. identity.6. The allowed within-group correlation structures include independence. This command is not part of the xt series. The xttest0 and xthaus commands after xtreg. 6. log-complement.6.62 • mle (Gaussian ML random-eﬀects estimator). . It is very ﬂexible and allows estimation of generalized linear models for panel data (see Liang & Zeger 1986) with diﬀerent choices of parametric family. respectively. link function and within-group correlation structures. probit and user-supplied. namely complementary log-log. After estimating the model. 6. Gaussian (normal).6. The allowed link functions are also the same. 6. log-log. equicorrelation. The allowed distribution families are the same as for glm. kth order autocorrelation. gamma. the Breusch and Pagan (1979) Lagrange multiplier test for random eﬀects and the Hausman (1978) speciﬁcation test.2 DYNAMIC PANEL DATA MODELS The xtabond command estimates dynamic panel data models using Arellano and Bond (1989) one-step. a random-eﬀects (re).

parzen. nonlinear smoothers (running medians) to a single variable varname and stores the new series in newvar. the integrals in the individual terms of the log-likelihood of the Gaussian random-eﬀects model are computed using Gauss-Hermite quadrature. The smooth command applies resistant. random-eﬀects (re). displays the graph. the quality of the Gauss-Hermite quadrature approximation may be checked using the quadchk command. Cleveland 1979). pa corresponds to xtgee. After estimating the model. The width(#) option speciﬁes the halfwidth of the kernel. Also notice that xtlogit. random-eﬀects (re). epan (Epanechnikov. As for xtprob. or population-averaged (pa) negative binomial models for panel data. gauss (Gaussian). 6. either a gamma (the default) or a normal (Gaussian) distributed random-eﬀe cts model is estimated. the default).6 POISSON AND NEGATIVE BINOMIAL MODELS The xtpois command estimates ﬁxed-eﬀects (fe). the population-averaged model assumes equicorrelation as default (that is. For both xtpois and xtnbreg.2 REGRESSION SMOOTHERS The ksm command carries out unweighted and locally weighted smoothing of a response variable yvar on a single covariate xvar. then Stata uses the asymptoticaly optimal width for Gaussian data and a Gaussian kernel.1 NONPARAMETRIC ESTIMATION DENSITY ESTIMATION The kdensity command produces kernel density estimates and graphs the result. corr(exchangeable)). Missing values at the beginning or end of the range of varname are ignored. and optionally saves the smoothed variable. family(binomial) link(probit) corr(exchangeable) 6. .6. fe command is equivalent to the clogit command. For the re option. but missing values in the middle of the series are not allowed. Among the command’s capabilities are lowess (robust locally weighted regression. If width() is not speciﬁed. cosine.7. re. rectangle (uniform). The available options for the kernel function are: biweight.STATISTICAL MODELS IN STATA 63 Notice that the xtlogit. or populationaveraged (pa) Poisson models for panel data. triangle. 6. The xtnbreg command estimates ﬁxed-eﬀects (fe).7.7 6. family(binomial) link(logit) corr(exchangeable) whereas xtprobit. pa corresponds to xtgee.

8.8.8 6. Although not as fast.1 ROBUST AND QUANTILE REGRESSION ROBUST REGRESSION The rreg command estimates a linear model by iteratively reweighted least squares using a particular set of robust weights. The estimated variance matrix of the estimators is obtained by bootstrap. .2 QUANTILE REGRESSION The qreg command estimates quantile (including median) regression models. The function must be provided in a separate program. 6. The iqreg command estimates interquantile regressions (with a limit of 336 covariates). it is not limited to 336 coeﬃcients. Thus. See Gould and Scribney (1999) for details. The estimated variance matrix of the estimators is obtained by bootstrap and includes between-quantiles blocks . The sqreg command estimates simultaneous-quantile regression and produces the same coeﬃcients as qreg for each quantile.64 6. The ml series of commands allows estimation of an arbitrary model by ML. where q is the number of quantiles speciﬁed.9 GENERAL NONLINEAR METHODS The nl command ﬁts an arbitrary nonlinear function to a response variable yvar by least squares. 6. The bsqreg command is the same as sqreg. one can test and construct conﬁdence intervals comparing coeﬃcients describing diﬀerent quantiles. This command has a limit of 336/q covariates.

and West K. TX. Journal of the American Statistical Association.. (1981) Logistic Regression Diagnostics. 5: 475—492.K. Wiley.A. Cleveland W. 3/4: 653—665.R. Econometrica. (1974) Estimation and Inference in Nonlinear Structural Models.J. 55: 703—708. Econometrica. (1989) Generalized Linear Models (2nd ed. and Scribney W.References Berndt E. (2001) Econometrics. (1979) Robust Locally Weighted Regression and Smoothing Scatterplots. (1987) A Simple.). McCullagh P. Stata Corporation. Annals of Statistics. and Hausman J. 46: 1251—1272. London. and Zeger S. Heteroskedasticity and Autocorrelation Consistent Covariance Matrix. and Nelder J. 73: 13—22. Newey W. Biometrika. 9: 705—724. Wiley. Annals of Economic and Social Measurement.V. Hall B.Y.L. (1999) Maximum Likelihood Estimation with Stata. Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models. Hall R. Liang. Heckman J. Review of Economic Studies. (1976) The Common Structure of Statistical Models of Truncation. (1980) The Lagrange Multiplier Test and Its Applications to Model Speciﬁcation in Econometrics.S. and Lemeshow S. Chapman and Hall. K.. Chichester. and Pagan A. Hausman J.A. Gould W. College Station. (1989) Applied Logistic Regression.A. (1986) Longitudinal Data Analysis Using Generalized Linear Models. Breusch T. New York. Peracchi F. (1978) Speciﬁcation Tests in Econometrics. Positive Semi-Deﬁnite. Annals of Economic and Social Measurement.. Pregibon D. . 74: 829—836. UK. 47: 239—254.W. Hosmer D.

- Aceituno, 1979 - Statistical Formula to Estimate Heating or Cooling Degree Days
- Chapter 2- The Random Variable
- Crashing - Stevenson
- A Two-Parameter Quasi-Lindley Mixture of Generalised Poisson Distribution
- Prob Dist
- Regression With Stata
- PE_Teoria e Simulação
- MAT207_APSR-syllabus
- Week 06 Part 1
- stat12t_0603
- Probability 01
- 9781848002876-c1
- Product Design 1
- Survey Sampling - Leslie Kish
- Elements of Statistical Methods Discrete and Continuous Random Variables by Fritz Scholz
- 37_1
- SPSS-PROBLEM FOR PRACTICE
- 432156IJIT7113-312
- 37_1_disc_prob_dist
- atpcfinal2
- l3 probability methods
- 06. Discrete Probability Distributions
- Untitled
- Jupyter Notebook Viewer
- Discrete random variables
- 11.1 Probabilistic Risk Analysis - Examples_with Notes
- Mathematics Probability MCQ
- John Ehlers - TheInverseFisherTransform.pdf
- c7.pdf
- Mc for Finance

Close Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Loading