An Introduction to Stata

ii

An Introduction to Stata
F. PERACCHI
Faculty of Economics, Tor Vergata University, Rome, Italy

iv .

. . . . . . . . . .2 THE REFERENCE MANUAL . . . . . . . . . . . . . . . . . .5 CHANGING AND RENAMING VARIABLES . .2. . . 1. . . . . . 1. . . . . . . . . . . . . . . 1. . . . . . . . . . . . . .1. . . . . . . . . 1. . . . . . . . . . . . . . . . .1 STARTING AND STOPPING STATA . . . . . . .5. . . .1 GENERAL SYNTAX . . . . . . . . . . . . . . . . . . . . . . . . . . . .5. . . 1.4. .5. 1.5. . . . . . . . . . . . . . . . . . . . . . . .7 INCREASING THE NUMBER OF OBSERVATIONS DATASET . . . . . . . . . . . . . . . . . . . . . .1. .4 TUTORIALS . . . . . .1 DIRECT TYPING . . . . . . . . 1 1 1 1 2 2 3 3 3 3 4 4 4 4 5 5 6 6 6 7 8 8 8 9 9 11 12 12 12 13 15 15 2 Stata Commands . . . . . . . .4. . . . . . . 1. . . . . . . . .5. . . .2. . . . . . . . . . 1. . . . . . . 1. . . . . . . . . . . . . . 1. . . . .1. .1 DISPLAYING DATA . . . . . . . . . . . . .2. . . . . . . . . . . . .2 THE DATA EDITOR . 1. . . . 1. . . . . . . . . .2. .4 INPUTTING DATA . . . . . . . . . . . . . . . . . . . . . . 1. . . . . . . . . . . . . . . . . . . . . . . .2 LABELING DATA . . . A . . . . . . . . . . . . . . . . .4 LOADING A STATA DATA FILE . . .5. . .1 THE STATA WINDOWS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 LOADING AN ASCII (TEXT) DATA FILE . 1. . . .6 OUTPUTTING DATA . . . . . . .2 OBSERVATIONS . 1 Getting Started . . . . . . . . . . . . . . . . . .5 STATA UPDATES . . . . . . . . . . . 2. . . . . . . . . . . . . . .7 LOG FILES . . .4 CREATING NEW VARIABLES . . . . . . . . . . .3 THE STATA TECHNICAL BULLETIN . . . . 1.6 ELIMINATING VARIABLES OR OBSERVATIONS . . . . . . . . . . . .2. . . 1. 1. . . . . . . . . . . . . . . . 1. . .3 ALLOCATING MEMORY TO STATA . . 1. . 1. . . . . . . . . . . . . . . . .4. .Contents Introduction . . . .5 BASIC DATA MANIPULATION . . 1. . . . . . . . . . . . . . . .1 THE HELP SYSTEM . . . . 1. . .3. . . . . . . . . . . . . . . . . . . . . . . . . .5. . . . . . . . . . 1.1 VARIABLES . . . . . . . . . . . . . . . . . . . 1. . .2 THE STATA TOOLBAR . . . . IN . . . 1. 1. . . . . .3 VARIABLES AND OBSERVATIONS . . . . . . . . . . . .3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1. .4. . . . . . . . .2 STATA DOCUMENTATION AND UPDATES . . . . . 1. . . . . . . . . . . . .3 SUMMARIZING DATA . . . . . . . . . . . . . . . . . . . . . .

. . . 2. . . .4. . . . . . . . . . . .4. . . . . . .1 APPEND . . . . . . . . .2 MERGE . . . . . . . . . . . .2 SUMMARIZE . . . . . . . . . . . . . . . . . .1 BASIC SYNTAX AND GRAPHIC STYLES 3. . 15 16 16 16 17 17 17 17 18 19 20 21 21 21 22 22 22 23 23 23 23 23 24 24 24 25 25 25 26 26 26 27 29 29 30 32 32 33 33 35 35 35 36 36 37 2. . .1. . . . . .6. 2.2 LIST . . . . . . . . . . . 4. . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . 2. . . . . . BASIC SAMPLE STATISTICS .4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1. . .4 QUIETLY AND NOISILY BASIC DATA COMMANDS . . . . . .5. . . .1. . . . . . . . . . . .5. . . . .2 SYSTEM MACROS . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.2. . . . . . . . . . 2. . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . .6.6 SORT AND GSORT . . . . .1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . 2. . . 2. . .3 MEANS . . . . . . . . . . . . . . . . . . . . . . .1 DESCRIBE . . . . . . . . . . . . . . . . . . .2. . . . . . . . . . . . . . . . . . . . . . . .3 TABSUM .4 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1. . . . . . . . . . . . . . . . . . . . . . .1. . 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 BRANCHING . . . . . . . . . . . . . . . .2. . . . . . . . . . . . . . . . 2. .2 COMMON GRAPH OPTIONS . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . .4 TWO-WAY SCATTERPLOTS . .2 CONTRACT . . . . 2.5. . . . . . . . . . . . . . .4 GENERATE AND EGEN 2. . . . . .5 REPLACE . . . . . . . . . . . . . . . . . . . . . . TABLES . . CONTENTS . . . . . . . . . . . . . . . . . . . . .6. . .3 EXPAND . . . . . .1 COLLAPSE . . . . .6 3 Graphics . . . . . . . . . . . . . . . . . .5 TWO-WAY SCATTERPLOT MATRICES . . . . . . .1 PROGRAMMING STATA . . . . . . . . . . . 4 Programming and Matrix Commands 4. .5. . . . . . . . . . . . . . .3 HISTOGRAMS . . . . . . . .6 BOX PLOTS .6 CORRELATE . . . . . . . . . . . . . . . . . . 2. . . . . .1 COUNT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 MACROS .2. . . . . . . . . . . 3. . . . . . . . . .7 REGRESS . . 2. . . . . . .5 2. 4. . . . . . . . . . . . .1. . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . 4. . . . 2. . . . .4 FILLIN . . . . . . . . . . . . . . . . .5. . . . . . . . . . . . .3 DROP AND KEEP . .1 TABLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5. . . . . . . .3 LOOPING . . . . . . . . . . . . . . 2. . .5.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . .4 CENTILE . . . . . . . . . COMBINING DATA . . . . . . . . . .2 2. . . . . . . . . . . . . . . . . . . . . . . . .2 WEIGHTS . . . . .2. . . . . . . . . . . . . . . . . . . . . .4. . . . . . .vi 2. . . . . . . . . . . . . . . . . . . . .2. . . . .1. .3. . . . . . . . . . . . 2. . . . 2. . . . . . . . . . . . . . . . 2. . . . 2. . . . . . . . .5 CUMUL . . . . . . . 2. . . . . . . . . .2 TABULATE . 3. . . . . . . . . . .1 BY . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . . . . . . . RESHAPING DATA . . . . . 2. . . .5 RESHAPE . . . . .3 IF AND IN . . . . . . . . . . . . . 2. . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . .3 2. . . . . . . . . . . . 2. . . . . . . . . . . . .

.3 PREDICTIONS AND RESIDUALS . . . . . . . . . . . . 6. . . . . . . . 6. . . . . . .1. 6. . . . . . . . . .2. . .7 MATRIX DECOMPOSITION . 6. . . . . . . . . . . . . . . . .4. . . . . . . . . . . . . . . . . . . . . . .3 BOOTSTRAPPING AND MONTE CARLO SIMULATIONS . . . .2. . . . . . . . .3. . .5 BIPROBIT .3. . . . . . .2 SUBSCRIPTING AND SUBMATRICES .6 TEMPORARY OBJECTS . 6. . . . . .1 ORDINARY LEAST SQUARES . . . . .2. . . . . . . . . . . . . . 6. . . . . . . . . .1. . . . . . 4. . . . . . .4 5 Statistical Inference Using Stata . .4 ROBUST VARIANCE ESTIMATES . . . . . . . . . . . . . . .2 POST-ESTIMATION COMMANDS . 5. . . . . . . . . . . . . .3. . . .1. . . . 6. . . . . . . . . . . . . . . . . .1. .1. . . . . .2 WEIGHTED ESTIMATION .3. . . . . . . . . .4. . . . . . .4 DURATION DATA . . . . . . . . . . . . . .4. . .2 DISPLAYING THE VARIANCE ESTIMATES . . . . . . . . . . . . . . . . .3 LINEAR INSTRUMENTAL VARIABLES . . . . . . . MATRIX COMMANDS . . .2 MONTE CARLO SIMULATION . . . . . . . . . . .1 GLM . . . . . . . . . . . . . . . . . . . . . . 5.3 OTHER LIMITED DEPENDENT VARIABLES MODELS 6. . . . . . . . .1 ROW AND COLUMN NAMES .3 NESTED LOGIT . . . . . . . . . . . .1. . . . . . . .1.2. . . . 6. . 5. . . . . . . 6. . . . . . . . . . . . . . .1. 5. . . 6. . 4. . . . . . . . . . . . . . . . .4 CROSS-PRODUCT MATRICES . 4. . 6 Statistical Models in Stata . 5. . . . . . . . . . . . . . .4. .2 4. . . . . . . . . . . .3 4. . . . . . .2 GENERALIZED LINEAR MODELS . . . . . . . . . . . . . . . . . ADO FILES . . . . . . . 5. . . . . .7 EXCHANGING RESULTS BETWEEN PROGRAMS DO FILES . . . . . . . 6. . .2 CONSTRAINED LINEAR REGRESSION .1 ACCESSING COEFFICIENTS AND STANDARD ERRORS 5. . 5. . . . . 5. . . . . . . . . . 4. . . . . 4. . . 5. .1 ESTIMATION . .3. . . . . . . . . . . . . . . . . . . . . .1 BOOTSTRAP . . . . . 4. .3 POISSON AND NBREG . . . . . . . . . . . . . . . .1 LINEAR MODELS . . . . . 6. . . . . . . . . . .2. . .3 CONSTRAINED ESTIMATION . . . . . . . . . . .4. . 6. . . . . .6 GETTING SYSTEM MATRICES . . . . . . . . . . . . . . . . . .4. .1. .5 PROGRAM ARGUMENTS . .2. . . .4 HYPOTHESIS TESTING . . . . . . . . . . 5. . .2 LOGIT AND PROBIT . . . .3. . . . . . . . . . . . . . . . . . vii 37 39 39 40 41 42 42 42 43 45 45 46 46 47 47 47 48 48 48 49 49 49 49 51 51 51 52 53 53 53 54 54 55 55 57 57 58 58 58 59 59 59 59 59 59 4. . . . . . . . . . . 5. . . . . . .3. . . . . . . . . 4. 4. . 6. . . . . . . . . . . . . . . . . . . . . . . . . .1 GENERAL SYNTAX OF ESTIMATION COMMANDS . . . . . . . . .1 PARAMETRIC DURATION MODELS . . . . . . . . . . . . . . . . . . .4. . .6 CENSORED AND TRUNCATED REGRESSION . . . .3 MATRIX OPERATORS AND FUNCTIONS . . . . . . . . . . . 6. . . . . . . . . . . . . . . 4. . . . . . .4 MULTINOMIAL LOGIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . .2 ORDERED CATEGORICAL RESPONSES . . . . . . . . . 6. . . . .2. . . . . .4. . . . . . . . . . . . . . . . . . 5. .5 DATA TO MATRIX CONVERSION . . . . . . . . .1 GROUPED BINARY RESPONSES . . .1. . . . . .CONTENTS 4. . . . .

. PANEL DATA . .2 ARIMA MODELS .2 DYNAMIC PANEL DATA MODELS . . .5. . . . . . . . . . 6. . . . . . . . . . . . . . . . 6. . . . . . . . . . . . . 6. . . . . . . . . . . . . . . .3 SEEMINGLY UNRELATED REGRESSION EQUATIONS . . . 6. . . . . . . . . .6. . . . . . . . . . . . . . ROBUST AND QUANTILE REGRESSION . . . . . . . . . . . . . . . .1 LINEAR PANEL DATA MODELS . . . .1 LINEAR MODELS WITH AUTOCORRELATED ERRORS 6. . . . . . . . 60 60 60 60 60 61 61 62 62 62 62 63 63 63 63 64 64 64 64 65 6. . . . .3 ARCH-TYPE MODELS . . . . . . . . . . . . . . .5 LOGIT AND PROBIT FOR PANEL DATA . . . . .4 GEE FOR PANEL DATA . . . . . .6. . . GENERAL NONLINEAR METHODS . . . . . . . . . . . . . . . . 6. . . . . . . . 6. . . . . . . . . .8. . . . . . .7. . . 6.6. . . . . . . .4. . . . . . . . . . . . . . . .5 CONTENTS 6. . 6. . . .8 6. . . . . . . . . . . . . .2 QUANTILE REGRESSION . . . . . . .6. . . . .2 COX PROPORTIONAL HAZARD MODEL . . . .5. . . . . . . .6 POISSON AND NEGATIVE BINOMIAL MODELS . . . . . . . . . . . . . .viii 6. . . . . .2 REGRESSION SMOOTHERS .6. . . . . . . . . . 6. . . . . . 6. . . . . . . . . . .8. . . . . .5. . . . . .6 6. . . . . . .1 ROBUST REGRESSION . . . . . . . TIME SERIES . . . . . . . . . . . 6. . . . . . . . . .7 6. .9 References . . . . . . . 6.1 DENSITY ESTIMATION .7. . . . . . . . . . . NONPARAMETRIC ESTIMATION . .6. . . . . . . . . .

0.0 for Windows 98/95/NT.Introduction Why use Stata? In my view. equivalently. Chapter 6 reviews the main classes of statistical models implemented in Stata. Finally. In what follows I focus on Stata 7. it has three main advantages over other statistical packages. The only limit is the amount of RAM available. The latest release is Stata 7. Windows). With 100MB of RAM. Chapter 3 describes Stata graphic capabilities. The second is speed: Stata is fast because all data manipulations are carried out in the RAM. hypothesis testing) using Stata. Chapter 2 introduces the syntax of a Stata command and presents some of the most used commands. and is fully integrated with a matrix language. . The first is portability: Stata runs on several platforms (Macintosh. one can work with a dataset containing 5 million observations on 4 real-valued variables or. Unix. prediction. Chapter 1 describes the main features of the program. The third advantage is that Stata contains “state-of-the-art” statistical procedures. Chapter 4 introduces the elements of Stata programming and the Stata matrix language. Chapter 5 shows how to carry out statistical inference (estimation. and Stata programs written for one of them run with (almost) no change on any other one. This introduction to Stata is organized as follows. is programmable. with one million observations on 20 real-valued variables.

.

describe or generate). basic data manipulation (Section 1. I adopt the following typographic conventions: the typewriter-style typeface is used for Stata commands or options that have to be typed in (e. the definition of variables and observations (Section 1.3) are correctly installed.1). type verinst. how to input data (Section 1.6). small caps is used for keyboard keys (e. click the Break button or press Ctrl+Break. 1.1 Getting Started This chapter introduces the main aspects of Stata.g. where to look for documentation and updates (Section 1. • the Stata Results window (where results are displayed). italics is used for things that must be substituted for by some other word (e. type exit. click on the Stata icon. All these windows may be resized and rearranged. 1. To test that the supplied ado files (see Section 4. • the Review window (it shows the past commands). To exit Stata when there are data in memory which have not been saved. Enter or Ctrl+Break) and boldface is used for Windows commands or switches (e. The windowing preferences may be saved by choosing Prefs from the main menu bar.7).g.1 STARTING AND STOPPING STATA To start Stata.g. clear. and how to open and close log files (Section 1.2). To test the installation of Stata.1. To exit Stata.5). To make Stata stop what is doing and return to the Stata prompt.1 THE STATA WINDOWS The Stata windows consists of: • the Stata Command window (where commands are typed in and then issued by pressing Enter). Exit or Help).g. .4).3). varname or varlist). type exit or choose Exit from the File menu. namely how to start and stop the program (Section 1. • the Variables window (it shows the list of variables). type crc. how to output data (Section 1.

Bring Dialog Window to Front (brings the Dialog window to the front of the other Stata windows). Do-file Editor (opens the Do-file editor or brings the Do-file Editor window to the front of the other Stata windows). 9. 7. a box with a brief description will appear): 1. 1. Esc.2 The Stata Command window follows standard Window editing style. Bring Results Window to Front (brings the Result window to the front of the other Stata windows). One can copy one line at a time from the Results window into the clipboard and paste into the Command window. Break (stops the current task in Stata). 5. 4. 8.3 ALLOCATING MEMORY TO STATA Initially. 6.2 THE STATA TOOLBAR Going from left to right. to permanently change the amount of memory used every time Stata is invoked. Stata allocates 1MB of memory to each session. Begin Log (starts a new log. the Stata toolbar contains the following buttons (holding the mouse pointer over each button. Start Viewer (opens the Stata viewer for help on Stata).1. Save (saves to disk the Stata dataset currently in memory). Open (opens a Stata dataset). 3. Backspace. 10. The keys for editing in the Command window are Delete. Bring Graph Window to Front (brings the Graph window to the front of the other Stata windows). click on . 1. Print (prints a graph or log). 11. End. and stops or suspends the current log). 2. 13. Home. Clicking once a command in the Review window copies the command on the Command window where it can be edited before being entered. appends to an existing log. Page Up and Page Down. Under Windows. Data Editor (opens the data editor or brings the Data Editor window to the front of the other Stata windows).1. Clear — more — Condition (tells Stata to continue when it has puased in the middle of a long output). 12. Data Browser ((opens the data browser or brings the Data Browser window to the front of the other Stata windows).

September and November). discussions on teaching statistics. Memory allocation may also be changed within a given Stata session (although not permanently) by using the set memory #k command (in kilobytes) or the set memory #m command (in megabytes). It contains articles written by Stata Corp.1 THE HELP SYSTEM On-line help can be accessed by opening the Stata viewer from the toolbar or by choosing Help from the main menu bar. If more memory is used than physically available on the computer. Stata users.com.2. illustrations of data analysis techniques.. debates on appropriate statistical techniques. Every year.2 STATA DOCUMENTATION AND UPDATES The main documentation comes from the help system and the Stata reference manual. tutorials on programming strategies. July.2. Stata slows down.stata. the 6 issues are bound into a volume. and then entering keyword searches for keyword in the list of help entries. and suggestions. set virtual on 1. 1. 1. plus seven volumes: the User’s Guide. The Start in line specifies the initial working directory. Additional documentation on Stata developments and updates is available through the Stata Technical Bulletin and the Stata Web site http://www. . selecting Contents opens the table of contents for on-line help. On-line help can also be accessed from the Command line by typing help keyword or lookup keyword. This may be changed by editing the line or by using the cd drive:/directory_name command from the Command line. Articles have included enhancements to Stata (ado-files). 1. or help list. reports on other programs. March. pull down File and choose Properties. In this case. the Programming Manual and the Reference Manual in four volumes. and others. or help contents. May. click on the Shortcut tab and put k# (in kilobytes) or m# (in megabytes) after the call to wstata. The STB is published every two months (in January. along with interesting datasets. it is recommended to set virtual memory on by typing .exe in the Target line. Try help. questions. . In either case.2.GETTING STARTED 3 the Stata icon. This command requires that no data be present in memory. . the Graphics Manual.3 THE STATA TECHNICAL BULLETIN The Stata Technical Bulletin (STB) is a printed and electronic journal with corresponding software. Selecting Search .2 THE REFERENCE MANUAL It consists of the introductory booklet Getting Started with Stata.

• survival (estimating maximum-likelihood survival models).3. where tutname is any of the following: • contents (lists the available official Stata tutorials). • regress (estimating regression models. • graphics (how to make graphs). • tables (how to make tables). 1. which are released fairly frequently (every 3—4 weeks).stata.4 TUTORIALS Stata provides tutorials on a variety of aspects: introduction to Stata (intro. The net search keywords command searches the Internet for user-written additions to Stata that contain the specified keywords. two. Collections of files are bound together into packages.tut).4 1. To run a tutorial.and N-way ANOVA and ANCOVA models).2. • yourdata (how to input data). • intro (introductory tutorial). among other things. The latter can be downloaded directly using the update command. which fetches and installs additions to Stata obtained from the Internet or from media.2. or even datasets.5 STATA UPDATES The Web site http://www. 1. which must be 1 to 8 characters long. • ourdata (description of the data provided by Stata). Variables are called by their name. data input. type tutorial tutname. answers to frequently asked questions (FAQs). variables are associated with the columns of a data matrix.3 VARIABLES AND OBSERVATIONS In Stata. The additions can be ado-files (new commands). tables. 1. observations with its rows. and procedures for statistical modeling. • logit (estimating maximum-likelihood logit and probit models). free additions to Stata (“Cool ado-files”) and the latest official updates to Stata. The first . • anova (estimating one-. • factor (estimating factor and principal component models). graphics.1 VARIABLES Variables come in two types: alphabetic (strings) or numeric (real or integer valued).com contains. help files. Another useful command is net. including 2SLS).

It is better to avoid using the name e for variables and beginning variable names with an underscore (all Stata build-in variables begin with an underscore).GETTING STARTED 5 character must be a letter or an underscore. Notice that Stata uses ‘.766) and byte (integers between -127 and 126). The compress command may be used to automatically optimize the storage type of the data in memory.4 INPUTTING DATA Data can be inputted into Stata by direct typing. Stata automatically creates and updates the build-in system variable _n. Thus.g. the other characters can be letters. where # is an integer between 1 and 80 specifying the number of characters in the string. which contains the total number of observation in the dataset. xX.483. if. in.’ (a period) to denote both the decimal symbol and missing numerical values. There are a few reserved names that cannot be used: e. and the system macro _N. The default for storing numeric variables is float. The maximum length of a string is 80 characters.147. The available storage types are: • String variables: str#.646). Xx and XX are all different names). Notice that when the sorting of the data changes. Associated with each type of variable is a storage type. a byte occupies 1/4 of the space of a float and 1/8 of the space of a double. 1. with the data editor or from a file (an ASCII file or a Stata data file).768 and 32. with.483. Variables can be renamed using the rename command.648 and 2. so does the counter _n.147. but Stata performs all internal calculations in double. a long occupies the same space as a float. • Real valued variables: double (double precision or about 16 digits of accuracy) and float (single precision or about 7 digits of accuracy). For example: rename x y renames the variable x as y. int. thereby allowing to store categorical and indicator variables very efficiently. A double occupies twice as much space as a float. and a byte occupies half the space of an int. an int occupies half the space of a float. • Integer valued variables: long (integers between -2. Stata is case sensitive (xx. int (integers between -32.2 OBSERVATIONS Observations correspond to the row of a data matrix. . which is a counter containing the number of the current observation. 1. digits or underscores (spaces or other characters are not allowed).3.

. The data editor initially names variables var1. Data may be entered or modified by choosing the cell. filename is the name of the disk datafile (including the path. . . and a single observation may span any number of input lines. Stata will interpret the number 1. It is important to always check the numeric format of a spreadsheet before copying data to Stata. The first command.5 "J. Neyman" 7 . .A. Tukey" 3 21. . 1. With the data editor. .4. tabs or commas) is: infile varlist using filename [. Variables may be renamed by doubly-clicking anywhere in the variable’s column. .4.2 The data editor corresponds to the edit command.5 "J. The data editor allows copying and pasting data created by other spreadsheet or database programs. infile.1 DIRECT TYPING The input command allows typing data directly into the dataset in memory. .6 1. The data editor is like a standard spreadsheet with colums corresponding to variables and rows to observations.01 "R. Cox" THE DATA EDITOR 1. Fisher" 2 11. var2. The basic syntax for data in free-format (data may be separated by spaces.1 "D. Missing numeric values are recorded as ‘.314 as a string. . . quotes around strings are unnecessary. input x1 x2 x3 4 3.or fixed-format. typing the value and then pressing Enter or Tab.’ (a period). if necessary) and clear is an option that clears data loaded in memory without saving them (I follow the convention of denoting items that are optional by enclosing them in square brackets). missing string values are just empty strings. thus bringing up the Variable information dialog.R. Example: . The data can be in either free. ). If the file name is specified without an extension. It may be accessed by clicking the Data Editor button on the Stata toolbar. Unless otherwise instructed.4.raw is assumed. clear] where varlist is a list of variable names with blanks in between (that is. is a very flexible way of reading an ASCII (text) datafile from disk into memory.3 LOADING AN ASCII (TEXT) DATA FILE Stata offers three basic commands for loading an ASCII (text) data file. . . varname1 varname2 .

Regardless of the creator. The data may be in the same file as the dictionary or in another file. using(filename2) clear] where filename is the name of a dictionary file and filename2 is the name of the file containing the data.GETTING STARTED 7 If the data are in fixed-format. The basic syntax is: use filename [. 1. the data is assumed to follow the dictionary in filename or. A dictionary is an ASCII (text) file which describes the contents of a datafile. Notice that if using(filename2) is specified. Its basic syntax is: infix using filename [.dct) file is necessary. The first line of the file may contain the variable names. Specifying tab or comma is not necessary because insheet can determine the separation character for itself when the character is a tab or comma. . It is somewhat easier but less flexible than infile. with a vertical bar as a separator). The third is the insheet command.4 LOADING A STATA DATA FILE Stata data files have the default extension dta. which reads ASCII files created by a spreadsheet or database program. a dictionary (. If the using() option is not specified. that file is assumed to contain the data. a single observation may span any number of input lines. The basic syntax is: infile filename [. {comma|tab|delimiter("char")} clear] The {comma|tab|delimiter("char")} option tells Stata how values are separated in the file (I follow the convention of denoting the available alternatives by enclosing them in curley brackets.4. this command reads ASCII files where there is one observation per line and the values are separated by tabs or commas. clear] where filename contains the full path to the data. which reads ASCII files in fixed-column format. filename2 is used to obtain the data even if the dictionary itself says otherwise. The basic syntax of a dictionary file is the following: [infile] dictionary [using filename] { * comments may be included freely * [type] varname } (data might appear here) The second is the infix command. Again. Stata data files on disk may be loaded using the use command. The insheet command can also determine for itself whether the file includes variable names. using(filename2) clear] where filename is the name of the dictionary file and filename2 is the name of the file containing the data. if the dictionary specifies the name of some other file. The basic syntax is: insheet [varlist] using filename [.

The list [varlist] command displays the values of variables. The display format of a variable may be specified using the command format varlist %fmt where %fmt is the chosen format for varlist.5.8 1. To switch the decimal character back to period type set dp period. display "this is a string" . If no label is specified.0g displays the variable x in g (generic numeric) format. The complementary ds command lists variable names in a compact format. The content of a Stata data file on disk may be described without actually loading it by using the describe using filename command. This command may also be used interactively as a substitute for a hand calculator. Variables are labeled by using the command label variable varname "string" where string (typed in quotes) is up to 80 character long. For example. 1. display "the value of f(x) is" 5+exp(ln(10)) The content of the dataset in memory may be displayed by using the describe command. display 5+exp(ln(10)) . the values of all the variables are displayed.2 LABELING DATA Stata contains a number of commands for manipulating labels.5 BASIC DATA MANIPULATION I now introduce some basic facilities for manipulating data. 1.1 DISPLAYING DATA Strings and values of scalar expressions may be displayed using the display command. Examples: .3f displays x in f (fixed numeric) format with three decimals. format x %9. Changing the display format does not affect the internal precision with which variables are stored and manipulated. The values of a variable are labeled by using the command label value varname [lblname] . any existing variable label is removed. whereas lookfor string helps in finding variables by searching for string among all variable names and labels. The set dp comma command may be used to display numerical values using comma as the decimal character.5. If no varlist is specified. whereas format x %8.

5. mean. + and > are examples of Stata operators. If no label is specified. minimum and maximum value). Data labels are displayed when the data are used or described. . log(x) is an example of a Stata function.. If no varlist is specified. For example. any existing label is removed. summary statistics are calculated for all the variables in the data. detail produces additional statistics including skewness. the four smallest and four largest values.GETTING STARTED where lblname is the name of a value label defined through the command label define lblname # "string" [# "string" . 1.3 SUMMARIZING DATA The command summarize [varlist] calculates and displays a variety of univariate summary statistics (number of nonmissing observations. The command summarize [varlist].. kurtosis. standard deviation. to generate the new variable y using positive values of an existing variable x.5. generate y = x*x + log(x) if x>0 where *. The basic syntax of this command is: generate newvar = exp [options] where exp is an expression and options are optional instructions that may restrict the application of exp. type: .] Example: label define sexlbl 0 "male" 1 "female" label value sex sexlbl 9 The label dir command lists the names of value labels stored in memory.4 CREATING NEW VARIABLES New variables are created using the generate command. along with various percentiles. label drop _all eliminates all value labels. and if x>0 is a qualifier that restricts the scope of the command to the observations for which x > 0. whereas label list lists the names and contents of value labels stored in memory. Data files are labeled by using the command label data string where string is up to 80 characters long. label drop lblnames eliminates the value labels lblnames. 1.

| (or). * (multiplication). they are separated by commas. Functions are used in expressions. for example exp(x). distribution functions and quantile functions of various probability distributions.10 To generate the new variable y containg lagged values of x.x) (chi-square distribution with df degrees of freedom). whereas typing generate y = x[_n] and y = x are equivalent.L.x. • Statistical functions: density. If X denotes the name of a continuous distribution. including other functions. ˜= (not equal). display binorm(x. == (equal). Examples: . display normden(x) (standard normal density) ..5) . > (greater than).25) (cumulative binomial with parameters n = 5 and π = .. Stata has built in a number of functions: • Mathematical functions. generate y = x[_n-1] Typing generate y = x[1] sets every observation of y equal to the first observation in x.25) . . unit variances and correlation ρ = . The arguments of a function are enclosed in parentheses.df2 . display invnorm(p) (standard normal quantile function) . ˜ (not). The currently available distributions include chi2(df .x) (t distribution with df degrees of freedom). If there are multiple arguments. display Binomial(5. log(x) or ln(x). • string operators: + (string concatenation). then Stata usually provides X() (cumulative distribution function). The Stata operators are: • arithmetic operators: + (addition). Parentheses may be used to force a different order of evaluation. norm(x) (standard Gaussian). invX() (quantile function) and invXtail() (upper quantile function). for example the expression "abc" + "def" produces the string "abcdef". <= (less or equal). ˆ (power).y. F(df1 . nchi2(df . • relational operators: < (less than). Xtail() (upper tail cumulative distribution function). sqrt(x). / (division). type: . both discrete and continuous. display norm(x) (cumulative standard normal) . >= (greater or equal).x) (noncentral chi-square distribution with df degrees of freedom and noncentrality parameter L) and t(df . The argument(s) of a function may be any expression.5) (cumulative bivariate Gaussian with zero means.(subtraction). • logical operators: & (and). abs(x) and the main trigonometric functions. Functions return missing when the value of the function is undefined. The order of evaluation follows the standard rules.x) (F distribution with df1 and df2 degrees of freedom).

.). round(x. n2 ) (returns the substring of s starting at n1 for a length of n2 . max(x1 . • Special functions. string(n) (converts n into a string). x2 . where the function invnorm evaluates the quantile function of the standard normal. Do not use encode if . y) (returns x rounded into units of y). 4 and 5 to 6. recode x 1=2 3=4 changes 1 in x to 2 and 3 to 4.GETTING STARTED • Pseudo-random number generator: uniform(). . and upper(s) (returns the uppercased variant of s). for example lower(s) (returns the lowercased variant of s). int(x) (returns the integer part of x). 1 if x > 0. . recode x 1 3/5 = 6 changes 1. Pseudorandom numbers according to any other continuous distribution may be generated through the inverse probability integral transform. ignoring missing values).3). For example: pseudo-random numbers according to the standard normal distribution may be generated with the invnorm(uniform()) command. x2 .5 CHANGING AND RENAMING VARIABLES The content of an existing variable may be changed by using the replace command.4. whereas the name of an existing variable may be changed (its contents remain unchanged) by using the rename command. if x = . . . the command encode varname. for example float(x) (returns the value of x rounded to float storage type). and sum(x) (returns the running sum of x. which generates uniformly distributed pseudo-random numbers on the interval [0. as well as matrix functions returning scalars (see Section 4. . treating missing values as zero). For example. The recode varname command changes the values of varname according to the rules specified. . 0 if x = 0. substr(s. Given a string variable named varname. 1. . if n2 = . 11 A variety of date and time-series functions are also available. and . creating at the same time (or just using as necessary) the value label newvar. real(s) (converts s into a numeric value). • String functions (which apply to string variables).5. xn ) (return respectively the maximum and the minimum of the arguments.. the remaining portion of the string is returned). It takes no arguments. and is George Marsaglia’s KISS (Keep It Simple Stupid). generate(newvar) generates a new numeric variable named newvar based on varname. n1 . whereas .1). xn ) and min(x1 . . sign(x) (returns -1 if x < 0. 3.

The keep command works the same as drop except that we specify the variables or observations to be kept rather than those to be deleted. If there are variables in memory.6 OUTPUTTING DATA Stata offers three basic commands for outputting data. . .4.12 varname contains numbers that merely happen to be stored as strings (e. . . .5. corresponding to the use. . the number ‘1. The clear command essentially resets Stata and is equivalent to the set of commands: . .0 drop _all label drop _all (drop all labels in memory) scalar drop _all (drop all scalar variables in memory) matrix drop _all (drop all matrices in memory) eq drop _all (drop all equations in memory) constraint drop _all (drop all constraints in memory) discard (drop all programs in memory) INCREASING THE NUMBER OF OBSERVATIONS IN A DATASET 1. drop _all .7 The set obs # command changes the number of observations in the current dataset to #. 1. where # is an integer at least as large as the current number _N of observations. version 7.g. . makes 100 observations and assigns the variable x the values from 1 to 100.5. the values of all new observations are set to missing. In this case use instead generate newvar = real(varname) The decode command creates a new string variable named newvar based on the “encoded” numeric variable varname and its value label. . infile and insheet commands discussed in Section 1. gen x = _n clears memory. To eliminate variables use drop varlist To eliminate observations use drop in range [if exp] The drop _all command eliminates all variables and observations in memory. set obs 100 . 1.314’). For example.6 ELIMINATING VARIABLES OR OBSERVATIONS The drop command eliminates variables or observations from the data in memory.

GETTING STARTED The first command save [filename] [, options]

13

stores the dataset currently in memory on disk in Stata format under the name filename. If filename is not specified, the name under which the data was last known to Stata is used. If filename is specified without an extension, .dta is assumed. The available options are nolabel old replace all. The old option enables a dataset to be readable by someone with Stata 6.0, the option replace permits save to overwrite an existing dataset. The second command outfile [varlist] using filename [, options] writes data to a disk file in ASCII (text) format. The data saved by outfile can be read back by infile. If filename is specified without an extension, .raw is assumed unless the dictionary option is specified, in which case .dct is assumed. The third command outsheet [varlist] using filename [, options] writes data in tab- or comma-separated ASCII format into a file. This is the format that most spreadsheet programs prefer. If filename is specified without an extension, .out is assumed. 1.7 LOG FILES

The log command echos a copy of a Stata session to a file or a device. More precisely: log using filename [, options] opens the file filename and echos a copy of the Stata session to the file. If filename is specified without an extension, .smcl is assumed (SMCL is Stata’s output language). The available options are noproc append replace. The log close command stops logging the session and closes the file, log off temporarily stops logging the session leaving the file open, while log on resumes logging to the file. The set log command controls the dimensions of output sent to the log. Its format is: set {display|log} {linesize|pagesize} # where # is the line or page length, for example set linesize 120 or set pagesize 40.

2

Stata Commands
In this chapter I describe the syntax of some frequently used Stata commands. My selection is of course subjective. 2.1 GENERAL SYNTAX

The general syntax of a Stata command is: command [varlist] [ = exp] [weight] [if exp] [in range] [, options] If no varlist appears, the command assumes a varlist of _all, that is, the command is applied to all the variables in the data. The option = exp specifies the value to be assigned to a variable. It is most often used with generate and replace. For example: . replace newvar = oldvar+2 Many commands take command-specific options. A single comma separates a command’s options from the rest of the command. Most commands can be abbreviated. For example, one may type gen or simply g instead of generate, summ or simply su instead of summarize, des or simply d instead of describe, l instead of list, etc. See the on-line help or the Reference Manual for the shortest allowable abbreviation of a command. The F -keys may be used to create shortcuts to some command. For example, the F 3-key comes defined as describe Enter. 2.1.1 BY

Most Stata commands allow the by varlist: prefix. This causes command to be repeated for each subset of the data for which the values of the variables in varlist are equal. The use of by requires the data to be preliminarily sorted by varlist. Example: . sort x . by x: summarize y Not all commands allow the by varlist: prefix. Some replace it with by(groupvar) in the options. For example, the syntax of the ttest command is:

2. For example. pweight. 4. frequency weights (fweight) are integer-valued and indicate multiple observations. #/l or f/#. summarize x y in 1/10 . Example: . replace y = x+2 if x>0 The in range qualifier restricts the scope of the command to a specific observation range. which correspond to the four kind of weights that Stata understands (although not every command supports all four of them): 1. #/#. The default treatment (weight) is each command’s idea of what the “natural” weights are and is one of the above weight types. summarize x y in f/10 2. 2.1.1. if appropriate. summarize x y in 10 . 3. where range is any of #.2 WEIGHTS The option weight indicates the weight to be attached to each observation.4 QUIETLY AND NOISILY Typing quietly command suppresses all terminal output for the duration of command. analytic weights (aweights) are inversely proportional to the variance of an observation. noisily command turns back on terminal output. to summarize the values of x and y for the first 10 observations: .1. The syntax of weight is [weightword = exp]. by(groupvar) [unequal welch level(#)] 2. importance weights (iweights) indicate the relative “importance” of an observation. quietly by x: generate y = sum(z) . aweight and iweight. for the duration of command. For example: . probability or sampling weights (pweight) are inversely proportional to the sample inclusion probabilities.3 IF AND IN The if exp qualifier restricts the scope of the command to those observations for which the value of the expression is true. where weighword is one either weight (the default treatment of weights) or one of fweight.16 ttest varname [if exp] [in range].

. Its syntax is describe [varlist] [. 2.2 BASIC DATA COMMANDS 17 I discuss nine basic commands: describe. . fullnames displays the full names of the variables (the default is to present an abbreviation when the variable name is longer than 15 characters). The numbers and fullnames options may not both be specified together. and the maximum size of the dataset).STATA COMMANDS 2. list. Its syntax is: drop varlist . drop and keep. and numbers presents the variable number along with the variable name. short detail fullnames numbers] in the first case. [no]display nolabel noobs] where [no]display forces the format into display or tabular (nodisplay) format (if one of these two options is not specified.3 DROP AND KEEP The drop command eliminates variables or observations from the data in memory. Its syntax is: list [varlist] [if exp] [in range] [.2. the maximum width for an observation. generate and its extension egen. list list list list list list in 1/10 x y x y in 1/10 if x>20 x y if z>20 x y z if z>20 in 1/10 2. the maximum number of variables holding the numbers of observations constant. . then Stata chooses one based on its judgment). Examples: . . detail includes more detailed information (the width of a single observation. and describe using filename [. sort and gsort.2 LIST This command displays the values of variables. replace.1 DESCRIBE This command displays a summary of the contents of either the data in memory or the data stored in a Stata-format dataset. 2. the maximum number of observations holding the number of variables constant. nolabel causes the numeric codes rather than label values to be displayed. where short suppresses the specific information about each variable.2.2. . short detail] in the second case. and noobs suppresses printing of the observation numbers.

stuff refers to an expression. Examples: . or a list of numbers. drop in 1/33 . . . Note that egen may change the sort order of the data. the number of missing values in newvar is always reported. drop in -10/l . a list of variables. Depending on fcn(). . by y: keep if _n==_N 2. options] egen creates newvar equal to fcn(stuff). drop if x<21 .2. generate x2 = x*x generate bigz = z>100000 & z˜=. Important examples of egen functions include: • count(exp) [.4 GENERATE AND EGEN (drop first 33 observations) (drop last 10 observations) The generate command creates a new variable. keep in 34/l . Examples: . by(varlist)] creates a constant (within varlist) containing the number of nonmissing observations of exp. . gen double w = x/y gen xlag = x[_n-1] gen u = uniform() (U(0. type must be set to str#. sort y . If missing values are generated. Its syntax is: generate [type] newvar[:lblname] = exp [if exp] [in range] If type is not specified.18 drop if exp drop in range [if exp] The keep command works exactly the same as drop except that one specifies the variables or observations to be kept. float is the default (the default type may be changed using the set type command). 1) pseudo-random numbers) gen z = invnorm(uniform()) (N (0. keep if x>=21 . It may not be combined with by. The options are similarly function dependent. Its syntax is: egen [type] newvar = fcn(stuff) [if exp] [in range] [. 1) pseudo-random numbers) The egen command provides an extension to generate. To prevent Stata from returning an error when string variables are generated. . . • diff(varlist) creates an indicator variable equal to 1 where the variables in varlist are not equal and 0 otherwise.

STATA COMMANDS • group(varlist) [. • iqr(exp) [. . such as rmean (row mean).5 REPLACE This command changes the contents of an existing variable.means subtraction) y = rmean(x1 x2 x3) y = rmean(x1-x3) (varlist. The same syntax holds for a number of other functions with argument exp. The label option returns integers from 1 up according to the distinct groups of varlist in sorted order. . If p() is not specified. Notice that # must be odd and exp must not produce missing values. medv (mean absolute deviation from the mean). The default is mean() and std() producing a variable with zero mean and unit variance. mad (median absolute deviation from the median). . by(varlist)] creates a constant (within varlist) containing the interquartile range of exp. . It may not by combined with by. . mean(#) std(#)] creates the standardized value of exp using the specified mean and standard deviation. . It may not be combined with by. • ma(exp) [. Examples: . median (median). meaning medians. t(#) nomiss] creates a #-period moving average of exp. gen dev = x-avgx . mean(100) std(10) sumx = sum(x). such as kurt (coefficient of kurtosis). max) (maximum value). The truncate() option truncates the values contributed to the label from each variable in varlist to the length specified by the integer argument num. skew (coefficient of skewness).2. by(y) xy = group(x y) 19 2. Its syntax is: . . p(#) by(varlist)] creates a constant (within varlist) containing the #-th percentile of exp. 50 is assumed. rmiss (row number of missing values). If t() is not specified. egen avgx = mean(x) . • std(exp) [. t(3) is assumed. • rmax(varlist) gives the maximum value in varlist for each observation (row). . or the value labels if they exist.means through) sdx = sd(x) stdx = std(x). missing label truncate(num)] creates a single variable taking on values 1..2. mean (mean). The integers are labeled with the values of varlist. egen egen egen egen egen egen egen x = median(x2-x1) (expression. . sum (sum). min (minimum value). . rmin (row minimum). • pctile(exp) [. for the groups formed by varlist. sd (standard deviation). The same syntax holds for a number of other functions with argument varlist.

. Examples: .6 SORT AND GSORT The sort commmad arranges the observations of the current data in ascending order of the values of the variables in varlist.2. replace z=. gsort may arrange the observations in either ascending or descending order. Its syntax is sort varlist [in range] There is no limit to the number of variables in varlist and each variable can be numeric or string.. generate(newvar) mfirst] The observations are placed in ascending order of varname if + or nothing is typed in front of the name and in descending order if . . The dataset is marked as being sorted by varlist unless in range is specified. Examples: . . by z: gen avgx = sum(x)/sum(x˜=. Its syntax is gsort [+|-]varname [[+|-]varname [. sort lstname frstname midinitl Unlike sort. by z: replace avgx = avgx[_N] 2. descending y) . The generate(newvar) option creates newvar containing 1. .. . This is useful when one wishes to use the ordering with a subsequent by. ascending y) x -y (ascending x. . replace y = 25 in 1007 . that can produce only ascending-order arrangements.is typed.]] [. Missing values are interpreted as being larger than any other number and are thus placed last (there is an exception: When sorting on a string variable. sort z .20 replace oldvar = exp [if exp] [in range] [.3.) . null strings are placed first). .2. for each of the groups denoted by the ordered varnames.. nopromote] where nopromote prevents replace from promoting the variable type to accommodate the change. gsort gsort gsort gsort gsort gsort x (same as sort x) +x (same as gsort x) -x (reverse sort) -name (reverse alphabetical) x y (ascending x. Examples: . if z<=0 . . sort personid . The mfirst option specifies that missing values are to be placed first in descending orderings rather than last.

from both master and using data) . . Even if this option is not specified. from using data) (obs. 2. 2.dta is assumed. . The default is _merge(_merge).2 MERGE This command joins corresponding observations from the dataset currently in memory (called the master dataset) with those from the Stata-format dataset stored as filename (called the using dataset) into single observations (if filename is specified without an extension.3. sort x (match merge) (obs. 2.STATA COMMANDS . . use data2 . Its syntax is: append using filename [.3 gsort -x. nolabel] where nolabel prevents copying the value label definitions from the disk dataset.dta is assumed).3. . gen(revx) quietly by revx: gen rcum = _N if _n==1 replace rcum = sum(rcum) replace rcum = rcum/rcum[_N] COMBINING DATA 21 I discuss two commands: append and merge. merge using data2 . Its syntax is: merge [varlist] using filename [. tab _merge . nolabel update replace nokeep _merge(varname)] where nokeep causes merge to ignore observations in the using data that have no corresponding observation in the master (the default is to add these observations to the merged result and mark them with _merge==2) and _merge(varname) specifies the name of the variable that will mark the source of the resulting observation. If filename is specified without an extension. use data1 (one-to-one merge) . It can perform both one-to-one and match merges.1 APPEND This command appends a Stata-format dataset stored on disk to the end of the dataset in memory. label definitions from the disk dataset never replace definitions already in memory. . from master data) (obs. which adds a new variable _merge to the data whose values are: _merge==1 _merge==2 _merge==3 Examples: .

max (maxima). Its syntax is: contract varlist [weight] [if exp] [in range] [. and nomiss specifies that observations with missing values on any of the variables in varlist will be dropped (if not specified. and stat is one of the following: mean (means).1 COLLAPSE This command replaces the data in memory with a new dataset consisting of the means. the name must be new). . zero specifies that combinations with frequency zero are wanted. The by(varlist) option specifies the groups over which the means.. median (medians). etc. 2. medians.22 . 2.4. min (minima).4 save data2.] [(stat)] target_var=varname [target_var=varname .2 CONTRACT This command makes datasets of frequencies. count (number of nonmissing observations).4. replace use data1 sort x merge x using data2 tab _merge RESHAPING DATA I discuss five commands: collapse. Its syntax is: collapse clist [weight] [if exp] [in range] [. all observations possible are used). are to be calculated.. _freq is used. .. iqr (interquartile range). of the specified variables. 2.. It replaces the data in memory with a new dataset consisting of all combinations of varlist that exist in the data together with a new variable that contains the frequency of each combination. expand.. all observations possible are used for each calculated statistic) and fast specifies that collapse not go to extra work so that it can restore the original data should the user press Break. .. If stat is not specified. sum (sums). cw specifies casewise deletion (if not specified. p# (#th percentile). rawsum (sums ignoring optionally specified weights). sd (standard deviations). etc. fillin and reshape. contract.] or any combination of the varlist or target_var forms. . . freq(varname) zero nomiss] where freq(varname) specifies a name for the frequency variable (if not specified. by(varlist) cw fast] where clist is either [(stat)] varlist [[(stat)] .] [[(stat) .. mean is assumed.

reshape ..5 RESHAPE This command converts data from wide to long form and vice versa.4. Its basic syntax is: reshape wide varnames. i(varlist) [j(varname) string] where i(varlist) specifies the variable(s) whose unique values denote a logical observation. i(z) (single i() variable) .1 COUNT This command counts observations satisfying the specified conditions.4 FILLIN This command rectangularizes a dataset by adding observations with missing data so that all interactions of the variables in varlist exist. centile. i(y) j(z) string 2. Examples: ... and the observation is retained but not duplicated). summarize. Its syntax is: expand [=]exp [if exp] [in range] Example: . i(y) j(z 1-3 5) . i(varlist) [j(varname) string] reshape long varnames. i(z1 z2) (two i() variables) ..3 EXPAND 23 This command replaces each observation in the current dataset with n copies of the observation. It also adds the variable _fillin to the data (with value 1 for created observations and 0 for previously existing observations). reshape long x1 x2. reshape long x. reshape . expand 2 2.. correlate and regress. cumul.STATA COMMANDS 2. i(y) j(z) (converts from wide to long) .4.5. where n is equal to the integer part of the required expression (if the expression is less than one or equal to missing. j(varname) specifies the variable whose unique values denote a subobservation. reshape wide (converts back to wide) . 2.4. reshape long x.5 BASIC SAMPLE STATISTICS (specifying j() values) (allow string variables in j()) I discuss seven commands: count. and string specifies that the j() may contain string values. then it is interpreted as if it were one.. Its syntax is: . means. Its syntax is: fillin varlist 2.

cci (conservative confidence interval) prevents centile from interpolating when calculating the distribution-free (binomialbased) confidence limits. the add() option only adds # to variables with at least one nonpositive value) and level(#) specifies the percentage confidence level for confidence intervals.3 MEANS This command reports the arithmetic. the four smallest and four largest values.2 SUMMARIZE This command reports a variety of univariate summary statistics. meansd . medians are reported). By default. and harmonic means. count displays the number of observations in the dataset. count if y<0 . centile(numlist) cci normal meansd level(#)] where centile(numlist) specifies the centiles to be reported. only modifies the action of the add() option (if specified. kurtosis. add(#) only level(#)] where add(#) adds the value # to each variable in varlist before computing the means and confidence intervals (this may be useful when analyzing variables with nonpositive values).5. along with various percentiles).24 count [if exp] [in range] If no condition is specified. confidence intervals are obtained using a binomial method that makes no assumptions as to the underlying distribution of the variable. meanonly suppresses display of the results and calculation of the variance (it is allowed only when detail is not specified) and format requests that the summary statistics be displayed using the display format associated with the variables rather than the default g format. {detail|meanonly} format] where detail produces additional statistics (including skewness.5.4 CENTILE This command reports the (per)centiles of the specified variables and their confidence intervals. along with their respective confidence intervals.5. Its syntax is: summarize [varlist] [weight] [if exp] [in range] [. The syntax is: centile [varlist] [if exp] [in range] [. for the specified variables. by x: count if y<0 2. 2. The ci command may be used if one simply wants arithmetic means and corresponding confidence intervals. geometric. Its syntax is: means [varlist] [if exp] [in range] [. 2. Examples: . for example centile(25 50 75) (if not specified. normal specifies that confidence intervals are to be obtained assuming that both the data and the centiles are normally distributed.

means noformat covariance wrap] where means causes summary statistics (means. 2.1. 2. 2.1. Estimation is carried out by least squares (either ordinary least squares or weighted least squares). The syntax is: correlate [varlist] [weight] [if exp] [in range] [. Its basic syntax is: regress yvar [xvars] [weight] [if exp] [in range] [. Observations are excluded from the calculation due to missing values on a casewise basis. noformat displays the summary statistic requested by the means option in g format regardless of the display formats associated with the variables. otherwise it is normalized so that newvar is 1 for the largest value of varname. standard deviations. freq requests the edf to be in frequency units.5. minima and maxima) to be displayed along with the matrix. covariance displays the covariances rather than the correlation coefficients. Its syntax is: cumul varname [weight] [if exp] [in range] .5 CUMUL This command creates a new variable containing the empirical distribution function (edf) of a variable.6 CORRELATE This command reports the covariance or correlation matrix of the specified variables.5. and wrap requests that no action be taken on wide matrices to make them readable. level(#) noconstant regress_options] where level(#) specifies the confidence level (in percent) for the regression parameters (the default is 95%). The related command pctile newvar = exp creates a new variable containing the percentiles of exp. gen(newvar) [freq by(varlist)] where gen(newvar) specifies the name of the new variable to be created (it is not optional).5. .STATA COMMANDS 25 calculates confidence intervals assuming that the estimated centiles themselves are normally distributed.7 REGRESS This command estimates linear regression models with a single response or dependent variable. and the additional regress_options are described in more detail in Section 6. where exp is typically just another variable. and by(varlist) specifies that edf’s be generated separately for each by-group. noconstant suppresses the constant term (intercept) in the regression.

column displays in each cell of a two-way table the relative frequency of that cell within its column. The syntax for one-way tables is: tabulate varname [weight] [if exp] [in range] [. format(%fmt) specifies the display format for presenting numbers in the table’s cells.26 2. row specifies a row is to be added to the table reflecting the total across rows.6. including the common Pearson chi-squared. and Kendall’s tau-b. Its syntax is a little involved: table rowvar [colvar [supercolvar]] [weight] [if exp] [in range] [. if contents() is not specified it is assumed to be contents(freq)). summarize. chi2 calculates and displays Pearson’s chi-squared for the hypothesis that the rows and columns in a two-way table are independent.1 TABLE This command provides tables of summary statistics. clist is as in collapse. See the on-line help or the Reference Manual for a description of the other options. missing specifies that missing statistics are to be shown in the table as periods (the default is to leave them blank). 2. center specifies results are to be centered in the table’s cells (the default is to right align). generate(varname) matcell(matname) matrow(matname) missing nofreq nolabel plot subpop(varname)] The syntax for two-way tables is: tabulate varname1 varname2 [weight] [if exp] [in range] [. left specifies that column labels are to be left aligned (the default is to right align). Goodman and Kruskal’s gamma. the likelihood ratio chi-squared.6. all cell chi2 column exact gamma lrchi2 matcell(matname) matcol(matname) matrow(matname) missing nofreq nolabel row taub V wrap] where all is equivalent to specifying chi2 lrchi2 V gamma taub.and two-way tables of frequency counts along with various measures of association. exact displays the significance calculated by Fisher’s exact test. cell displays the relative frequency of each cell in a two-way table. 2. Fisher’s exact test. Cramer’s V. tabulate and tabulate. col specifies a column is to be added to the table reflecting the total across columns. contents(clist) by(superrow_varlist) cw row col scol format(%fmt) center left concise missing replace name(string) cellwidth(#) csepwidth(#) scsepwidth(#) stubwidth(#)] where contents(clist) specifies the content of the table’s cells (up to 5 statistics may be specified.2 TABULATE This command provides one.6 TABLES Stata offers three basic commands for producing tables: table. gamma displays Goodman and Kruskal’s gamma along with .

[no]obs includes only or suppresses only the reported number of observations from the table. and missing requests that missing values of varname1 and varname2 be treated as categories rather than as observations to be omitted from analysis. subpop(varname) excludes observations for which varname = 0 in tabulating frequencies. matrow(matname) saves the numeric values of the row stub in the vector matname.and two-way tables of summary statistics. and. Examples: . summarize() is faster. and V (note capitalization) displays Cramer’s V.STATA COMMANDS 27 its asymptotic standard error. replace indicates that the immediate data specified as arguments to the command are to be left as the current data in place of whatever data was there. summarize(z) (one-way table) . summarize() command produces one. summarize(z) means nofreq . [no]freq includes only or suppresses only the frequencies from the table. Although table is better. matcell(matname) saves the reported frequencies in the matrix matname. tabulate.6. frequency. nolabel causes the numeric codes to be displayed rather than the value labels. [no]means includes only or suppresses only the means from the table (the summarize() table normally includes the mean. the number of observations). summarize(z) (two-way table) . nofreq suppresses printing the frequencies. [no]standard includes only or suppresses only the standard deviations from the table. row displays in each cell of a two-way table the relative frequency of that cell within its row. percentages. by x: tabulate y1 y2. if the data is weighted. and other statistics. summarize(varname3) [[no]means [no]standard [no]freq [no]obs wrap nolabel missing] where summarize(varname3) identifies the name of the variable for which summary statistics are to be reported (if this option is not specified. sort x (n-way table) . standard deviation. Its syntax is: tabulate varname1 [varname2] [weight] [if exp] [in range] . missing requests that missing values be treated like other values in calculations of counts. tabulate y. lrchi2 displays the likelihoodratio chi-squared statistic (the request is ignored if any cell of the table contains no observations). then a table of frequencies is produced).3 TABSUM The tabulate. 2. tabulate y1 y2. generate(varname) creates a set of indicator variables reflecting the observed values of the tabulated variable. matcol(matname) saves the numeric values of the column stub in the vector matname. plot produces a bar chart of the relative frequencies in a oneway table. taub displays Kendall’s tau-b along with its asymptotic standard error.

.

translate mygraph. The syntax to review a saved Stata graph is: graph using filename [filename] [. 3. ksm (regression smoothers) and logistic (logistic regression diagnostic plot). options] An existing Stata graph can be translated to another format (e.gph mygraph.eps (converts to Encapsulated PostScript) .prn (converts to printer format) . . oneway (one-way scatterplots). box (box plots). PostScript) using the translate command and printed using the print command. translate mygraph. 2.1 BASIC SYNTAX AND GRAPHIC STYLES The basic syntax of the graph command is: graph [varlist] [weight] [if exp] [in range] [.gph mygraph. twoway (two-way scatterplots).gph (print mygraph) . Stata offers eight basic graph styles: 1.gph mygraph. 4. matrix (two-way scatterplot matrices).wmf (converts to Windows metafile) . but allow considerable flexibility and are relatively simple to use. histogram. Examples: . print mygraph.3 Graphics Stata graphics are not very fancy. graph recalculates and redisplays the last graph. options] Typed without arguments. 5. translate mygraph. 3.g. print @Graph (print the graph in the Graph window) Notice that graphs may also be produced by other Stata commands such as kdensity (nonparametric density estimation).

3. • Specifying titles: graph allows up to two titles on every side of the graph (top. The first title (e. The yline and rline options draw horizontal lines. replace]). star (star charts). Additional ticking may be obtained with the options {x|y|r|t}tick[(#. denoted by the options t1. . t1) is always the farther from the figure. bar (bar charts). r1. graph y x. After discussing some options of the graph command that are common across all styles (Section 3.gph is assumed.. xline and tline draw vertical lines. Typed without arguments. l2. The argument of each option is some text enclosed in quotes. t2) is the closest. The default is gap(8). and r2. 8..2). • Adding lines: lines across the graph may be drawn with the options {x|y|r|t}line[(#.g. . • Saving a graph to disk: saving(filename [. b1 (same as title or ti). More aesthetically pleasing results may be obtained with the options {x|y|r|t}label[(#. {x|y|r|t}label chooses “round” values to be labelled.. • Multiple-imaging options: by(varname) is allowed for all styles except matrix and star.. left and right). l1. • Printing a graph: after the graph command..2 COMMON GRAPH OPTIONS In this section I briefly discuss some of the general options of the graph command... It requests that graphs be drawn separately for the groups defined by varname and be combined into a single image. Quotes can be omitted if text contains no special character. t2. 7. I shall focus on the first four styles.#)].. • Adding ticks: graph automatically places tick marks on axes anywhere they are labelled. use the Print button in the Stata toolbar. b2. Example: .30 6.. l1(y) b2(x) title("Figure 1: x-y scatterplot") • Setting the gap: gap(#) sets the amount of space between the left title and the values along the y-axis. the second (e. pie (pie charts)..#)]. graph labels just the minimum and maximum of each variable. If an extension is not specified. bottom.g. I will henceforth refer to these options as common_options.. • Labeling axes: by default..#)].

graph scales each axis according to the minimum and maximum of all things that go on the axis (data.(a medium dash). By default. • Setting the axes rendition: by default. T (large triangle). s (connect median bands using cubic splines). d.5. If connect is not specified. m (connect median bands using straight lines). • Plotting symbols: graph uses the following plotting symbols to specify the location of a point on a scatterplot: O (large circle. where s is any of the above symbols. symbol(. . The border option replaces axes with borders. . by default. graph draws an axis on any style that has an axis. graph includes a sort option that automatically sorts the data according to the x-axis before graphing. S (large square). The sequence of plotting symbols for the variables in varlist is specified with the option symbol(s . the {x|y|r}log options with the twoway style. where s is any of the above alternatuves. L (straight line between ascending x-points).1) ytick(. (dot). S.1) • Setting the scale: by default.. p. The options {x|y|r}xscale(#. and the remainder .5) yline(0. . o (small circle. gap(4) xlabel ylabel(0. graph y x. _ (a long dash). d (small diamond). it chooses instead o. s). || (connect two variables vertically (high-low)). T. twoway chooses the symbols O. l (straight lines between points). default). point are not connected. Combined with by(). not the order of the x-axis. graph y1 y2 y3 x.) connect(ll[_]l[-]) 31 .GRAPHICS Example: . _n (observation number). or matrix). p (small plus). default for twoway). The connect option connects points in the order of the data. default for twoway with by. How the variables in varlist are connected is specified by the option connect(s . and the remainder . where pattern is any combination of the following: l (a solid line. the default). one can specify the pattern of the line by adding a [pattern] after the line type.#) may be used to widen (but never to narrow) the scale used for drawing a graph on any style that has an axis. • Connecting points: graph offers the following alternatives to connect points on a scatterplot: . • Line patterns: For each line type. II (same as || but cap bottom and top of line). • Creating log scales: the log option is used with the histogram style. . . . J (connect rectilinearly making steps). Example: . . The noaxis option suppresses both axes and borders. if symbol is not specified. almost a dot) and # (a space). (do not connect. [varname] (variable to be used as text)... i (invisible). (a short dash. s). labeling or ticking).

one must specify twoway explicitly. Examples: . std. . twoway [common_options rescale rbox {y|x|r}reverse] where rescale scales each y-variable independently (if there are two y-variables. normal(10. I refer to the Graphics Manual for other common options and to the remainder of this chapter for options specific to the various graph styles. This is not an option but a separate command that must be issued before graph. if there are more than two y-variables. twoway may be combined with oneway or box. and {y|x|r}reverse reverses the indicated scale to run from high-to-low.3 HISTOGRAMS This is the default for graph when only one variable is specified. bin(15) (uses 11 bins for histogram) graph x.32 The set textsize # command controls the size of the text used in a graph. but in that case. The basic syntax is: graph [yvars xvar] [weight] [if exp] [in range].#)] overlays a normal density with specified mean and standard deviation (normal by itself uses the observed mean and standard deviation). graph y x (graph of y against x) .3) (overdraws a normal density with mean 10 and dev. no vertical scale is labeled). rescale (graph of z and y against x) . graph z y x. 3. Examples: . 3. freq and percent affect how the vertical axis is labeled (respectively. histogram [common_options bin(#) {freq|percent} normal[(#. the scale of the first is presented on the left axis and the scale for the second on the right axis. . rbox places a rangefinder box plot on the graph. normal[(#.#)] density(#)] where bin(#) specifies the number of (equally spaced) bins to use for constructing the histogram (the default is bin(5)). 3) TWO-WAY SCATTERPLOTS This is the default for graph when more than one variable is specified.4 graph x (draws a histogram of x) graph x. in frequency units and in percent). The basic syntax is: graph [variable] [weight] [if exp] [in range]. and density(#) (only used with normal) specifies the number of points along the density to be calculated (the default is density(100)).

graph y x.5 TWO-WAY SCATTERPLOT MATRICES 33 A two-way scatterplot matrix is a set of two-way scatterplots arranged in a matrix.6 BOX PLOTS A box plot is a graphical procedure with the following features: (i) it combines a measure of location (the median) and a measure of spread (the interquartile range).GRAPHICS 3. box by(z) (graphs y against x by z) . vwidth makes the width of the box proportional to the number of observations. and root (only used with vwidth) makes the width of the box proportional to the square root of the number of observations. matrix [common_options half] where half draws only the lower half of the matrix. graph x y z if z>0. graph y. and (iii) it provides some indication about the shape of the distribution of the data in terms of their symmetry or skewness. (ii) it shows the presence of possible outliers. box [common_options [no]alt vwidth root] where [no]alt forces the labeling of the groups to be on single line (noalt) or multiple lines. box by(z) (graphs box-and-whiskers for y by z groups) . The basic syntax is: graph [varlist] [weight] [if exp] [in range]. matrix 3. graph y x. Examples: . Example: . box (graphs box-and-whiskers for y and x) . The basic syntax is: graph [varlist] [weight] [if exp] [in range].

.

Stata allows programs to be nested 32 deep.2 and 4. called the macro name. Their content is set respectively by the local and global commands. unless the program is short. Programs may call other programs. Displaying of the underlying commands is suppressed. called the macro content.1. the general structure of a Stata program is program define progname Stata commands end Programs must be defined (loaded in memory) before they can be used. A Stata program is just a sequence of Stata commands enclosed between the commands program define progname and end. specific to the program where they are defined. global macro are public.4 Programming and Matrix Commands In this note I discuss the elements of Stata programmming and Stata matrix language.1 MACROS A macro is a user-defined string of characters. Local macros are private. 4. 4. however. The simplest way to do so is to type directly the commands from the keyboard.1 PROGRAMMING STATA The capabilities of Stata may be extended considerably by using programs.3. local and global. Their general syntax is {local|global} mname [[`]"[string]"[´]|= exp|: extended_fcn] where the macro name mname can be up to 7 character long for local macros and up to 8 characters for global macros. Thus. This is not recommended. For the use of a extended macro function see the Stata manual. . that is. Alternative ways of defining programs are described in Sections 4. Stata has two types of macros. and exp may be either a numeric or a string expression. Programs are executed by typing progname. that stands for another string of characters.

1. Typing macro drop _all does not eliminate system macros and the content of system macros such as S_DATE and S_TIME cannot be changed.36 To copy string to mname (the maximum length of string is 18.3 LOOPING Stata provides two commands for looping. The syntax of while is simpler: while exp { Stata commands } . Typing macro drop mname eliminates the global macro mname. The content of a local macro is accessed by enclosing the macro in `´.1. do-files and ado-files (see below). Examples: $S_DATE : contains the current date in the format dd mon yyyy $S_TIME : contains the current time in the format hh:mm:ss $S_FN : contains the filename last specified with use or save User-written programs may examine and change the content of system macros.2 SYSTEM MACROS In addition to user-defined macros. $options If a macro contains double quotes.623 characters) use: {local|global} mname "string" To evaluate exp and store the result in mname (the maximum length of exp is 80 characters) use: {local|global} mname = exp Macros can be used everywhere in programs. 4. compound double quotes `""´ may be used to define a macro. 4. This simply replaces the name of the macro with its content. Stata has number of built-in global system macros that begin with the characters S_. Typing macro drop _all eliminates all global macros. while and forvalues. `options´ sort z global options "by(z) gap(4) sy(.) c(l) xlab ylab" graph x y. Examples: local options "gap(4) sy(i) xlab(10) ylin" graph x y. that of a global macro by prefixing it with a $.

It then repeats the process until exp evaluates to false (zero). If exp refers to any variables. the commands inside the braces are executed. If the result is false (zero). those statements are ignored and the statements following the else. if it is true (nonzero). are executed.4 BRANCHING The syntax of this programming command is: if exp { Stata commands } else { other Stata commands } This command evaluates exp. the arguments of a program are not enclosed in parentheses but simply follow the program name. executes the commands enclosed in the braces. however. PROGRAM ARGUMENTS Programs may take arguments. Example: The following code fragment may be used to iterate Stata commands 10 times local i = 1 local I = 10 while `i´<=`I´ { Stata commands local i = `i´+1 } The while command may also be used interactively.1. just like functions.5 y = log(x) { y = log(-x) y = . Do not confuse this command with the if qualifier at the end of a command. whiles may be nested within whiles.1.PROGRAMMING AND MATRIX COMMANDS 37 This command evaluates exp and. . if specified. their values in the first observation are used unless explicit subscripts are specified. If the result is true (nonzero). 4. Example: if x>0 { replace } else if x<0 replace } else { replace } 4. Unlike functions.

Arguments are passed to programs via local macros: `0´. `3´. `2´. The outer while loop continues the process until macro `1´ is empty. two or more variables. etc. Programs with this feature may be coded by shifting through its arguments program define myprog while "`1´" ˜= "" { Stata commands in terms of `1´ macro shift } end where macro shift shifts `1´. `1´ is the first argument of the program. `2´. etc. if prog1 is a program and we type prog1 x y then x and y are the program’s arguments. . . Thus. `1´. where the local macro `0´ is exactly what the user typed. . . respectively the first and the second argument.38 For example. for example. An alternative is the following: program define myprog local i = 1 while "``1´´" ˜= "" { Stata commands in terms of `1´ . the following two programs both produce a sequence of n pseudo-random numbers according to the U(a. `2´. which may be applied to one. An example is the summarize command. .. what was `2´ becomes `1´. `2´ the second argument. may be renamed to facilitate reading and understanding of a program. b) distribution: program define prog1 drop _all set obs `1´ generate x = `2´+(_n-1)/(_N-1)*(`3´-`2´) end program define prog2 args n a b drop _all set obs `n´ generate x = `a´+(_n-1)/(_N-1)*(`b´-`a´) end Sometimes programs involves a variable number of arguments. one to the left: what was `1´ disapears. . etc. with the same thing done to each argument. . The positional macros `1´. . what was `3´ becomes `2´.

Stata provides three commands to deal with this: tempvar creates names for temporary variables. tempvar x y gen `x´ = exp gen `y´ = exp .. .) that are temporary. tempname creates names for temporary scalars and matrices. Regardless of their flavor. Most commonly. After an e-class command. one may refer to saved results in two ways. macros and matrices. Typing return list after an r-class command or estimates list after an e-class (estimation class) command summarizes what the command saved. e-class (estimation) commands (such as regress) save their results in e(). Example: . They all have the same syntax: {tempvar|tempname|tempfile} mname [mname . etc.. and tempfile creates names for temporary files. commands save results in one of two places: 1. can be discarded once the program completes. The other is to use macro substitution characters to produce `r(name)´ or `e(name)´. r-class commands (such as summarize) save their results in r(). 4.1.1. because Stata automatically drops any variables with names assigned by tempvar. ] The command creates local macros containing names one may use.PROGRAMMING AND MATRIX COMMANDS local i = `i´+1 } end 4. data. the number of observations used by a command are saved in the scalars r(N) or e(N).. One is just by simply typing r(name) or e(name). . the command name and the name of the response (dependent) variable are saved in the macros e(cmd) and e(depvar). For example.7 EXCHANGING RESULTS BETWEEN PROGRAMS Stata commands that report results save them in places where they can be subsequently used by other commands or programs. that is.. Results saved in r() and e() come in three flavors: scalars. scalars.6 TEMPORARY OBJECTS 39 Programs often require objects (variables. whereas the estimated coefficients and their variance matrix are saved in the matrices e(b) and e(V). The drop `x´ `y´ command is not necessary when the program completes. 2. matrices. .

Here are some rules and recommendations for constructing a do-file..40 Example: After regress . . A do-file may be used to define one or more programs or may call programs already defined. The sequence of commands is executed using the do or run commands.or e-class.. (ii) an exit is executed. display "You can refer to " e(cmd) " or to `e(cmd)´" You can refer to regress or to regress Notice that after running an r-class command. If filename is specified without an extension. The code to save results in r() is return scalar name = exp return local name . This allows the do-file to run under later releases.2 DO FILES A do-file is a standard ASCII (text) file containing a sequence of Stata commands. running another one would change the content of r() but not the content of e(). • Blank lines and comments may be included freely. A do-file completes the execution when: (i) the end of the file is reached. On the other hand. . nostop] where nostop allows the do-file to continue executing even if an error occurs. where # is the Stata release under which the file was written..do is assumed. whose syntax is {do|run} filename [arguments] [. Their proper use may considerably enhance understanding of a program. Thus. while run is silent. running a new e-class command may change the content of both e() and r(). depending on whether the program is intended to be r.. a separate command on each line. Stata allows do-files to be nested 32 deep. it is important to do so immediately. The difference between do and run is that do echos the commands and their output. Do-files may also call other do-files. if one wants to access the results produced by a command. return local name matname while the code to save results in e() is estimates scalar name = exp estimates local name . estimates local name matname 4. or (iii) an error (nonzero return code) occurs (pressing Break while executing a do-file causes a nonzero return code and therefore stops the do-file). • Start a do-file by typing version #. As for programs. User-defined programs may save their results if their class is specified on the program define line through the option rclass or eclass.

because x is the first argument typed after do try.3 ADO FILES An ado-file defines a Stata command. do try x y The second command (drop if `1´==. the ci command produces confidence intervals and is implemented as an ado-file. although many commands (e. just like programs.g.) would be interpreted as drop if x==. . clear drop if `1´==. For example. or by placing the comment in /* */ delimiters. replace]. The delimited may later be changed back to carriage return by including the #delimit cr command. Arguments are stored in local macros `1´. summarize or regress) are not defined by ado-files but are build directly into Stata.0 /* do-file written under Stata 7. Example: version 7. For example. Output to the log file is suppressed if run is used to execute a do-file. the end-of-line delimiter may be changed from carriage return to. 41 Do-files accept arguments. Logging stops and the log file is closed when log close is encountered. summarize `2´.ado is stored on some directory that Stata can access. • To prevent Stata from pausing when the screen is full. ‘.0 */ * read in the data use mydata. detail • To avoid lines wider than the screen. detail and then execute it by typing . • The output of a do-file may be sent to a log file by including the command log using filename [. The /* */ delimiters can be put anywhere. clear * summary statistics summarize x y z. to repeat the same set of instruction for different variables one could write the do-file try. say. command. at the end of a line or even in the middle. 4.do use mydata. and so on.’ by including the #delimit . This means that a file called ci. `2´. include the set more off command. An ado-file is an ASCII (text) file that contains a Stata program which defines (implements) a command.PROGRAMMING AND MATRIX COMMANDS • Comments may be included either by beginning a line with a ‘*’ (a star). the third command (summarize `2´) would be interpreted as summarize y because y is the second argument typed after do try.

they could also be handled as special cases of a matrix. matrix D = G[1.r1.4 MATRIX COMMANDS A Stata matrix is a rectangular array of double-precision numbers.. Thus. the personal ado-directories. just as it does for the file ci. Typing help ci (or pulling down Help and searching for ci). c0. do-files and ado-files. The matrix list command displays a matrix with its row and column names. The maximum matrix size can be increased to 800×800 by issuing the command set matsize 800.subname. the maximum matrix size is 40 × 40.2 SUBSCRIPTING AND SUBMATRICES The basic syntax for subscripting is matrix A = . none of which can be missing. Examples: . These names are used to produce “pretty” output.c] . where r and c are numeric or string scalar expressions.c1] . 4. c0..42 Ado-files typically come with an associated help-file. and which is bordered by a row and a column of names. where r0. The first two parts may be blank.help. the current directory. By default..ado after the command ci is typed. r1.4. Row and column names have three parts: equation_name:ts_operator.1] ."displ"] ..gnp"] The basic syntax for extracting submatrices is matrix A = . Stata matrices are unsuited for holding large amounts of data.. Although Stata has scalars.1 ROW AND COLUMN NAMES Stata matrices always have row and column names. prompts Stata to look for the file ci. Examples: . 4. A vector is a special case of a matrix.B[r...B[r0. and c1 are numeric or string scalar expressions. matrix A = A/A[1..."eq1:l1. Stata looks for ado-files (and the associated help-files) in several places: the official adodirectories (the base directory and the updates directory).. Row and column names may be reset using the matrix rownames and matrix colnames commands. Matrices can be used interactively or in programs. matrix B = A["weight".4. 4.

Examples: . "eq1:"] 43 The basic syntax for substituting submatrices is matrix A[r.. colnumb("mpg")] = sqrt(2) 4.. • B \ C (adds the rows of C below the rows of B). • det(B) (determinant of a square matrix). 3. • B / z (division by a scalar z). the indicated element of A is replaced.4. • B + C (addition). Examples ."mpg"] B["eq1:". . .2] = B .. matrix A[rownumb("price"). • trace(B) (trace of a square matrix). where r and c are numeric scalar expressions. • B .. including multiplication by a scalar).6] B[2. • B .C) (relative difference). • B * C (multiplication..c] = . matrix A[2. "price".3 MATRIX OPERATORS AND FUNCTIONS The matrix operators are: • -B (negation). • B’ (transposition). matrix C = (B + B’)/2 The matrix functions returning scalar are: • mreldif(B. Parentheses may be used to enforce a particular order of evaluation.. matrix matrix matrix matrix A A A A = = = = B[2.. C (adds the columns of C to the right of the columns of B). • B # C (Kronecker product). . the resulting matrix is placed in A with its upper left corner at (r.C (subtraction). If the matrix expression to the right of the equal sign evaluates to a scalar or a 1 × 1. c)...4.] B[1..PROGRAMMING AND MATRIX COMMANDS . . If the matrix expression evaluates to a matrix. 2.

• sweep(B. • syminv(B) (inverse of a symmetric matrix. this function is the same as A[i. rename a matrix (matrix rename) and drop a matrix (matrix drop).s) (the first column number named s.s) (first row number named s.i. • corr(B) (correlation transform).1*I(rowsof(X)) + 0. returns a generalized inverse). where s is a string or string expression). display trace(X) . • colnumb(A. where s is a string or string expression).m. The matrix drop _all command drops all matrices. if B is not positive definite.z) (n × m matrix containing the constant element z).j) (the i.9*X) There are matrix utilities to list the currently defined matrices (matrix dir). • el(A.44 • diag0cnt(B) (number of zeros on diagonal). The matrix functions returning matrix are: • I(n) (n × n identity matrix). matrix L = cholesky(0. • vecdiag(B) (returns a row vector containing the diagonal of a square matrix). not just matrix expression contexts. Matrix functions returning scalar may be used in any expression context. it returns a diagonal n × n matrix with diagonal elements equal to those of V ). • rowsof(B) (number of rows). returns B with jth row/column swept). matrix beta = syminv(X’*X)*X’*y . . • J(n. j element of A). • cholesky(B) (Cholesky decomposition of a symmetric matrix). • inv(B) (inverse of a square matrix). display the contents of a matrix (matrix list). • get(mname) (returns the system matrix mname).j) (sweep of a square matrix.j]). • colsof(B) (number of columns). • rownumb(A. • diag(V) (V is a row or column n-vector. Examples .

) matrix list d . A is the name of an existing matrix. .). whereas W takes on a restricted form (diagonal. . matrix(X) . names(col|eqcol|matcol|string)] where type is a storage type for new variables. mkmat mpg . matrix(matname)] svmat [type] A [.PROGRAMMING AND MATRIX COMMANDS 4. rows(range) columns(range) explicit] renames the rows and columns of a matrix. where X = (X2 . Their syntax is mkmat varlist [if exp] [in range] [. block diagonal. . X3 .4 CROSS-PRODUCT MATRICES 45 Statistical computations often involve matrix operations such as X> X or X> WX. The matrix glsaccum command accumulates cross-product matrices from the data using a specified inner weight matrix to form A = X> BX. Stata ha a number of commands to compute these results efficiently. c(. X usually has a large number of rows and a small to moderate number of columns.4.4. or is known in some functional form and need not be stored).5 DATA TO MATRIX CONVERSION Variables can be converted into matrices and matrices into variables through the mkmat and svmat commands. where B is a block diagonal matrix. if matrix() is specified. = matrix list bvector1 matrix d = bvector1’ matname d wei gr for _cons. The matrix vecaccum command accumulates the first variable against the remaining variables to form the row vector a = X> X. they can be stored as a single matrix. In these cases. The matrix accum command accumulates cross-product matrices from the data to form A = X> X. mkmat bvector1 if bvector1˜. The related command matname A namelist [. . The svmat command is the reverse of mkmat: it takes a matrix and stores its columns as new variables. matrix b = syminv(X’*X) * X’*mpg . mkmat foreign weight displ. Computing X> X or X> WX by storing the matrices and then directly performing the matrix multiplications is inefficient and wasteful. and the names(col|eqcol|matcol|string) option specifies how the new variables are to be named. . 1 4. . Examples: . Optionally. . The mkmat command stores the k variables listed in varlist in k column vectors of the same name.

2] contains the second largest eigenvalue and X[1. and so on. the columns of U are orthogonal.1] contains the largest eigenvalue and X[1. the matrix get command also obtains matrices after certain commands. These matrices can be referenced directly...7 MATRIX DECOMPOSITION The matrix symeigen command returns the eigenvectors in the columns of the n × n matrix X and the corresponding eigenvalues in the n-vector V . Alternatively.. a row n-vector W and an n × n matrix V such that A = U diag(W ) V > .4. matrix list e(b) . Examples: .. The singular value decomposition of a symmetric nonnegative definite matrix A is carried out through the matrix svd command. matrix S = vecdiag(e(V)) Other matrices are returned by various commands..2] its corresponding eigenvector. . and V is orthonornmal.1] its corresponding eigenvector. In addition.. For example.1) return the coefficent vector e(b) and the variance-covariance matrix v(b) of the estimates.6 GETTING SYSTEM MATRICES The usual way to obtain matrices after a command that produces matrices is to refer to the returned matrix in the standard way.46 4. 4. This command returns an m × n matrix U . V[1. the elements of W are positive or zero. The eigenvalues are sorted from largest to smallest: V[1. all estimation commands (see Section 5.4. They are obtained in the same way.

(varlist) [weight] [if exp] [in range] [. • After estimation.3). denoted by yvar. options] All estimation commands share the following common features: • To review the last estimates. one may obtain prediction. one can perform tests of hypotheses about the model parameters (Section 5.2.1 ESTIMATION GENERAL SYNTAX OF ESTIMATION COMMANDS The general syntax of an estimation command is: command varlist [weight] [if exp] [in range] [.2. .4). The default is level(95).1.5 Statistical Inference Using Stata 5.. residuals and influence statistics using the predict command (Section 5. options] The first variable in varlist is the response or outcome variable. • In addition to the estimated parameters and their standard errors. is similar: command (varlist) (varlist) . one may obtain the estimated variance matrix of the estimators using the vce command (Section 5. The confidence level may be set using the level(#) option.2). • The estimated variance matrix of the estimators is computed under the assumption that the statistical model is correctly specified. • After estimation. confidence intervals for the coefficients are displayed. just type the estimation command without arguments.1 5. namely commands that estimate systems of equations.. denoted by xvars. the other variables are the covariates or predictors. The general syntax for multiple-equation commands. where # is the desired percentage level.2. but some commands allow for certain forms of model misspecification with the robust option. • After estimation.

with 1 ≤ # ≤ 999. The constraint command defines.. The following example estimates the linear model E(Y ) = α + β1 X1 + · · · + β6 X6 subject to the constraints that β1 = β2 = β3 = β6 and β4 = −β5 = α/10: . Its syntax is: constraint constraint constraint constraint define # [exp=exp|coefficientlist] dir [clist|_all] drop {clist|_all} list [clist|_all] where coefficientlist lists the variables whose coefficients are set equal to zero. p-values and confidence intervals for a linear combination of coefficients after any estimation command except anova. thus allowing for heteroskedasticity of unknown form. regress or glm) offer the option of estimating the variance matrix of the parameter estimators by relaxing the assumption that the statistical model is correctly specified and allowing for certain forms of model misspecification. . The following example estimates the linear model E(Y ) = α+β1 X1 +β2 X2 respectively without and with heteroskedasticity-robust variance estimates: .1. robust . regress y x1 x2 [pweight=w] gives a weighted least squares regression of y on x1 and x2 using the probability weights contained in the variable W . #[-#] .4-5) ROBUST VARIANCE ESTIMATES 5. constraint define 1 x1=x6 constraint define 2 x2=x6 constraint define 3 x3=x6 constraint define 4 x4=-x5 constraint define 5 x4=_cons/10 cnsreg y x1-x6.2 WEIGHTED ESTIMATION Specifying weights allows weighted estimation.3 CONSTRAINED ESTIMATION Several commands (e. The robust options relaxes the assumption that the observations are identically distributed. regress y x1 x2.g. . For example . 5. standard errors.4 Some commands (e. . regress y x1 x2 . t statistics. cnsreg) allow estimation subject to linear constraints on the model parameters through the constraint(clist) option. . where clist is of the form #[-#][. . 5. constraint(1-3.].1.1.g. lists and drops linear constraints.48 • The command lincom computes point estimates..

the estimated probit probabilities Φ(β 2. where eqno is the equation number. stdp command creates newvarname containing the standard error of the linear prediction. 5. it creates ˆ> Xi ). xb command creates newvarname containing ˆ the linear prediction β > Xi . the estimated variance or correlation matrix of the estimators is displayed using the command vce [. thus relaxing the assumption of independence. .2 DISPLAYING THE VARIANCE ESTIMATES After model estimation. after linear regression. The general features of predict are: 1. The predict newvarname.2. After probit.3 PREDICTIONS AND RESIDUALS The predict command calculates predictions. The predict newvarname.STATISTICAL INFERENCE USING STATA 49 The robust cluster(varname) option only requires observations to be independent across clusters specified by the variable varname. 3. both recorded to machine precision. The command mfx produces tables displaying the marginal effects or the elasticities (and their standard errors) instead of the estimated coefficients.2. use [eqno]_b[varname] (or simply [eqno][varname] or [eqno]varname) and [eqno]_se[varname]. this command produces the same result as predict newvarname.1 POST-ESTIMATION COMMANDS ACCESSING COEFFICIENTS AND STANDARD ERRORS After a (single-equation) estimation conmmand. corr rho] where corr and rho are synomis and either displays the correlation matrix instead of the variance matrix. residuals and influence statistics after estimation. Adding the nooffset option to any of the above makes the calculation ignoring any offset or exposure variable specified in the estimation command. ˆ predict newvarname creates the fitted values β > Xi . What predict can do depends to some extent on the previous estimation command. To obtain a copy of the estimated variance matrix for manipulations type matrix matname = e(V) 5.2. 5. In case of multiple-equation estimation command. _b[varname] (or _coef[varname]) contain the coefficient on varname and its standard error. For example.2 5. 4. The predict newvarname command creates newvarname containing the “predicted values” of the response. For linear models.

predict newvarname if e(sample). 6. Some statistics make sense only with respect to the estimation sample.. equation(#1) is the default when equation() is not specified. the calculation is automatically restricted to the estimation sample. One can restrict the prediction to the estimation sample by typing . one may use equation(#1. predict calculates the requested statistic for all observations possible. It computes the standard error of the difference in linear predictions between two equations. one could refer to the equations by their names. predict can be used to make in-sample or out-of-sample predictions. For example. Other statistics (for example stddp) refer to between-equation concepts. equation(hours) to the one named hours. In general. equation(income) would refer to the equation named income. 7.. • stdp calculates the standard error of the linear prediction. Alternatively. . • stddp is allowed only after multiple-equation estimation commands.. • equation(eqno[. • The nooffset option may be combined with most statistics and specifies that the calculation should be made ignoring any offset or exposure variable specified when the model was estimated. This option is available even if not documented for predict after a specific command. predict hat. 9. and so on. equation(#1) means that calculations are to be made for the first equation. It specifies to which equation one is referring. In such cases. use data1 model estimation commands ..hours).#2) or equation(income. equation() is not optional. Example: . /* fill in the predictions */ The options of predict are: • xb calculates the linear prediction from the estimated model.50 5. If neither the offset() . Out-of-sample predictions may be obtained by applying predict to other datasets. In those cases. use data2 /* another dataset */ . 8. . equation(#2) that they are to be made for the second. When two equations must be specified.eqno]) is only relevant after multiple-equation estimation commands. whether they were used in estimating the model or not.

then bstat can be run on the data in filename to view the bootstrap estimates again. and the lrtest command performs likelihood-ratio tests after ML estimation. • other_options refers to command-specific options that are documented with each command. Note that command and exp_list must both be enclosed in double quotes. and calculates confidence intervals using three different methods: normal approximation.1 BOOTSTRAP The command bstrap progname [. 5. Reproducibility of the results requires setting the random-number seed by typing set seed #.. stat(#) level(#) title(text)] displays bootstrap estimates of standard error and bias.4 HYPOTHESIS TESTING 51 The test command performs Wald-type tests of linear hypothees.3 BOOTSTRAPPING AND MONTE CARLO SIMULATIONS Bootstrap and Monte Carlo simulations rely on Stata’s uniform() random number generator. The command bs "command" "exp_list" [. The expressions in exp_list must be separated by spaces and there must be no spaces within each expression. reps(#) size(#) dots args(.2. the command bsample [exp] [. specifying nooffset does nothing. 5. percentile. the testnl command performs Wald-type tests of nonlinear (or linear) hypothees. If the user specifies the saving(filename) option with bstrap or bs. and bias corrected. Finally.. cluster(varnames) idcluster(newvarname)] . This command takes the same options as bstrap except for args(). The bstrap and bs commands automatically run bstat after completing all the bootstrap replications. The command bstat varlist [. 5.STATISTICAL INFERENCE USING STATA nor the exposure() option was specified at the model estimatio stage.3. bstrap_options nowarn noesample] runs the user-specified command bootstrapping the statistics specified in exp_list.) level(#) cluster(varnames) idcluster(newvarname) saving(filename) double every(#) replace noisily] runs the user-defined program progname reps(#) times on bootstrap samples of size size(#).

52 is a low-level utility for those who prefer not to use bstrap or bs. Typing "simul progname. If exp is not specified. the sample replaces the data in memory. reps(#)" iterates progname # replications and collects the results. a sample size of _N is drawn (or size n_c when the cluster() option is specified where n_c is the number of clusters). It draws a sample with replacement from the existing data. reps(#) [args(whatever) dots double saving(filename) every(#) replace noisily] where progname is the name of a program that performs a single simulation. exp specifies the size of the sample and must be less than or equal to _N. reps(#) (not optional) specifies the number of replications to be performed. and args(sl whatever) specifies any arguments to be passed to progname.2 MONTE CARLO SIMULATION The simul command is aimed at easing the programming task of performing Monte Carlo simulations. Its syntax is: simul progname. 5.3. .

statistic may be: pr(a. ystar(a. eform(). 6. mse1 and plus are for ado-file writers. residuals (residuals). depname().1. The syntax of predict following regress is: predict [type] newvarname [if exp] [in range] [.4.6 Statistical Models in Stata A broad range of statistical models may be estimated directly using the available Stata commands. level(#) beta robust cluster(varname) hc2 hc3 hascons noconstant tsscons noheader eform(string) depname(varname) mse1 plus] where beta requests that the normalized regression coefficients be reported instead of confidence intervals. and only provide a brief description for a number of other models.1). in addition to xb (the default) and stdp. In this note. generalized linear models (Section 6.1). I focus on estimation of linear models (Section 6. tsscons forces the total sum of squares to be computed as though the model has a constant (i.4. stdf (standard error of the forecast).2) and parametric models for duration data (Section 6.e. rstandard (standardized residuals). 6. hc2 and hc3 specify alternative bias corrections for robust (they may not be specified with cluster()). Its syntax is: regress yvar [xvars] [weight] [if exp] [in range] [. rstudent (Studentized or jackknifed residuals). statistic] where. and noheader.1 LINEAR MODELS Stata offers several commands for estimating linear models.b) (E max(a. min(Y. b)). as deviations from the mean of the dependent variable). noconstant suppresses the constant term (intercept) in the regression.. e(a.b) (Pr(Y | a < Y < b}). leverage|hat (diagonal elements of hat matrix).1 ORDINARY LEAST SQUARES The regress command estimates a linear model by least squares (ordinar least squares or weighted least squares). .1.b) (E(Y | a < Y < b)). hascons indicates that a user-defined constant or its equivalent is specified among the independent variables (some caution is recommended when using this option as resulting estimates may not be as accurate as they otherwise would be). stdr (standard error of the residual). robust and cluster(varname) have been discussed in Section 5. cooksd (Cook’s distance).

• dfbeta (calculates the DFBETAs).2 CONSTRAINED LINEAR REGRESSION The cnsreg command estimates constrained linear regression models. dfits (DFITS). 6. • rvpplot (graphs a residual-versus-predictor plot). • vif (calculate the variance inflation factors). • cprplot (graphs a partial residual plot). 6. dfbeta(varname) (DFBETA for varname). level(#) beta hascons noconstant robust cluster(varname) first noheader eform(string) depname(varname) mse1] . or welsch (Welsch distance).1. • hettest (performs the Cook-Weisberg test for heteroskedasticity). those in xvars2 are the endogenous variables. Constraints are defined using the constraint command. In addition to predict. The syntax of this command is: ivreg yvar [xvars1] (xvars2=ivars) [weight] [if exp] [in range] [. The variables in xvars1 and ivars are the exogenous variables. • dwstat (computes the Durbin-Watson test statistic). • lvr2plot (graphs a leverage vs. • rvfplot (graphs a residual-versus-fitted plot). • ovtest (performs Ramsey’s RESET test for omitted variable).1.3 LINEAR INSTRUMENTAL VARIABLES The ivreg command estimates a linear regression model using instrumental variables (or two-stage least squares) of yvar on xvars1 and xvars2 using ivars (along with xvars1) as instruments for xvars2.3). Its syntax is: cnsreg yvar xvars [weight] [if exp] [in range].54 covratio (COVRATIO). constraints(numlist) [level(#)] where constraint(numlist) (not optional) specifies the constraint numbers of the constraints to be applied (see Section 5.1. squared residual or L-R plot). the following commands can be used after regress for diagnosing sensitivity to individual observations: • avplot (graphs an added-variable or leverage plot).

Estimation is carried out either by iteratively reweighted least squares (IRLS) or by using the Newton-Raphson (NR) method. mu(varname) specifies varname as the initial estimate for the mean of yvar. not with IRLS). which is the default. search specifies that the command should search for good starting values. special commands (e. jknife and jknife1 specify that jackknife estimates of variance be used. ltolerance(#) specifies the convergence criterion for the change in deviance between iterations (ltolerance(1e-6) is the default). robust and cluster(varname) have already been defined. The basic syntax is: glm yvar [xvars] [weight] [if exp] [in range] [. logit. scale(x2|dev|#) overrides the default scale parameter (by . and irls requests IRLS minimization of the deviance instead of NR maximization of the log-likelihood. Hall.g. Hall. nwest(wtname [#]) specifies that a heteroskedasticity and autocorrelation consistent variance estimate be used.2 GENERALIZED LINEAR MODELS 55 Stata offers a single and very flexible command (glm) to estimate generalized linear models (McCullagh & Nelder 1989). brep(#) specifies the number of bootstrap samples to consider in forming the bootstrap estimate (the default is brep(199)). max_options var_options output_options spec_options] The max_options are: iterate(#) ltolerance(#) mu(varname) nolog search fisher(#) irls where iterate(#) specifies the maximum number of iterations allowed in estimating the model (iterate(50) is the default). and Hausman (1976) variance estimator (this option is not allowed when cluster() is specified). for selected models in this class. nolog suppresses the iteration log. bstrap specifies that the bootstrap estimate of variance be used. fisher(#) specifies the number of NR steps that should use the Fisher scoring Hessian or expected information matrix before switching to the observed information matrix (both search and fisher() are only useful with NR optimization. vfactor(#) specifies a scalar by which to multiply the resulting variance matrix. 6. It also offers.1 GLM The glm command fits generalized linear models. probit. especially diagnostics and other post-estimation output. The var_options are: oim opg vfactor(#) robust cluster(varname) unbiased nwest(wtname [#]) jknife jknife1 bstrap brep(#) scale(x2|dev|#) disp(#) score(newvar) t(varname) where oim specifies that the variance matrix should be calculated using the observed information matrix rather than the usual expected information matrix (option ignored if irls is not specified). opg specifies that the variance matrix be calculated using the Berndt. poisson) with a broader and more specific set of options.2. unbiased specifies that the unbiased sandwich estimate of variances be used (robust is implied when unbiased is used).STATISTICAL MODELS IN STATA 6.

If family() is specified but not link(). gamma. probit. a single dot character is printed for each estimation that is performed). scale(1) is assumed for discrete distributions and scale(x2) for continuous distributions). linkname is either a user-written program or one of: cloglog (complementary log-log). log. then the canonical link for the family is obtained. opower # (odds power). link(linkname) specifies the link function. scale(dev) sets the scale parameter to the deviance divided by the residual degrees of freedom (this provides an alternative to scale(x2) for continuous distributions and over. and [ln]offset(varname) specifies an offset to be added to the linear predictor. • link(logit) for family(binomial) (same as logit). loglog (log-log). noconstant specifies that the linear predictor has no intercept term. igaussian (inverse Gaussian). familyname is either a user-written program or one of: binomial (Bernoulli/binomial). The output_options are: eform level(#) trace noheader nodisplay nodots where eform displays the exponentiated coefficients and corresponding standard errors and confidence intervals (for binomial models with the logit link. exponentiation results in odds ratios. nbinomial (negative binomial). poisson. power #. The spec_options are: family(familyname) link(linkname) noconstant [ln]offset(varname) where family(familyname) specifies the parametric family.or under-dispersed discrete distributions) scale(#) sets the scale parameter to #. nbinomial (negative binomial). disp(#) multiplies the variance of yvar by # and divides the deviance by #. noheader suppresses the header information from the output (the coefficient table is still printed).56 default. • link(log) for family(nbinomial) (same as nbreg). and nodots specifies that a dot should not be printed for each fitted model when calculating jackknife or bootstrap estimates (by default. logit. score(newvar) creates the new variable newvar containing each observation’s contribution to the score. • link(log) for family(poisson) (same as tt poisson). identity. . for Poisson models with the log link. logc (log-complement). nodisplay suppresses the output (the iteration log is still displayed). trace requests that the estimated coefficient vector be printed at each iteration. exponentiated coefficients are rate ratios). namely: • link(identity) for family(gaussian) (same as regress). gaussian. scale(x2) specifies the scale parameter be set to the Pearson chisquared (or generalized chi-squared) statistic divided by the residual degrees of freedom. and t(varname) specifies the variable name corresponding to the time index (this option is required if nwest() is specified). • link(power -2) for family(igaussian).

except that it reports odds ratios rather than coefficients by default. The logistic command is just the same as logit. statistic nooffset standardized studentized modified adjusted] 57 where. ddeviance (Delta-D influence statistic. deviance (deviance residual). hat (diagonal of the hat matrix). The syntax of predict after glm is: predict [type] newvarname [if exp] [in range] [.2. dbeta (Delta-Beta influence statistic. stat. hat (leverage.2 LOGIT AND PROBIT The logit command estimates logit models by maximum-likelihood (ML). Hosmer & Lemeshow 1989). response (response residual). Pregibon 1981). and lsens (graphs sensitivity and specificity versus probability cutoff). statistic may be: mu (predicted mean of the response. lstat (reports summary statistics including classification table). the default). Its syntax is: logit yvar [xvars] [weight] [if exp] [in range] [. The probit commands is completely analogous and estimates probit models by ML. or working (working residual). deviance (deviance residual). The syntax of predict following logit or logistic is: predict [type] newvarname [if exp] [in range] [. yvar˜=0 & yvar˜=. likelihood (likelihood residual). in addition to xb and stdp. or residuals (Pearson residual).STATISTICAL MODELS IN STATA • link(power -1) for family(gamma). the default). Its syntax is: poisson yvar [xvars] [weight] [if exp] [in range] [. Pregibon 1981).3 POISSON AND NBREG The command poisson produces ML estimates of the Poisson regression model.2. cooksd (Cook’s distance). 6. Hosmer & Lemeshow 1989). number (sequential number of the covariate pattern). 6. pearson (Pearson residual). statistic rules asif nooffset] where. (typically yvar==1) indicates a positive outcome. in addition to xb and stdp.. The following commands can be used after both logit or logistic to explore the nature of the fit: lfit (performs goodness-of-fit tests). dx2 (Delta chi-squared infl. score (score residual). eta (same as the xb option). statistic may be: p (predicted probability of a positive outcome. level(#) nocoef noconstant or robust cluster(varname) score(newvar) offset(varname) asis max_options] where yvar==0 indicates a negative outcome. lroc (graphs the ROC curve). irr level(#) .

Up to 50 outcomes are allowed. delta = 0 (or ln(δ) = −∞) corresponds to unit dispersion. statistic may be n (predicted number of events. it is simply a Poisson model. The alternative parameterization.. Two different parameterizations of the negative binomial model may be estimated. nbreg yvar [xvars] [weight] [if exp] [in range] [. thus.1 OTHER LIMITED DEPENDENT VARIABLES MODELS GROUPED BINARY RESPONSES The blogit and bprobit commands produce ML estimates of the logit and probit models for grouped data.3 6.3.2 ORDERED CATEGORICAL RESPONSES The ologit and oprobit commands estimate ordered logit and probit models of ordinal variable depvar on the covariates. the dispersion is a function of the expected mean of the counts for the ith observation: exp(β > Xi + offset). is equivalent to predict . The syntax of predict after nbreg is the same as after poisson. and. has constant dispersion for all observations equal to 1 + δ. given by the option dispersion(constant). also given by the option dispersion(mean)..3. that is. statistic nooffset] where. The default. 6. For the default model. in addition to xb and stdp. n nooffset). has dispersion for the ith observation equal to 1 + α exp(β > Xi + offset). 6. and is simply a Poisson model. the default). The glogit and gprobit commands produce weighted leastsquares (mimimum chi-square) estimates. . For the alternative parameterization. The actual values taken on by the response variable are irrelevant except that larger values are assumed to correspond to “higher” outcomes. The command nbreg produces ML estimates of the negative binomial regression model (Poisson regression with overdispersion). α = 0 (or ln(α) = −∞) corresponds to unit dispersion. dispersion(mean|constant) level(#) irr exposure(varname) offset(varname) robust cluster(varname) score(newvarnames) noconstant constraints(numlist) nolrtest nolog max_options] where depvar is a nonnegative count variable. The syntax of predict after poisson is: predict [type] newvar [if exp] [in range] [.. or ir (incidence rate.58 exposure(varname) offset(varname) robust cluster(varname) score(newvarname) noconstant constraints(numlist) nolog max_options] where depvar is a nonnegative count variable and irr reports estimated coefficients transformed to incidence rate ratios.

g. a ≥ Y1 ). The intreg command estimates a bivariate censored Gaussian model where the response variables (Y1 . 6. Y1 ≤ b). in addition to xb and stdp.3 NESTED LOGIT 59 The command nlogit estimates a nested logit model by ML. The heckman command estimates Gaussian linear models with sample selection using either Heckman’s two-step estimator (Heckman 1976) or full ML.3..3.g. a ≤ Yi ≤ b).1 DURATION DATA PARAMETRIC DURATION MODELS The ereg and weibull commands produce ML estimates respectively of the exponential and Weibull (survival time) models.3. 6. (predicted median . The cnreg estimates the same class of models but allows the censoring points to vary across observations.3.g. Constraints may be defined to perform constrained estimation.5 BIPROBIT The biprobit command produces ML estimates of two-equation probit models. Their syntax is: {ereg[het]|weibull[het]} yvar [xvars] [weight] [if exp] [in range] [.STATISTICAL MODELS IN STATA 6. The truncreg command produces ML estimates of a truncated Gaussian regression model. statistic] where.4. either a bivariate probit or a system of two seemingly unrelated probit equations. leftcensored (e. with and without gamma-distributed or inverse Gaussian unobserved heterogeneity (frailty). interval data (e.6 CENSORED AND TRUNCATED REGRESSION The tobit command produces ML estimates of censored Gaussian regression models with a fixed censoring point. statistic may be: median time. 6. hazard hr tr dead(varname) t0(varname) frailty(gamma|invgaussian) ancillary(varlist) strata(varname) robust cluster(varname) score(newvars) constraints(numlist) level(#) nocoef noheader nolog maximize_options] The syntax of predict following ereg and weibull is: predict [type] newvarname [if exp] [in range] [. nlogit estimates the same model as clogit. Y2 ) can be point data. The model may contain one or more levels.4 6. The related command heckprob estimates probit models with sample selection by ML. or right-censored (e.4 MULTINOMIAL LOGIT The mlogit command estimates multinomial logit models by ML. For a single-level model. 6.

mean lntime (predicted mean log survival time).5. The covariates may be either fixed or time-varying (fixed within intervals). The response variable and the covariates may be differenced or seasonally differenced to any degree. models may include multiplicative heteroskedasticity. (predicted median log survival time). 6. Missing data are allowed and are handled using the Kalman filter.60 survival time. the conmmand estimates autoregressive integrated moving-average (ARIMA) models for the response variable. or mgale (partial martingale-like residuals). 6. The tsset command must be used before time-series operators may be used.5 TIME SERIES The tsset timevar command declares the data to be a time series and designates that variable timevar (which must take on integer values) represents time. The following options may be used to estimate a number of models in the ARCH family: . the default). models may also contain ARCH-in-mean and/or ARMA terms. hr (predicted hazard ratio). 6. The newey command produces estimated standard errors for the OLS coefficients of linear regression models with heteroskedastic and possibly autocorrelated errors (see Newey & West 1987). When no covariate is specified.5.2 COX PROPORTIONAL HAZARD MODEL The cox command estimates proportional hazards models by ML. Concerning the regression equation itself. as well as gaps and right censoring. csnell (partial Cox-Snell residuals). hazard (predicted hazard). The failure event may be unique or recurring. In addition to ARCH terms.3 ARCH-TYPE MODELS The arch command estimates models with autoregressive conditional heteroskedasticity (ARCH) using conditional ML.1 LINEAR MODELS WITH AUTOCORRELATED ERRORS The prais command estimates a linear model with first-order autoregressive errors using the Prais—Winsten transformed regression estimator. the data will be sorted on timevar. After tsset.4. mean time (predicted mean survival time). 6.2 ARIMA MODELS The arima command estimates a linear model with autoregressive moving-average (ARMA) errors. The procedure allows for left truncation (delayed entry). or a version of the Hildreth—Lu search method. median lntime. 6. the Cochrane—Orcutt transformed regression estimator.5. surv (predicted survival probability). A simplified version of cox is stcox.

• archm arch() [garch()] (ARCH-in-mean). • arch() tarch() [garch()] (GJR. • re (GLS random-effects estimator).1 LINEAR PANEL DATA MODELS The xtreg command estimates linear panel data model. • fe. . This command can estimate fixed-effects (within-group).6. The iis command or the i() option set the name of the variable corresponding to index i.6 PANEL DATA 61 The xt series of commands provides tools for analyzing longitudinal (panel) data. NARCH with a single shift). power ARCH). 6. 6. simple asymmetric ARCH). and random-effects models as well as population-averaged models. Which estimator is used is determined by the following options: • be (between-group estimator). • arch() saarch() [garch()] (SAARCH). • nparch() [pgarch()] (NPARCH. (fixed-effects estimator). threshold ARCH). while the tis command or the t() option set the name of the variable corresponding to index t. asymmetric power ARCH). nonlinear power ARCH). • arch() garch() (GARCH). • narch() [garch()] (NARCH. Each Each observation in a longitudinal dataset is indexed by a unit-specific index i and a time-specific index t. • arch() garch() ar() ma() (GARCH with ARMA terms). • narchk() [garch()] (NARCHK. between-group.STATISTICAL MODELS IN STATA • arch() (ARCH). • pa (GEE population-averaged estimator). Some of the xt commands use time-series operators in their internal calculations and thus require the data to be tsset. form of threshold ARCH). nonlinear ARCH). • earch() egarch() (EGARCH). • abarch() atarch() sdgarch() (TARCH. • aparch() [pgarch()] (A-PARCH. • parch() [pgarch()] (PARCH.

negative binomial. The integrals in the individual terms of the log-likelihood of the random-effects model are computed using Gauss-Hermite quadrature.6. Poisson and usersupplied. one-step robust or two-step estimators. negative binomial. B If no option is specified. unstructured (arbitrary nonstationary) and user-supplied.4 GEE FOR PANEL DATA The xtgee command generalizes the glm command to panel data. log. logit. The command can be used with exogenously unbalanced panels and handles embedded gaps in the time series as well as opening and closing gaps.3 SEEMINGLY UNRELATED REGRESSION EQUATIONS The sureg command estimates a system of seemingly unrelated linear regression equations by feasible generalized least-squares. inverse Gaussian. namely Bernoulli/binomial. re perform. re is assumed. The xtprobit command estimates random-effects (re) and population-averaged (pa) probit models for panel data. or a population-averaged (pa) logit model for panel data.5 LOGIT AND PROBIT FOR PANEL DATA The xtlogit command estimates a fixed-effects (fe). kth order moving average. 6. the quality of the Gauss-Hermite quadrature approximation may be checked using the quadchk command. power. odds power. identity.6. The allowed within-group correlation structures include independence. This command is not part of the xt series. The xttest0 and xthaus commands after xtreg. 6. log-complement.6.62 • mle (Gaussian ML random-effects estimator). . It is very flexible and allows estimation of generalized linear models for panel data (see Liang & Zeger 1986) with different choices of parametric family. respectively. link function and within-group correlation structures. probit and user-supplied. namely complementary log-log. After estimating the model. 6. Gaussian (normal).6. The allowed link functions are also the same. 6. log-log. equicorrelation. The allowed distribution families are the same as for glm. kth order autocorrelation. gamma. the Breusch and Pagan (1979) Lagrange multiplier test for random effects and the Hausman (1978) specification test.2 DYNAMIC PANEL DATA MODELS The xtabond command estimates dynamic panel data models using Arellano and Bond (1989) one-step. a random-effects (re).

parzen. nonlinear smoothers (running medians) to a single variable varname and stores the new series in newvar. the integrals in the individual terms of the log-likelihood of the Gaussian random-effects model are computed using Gauss-Hermite quadrature. The smooth command applies resistant. random-effects (re). displays the graph. the quality of the Gauss-Hermite quadrature approximation may be checked using the quadchk command. Cleveland 1979). pa corresponds to xtgee. After estimating the model. The width(#) option specifies the halfwidth of the kernel. Also notice that xtlogit. random-effects (re). epan (Epanechnikov. As for xtprob. or population-averaged (pa) negative binomial models for panel data. gauss (Gaussian). 6. either a gamma (the default) or a normal (Gaussian) distributed random-effe cts model is estimated. the default).6 POISSON AND NEGATIVE BINOMIAL MODELS The xtpois command estimates fixed-effects (fe). the population-averaged model assumes equicorrelation as default (that is. For both xtpois and xtnbreg.2 REGRESSION SMOOTHERS The ksm command carries out unweighted and locally weighted smoothing of a response variable yvar on a single covariate xvar. then Stata uses the asymptoticaly optimal width for Gaussian data and a Gaussian kernel.1 NONPARAMETRIC ESTIMATION DENSITY ESTIMATION The kdensity command produces kernel density estimates and graphs the result. corr(exchangeable)). Missing values at the beginning or end of the range of varname are ignored. and optionally saves the smoothed variable. family(binomial) link(probit) corr(exchangeable) 6. .6. fe command is equivalent to the clogit command. For the re option. but missing values in the middle of the series are not allowed. Among the command’s capabilities are lowess (robust locally weighted regression. If width() is not specified. cosine.7. re. rectangle (uniform). The available options for the kernel function are: biweight.STATISTICAL MODELS IN STATA 63 Notice that the xtlogit. or populationaveraged (pa) Poisson models for panel data. triangle. 6. The xtnbreg command estimates fixed-effects (fe).7.7 6. family(binomial) link(logit) corr(exchangeable) whereas xtprobit. pa corresponds to xtgee.

8.8.8 6. Although not as fast.1 ROBUST AND QUANTILE REGRESSION ROBUST REGRESSION The rreg command estimates a linear model by iteratively reweighted least squares using a particular set of robust weights. The estimated variance matrix of the estimators is obtained by bootstrap. .2 QUANTILE REGRESSION The qreg command estimates quantile (including median) regression models. The function must be provided in a separate program. 6. The iqreg command estimates interquantile regressions (with a limit of 336 covariates). it is not limited to 336 coefficients. Thus. See Gould and Scribney (1999) for details. The estimated variance matrix of the estimators is obtained by bootstrap and includes between-quantiles blocks . The sqreg command estimates simultaneous-quantile regression and produces the same coefficients as qreg for each quantile.64 6. The ml series of commands allows estimation of an arbitrary model by ML. where q is the number of quantiles specified.9 GENERAL NONLINEAR METHODS The nl command fits an arbitrary nonlinear function to a response variable yvar by least squares. 6. The bsqreg command is the same as sqreg. one can test and construct confidence intervals comparing coefficients describing different quantiles. This command has a limit of 336/q covariates.

and West K. TX. Journal of the American Statistical Association.. (1981) Logistic Regression Diagnostics. 5: 475—492.K. Wiley.A. Cleveland W. 3/4: 653—665.R. Econometrica. (1974) Estimation and Inference in Nonlinear Structural Models.J. 55: 703—708. Econometrica. (1989) Generalized Linear Models (2nd ed. and Scribney W.References Berndt E. (2001) Econometrics. (1979) Robust Locally Weighted Regression and Smoothing Scatterplots. (1987) A Simple.). McCullagh P. Stata Corporation. Annals of Statistics. and Hausman J. 46: 1251—1272. London. and Zeger S. Heteroskedasticity and Autocorrelation Consistent Covariance Matrix. and Nelder J. 73: 13—22. Newey W. Biometrika. 9: 705—724. Wiley. Annals of Economic and Social Measurement.V. Hall B.Y.L. (1999) Maximum Likelihood Estimation with Stata. Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models. Hall R. Liang. Heckman J. Review of Economic Studies. (1976) The Common Structure of Statistical Models of Truncation. (1980) The Lagrange Multiplier Test and Its Applications to Model Specification in Econometrics.S. and Lemeshow S. Chapman and Hall. K.. Chichester. and Pagan A. Hausman J.A. Gould W. College Station. (1989) Applied Logistic Regression.A. (1986) Longitudinal Data Analysis Using Generalized Linear Models. Breusch T. New York. Peracchi F. (1978) Specification Tests in Econometrics. Positive Semi-Definite. Annals of Economic and Social Measurement.. Pregibon D. . 74: 829—836. UK. 47: 239—254.W. Hosmer D.