STATA Tutorial

GVPT622: Quantitative Methods I September 4, 2002



STATA is a command-line driven statistics package. This means that much like DOS, you need to type commands into the software to make it execute any routine. While this is a bit more difficult than menu-driven packages like SPSS, it is much faster and more flexible. This document is meant to get you started working with STATA.


Getting Started
Why All those Windows?
STATA is a multiple-windowed environment. When you open STATA, you will see 4 windows. 1. Review window - The review window gives you a list of previously typed commands. You can access these two different ways. You scroll through the previous commands with the scroll arrows and click on the command, or each time you hit the “page up” command the previous command will show up in the command window. If you hit the “page up” key twice, the next to last command will pop up and so forth. 2. Variables window - The variables window provides a list of the variables and their labels that are in the currently loaded dataset. 3. STATA Command window - The STATA command window is where the user can interact with STATA. This is where commands are typed in. 4. STATA Results window - The STATA results window show you the results of the commands typed into STATA.

The Graphics Window displays graphs as a result of a graph command being typed in the command window. This window will not be visible when you first open STATA, rather it will pop up directly following a graphical command.


Log Files

Log files are a way to save all of the commands and corresponding output generated during a STATA session. It is essential to open a log file during every session to keep track of data manipulation and any analysis performed. A log file can be opened in two different ways. First, you can open a log using the menu option: File>>Log>>Begin at which point the program will ask you to specify the name for the log file. Make sure that you put it in the directory you want. There will be two options for a log file. The .smcl format is a formatted log and the .log format is an unformatted log which can be opened in any text editor. I find the .log files easier to work with, but this is only a personal preference. If you want a .log file, make sure to change the option in the pull-down menu of the save box. The second way to open a log is to type the command in the command window: log using filename [, append replace [ text | smcl ] ]


This type of log file captures everything that comes up in the results window. There is another type of log file - the command log - that captures only commands, not output. This type of file is one that would allow you to replicate your analysis with just one command. The command log can be requested using command-line syntax as follows: cmdlog using filename [, append replace] The command log can then be opened as a text file, or can be opened in STATA’s do-file editor. You can access this document by clicking the envelope looking button (the fifth one from the left in STATA’s toolbar) which opens the do-file editor or you can access it through the command line by typing: doedit filename The file can easily be run in stata by typing in the command line: do filename or by clicking on the Tools>>Run menu option in the STATA do-file editor. Log files can be suspended or closed. Suspending a log file can be done by typing “log off” which temporarily closes the log file. The log can then be turned back on by typing “log on”. Closing a log file is done by typing “log close”. You may then open a new log file. You could open the same log file and add more information by typing: log using logname.log, append You can also replace a log file with a new log file by replacing the append statement with the replace statement in the syntax above.


Getting Data in STATA Format

There are 4 main ways to put your data in STATA: 1. Typing data into STATA. 2. Copy and paste from another program. 3. Infile/Insheet. 4. Stat-Transfer. 2.3.1 Typing in Data

You can type data into STATA directly without having it in any other format. You can type data into a spreadsheet environment either by typing “edit” in the command window or by pushing the button with the spreadsheet (without the magnifying glass) on it. The button with the spreadsheet and a magnifying glass is a data viewing environment where the data cannot be edited. STATA’s capabilities as a spreadsheet are lacking so typing directly into STATA is only a good idea with very small datasets. 2.3.2 Copy and Paste

Data can be cut and pasted from other programs into STATA. This works particularly well with spreadsheet data, but can also work with text delimited data. To copy and paste data into STATA simply open the data editor as suggested above, then copy from a spreadsheet like excel and paste into STATA. There are a couple of text editors that will allow the user to copy blocks of text, such as columns, from the middle of a document which is particularly useful in these types of situations. These are Textpad - and WinEdt - These are essentially shareware. WinEdt is a $30 registration and the program becomes annoying after the trial period is up. Textpad could also be registered but is not particularly annoying if you don’t.




STATA’s infile command allows the user to bring in any sheet of data into the program. This is usually done from a .txt document. The data file should be delimited by tabs, spaces or commas. Insheet is a similar command that is specifically designed for data read out of a spreadsheet program and in this utility, the delimiter is an argument to the function where it is not in the infile command. The syntax to the infile command is as follows 1 : infile varlist [_skip[(#)] [varlist [_skip[(#)] ...]]] using filename [if exp] [in range][, automatic byvariable(#) clear ] The syntax to the insheet command is: insheet [varlist] using filename [, double [no]names { comma | tab | delimiter("char") } clear ] You can get a description of what the arguments mean to these and other functions by typing: help infile1 help insheet or more generally: help <function> Dictionaries Dictionaries are a way to define variable types. STATA does not like to infile string variables without a dictionary. Dictionary files include not only the data you want to input, but a dictionary command at the beginning. For an example, see “H:/GVPT622 F02/auto.dct”, you can open it in a text editor. STATA has two basic types of variables: string and numeric. To use a dictionary with the infile statement, just type: infile using [filename.dct] 1. String variables are those that contain at least one non-numeric character such as a letter or symbol. STATA calls these “str” variables. There is always a number after the “str” which denotes how many characters wide the variable is, so a variable that is str8 is 8 characters long. 2. Numeric variables are those containing only numbers (including possibly a decimal point). There are different kinds of numeric variables: byte, int, long, float and double. They all have different minima, maxima and precision toward zero. Type “help datatypes” for a more thorough discussion. 2.3.4 Stat-Transfer

By far, the easiest way to get data into STATA or nearly any other format for that matter, is with Stat-Transfer. This program allows the user to take data in nearly any format (including SAS, SPSS, Excel (or other spreadsheet), Access (or other database), Systat, Gauss, Limdep, Matlab, Statistica, etc...) and transfer the data into any other format. One of the benefits is that variable names and labels as well as value labels tend to be preserved across formats. Stat-Transfer is a windows program that should be on the statistical software menu in the graduate lab or in LeFrak. The program works in 4 steps. 1. Choose the type of file you want to transfer. 2. Find the file on your computer 3. specify the type of file into which you want to transfer your data.
hard brackets [ or ] in the commands need not be entered in the syntax, they are simply for clarity in the presentation.
1 The


4. hit “Transfer”. For more advanced users, there are tabs of observations, variables and options that will help the user tweak the program to produce more polished data, but often times specifying further options in these tabs is not necessary.


Saving and Loading Data in STATA Format
Loading Data
Data can be loaded in one of two ways: 1. Menu - With the Menu option File>>Open, you can search and load data. Similarly, you can type ctrl+O or hit the open folder button, the first one on the left-hand side of the STATA toolbar. 2. Syntax - You can type the use command directly into the command window. The command is as follows: use filename [, clear nolabel ]

The clear option allows data to be loaded in even if data are currently loaded into the program and have changed since the last save command was executed.


Saving Data

Data can be saved in a couple of different ways as well. 1. Saving with the Menu - menu option File>>Save or File>>Save As, can be used to save data in STATA format. These files end in a “.dta” extension. One can also hit ctrl+s to save as well. 2. Syntax - data can be saved using the command save as follows: save [filename] [, nolabel old replace all intercooled ] Where old instructs the software to save the dataset in the previous version of STATA. You shouldn’t need this in the lab, but will if you’re using STATA 7 elsewhere and want to use the data in STATA 6 in the lab. Replace simply replaces the dataset if there is one that has the exact same name. The other options are irrelevant to your work.



STATA’s graphing capabilities are not the best of the statistical packages, but they are sufficient for exploratory analysis. They are, however, probably not good enough for publication. There are many possibilities. These can be broken down into two basic types - univariate and bivariate.


Univariate Graphs

Univariate graphs are usually meant to describe the distributional properties of a single variable. These include histograms, density plots, boxplots, and oneway scatterplots. 4.1.1 Histogram

Histograms - Histograms place observations into categories (or bins) which are then graphed as a function of the percentage of the total observations that are in each bin. The command in stata is: graph [variable] [weight] [if exp] [in range], histogram [common_options bin(#) {freq | percent} normal[(#,#)] density(#)] The “bin” argument allows you to set the number of categories into which the observations are placed. A density curve can be imposed on the histogram. 4


Density Plots

A density plot is also called a “smoothed histogram”. In this graph, there are no bins. It is a single line that is more like the population density function than the histogram. The command in stata for this is: kdensity varname [weight] [if exp] [in range] [, nograph generate(newvarx newvard) n(#) width(#) {biweight|cosine|epan|gauss|parzen|rectangle|triangle} normal stud(#) at(varx) symbol(...) connect(...) title(string) graph_options ] The gauss option is probably the one that will be most useful. The biweight, cosine, epan (epanechankov), parsen, rectangle and triangle options are all options that control how observations are weighted (this is analogous to deciding which bin they are in). 4.1.3 Boxplots

Boxplots, sometimes called “box and whisker” plots are particularly good at showing the spread of a distribution. The box represents the inter-quartile range (the range between the 25th and 50th percentiles. The whiskers cover most of the rest of the observations, but some extreme outliers can still lie outside the whiskers. The STATA command to make a boxplot is: graph [varlist] [weight] [if exp] [in range], box [common_options [no]alt vwidth root] 4.1.4 Oneway Scatterplots

Oneway scatterplots (also called rug plots in other packages) are yet another way to visualize univariate distributions. These are particularly good with smaller datasets as with larger ones, the distributional qualities are not distinguishable. The STATA command to construct a oneway scatterplot is: graph [varlist] [weight] [if exp] [in range], oneway [common_options jitter(#)]


Bivariate Graphs

Bivariate graphs display the relationship between two variables, While theoretically, there are a number of possibilities for visualizing two variables together, such as a joint density plot, the one used almost exclusively is the bivariate scatterplot. 4.2.1 Bivariate Scatterplot

The bivariate scatterplot uses the values on two variables (X and Y) as coordinates graphed onto a set of coordinate axes. A number of different lines can be plotted on the graphs to further describe the relationship between the two variables. We will learn more about these later in the semester. The command to create a bivariate scatterplot in STATA is: graph [varlist] [weight] [if exp] [in range], twoway [common_options jitter(#) rescale rbox {y|x|r}reverse] You may consult the STATA graphing manual or help files for more specific help on any of these and many other commands for the graphical display of data.



There are a number of other commands that will become useful as you begin to use STATA on a regular basis. 5

1. Describe - describe provides you with a list of properties of the variables specified or all of the varaibles in the dataset if no variables are specified. describe [varlist] [, short detail fullnames numbers ] 2. Summarize - summarize provides mean, variance, min and max for all of the variables specified or all variables in data if none are specified. summarize [varlist] [weight] [if exp] [in range] [, [detail|meanonly] format ] 3. Set Memory - the memory set function will be important when you are using large datasets. set mem 100m This will set the memory at 100 megabytes. This should be sufficient for nearly all of your projects. The upper bound is determined by the computer’s physical memory and if 100 megabytes is not enough, if you computer has more memory available, you can set the limit higher. 4. Labelling - Labelling variables and variable values is important to keeping your dataset manageable. You will hear horror stories from many quantitative types about how they didn’t label variables and variable values because they were sure they would always remember and then two years later after having left the project sitting, they come back only to find they’ve forgotten everything about the variables and their coding. You will need three different commands to properly label your variables. (a) Label Variable - this command simply attaches a label to the variable name. So, if for instance the name of the variable is ’var1’, and you label it ’party ID’, then ’party ID’ will show up in all printed output containing that variable. The command in STATA is as follows: label variable varname ["label"] Where ’varname’ is the variable name (var1 in the example above) and label is the label you want to apply to that variable name (party ID in the example above). So, to create the label party ID for var1, we would type the following: label variable var1 "party ID" (b) Label Define - this command defines value labels. For instance, our party ID variable may have republicans, independents and democrats. We want to make a label so that if we tabulate the variable, instead of 0, 1 and 2 as categories, it shows republicans, independents and democrats. The general STATA code is as follows: label define lblname # "label" [# "label" ...] [, add modify nofix ]

Where lblname is the name you want to give to the label, like ’partyid’ for this case, # signifies the number you want the label to apply and label is the descriptor. For this example, we would type: label define partyid 0 "republican" 1 "independent" 2 "democrat" (c) Label Values - Finally, we can apply the new value label we defined ’partyid’, to the variable of interest. label values var1 partyid More generally, the syntax is: label values varname [lblname] [, nofix ]



1. STATA’s website: has a number of useful resources, like help files and FAQ’s. 2. STATA also has a listserv called STATA list. You can subscribe to STATA list you can consult the STATA list FAQs located at 3. Reference manuals are also a great source of information, hopefully we will have them available to you early in the semester.


Sign up to vote on this title
UsefulNot useful