You are on page 1of 2

Stata Programming II September 8, 2003 1

Eric Reinhardt (Department of Political Science, Emory University, Atlanta, GA 30322)

Getting Data into Stata

• Datasets not in Stata format can be found in formats which can be classified along a number of
different dimensions: (1) whether it is in a proprietary binary file format associated with a
spreadsheet, database, or statistical program like Microsoft Excel, dBase, SPSS, etc., or instead is in
raw text (ASCII) format; and (2) whether it has single or multiple record observations. If raw text, it
can be fixed format (with variables starting and ending on specified columns, hence it looks like a
giant rectangle often with no spaces in between digits) or delimited (with variables set off from each
other by spaces, commas, tabs, or other characters). Since most proprietary programs can save their
datasets out into raw text files as an option, Stata is only designed to be capable of importing raw text
files (in addition to Stata format itself, of course).
• Single-record comma- or tab-delimited files are easiest to import. Set Stata’s memory sufficiently
high, then use the insheet command: insheet using filename.ext, clear [comma] [names],
where ext is the file’s extension (e.g., *.csv for many comma-delimited files), and the options
[comma] and [names] tell Stata that you are importing a comma-delimited (as opposed to tab-) file
and that the first row in the dataset has the variable names, respectively.
• For space-delimited, fixed-format, or multiple-record files, you must use the infile command
combined with a dictionary file. Refer to Stata’s help or my class example for more information.
• Once you have a dataset in Stata’s memory, you may need to check the variable memory storage
types (to make sure numeric variables are properly stored as numeric instead of strings), modify the
variable names and descriptions, etc.
• To export a Stata dataset in memory into a comma-delimited format which can be read into Excel,
etc., just use the outsheet command: outsheet using filename.csv, comma names replace.
You can optionally export only selected variables or observations, by listing the variables after
outsheet, or by using the if condition before the comma, respectively.

Merging Stata Datasets

• To merge rows and variables from one dataset into another based on matching values of a variable or
variables, use the merge command.
• First, verify each dataset has an identically named match variable(s) with identical values where
matches should occur.
• Second, sort each dataset by the match variable and save each file.
• Third, open up the file into which you want to import the new rows/columns.
• Fourth, type merge matchvar using dataset2, where dataset2 is the dataset you want to import
from and matchvar is the match variable(s). This process creates a new variable, _merge, which
indicates (see help merge) which rows were originally in the dataset you started with and which came
solely from the dataset you imported from.
• By default, the merge command does not change the values of existing rows or columns if the ones in
the importing dataset are different. Use the , update option to replace cells that were missing in the
original dataset with filled-in values from the importing dataset. Use the , replace option to
change filled-in values from the original dataset with filled-in values from the importing dataset.
Stata will not replace filled-in values from the original dataset with missing values from the importing
dataset.

Aggregating and Disaggregating Observations

• Aggregate rows sharing a common value of any given variable using the collapse command. First,
sort by the variable(s) to aggregate on the basis of. Second, type collapse (function)
newname1=varname1 (function) newname2=varname2, by(aggvar), where aggvar is the variable
Stata Programming II September 8, 2003 2
Eric Reinhardt (Department of Political Science, Emory University, Atlanta, GA 30322)

which is the basis of the aggregation, function is the code representing the particular aggregating
function you would like performed (e.g., mean, min, max, sum, count), varname1 is the first variable
which you would like to perform this aggregating function on, and newname1 is the name of the
variable Stata will create when it performs this aggregating function on varname1.
• The collapse command creates a new, smaller, dataset in memory, and drops the existing one, so save
if necessary before you use it.
• You can multiply rows, creating x number of rows for each existing row, by typing expand x.

Transposing

• You can shift your dataset’s structure (e.g., from a structure with values of one variable across time
recorded in separate columns for each time period, to a structure with all values of that variable
recorded in one column, with separate rows for each time period) with the reshape command.
• See help reshape for more information.

Graphing

• graph produces a wide variety of types of two-dimensional figures.


• [graph twoway] scatter y x produces a scatterplot of variable y against variable x; you don’t have
to type in the words in square brackets [] to make the command work.
• To plot multiple dependent variables against one common independent variable, e.g., year, type
scatter y1 y2 y3 year.
• To change the symbols represented for each pair y-x from the default dot, use the msymbol() option
after a comma after the main graph command. See Stata help for a list of the potential values you
could type in, for msymbol. You can even use variable values for the point markers, instead of dots,
etc., by typing scatter y x, msymbol(i) mlabposition(0) mlabel(variablename).
• You can connect the dots for selected y-x pairs in the graph, with the connect() option: scatter y1
year, connect(L) clpattern(solid) will connect the dots with a solid line (change what’s in the
clpattern parentheses for a dashed line, etc.). You might want to sort the data by the x-variable
before graphing; you can do this within the graph command by adding the sort(xvar) option. You
can do different connection styles for different y variables, a la scatter y1 y2 y3 year,
connect(L L .) clpattern(solid dash .), which connects the y1 dots with a solid line, the y2
series with a dashed line, and doesn’t draw any line connecting the y3 dots.
• You can label the axes of a graph using the xtitle and ytitle options after the graph command’s
comma, even specifying particular values you’d like to label if desired, e.g., xtitle(This is the Year
Number). See help graph for more information.
• You can copy the graph with the Edit-Copy menu selection, and paste into MS Word or elsewhere.
You can also save the graph using the saving() option after the comma: e.g., graph y x,
saving(filename,replace). This saves a new file named filename.gph in your working directory.
• ksm y x, low produces a scatterplot with a smoothed Lowess-style regression line, useful for
displaying a trend in one simple command. It takes the usual graph options, e.g., as per the above.
• Histogram y produces a histogram of the variable y, not surprisingly. You can add an option to
display a certain number of ‘bins’ or vertical bars, e.g., histogram y, bin(40).
• You can make a bar graph of group means (or medians or any other characteristic) like so: graph bar
(mean) y, over(x). Draw it sidewise by typing graph hbar, with otherwise the same input.
• You can combine Stata graphs in a variety of ways, to overlay one series on top of another, for
example.
• You can produce a wide variety of different looks on Stata 8 graphs, using different fonts, labels,
legends, multiple axis scales, etc. The graphing power in Stata 8 is extraordinarily flexible, but,
starting in this version, somewhat complex. See help graph for more information.

You might also like