Stata Slides Day3

Federal Department of Economic Affairs,
Education and Research EAER

Agroscope
Day 3: Preparing data
Andreas Kohler and Anne Wunderlich
Data preparation | Day 3

Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
1
Day 3: Preparing data
Agenda Day 3
• Importing datasets
• Transposing & reshaping datasets
• Collapsing & expanding datasets
• Combining datasets
• Generating & transforming variables
• Missing values
• Storage types & output format
• MATA
• Data checks

2
Lessons from Day 2
Yesterday, we made a strong case for using well-documented do-files

in Stata to ensure we can reproduce all our work.
A crucial step in empirical work is data preparation. In my experience,

about two-third of the total time spent on an empirical project is
devoted to preparing the data in such a way that it can be used in an
empirical analysis. This is the cumbersome but (usually) unavoidable
part of empirical work. So it always pays off to use a (separate) do-file
to document your data preparation steps.

3
Importing datasets
File paths
• Use global macro to define file path where you store your
datasets: global mname "file path "
• Never work with your original dataset - always use a copy!
Stata can import a variety of different file formats (see help import).
Often, datasets are in one of the following (spreadsheet) formats
• text (tab-separated or comma-separated) format: insheet
[varlist] using filename [, options]
• Microsoft Excel (.xls and .xlsx) format: import excel [using]
filename [, import_excel_options]
• Stata format: use filename [, clear]
Remarks:
• Stata can also export datasets into other file formats e.g. using
export excel or outsheet
4
Transposing & reshaping datasets
Once in a blue moon, we would like to transpose our dataset, i.e.
interchange columns and rows
• xpose, clear [options]
More often, we would like to change between “long” format (each

observation is in a separate row) and “wide” (values of the same
observation are in one row)
• reshape wide stub, i(i) j(j), where j is an existing variable
• reshape long stub, i(i) j(j), where j is a new variable

5
Collapsing, expanding & contracting datasets
Sometimes we would like to collapse our dataset computing statistical

measures across groups (e.g. means across male and female).
• collapse [(stat)] varlist [if] [in] [weight] [,
options]
Sometimes we would like to expand or contract our dataset

• replace each observation in dataset with n(=exp) copies of the
observation: expand [=]exp [if] [in] [weight] [,
generate(newvar)]
• replace dataset with new dataset consisting of all combinations
of varlist that exist in data: expand [=]exp [if] [in]
[weight] [, generate(newvar)]

6
Combining datasets
Combining datasets vertically (same var., different obs.)
• append datasets: append using filename [, options]
Combining datasets horizontally (same obs., different var.)

• merge datasets one-to-one: merge 1:1 varlist using
filename [, options]
• merge datasets many-to-many: merge m:m varlist using
filename [, options]

7
Sampling
We can draw a random sample (without replacement) from our

dataset
• sample # [if] [in] [, count by(groupvars)]
Caution
• observations not drawn are dropped from memory
• if you want to reproduce results from analyzing random sample
you previously need to set the start value of the random numbers
generator to a specific but arbitrary value using set seed #

8
Generating & dropping variables
Generate a new variable
• generate [type] newvar[:lblname] =exp [if] [in]
• extensions to generate: egen [type] newvar =fcn(arguments)
[if] [in] [, options]
Use short and meaningful variable names and label them (see help
rename and help label var). Note that the following variable names
are not allowed:
Delete data (alternative help keep)

• Variables are deleted from the dataset using drop varlist
• Observations are deleted using drop if exp
9
Transforming variables
Instead of generating a new variable, we sometimes want to replace
an existing variable, i.e. change the contents of an existing variable
• replace oldvar =exp [if] [in] [, nopromote]
An easy way to recode categorical variables (i.e. a variable that takes

on a limitied number of possible values)
• recode varlist (rule) [(rule) ...] [, gen(newvar)]
Recoding using by, _n and _N

• We have already seend the prefix by, which allows to repeat
commands on a subset of data
• _n contains the current position of an observation
• _N is the highest value of _n
Explicit subscripts
• varname[...]
10
Missing values
Stata has 27 numeric missing values (see help missing)
• default (or system missing value) is denoted by . (dot)
• extended missing values are denoted by .a, .b, ..., .z
• functions: missing()
Numeric missing values are represented by large positive values. The

ordering is
• all nonmissing numbers < . < .a < .b < ...< .z
• Example: expression age > 60 is true if variable age is greater
than 60 or missing
Stata has one string missing value, denoted by "" (blank).

Caution:
• Make sure you know the reasons why values are missing!
• Most Stata commands ignore observations that are missing in
one or more of the variables referred to in the command
11
Storage types & output format
Stata knows the following data types (see help datatypes)

• byte
• int (integer)
• long
• float
• double
• string (alphanumeric)
Datatypes
• differ by precision and storage memory they use
• Stata chooses data type for us (default for numeric variable is
float)
Output format (see help format)

• format varlist %fmt
12
MATA
Stata also offers with MATA (see help mata) a matrix programming
language to perform matrix calculations. A very simple example is
shown below
mata
A = (1, 2 3, 4)
B = (1, 2 3, 4)
C = A*B
end
We will not cover MATA in this program (although we will use some
MATA features when we retrieve regression results stored by Stata).
Those interested can find more information here
• http://www.stata.com/features/matrix-programming-mata/

13
Data checks
It is important to check your Stata code and data after manipulations:

• errors: your code does not contain errors ⇒ do-file runs without
execution errors (see help trace)
• bugs: your code does not contain bugs ⇒ code does what you
want it to do (e.g. code might run without execution error but
does not produce result you want)
Moreover, it is good practice to check the raw data you collect even
before you start manipulating it!

14
Outlook Day 4
Today, we have seen how to import and prepare data in Stata such
that it can be used in an empirical analysis.
Before we start running complex models, we always use summary

statistics in the form of tables and graphs to describe our data. This is
good practice because it allows us to spot anomalies (possible errors,
outliers etc.) and more generally, gives us a “feeling” for our data.
Thus, tomorrow
• we will learn how to describe data using tables and graphs in
Stata

15
Thank you for your attention
Agroscope good food, healthy environment

16

Stata Slides Day3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stata Slides Day3

Uploaded by

Copyright:

Available Formats

Federal Department of Economic Affairs,

Education and Research EAER

Day 3: Preparing data

Andreas Kohler and Anne Wunderlich

Data preparation | Day 3

Data preparation | Day 3

Yesterday, we made a strong case for using well-documented do-files

A crucial step in empirical work is data preparation. In my experience,

Data preparation | Day 3

More often, we would like to change between “long” format (each

Data preparation | Day 3

Sometimes we would like to collapse our dataset computing statistical

Sometimes we would like to expand or contract our dataset

Data preparation | Day 3

Combining datasets horizontally (same obs., different var.)

Data preparation | Day 3

We can draw a random sample (without replacement) from our

Data preparation | Day 3

Delete data (alternative help keep)

An easy way to recode categorical variables (i.e. a variable that takes

Recoding using by, _n and _N

Numeric missing values are represented by large positive values. The

Stata has one string missing value, denoted by "" (blank).

Stata knows the following data types (see help datatypes)

Output format (see help format)

Data preparation | Day 3

It is important to check your Stata code and data after manipulations:

Data preparation | Day 3

Before we start running complex models, we always use summary

Data preparation | Day 3

Agroscope good food, healthy environment

Data preparation | Day 3

You might also like