You are on page 1of 16

Federal Department of Economic Affairs,

Education and Research EAER


Agroscope

Day 3: Preparing data

Andreas Kohler and Anne Wunderlich

Data preparation | Day 3


Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
1
Day 3: Preparing data

Agenda Day 3
• Importing datasets
• Transposing & reshaping datasets
• Collapsing & expanding datasets
• Combining datasets
• Generating & transforming variables
• Missing values
• Storage types & output format
• MATA
• Data checks

Data preparation | Day 3


Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
2
Lessons from Day 2

Yesterday, we made a strong case for using well-documented do-files


in Stata to ensure we can reproduce all our work.

A crucial step in empirical work is data preparation. In my experience,


about two-third of the total time spent on an empirical project is
devoted to preparing the data in such a way that it can be used in an
empirical analysis. This is the cumbersome but (usually) unavoidable
part of empirical work. So it always pays off to use a (separate) do-file
to document your data preparation steps.

Data preparation | Day 3


Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
3
Importing datasets
File paths
• Use global macro to define file path where you store your
datasets: global mname "file path "
• Never work with your original dataset - always use a copy!

Stata can import a variety of different file formats (see help import).
Often, datasets are in one of the following (spreadsheet) formats
• text (tab-separated or comma-separated) format: insheet
[varlist] using filename [, options]
• Microsoft Excel (.xls and .xlsx) format: import excel [using]
filename [, import_excel_options]
• Stata format: use filename [, clear]

Remarks:
• Stata can also export datasets into other file formats e.g. using
export excel or outsheet
Data preparation | Day 3
Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
4
Transposing & reshaping datasets
Once in a blue moon, we would like to transpose our dataset, i.e.
interchange columns and rows
• xpose, clear [options]

More often, we would like to change between “long” format (each


observation is in a separate row) and “wide” (values of the same
observation are in one row)
• reshape wide stub, i(i) j(j), where j is an existing variable
• reshape long stub, i(i) j(j), where j is a new variable

Data preparation | Day 3


Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
5
Collapsing, expanding & contracting datasets

Sometimes we would like to collapse our dataset computing statistical


measures across groups (e.g. means across male and female).
• collapse [(stat)] varlist [if] [in] [weight] [,
options]

Sometimes we would like to expand or contract our dataset


• replace each observation in dataset with n(=exp) copies of the
observation: expand [=]exp [if] [in] [weight] [,
generate(newvar)]
• replace dataset with new dataset consisting of all combinations
of varlist that exist in data: expand [=]exp [if] [in]
[weight] [, generate(newvar)]

Data preparation | Day 3


Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
6
Combining datasets
Combining datasets vertically (same var., different obs.)
• append datasets: append using filename [, options]

Combining datasets horizontally (same obs., different var.)


• merge datasets one-to-one: merge 1:1 varlist using
filename [, options]
• merge datasets many-to-many: merge m:m varlist using
filename [, options]

Data preparation | Day 3


Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
7
Sampling

We can draw a random sample (without replacement) from our


dataset
• sample # [if] [in] [, count by(groupvars)]

Caution
• observations not drawn are dropped from memory
• if you want to reproduce results from analyzing random sample
you previously need to set the start value of the random numbers
generator to a specific but arbitrary value using set seed #

Data preparation | Day 3


Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
8
Generating & dropping variables
Generate a new variable
• generate [type] newvar[:lblname] =exp [if] [in]
• extensions to generate: egen [type] newvar =fcn(arguments)
[if] [in] [, options]

Use short and meaningful variable names and label them (see help
rename and help label var). Note that the following variable names
are not allowed:

Delete data (alternative help keep)


• Variables are deleted from the dataset using drop varlist
• Observations are deleted using drop if exp
Data preparation | Day 3
Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
9
Transforming variables
Instead of generating a new variable, we sometimes want to replace
an existing variable, i.e. change the contents of an existing variable
• replace oldvar =exp [if] [in] [, nopromote]

An easy way to recode categorical variables (i.e. a variable that takes


on a limitied number of possible values)
• recode varlist (rule) [(rule) ...] [, gen(newvar)]

Recoding using by, _n and _N


• We have already seend the prefix by, which allows to repeat
commands on a subset of data
• _n contains the current position of an observation
• _N is the highest value of _n

Explicit subscripts
• varname[...]
Data preparation | Day 3
Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
10
Missing values
Stata has 27 numeric missing values (see help missing)
• default (or system missing value) is denoted by . (dot)
• extended missing values are denoted by .a, .b, ..., .z
• functions: missing()

Numeric missing values are represented by large positive values. The


ordering is
• all nonmissing numbers < . < .a < .b < ...< .z
• Example: expression age > 60 is true if variable age is greater
than 60 or missing

Stata has one string missing value, denoted by "" (blank).


Caution:
• Make sure you know the reasons why values are missing!
• Most Stata commands ignore observations that are missing in
one or more of the variables referred to in the command
Data preparation | Day 3
Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
11
Storage types & output format

Stata knows the following data types (see help datatypes)


• byte
• int (integer)
• long
• float
• double
• string (alphanumeric)

Datatypes
• differ by precision and storage memory they use
• Stata chooses data type for us (default for numeric variable is
float)

Output format (see help format)


• format varlist %fmt
Data preparation | Day 3
Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
12
MATA

Stata also offers with MATA (see help mata) a matrix programming
language to perform matrix calculations. A very simple example is
shown below
mata
A = (1, 2 3, 4)
B = (1, 2 3, 4)
C = A*B
end

We will not cover MATA in this program (although we will use some
MATA features when we retrieve regression results stored by Stata).
Those interested can find more information here
• http://www.stata.com/features/matrix-programming-mata/

Data preparation | Day 3


Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
13
Data checks

It is important to check your Stata code and data after manipulations:


• errors: your code does not contain errors ⇒ do-file runs without
execution errors (see help trace)
• bugs: your code does not contain bugs ⇒ code does what you
want it to do (e.g. code might run without execution error but
does not produce result you want)

Moreover, it is good practice to check the raw data you collect even
before you start manipulating it!

Data preparation | Day 3


Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
14
Outlook Day 4

Today, we have seen how to import and prepare data in Stata such
that it can be used in an empirical analysis.

Before we start running complex models, we always use summary


statistics in the form of tables and graphs to describe our data. This is
good practice because it allows us to spot anomalies (possible errors,
outliers etc.) and more generally, gives us a “feeling” for our data.
Thus, tomorrow
• we will learn how to describe data using tables and graphs in
Stata

Data preparation | Day 3


Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
15
Thank you for your attention

Agroscope good food, healthy environment

Data preparation | Day 3


Kohler & Wunderlich | ©Agroscope | Institute for Sustainability Sciences ISS | Tänikon 1, 8356 Ettenhausen
16

You might also like