Professional Documents
Culture Documents
Data Cleaning: Hints and Tips
Data Cleaning: Hints and Tips
Introduction
Data cleaning one of the most time
consuming jobs of all!
Many ways of attacking the same
problem when using Stata
The talk will describe some common
problems and propose possible solutions
These are mostly reminders!
Felicity Clemens 18 May
2005
Contents
1) Introduction to the first datasets
2) Identifying and removing duplicates
by hand
3) Merging data and uses of the
merge command
4) Generating a moving target
variable
Felicity Clemens 18 May
2005
The study
A case-control study carried across 3
central European countries
Exposure of interest: exposure to
chemicals in the environment
Outcome of interest: cancer
Identifying duplicates in a
dataset
This can be done automatically (using
the duplicates set of commands)
We will demonstrate a manual method of
identifying duplicates
Two different possibilities:
The same data have been entered on more
than one occasion;
Felicity Clemens 18 May
2005
Identifying duplicates in a
dataset
This can be done automatically (using the
duplicates set of commands)
We will demonstrate a manual method of
identifying duplicates
Two different possibilities:
The same data have been entered on more
than one occasion;
Different data have been entered using the
same identifier (id numbers)
Felicity Clemens 18 May
2005
Identifying a moving
target
Scenario: we have data for each town giving
the chemical concentration for each year
between 1982 and 2002
Problem: we need to identify the year counting
backwards from 2002 in which the chemical
changed from its 2002 level
Why? We need to overwrite the 2002 value
with a new value, and overwrite backwards
until the value changed
Felicity Clemens 18 May
2005
Identifying a moving
target (2)
rescode
y1990
y1991
y1992
1010113
65
32
32
1010114
41
41
41
1010115
78
23
23
1010116
44
44
44
1010117
82
82
29
1010118
25
25
25
1010119
12
12
1010120
40
12
Identifying a moving
target (3)
We will use the forval loop to examine the
relationship between each years
observed value and the observed value
for the previous year
Summary
Identifying duplicates can be done by
hand or automatically using the
duplicates set of commands
Use of the merge command to merge
on a specific variable, to multiply merge
datasets
Generating a moving target variable the
use of the forval loop
Felicity Clemens 18 May
2005