You are on page 1of 11

Data cleaning:

hints and tips


Felicity Clemens
Stata Users Group meeting
London, 17 & 18th May 2005

Felicity Clemens 18 May 2005

Introduction
Data cleaning one of the most time
consuming jobs of all!
Many ways of attacking the same
problem when using Stata
The talk will describe some common
problems and propose possible solutions
These are mostly reminders!
Felicity Clemens 18 May
2005

Contents
1) Introduction to the first datasets
2) Identifying and removing duplicates
by hand
3) Merging data and uses of the
merge command
4) Generating a moving target
variable
Felicity Clemens 18 May
2005

The study
A case-control study carried across 3
central European countries
Exposure of interest: exposure to
chemicals in the environment
Outcome of interest: cancer

Felicity Clemens 18 May


2005

Identifying duplicates in a
dataset
This can be done automatically (using
the duplicates set of commands)
We will demonstrate a manual method of
identifying duplicates
Two different possibilities:
The same data have been entered on more
than one occasion;
Felicity Clemens 18 May
2005

Identifying duplicates in a
dataset
This can be done automatically (using the
duplicates set of commands)
We will demonstrate a manual method of
identifying duplicates
Two different possibilities:
The same data have been entered on more
than one occasion;
Different data have been entered using the
same identifier (id numbers)
Felicity Clemens 18 May
2005

The merge command


A necessary command in data
management of most big studies
There are many different uses of the
merge command. We look at two of
them:
Simple merge on id
Multiple merge on id
Felicity Clemens 18 May
2005

Identifying a moving
target
Scenario: we have data for each town giving
the chemical concentration for each year
between 1982 and 2002
Problem: we need to identify the year counting
backwards from 2002 in which the chemical
changed from its 2002 level
Why? We need to overwrite the 2002 value
with a new value, and overwrite backwards
until the value changed
Felicity Clemens 18 May
2005

Identifying a moving
target (2)
rescode

y1990

y1991

y1992

1010113

65

32

32

1010114

41

41

41

1010115

78

23

23

1010116

44

44

44

1010117

82

82

29

1010118

25

25

25

1010119

12

12

1010120

40

12

Felicity Clemens 18 May


2005

Identifying a moving
target (3)
We will use the forval loop to examine the
relationship between each years
observed value and the observed value
for the previous year

Felicity Clemens 18 May


2005

Summary
Identifying duplicates can be done by
hand or automatically using the
duplicates set of commands
Use of the merge command to merge
on a specific variable, to multiply merge
datasets
Generating a moving target variable the
use of the forval loop
Felicity Clemens 18 May
2005

You might also like