You are on page 1of 10

Merging in SAS

• These slides show alternatives regarding the


merge of two datasets using the IN data set
option (check in the SAS onlinedoc >
“BASE SAS”, “SAS Language Reference:
Dictionary” > “Data step options” > “IN=“
• In the slides, the red data goes into the
merged data set. The greyed out
observations are left out.
The perfect merge
Dataset A Dataset B
ID V1 V2 ID V3 V4
1 123 123 1 343 343
2 421 434 2 85 4234
3 129 436 3 325 434
4 122 767 4 763 234
5 232 34 5 229 324
6 534 435 6 554 324
7 343 89 7 884 34
8 324 6787 8 895 342
Not so perfect (if a or b;)
Dataset A (in=a) Dataset B (in=b)
ID V1 V2 ID V3 V4
1 343 343
2 421 434 2 85 4234
3 129 436
4 122 767 4 763 234
5 229 324
6 534 435 6 554 324
7 343 89
8 324 6787 8 895 342
If a=b; (both datasets contribute)
Dataset A (in=a) Dataset B (in=b)
ID V1 V2 ID V3 V4
1 343 343
2 421 434 2 85 4234
3 129 436
4 122 767 4 763 234
5 229 324
6 534 435 6 554 324
7 343 89
8 324 6787 8 895 342
If a; (must be in dataset A)
Dataset A (in=a) Dataset B (in=b)
ID V1 V2 ID V3 V4
1 343 343
2 421 434 2 85 4234
3 129 436 . . .
4 122 767 4 763 234
5 229 324
6 534 435 6 554 324
7 343 89 . . .
8 324 6787 8 895 342
If b; (must be in dataset B)
Dataset A (in=a) Dataset B (in=b)
ID V1 V2 ID V3 V4
. . 1 343 343
2 421 434 2 85 4234
3 129 436
4 122 767 4 763 234
. . 5 229 324
6 534 435 6 554 324
7 343 89
8 324 6787 8 895 342
Notes
• The examples assume there is a unique
identifier. This can be either one variable
(ex, CRSP's PERMNO or Compustat's
GVKEY) or more than one variable (for
example, PERMNO and DATE for a panel
dataset).
• Assumption: Both data sets are sorted by the
unique identifier(s).
Sample code
proc sort data=yourdata; by permno date;
proc sort data=otherdata; by permno date;

data newdata;
merge yourdata (in=a) otherdata (in=b);
by permno date;
/* note by variables are in the same order */
/* as the sort by variables) */
/* below this, you write your control statement,
one of the following */
if a;
if b;
if a and b;
if not a;
if not b;
Typical problems
• If both datasets were complete (they both have the same
observed units, then the IF statements would be unnecessary; "if
a and b" would be equivalent to leaving the statement out
altogether)
• If you do not have a BY statement (no identifier -- you somehow
know that each row of one datasets corresponds to the same one
row in the other dataset), the datasets are just "glued" side-by-
side.
• Common mishaps: the by variables have different formats across
datasets, SAS will merge the datasets, but will put a WARNING
in the log. Another common mishap is to have variables with the
same name (that are not the ID) -- one of the will be overwritten.
References
Good references are
• http://ftp.sas.com/techsup/download/technote/ts64
4.html
• and a manual called "Combining and modifying
SAS data sets: examples", which is in the RC
library. It has a lot of example. Unfortunately, it
does not exist in an online version (only the code
is available, but the explanations are very good).

You might also like