You are on page 1of 23

Preventing data loss when combining SAS datasets.

John Cantrell Univ. North Carolina Chapel Hill, NC

SAS has powerful tools for dataset combination

Proc

SQL SAS data step

SAS data step

Datastep methods use SET, MERGE, and BY statements


MERGE SET statement statement no BY statement 1 to 1 merge followed with BY matched statement merge concatenation interleaving

The workhorses:
Matched

merge & concatenation. Data twothree; merge two three; run; Forgot BY statement, yields 1 to 1 merge instead of matched merge

LOG: errors, warnings & notes


Unintentional

1 to 1 merge caught with options mergenoby=error; Error messages in the LOG help to find mistakes Not all problems yield error messages: the note, repeats of BY values > many-to-many merge

LOG does not flag all mistakes


A

serious problem can occur without any message in the LOG For example, data loss by truncation due to unequal variable lengths

Example: Concatenation
Concatenation

of 2 datasets named two and three, each input dataset contains: numeric variable named id.

A A

character variable named var.

Example: input datasets:


dataset two id var
1 2 3 4 5 6 7 21 23 24 25 27 2g 2k

dataset three id var


0 1 2 3 4 8 9 3ks 31a 35a 34a 350 370 3gv

Example: variable var


two:var

and three:var are not the

same two:var has a length of $2, all values begin with the character 2 three:var has a length of $3, all values begin with the character 3

Example: the datastep


data

twothree; set two three; run; Note: the output dataset name indicates the order of input datasets in SET statement.

Dataset twothree
id 1 2 3 4 5 6 7 var 21 23 24 25 27 2g 2k id 0 1 2 3 4 8 9 var 3k 31 35 34 35 37 3g

Example: data loss


In

twothree:id=1 var =21 and var=31 In inputs: two:id=1 var=21 three:id=1 var=31a Value of var in twothree that came from three lacks the a, Why?

output:

Explanation:
Variable

length in output is determined by variable length in the first (leftmost) input dataset in the SET statement, two:var $2. 31a does not fit into a character variable of length $2, so the a is lost through truncation. LOG does not mention data loss.

Example: another datastep


What

would happen if we reverse the order of the datasets in the SET statement? data threetwo; set three two; run;

Dataset threetwo
id 0 1 2 3 4 8 9 var 3ks 31a 35a 34a 350 370 3gv id 1 2 3 4 5 6 7 var 21 23 24 25 27 2g 2k

Example: Matched merge


Matched

merge: data from last dataset overwrites data from first dataset. Usually re-name variables with same name to avoid overwriting (except BY variables). Maybe you want matched values to overwrite, so you do not rename.

Example: Matched merge


data

twothree; merge two three; by id; run;

Example: dataset twothree


id 0 var 3k id 5 var 27

1 2
3 4

31 35
34 35

6 7
8 9

2g 2k
37 3g

Length of variable:
Determined

by first, or left most, dataset in MERGE statement. Again, the LOG is silent on data loss through truncation. What to do?

Solutions:
Complain

to SAS Until SAS corrects, use macro before merge or concatenation to check for unequal character variable lengths If found, re-size variables

Macro: verifyVariables.sas
Invoke

before merge or concatenation. %verifyVariables(two,three,set); data twothree; set two three; run;

How does it work?


proc

contents on two and three, outputs results to datasets Merges datasets and uses datastep to find like named character variables It they have different lengths, puts ERROR statement into log.

Macro code and slides:


On website: http://www.unc.edu/~jcantrel/ Macro in paper. John Cantrell University of North Carolina Chapel Hill, NC 27599-7411 (919)843-6495 john_cantrell@unc.edu

You might also like