You are on page 1of 7

Alternatives to Merging SAS Data Sets … But Be Careful

Michael J. Wieczkowski, IMS HEALTH, Plymouth Meeting, PA

Abstract Example 1A:
/* Read data set #1 – zip/state file */
The MERGE statement in the SAS DATA STATES;
programming language is a very useful tool in LENGTH STATE $ 2 ZIP $ 5;
combining or bridging information from INPUT ZIP $ STATE $;
multiple SAS data sets. If you work with large CARDS;
data sets the MERGE statement can become 08000 NJ
cumbersome because it requires all input 10000 NY
data sets to be sorted prior to the execution of 19000 PA
the data step. This paper looks at two simple RUN;
alternative methods in SAS that can be used
to bridge information together. It also /* Sort by variable of choice */
highlights caveats to each approach to help PROC SORT DATA=STATES;
you avoid unwanted results. BY ZIP;
RUN;
PROC SQL can be very powerful but if you
are not familiar with SQL or if you do not /* Read data set #2 – zip/city file */
understand the dynamics of SQL “joins” it DATA CITYS;
can seem intimidating. The first part of this LENGTH ZIP $ 5 CITY $ 15;
paper explains to the non-SQL expert, the INPUT ZIP $ CITY $;
advantages of using PROC SQL in place of CARDS;
MERGE. It also highlights areas where 10000 NEWYORK
PROC SQL novices can get into trouble and 19000 PHILADELPHIA
how they can recognize areas of concern. 90000 LOSANGELES
RUN;
The second section shows another
technique for bridging information together /* Sort by variable of choice */
with the help of PROC FORMAT. It uses PROC SORT DATA=CITYS;
the CNTLIN option in conjunction with the BY ZIP;
PUT function to replicate a merge. Caveats RUN;
to this method are listed as well.
/* Merge by variable of choice */
DATA CITY_ST;
What’s wrong with MERGE? MERGE STATES CITYS;
BY ZIP;
Nothing is wrong with using MERGE to RUN;
bridge information together from multiple
files. It is usually one of the first options that PROC PRINT DATA= CITY_ST;
come to mind. The general technique is to RUN;
read in the data sets, sort them by the
variable you would like to merge on then The following output is produced:
execute the merge. Example 1A below
shows a match merge between two data OBS STATE ZIP CITY
sets. One contains zip codes and states, the 1 NJ 08000
other contains zip codes and cities. The 2 NY 10000 NEWYORK
goal is to attach a city to each zip code and 3 PA 19000 PHILADELPHIA
state. 4 90000 LOSANGELES
Notice above that the merged data set
shows four observations with observation 1
having no city value and observation 4

Extracting data MERGE STATES (IN=A) CITYS (IN=B). Be careful . few basic things about the SQL language i. merging in SAS we perform “joins” in SQL. the data step allowed for these non-matches to be included in the output data set.a. In SQL/database jargon we think of columns Example 1B: and tables where in SAS we refer to them as DATA CITY_ST2. SQL (Structured the SELECT statement. Before showing an example of how this is done. Otherwise.k.a. ZIP=10000 and ZIP=19000. A) and CITYS (a. col3 variables. The key to replicating a MERGE in PROC PROC PRINT DATA= CITY_ST2. the merge step can be altered know about these topics the more powerful using the IN= data set option where a things you can do with PROC SQL.e. from a SAS data set is analogous. B) The first line executes PROC SQL.. talk. SQL lies in the SELECT statement. each all of the records from the state file had a zip contains its own “dialect” which may not code in the city file and not every zip code in work the same from one platform to another. Notice that the Query Language) is a very popular language semicolon only appears at the very end. two situations where each data set in the CREATE TABLE newtable AS MERGE statement contained common BY SELECT col1. to querying a table and instead of IF A AND B. Use the the other by saying either NOPRINT option to suppress the printing. Although this is a variables we wish to extract separated by standardized language. the city file had a matching zip code in the These intricacies are not an issue when state file. there are many commas. It tells which columns (variables) you wish to The following output is produced: select and from which tables (data sets) you OBS STATE ZIP CITY want them selected under certain satisfied 1 NY 10000 NEWYORK conditions. The more you the two files. a brief description of SQL and Lines 2 through 5 are all considered part of how it works is necessary. FROM table1. While most of the put an asterisk (*) after SELECT. The files. the CREATE TABLE statement is not necessary as long as you A similar result is possible using PROC want the output to print automatically. This is because not different versions are nearly identical. observations that have by variables in both the STATES (a. If all variables are desired simply different flavors of SQL. In used to create. For the purpose of this paper using PROC SQL does not require being an expert in To only keep matched observations between SQL or relational theory. update and retrieve data in the SELECT statement we define the relational databases. or data set for future processing you should IF B AND NOT A. The basic PROC SQL syntax is 2 PA 19000 PHILADELPHIA as follows: The two observations above were the only PROC SQL NOPRINT. SQL. shown in the example below. variables and data sets. RUN. col2. The way the merge was defined in using PROC SQL in SAS.having no state value. use the CREATE TABLE option as shown on line 2 and specify the name of the new Introduction to PROC SQL data set. a match in the BY variables.k. If you need the data from the query result in a IF A AND NOT B. In my variable is set to the value of 1 if the data set examples it is only required that you learn a contains data for the current observation. SELECT statement defines the actual query. This is and database concepts. Similarly you could include default for PROC SQL is to automatically observations with ZIPs in one file but not in print the entire query result. The RUN. table2 Specifying IF A AND B is telling SAS to keep WHERE some condition is satisfied. in SQL BY ZIP.

FROM clause in an inner join. the print for PROC SQL to work in general but as we out would show both ZIP variables with no are using PROC SQL to replicate a merge it indication of which data set either one came will always be necessary. titles can be defined on the output You may specify up to 16 data sets in your the same as any other SAS procedure. instead of using the table name in the first level I used a table alias. STATE. In the example. PROC SQL. Notice that no CREATE TABLE line was Joins versus Merges used since all we did in our example was print the data set. This will be explained shorthand way of selecting variables in your more in the example to follow. There are three common types of .ZIP. The WHERE clause is not necessary specifying which data set to use. CITY This results in: • Two-level with alias where necessary: STATE ZIP CITY SELECT A. The results from the merge in example 1B can be replicated with PROC SQL by We could also code the SELECT statement eliminating the two PROC SORT steps.CITY SELECT A. The table aliases get defined in the FROM clause in line 3.ZIP=B. If that’s all that is needed In the last example (1B) we used a type of then there is no use in creating a new data join called an inner join. Aliases are useful as a same variable name.STATE.if the data sets you are joining contain the keyword is optional). occur because SAS will not allow duplicate variable names on the same data set. A. It’s a matter PA 19000 PHILADELPHIA of preference for the programmer. The WHERE clause contains the data sets contain the variable ZIP. They retrieve all matching rows plus 2A we also use it to define an alias for each any non-matching rows from one or both of data set. The second is the variable name itself. refers to A B the table or data set name. CITY NY 10000 NEWYORK All accomplish the same result. We are calling the STATES data the tables depending on how the join is set “A” and the CITYS data set “B” (the AS defined. an error would SQL. All variables in the SELECT looks like this: statement can be expressed as a two-level name separated by a period. Graphically (using only two data sets) the inner join Also notice in the SELECT statement that each variable is preceded by A or B and a period. SELECT STATES. STATE.CITY FROM STATES AS A. CITYS AS B • Two-level where necessary: WHERE A. If all matching rows from the WHERE clause.ZIP. with and without two MERGE data set step the PROC PRINT level names and aliases as shown below. So There are also joins called outer joins which the FROM clause not only defines which can only be performed on two tables at a data sets to use in the query but in example time. The FROM query when several data sets contain the clause tells which data sets to use in your same variable name.STATE.ZIP. the several different ways. The first level.STATES. In our example both query. Also note that the from. If we conditions that determine which records are simply asked for the variable ZIP without kept.ZIP. step and replacing it with the following code: • Two-level without aliases Example 2A: SELECT STATES. CITYS. desired. Inner joins retrieve set just to print it if PROC SQL can do it. B. If we used a CREATE clause without RUN statement is not needed for PROC specifying which ZIP to use. which is optional if the variable name selected is unique between tables.ZIP.

a Cartesian product of these two data sets would contain 9 observations (3 in STATES x 3 in CITYS) and would look conceptually like the table below. right joins and full How Joins Work joins. So for the STATES and CITYS data sets in our A B examples. Right Join STATE A.ZIP CITY NJ 08000 10000 NEWYORK A B NJ 08000 19000 PHILADELPHIA NJ 08000 90000 LOSANGELES     NY 10000 19000 PHILADELPHIA Full Join NY 10000 90000 LOSANGELES PA 19000 10000 NEWYORK . SAS builds an internal list of all combinations of Left Join all rows in the query data sets. Each type of join is illustrated below: In theory.ZIP B.outer joins: Left joins. when a join is executed. This is known as a Cartesian product.

 .

Also. Right Join You also don’t need to have a common variable name between the data sets you are joining. An ON clause replaces the WHERE clause There are several worthwhile advantages to but both serve the same purpose in a query. an FULL would be replaced by LEFT and entire data step (5 lines) and a PROC RIGHT respectively. This seems to be a highly matching records from each data set. Using PROC SQL requires fewer replace a comma to separate the tables.ZIP. Outer joins use a slightly different syntax.STATE.ZIP. SELECT A. Conceptually. the fewer of outer joins: lines of code the better chance of error free code. In example 1B we eliminated Similarly for left and right joins. B.    A B PA 19000 90000 LOSANGELES The italicized rows would be the ones actually selected in our inner join example because the WHERE clause specified that In the merge step from example 1A. the of code. A. condition. using PROC SQL instead of data step On the FROM clause.CITY FROM STATES A FULL JOIN CITYS B ON A. SAS would result was equivalent to performing a full join evaluate this pseudo-table row by row in PROC SQL where we ended up with not selecting only the rows that satisfy the join only matching records but also all non. We replaced it with 4 lines of PROC SQL code for a net loss of 9 lines Comparing SQL joins with merging. the word two PROC SORTS (6 lines of code). IF A. Advantages to Using PROC SQL instead of MERGE. Left Join IF B. lines to code. the the zips be equal. The inefficient way of evaluation but in reality PROC SQL code would look as follows: SAS uses some internal optimizations to make this process more efficient. PROC SQL.ZIP=B. It doesn’t sound like much but if following IF statements used in the merge you normally use MERGE a lot it can add up example would be equivalent to these types to something significant. the words FULL JOIN merging. PRINT (2 lines). This can come in handy if you .

many observations to one. SAS will give an error because it will Introduction to PROC FORMAT not allow more than one variable with the same name on the output data set. WHERE/ON clause in PROC SQL lets you specify each different variable name from Things to be careful with in PROC SQL each data set in the join. will always when using MERGE. Running Another advantage is that once you become a frequency on these values would show accustomed to using PROC SQL as a how many people answered 1. The powerful procedure. It’s important to new table from two or more data sets and understand the data. the variable on data set without our knowledge. Match merging is best used in situations Analysis of these different factors was not where you are matching one observation to performed because it is not within the scope many. statement using the following syntax: PROC FORMAT is often used to transform SELECT var1 as newvar1. Any references to potential observation to one. the MERGE can produce unexpected results.are dealing with SAS data sets created by novice it will give you the confidence to someone who is not aware of how you are explore many more of its uses. a PROC SQL join can potentially save you Match merging in SAS does not work in the processing time. many to quantitative. If you are used to using MERGE and never Another benefit is that you do not need to had any SQL experience there are a few sort each input data set prior to executing things you must be aware of so that you fully your join. sorts sets you may need to rename one of the and to query external databases. SAS prints a warning message in PROC SQL also forces you to be more the SASLOG if that situation exists in your diligent with variable maintenance.. There is a tendency in build a virtual Cartesian product between the data step merging to carry all variables and data sets in the join. If your input data sets efficiency improvements are qualitative. var2 as newvar2 output values of variables into a different form.2 or 3 but it . many. i. input to a MERGE but if you don’t do this let’s say we have survey response data and variable overlaying occurs. if you are creating a be identified in any way. data set indexes and the size of the data sets can also have effects on performance. The resource intensive operation when involving entire Cartesian product concept is usually a very large data sets. It forces We can also use PROC FORMAT along with the programmer to either include only the some help from the PUT function to replicate necessary variables in the SELECT a merge but first a review of PROC statement or rename them in the SELECT FORMAT. answers to our questions are coded as 1 for “yes”. or one of this paper. there are no consisting of three questions and all of the error or warning messages. This may not be what you wanted and is a common oversight PROC SQL. Since sorting can be a rather understand what happens in SQL joins. the output data set will contain the values from the right most data set listed on the MERGE statement. Note that factors such as same manner. replacing a merge with new one for most veterans of the merge. If you data. situations to not specify only the necessary variables. It’s a very variables on one of the data sets. of duplicate where-clause variables will not In a PROC SQL join.e. Analogously. We use the VALUE statement to You can rename the variables as they are create user-defined formats. If you were merging these data PROC SQL to perform summaries. For example. You can use using them. not contain repeated BY variables. This is a helpful message because we MERGE two or more data sets and these sometimes think our data is clean but data sets have common variable names (not duplication of by variables can exist in each including the BY variable). on the other hand. 2 for “no” and 3 for “maybe”. those data sets have common variable names.

it can become cumbersome You may need to create a new variable if you have to do it every day. 1='YES' @3 LABEL $5. There is a whose permanent value is actually “YES”. had a new definition of answer codes.g. RUN.g. RUN. 2='NO' FMTNAME='ANSWER'. e. RUN.. RUN. Showing how the Using CNTLIN= option PUT function is used. PROC FORMAT CNTLIN=ANSFMT. 1. NO. convenient way to lessen the maintenance. In this example we would not apply the conversion in the PROC step. TABLES Q1 Q2 Q3. You would simply 3 MAYBE ANSWER read in the data set and appropriately assign the variable names. 111 1 LABEL and FMTNAME. 2 is “Good”. VALUE ANSWER INPUT @1 START 1. Using PROC FORMAT in this manner is fine Replicating a MERGE with PROC if you are using a static definition for your FORMAT using the PUT function.2 and 3. PROC FORMAT. e.. 2 or 3. INFILE external filename. It may not be enough “Poor”. YES. You 1 YES could now run a frequency on the answers to 2 NO Q1.g. The LABEL variable would be the value you Format Data: ANSFMT are converting to. Both methods create the same format called We used the FORMAT statement here to ANSWER. The variable 222 2 named START would be the actual data 333 3 value you are converting from. Imagine now that you have data from another survey but this time the answer Sometimes it may become necessary to codes are totally different where 1 is create an entirely new variable based on the “Excellent”. conversion. Then run PROC . The have to go into the SAS code every time we output would show frequencies of “YES”. automating their program. We could code a following example: PROC FORMAT step like the following: DATA ANSFMT. Our data sets look like CNTLIN= that allows you to create a format this: from an input control data set rather than a VALUE statement.would be much nicer if we could actually see FORMAT with the CNTLIN option. we’ll stay with our There is an option in PROC FORMAT called survey example. FORMAT Q1 Q2 Q3 ANSWER. work to do here. It “NO” and “MAYBE” as opposed to their allows the programmer more flexibility in corresponding values of 1. ‘ANSWER’... e. /* The external input data set would look like: This creates a format called ANSWER. START LABEL FMTNAME The FMTNAME variable is a character string 1 YES ANSWER variable that contains the name of the 2 NO ANSWER format. The data set must Survey Data: ANSWERS contain three specially named variables that ID Q1 are recognized by PROC FORMAT: START. Q2 and Q3 as follows: 3 MAYBE */ PROC FREQ DATA=ANSWERS. 3 is “Fair” and 4 is values in your format. You would have to re-code your to simply mask the real answer value with a PROC FORMAT statement to reflect the temporary value such as showing the word new conversion. Although it is not a lot of “YES” whenever the number 1 appears. MAYBE. See the what these values meant. This can be done using the PUT function along with a FORMAT. 3='MAYBE'.

We have in effect “merged” on the Q1_A variable.imshealth.com This method is best used when you would only need to merge on one variable from another data set.format). Statistical Programmer.We will create a new variable using the PUT Conclusions function that will be the value of the transformed Q1 variable. IMS Health and John Gerlach. Darnovsky merge variables. It is not too like: difficult to learn a few beginner lines of PROC SQL code to execute what most DATA ANSWERS. SAS is a registered trademark of the SAS etc. If a assistance. RUN. limited information from other data sets. The PROC FORMAT / CNTLIN SET ANSWERS. transformation.ANSWER. Sr. PA 19462 options will not be discussed here. Version 6. There are Michael Wieczkowski other options in PROC FORMAT such as IMS HEALTH OTHER and _ERROR_ that can be used to 660 West Germantown Pike highlight these types of situations. Cary. Version in place of the merge would be in a situation 6. for providing of your “before” variables have a various ideas. where we had several variables (Q1. These Plymouth Meeting. suggestions and technical corresponding value in the format list. First 111 1 YES Edition. The format of the As you can see. value in the data exists that has no matching value in the format data set then the new Author Information variable will retain the value of the “before” variable as a character string.). An advantage to using this method SAS Guide to the SQL Procedure. If we had a data set containing eight different attributes that we want added to a master file. in SAS to using the traditional MERGE when An example of the program code would look combining data set information. 222 2 NO 333 3 MAYBE SAS Procedures Guide. Thanks to Perry Watts. Email: Mwieczkowski@us. First Edition. Q2. Third Edition. Acknowledgements Limitations and things to be careful with using the PUT function. perform multiple merges and renaming of Bowman. The output data set would look like this: References ID Q1 Q1_A SAS Language Reference. Emerson. / PUT method is also a quick way to get Q1_A = PUT(Q1. NC USA. You could just apply another PUT statement for each variable rather than The Practical SQL Handbook. it would require creating eight CNTLIN data sets and eight PROC FORMATS before executing eight PUT functions. Inc. there are other alternatives PUT function is: PUT(variable name. . Third Edition. merges do. A MERGE or PROC SQL would be better suited in such a case. Version 6.) that all required the same Institute. Hopefully you will try out these new methods and expand you SAS knowledge in the process.. You must be careful that all expected values SAS consultant at IMS Health.