You are on page 1of 6

Beginning Tutorials Paper 60

Changing the Shape of Your Data: PROC TRANSPOSE vs. Arrays
Bob Virgile Robert Virgile Associates, Inc.

Overview To transpose your data (turning variables into observations or turning observations into variables), you can use either PROC TRANSPOSE or array processing within a DATA step. This tutorial examines basic variations on both methods, comparing features and advantages of each.

on both observations. To eliminate the extra variable, modify a portion of the PROC statement: OUT=NEW (DROP=_NAME_) The equivalent DATA step code using arrays could be: DATA NEW (KEEP=NAME DATE1-DATE3); SET OLD; BY NAME; ARRAY DATES {3} DATE1-DATE3; RETAIN DATE1-DATE3; IF FIRST.NAME THEN I=1; ELSE I + 1; DATES{I} = DATE; IF LAST.NAME; This program assumes that each NAME has exactly three observations. If a NAME had more, the program would generate an error message when hitting the fourth observation for that NAME. When I=4, this statement encounters an array subscript out of range: DATES{I} = DATE; On the other hand, if a NAME had only two incoming observations, the transposed observation for that NAME would also include the retained value of DATE3 from the previous NAME. The following section, How Many Variables Are Needed?, will examine this issue in more detail. Certainly the DATA step code is longer and more complex than PROC TRANSPOSE. A slightly shorter version would be: DATA NEW (KEEP=NAME DATE1-DATE3); ARRAY DATES {3} DATE1-DATE3; DO I=1 TO 3; SET OLD; DATES{I} = DATE; END;

A Simple Transposition Let’s begin the analysis by examining a simple situation, where the program should transpose observations into variables. In the original data, each person has 3 observations. In the final version, each person should have just one observation. In the "before" picture, the data are already sorted BY NAME DATE: NAME Amy Amy Amy Bob Bob Bob DATE Date #A1 Date #A2 Date #A3 Date #B1 Date #B2 Date #B3

In the "after" picture, the data will still be sorted by NAME: NAME Amy Bob DATE1 Date #A1 Date #B1 DATE2 DATE3

Date #A2 Date #A3 Date #B2 Date #B3

The PROC TRANSPOSE program is short and sweet: PROC TRANSPOSE DATA=OLD OUT=NEW PREFIX=DATE; VAR DATE; BY NAME; The PREFIX= option controls the names for the transposed variables (DATE1, DATE2, etc.) Without it, the names of the new variables would be COL1, COL2, etc. Actually, PROC TRANSPOSE creates an extra variable, _NAME_, indicating the name of the transposed variable. _NAME_ has a value of DATE

the program eliminates the need to retain variables. the DATA step has relied on two major assumptions: ! ! Each NAME has the same number of observations. So the DATA step can read that observation. For each NAME. returns to the DATA statement. For example. the DATA step remains short: How Many Variables Are Needed? So far. First. The DATA step. Date #B2 Date #B3 In that case. and then stop executing. the first observation contains the largest value of COUNT. however. Because of the ORDER=FREQ option on the PROC statement. The advantages of each method at this point: ! ! PROC TRANSPOSE uses a simpler program. either the programmer doesn’t know the largest number of observations per NAME. CALL SYMPUT(’N’. If the incoming data set contains additional variables. the programmer will know this largest number. since the program never reinitializes DATE1 through DATE3 to missing until it returns to the DATA statement. Then the DATA step outputs the transposed observation. requires significant changes. PROC TRANSPOSE has a major . SET TEMP. it can perform calculations at the same time it transposes the data.))). In that way. Once the program has calculated the largest number of observations per NAME. Often. The DATA step’s complexity varies depending on which programming style it uses. The DATA step has more flexibility. or mean value of those variables for each NAME. highest. COMPRESS(PUT(COUNT. the program will need the extra steps. Offsetting this disadvantage of the DATA step is the fact that PROC TRANSPOSE doesn’t reveal how many variables it created (except for notes in the SAS log about the number of variables in the output data set). assigning values to DATE1 through DATE3. the DO loop reads all incoming observations. DATA _NULL_. Most of the time. Here is one approach: PROC FREQ DATA=OLD ORDER=FREQ. and resets DATE1 through DATE3 to missing. the "after" picture will contain a missing value for DATE3: NAME Amy Bob DATE1 Date #A1 Date #B1 DATE2 DATE3 Date #A2 . the subsequent DATA step must still utilize the information. and We know what that number is. The programmer may need to run a PROC CONTENTS later to discover the structure of the output data set.Beginning Tutorials While unusual.3. consider a slight change to the incoming data: NAME Amy Amy Bob Bob Bob DATE Date #A1 Date #A2 Date #B1 Date #B2 Date #B3 Since Amy now has just two observations. By continuing to place the SET statement inside a DO loop. with the variable COUNT holding the number of observations for that NAME. on the other hand. neither assumption is true. The output data set TEMP contains one observation per NAME. advantage: the program does not have to change! Automatically. PROC TRANSPOSE creates as many variables as are needed. TABLES NAME / NOPRINT OUT=TEMP. For example. it is perfectly legal to place the SET statement inside a DO loop. capture COUNT as a macro variable. so the program won’t need to calculate it. or "knowing" is unreliable and the program should double-check. an initial step must calculate and capture as a macro variable the largest number of observations for any NAME. the DATA step can calculate the lowest. and include them in the transposed data set. In some cases. STOP.

Because of the DO UNTIL condition. most of which are missing on most observations. one NAME might contain 100 observations. ELSE I + 1. NAMEs with 21 to 40 incoming observations would need 2 transposed observations. The DO group in the last three lines handles NAMEs which don’t need all &N variables. IF LAST. Here is the incoming data set: NAME Amy Amy Amy Bob Bob Bob DATE Date #A1 Date #A2 Date #A3 Date #B1 Date #B2 Date #B3 RESULT Res #A1 Res #A2 Res #A3 Res #B1 Res #B2 Res #B3 PROC TRANSPOSE requires little change.) PROC TRANSPOSE cannot change its methods: the result will always be 100 new variables. or it encounters the last observation for a NAME. (In practice. A DATA step. This DATA step has two ways it can exit the DO loop and output an observation. the program could try to stick with the original. DATES{I} = DATE. BY NAME. Either it fills in values for all 20 array elements. In that case. RETAIN DATE1-DATE&N. SET OLD. In the transposed data. . To create this new form to the transposed data. What’s In a Name? Sometimes. In that case. a necessary skill for the programmer is the ability to visualize the data before and after transposing. IF FIRST. DATES{I} = DATE. Instead of this DATA step.NAME).NAME THEN OBSNO=1. DO I=1 TO &N UNTIL (LAST. DATES{I} = DATE. The variables in the transposed data set would be: NAME DATE1 . VAR DATE RESULT.Beginning Tutorials DATA NEW (KEEP=NAME DATE1-DATE&N). BY NAME. the loop ends after two iterations when AMY has only two incoming observations. IF FIRST. it will supply better compression algorithms for compressing numeric data. the vast majority of transpositions involve numeric variables. BY NAME. This would be the program: PROC TRANSPOSE DATA=OLD OUT=NEW. compressing the transposed data will help save storage space. longer DATA step. can produce many different forms to the transposed data. It resets to missing retained date variables which were needed by the previous NAME. One possibility is to create just 20 new variables (DATE1 through Transposing Two Variables As the situations become more complex. ARRAY DATES {&N} DATE1-DATE&N. SET OLD. OBSNO + 1. When Version 7 of the software becomes available. etc. the DATA step requires very little change: DATA NEW (KEEP=NAME DATE1-DATE20). while the other NAMEs contain just three. BY NAME. NAMEs with up to 20 incoming observations would need just 1 transposed observation. ARRAY DATES {20} DATE1-DATE20. END. 4. 3. DATE20). SET OLD. the number of incoming observations varies dramatically from one NAME to the next. END.NAME THEN I=1. 5) for the current value of NAME. however. DATES{I}=. Consider the case where the program transposes multiple variables. then.NAME). but create multiple observations per NAME as needed.. transposing the data would create 100 variables (DATE1 through DATE100). For example. ARRAY DATES {&N} DATE1-DATE&N. IF I < &N THEN DO I=I+1 TO &N. allowing for different numbers of observations per NAME leads to more involved programming: DATA NEW (KEEP=NAME DATE1-DATE&N). 2. DO I=1 TO 20 UNTIL (LAST. In these types of cases.NAME.DATE20 OBSNO OBSNO numbers the observations (1. END.

Here is the "before" picture: NAME Amy Bob DATE1 Date #A1 Date #B1 DATE2 DATE3 The first method might continue (ater PROC TRANSPOSE) with: DATA ALL7VARS (DROP=_NAME_). SET OLD. a prefix on the way out (such as DATE). it would pay to write a macro to generate sets of renames. based on a prefix coming in (such as COL). and a numeric range (such as from 1 to 3).NAME). PROC TRANSPOSE DATA=OLD OUT=TEMP2 PREFIX=RESULT. MERGE NEW (WHERE=(_NAME_=’DATE’) RENAME=(COL1=DATE1 COL2=DATE2 COL3=DATE3)) NEW (WHERE=(_NAME_=’RESULT’) RENAME=(COL1=RESULT1 COL2=RESULT2 COL3=RESULT3)). or Run PROC TRANSPOSE twice. END. The software does not support renaming a list of variables in this fashion: RENAME=(COL1-COL3=DATE1-DATE3) However. VAR DATE. . and merge the results. ARRAY RESULTS {3} RESULT1-RESULT3. RESULTS{I} = RESULT. ARRAY DATES {3} DATE1-DATE3. DO I=1 TO 3 UNTIL (LAST. VAR RESULT. BY NAME. The alternate approach using PROC TRANSPOSE would transpose the data twice. DATES {I} = DATE. creating multiple observations from each existing observation. processing large data sets three times instead of once can take a while. This program may be simpler. that judgment is in the eye of the beholder. Afterwards. BY NAME.Beginning Tutorials The output data set would contain two observations per NAME. BY NAME. Date#A1 Date#A2 Date#A3 Res#A1 Res#A2 Res#A3 Date#B1 Date#B2 Date#B3 Res#B1 Res#B2 Res#B3 For practical purposes. with all seven variables: NAME DATE1-DATE3 RESULT1-RESULT3 The DATA step can easily create the more desirable structure: DATA ALL7VARS (DROP=I DATE RESULT). the original PROC TRANSPOSE could have reduced the renaming load by adding PREFIX=DATE. Most analyses require one observation per NAME. BY NAME. transposing a different variable each time. However. this structure is a drawback. ! Transposing Back Again Occasionally a program must transpose data in the other direction. The possibilities include: ! Run PROC TRANSPOSE just once (as above). one for each transposed variable: NAME _NAME_ COL1 Amy Amy Bob Bob DATE RESULT DATE RESULT COL2 COL3 There is no shortcut for renaming variables individually. If the number of variables being renamed is sufficiently large. While PROC TRANSPOSE can also create the more desirable form. and use a more complex DATA step to recombine the results. it cannot do so in one step. Date #B3 The "after" picture should look like this: . MERGE TEMP1 TEMP2. Here is an example: PROC TRANSPOSE DATA=OLD OUT=TEMP1 PREFIX=DATE. DATA ALL7VARS (DROP=_NAME_). MERGE the results. Date #A2 . BY NAME.

DIFFER = DIF(DATE). However. BY NAME. DO I=1 TO 3. SET OLD. it can use the same DATA step to transpose the data. In every case. DATE2 Date #B3 _NAME_ DATE1 DATE2 DATE1 DATE3 Transpose Abuse Some programs transpose data because the programmer finds it easier to work with variables than with observations. Computing the average of the DIFFER values is simple. However.NAME=0. For the simplified case where no missing values belong in the output data set. this program would suffice: DATA AFTER (KEEP=NAME DATE). Under these conditions. In this situation. not two. DATE = DATES{I}. Once the DO loop has output all the nonmissing values (as well as any missing values found before that point). the program should calculate for each NAME the average number of days between DATE values. a DATA step could easily compute the differences between DATEs: DATA DIFFER. BY NAME. not two. Notice how the program must handle missing values. this program creates three observations for Amy. IF FIRST. the program has prepared the data first using: PROC SORT DATA=OLD. In that case. NONMISS = N(OF DATE1-DATE3). Using our NAME and DATE data set. Missing values at the end of the stream of dates (as in Amy’s data) should be ignored. BY NAME DATE.Beginning Tutorials NAME Amy Amy Bob Bob Bob DATE Date #A1 Date #A2 Date #B1 . and indicates that the programmer should become more familiar with processing groups of observations in a DATA step. OUTPUT. OUTPUT. ARRAY DATES {3} DATE1-DATE3. Date #B3 value (DATE2. IF NONMISS > 0 THEN DO I=1 TO 3. Without the DROP= and RENAME= data set options. however. THEN DO. this constitutes an abuse of PROC TRANSPOSE. Bob requires three observations. END. DATE3 Date #B1 . In many cases. The N function counts how many of the variables (DATE1 through DATE3) have nonmissing values. SET BEFORE. END. it counts the nonmissing values that it outputs. ARRAY DATES {3} DATE1-DATE3. if the program is going to introduce a DATA step. Let’s take an example of a program that transposes unnecessarily. NFOUND=0. PROC MEANS could do it: A subsequent DATA step could process this output. As the DO loop outputs observations. falling between nonmissing values in DATE1 and DATE3) must also appear in the output data set. IF DATES{I} > . the output data set would have contained: NAME Amy Amy Amy Bob Bob Bob COL1 Date #A1 Date #A2 . a slightly more complex DATA step would do the trick: DATA AFTER (KEEP=NAME DATE). IF NFOUND=NONMISS THEN LEAVE. SET BEFORE. The embedded missing . However. IF DATES{I} > . THEN NFOUND + 1. PROC TRANSPOSE cannot do the job! It can come close: PROC TRANSPOSE DATA=BEFORE OUT=AFTER (DROP=_NAME_ RENAME=(COL1=DATE)). VAR DATE1-DATE3. END. With the observations in the proper order. the LEAVE statement exits the loop. eliminating the third observation for Amy. missing values with a valid DATE following (as in Bob’s data) should be included in the transposed data.

Or. BY NAME. comments. BY NAME. you wouldn’t write such an involved program. END. IF LAST. I didn’t make up this example out of thin air! For some individuals. TOTAL = 0. Finally. Just add to the same DATA step: MEAN = MEAN(OF DIFF1-DIFF19). TOTAL + DIFFER. the same DATA step could continue with additional calculations: IF DIFFER > . VAR DIFF1-DIFF19. ARRAY DIFFS {19} DIFF1-DIFF19. computing the means is easy. and bright ideas on interesting programming techniques. calculate the differences in a DATA step: DATA DIFFER. BY NAME. N + 1. VAR DIFFER. Feel free to call or write: Bob Virgile Robert Virgile Associates. MA 01801 (781) 938-0307 . N = 0. Inc. END. DIFFS{I} = DATES{I+1} . you could always transpose these differences back the other way: PROC TRANSPOSE DATA=DIFFER OUT=FINAL (KEEP=NAME COL1 RENAME=(COL1=DIFFER)). The author welcomes questions. BY NAME. the lesson is clear: learn to process groups of observations in a DATA step and you won’t have to transpose your data so much.Beginning Tutorials PROC MEANS DATA=DIFFER. At this point. ARRAY DATES {20} DATE1-DATE20. On the other hand. However. Of course. OUTPUT. However. IF N > 0 THEN MEAN = TOTAL / N. THEN DO. if you enjoy a really long program. Assuming you know the maximum number of transposed variables per NAME is 20. 3 Rock Street Woburn. SET NEW. you could transpose the data to work with variables instead of observations: PROC TRANSPOSE DATA=OLD OUT=NEW PREFIX=DATE. VAR DATE. the data are ready for the same PROC MEANS as in the original example: PROC MEANS DATA=FINAL.NAME. if you really wanted to (or didn’t have the programming skills to choose one of the methods above). DO I=1 TO 19. VAR DIFFER.DATES{I}.