How to import SAS data set into R

Erik Jørgensen November 7, 2007

1

Introduction

The import of data into R is still a bit complicated if the original data are stored in the SAS-data format. Though solutions have existed for a long time, see e.g. the R-documentation concerning Importing from other statistical systems, it is complicated to use these facilities. They are based on functions which need to call the SAS-program from R. The settings for doing this do not always correspond to the settings on the computers we use. Instead we will focus on the export problem from the SAS point of view. We will present three different transfer formats, and illustrate some of their benefits and short-comings. As a start we use a short SAS program to generate a dataset that will illustrate our main points. As shown in Table 1 the dataset has a very long name, some of variable are long and they even share the initial character sequence. In addition SAS formats (date and numeric) are specified for some of the variables. Both short and long labels are specified. Table 1: SAS-code for generating example data set libname sasdata "C:/SASImport"; data sasdata.RImportTest ; do i=1 to 10 ; x=i ; y=i*i ; if i>5 then y=. ; Weightincrease=y-x ; Weightincreasedate=mdy(11,i,2007) ; Dayasnumber =Weightincreasedate; output ; end; label x= 'x has a short description' y= 'A very long description that is clearly longer than allowed'; format y 8.1 Weightincreasedate date7. Dayasnumber 8.0;

1

2

Reading SAS files

To read SAS-files in R it is necessary to store them in an intermediate file format. Unfortunately, this cannot be done without some loss of information. The solutions presented here use three different intermediate formats. Each solution has its own benefits. 1. Comma separated file (csv). This format can be read into R via the read.csv function 2. Excel file format (xls). This format can be read into R via the read.xls function in the xlsReadWrite package. 3. SAS xport file (xpt). This format can be read into R via the read.xport function in the SASxport package. The most complicated part is to translate the SASs data format into the intermediate format. This should be done within SAS, and thus requires a SAS license. The second major obstacle is to transfer the specialised SAS information from the files, such as formatting (e.g. dates) and labels. In the following the export to the three different formats is shown. It seems that three different methods have to be used within SAS. We will present each of these in the following section. Actually, we have defined some R-functions that specify the SAS-program and executes SAS as a batch command automatically directly from R. The functions are presented in the appendix in section A.

2.1

Writing to csv - files

This uses the PROC EXPORT in SAS. Actually, this is called by the Export data wizard of SAS. To use the wizard, choose File|Export data from the main menu in SAS, and it guides you through the necessary specification. You can choose to save the SAS-program during the process. Then you have a template that can be used later on. The resulting lines are shown in Table 2. Table 2: SAS-code for export to csv file libname sasdata "C:/SASImport"; proc export data=sasdata.rimporttest outfile= "rimporttest.csv" DBMS=CSV REPLACE;

The resulting rimporttest.csv file, may be read from the read.csv function in R.

2.2

Writing to xls - files

In principle and according to the documentation it should be possible to us the method described in section 2.1, if we specify DBMS=EXCEL in the PROC EXPORT. However, it does not seem to work, at least on my machine. Therefore another method is employed. This method uses the new facility in SAS 9. Here you may specify the database management system directly in the libname statement. The necessary program statements are shown in Table 3.

2

Table 3: SAS-code for export to xls file libname sasdata "C:/SASImport"; libname out excel "C:/SASImport/file798b12e1.xls"; data out.rimporttest; set sasdata.rimporttest; run;

There are two irritating infelicities in this approach. Firstly, the output file may not exist before. If this is the case the program ends with an error message. Therefore use a new name. Actually the filename used in Table 3 is generated in the R-function readsas.xls (see appendix A) using a random number generator. Secondly, the xls-file is locked by SAS until you release it and cannot be used by other programs. The easiest way of doing this, is to exit SAS. Finally, the xls dataset is read into R using the read.xls function in the package xlsReadWrite.

2.3

Writing to xport - files

This is the most comprehensive transfer option. Originally it was used for copying SAS-files between SAS installed on different operating systems. Unfortunately, the facility has not been upgraded since SAS version 6. Therefore you need to downgrade your data set in two steps. First step is to reduce the name of the dataset to only 8 characters. You can do this using Windows explorer. E.g if your data is in the file longdatasetname.sas7bdat, copy the file and rename the new file to shortdat.sas7bdat. Then you can address the dataset as e.g. data a; set sasdata.shortdat; in SAS. You may also use this facility to copy the dataset, using the following program lines: data sasdata.shortdat; set sasdata.longdatasetname;. Second step is to use proc copy to reduce the length of variable names and labels, as shown in Table 4. You need initially to specify that you should restrict the valid names to those that were allowed in version 6 (options VALIDVARNAME=V6;). If the data statement is used for copying to a short dataset name, it is to specify this option after the copying. If it is specified before the dataset with the short name is generated, the copying is not allowed, because the long dataset name is not a valid name in version 6. Table 4: SAS-code for export to xport file options VALIDVARNAME=V6; libname source "C:/SASImport"; libname xportout xport "C:/SASImport/le26ca58.xpt" ; proc copy in=source out=xportout memtype=data; select le26ca58; run;

Again the strange naming of the dataset is because it is generated by the random generator in RFinally, the xpt dataset is read into R using the read.xport function in the package SASxport.

3

3

Reading the dataset

To illustrate the result of the data import, the result is presented of importing the data set via the three different intermediate formats. The function read.sas in appendix A is used. However, the same result would have been obtained by running the three different SAS-programs to generate to files in the three formats, and subsequently, using either read.csv, read.xls or read.xport. > > > > > > library(xtable) library(SASxport) library(xlsReadWrite) datcsv <- readsas("rimporttest", via = "csv") datxls <- readsas("rimporttest", via = "xls") datxport <- readsas("rimporttest", via = "xport")

3.0.1

Output from csv-file

> (datcsv) i x 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 y Weightincrease Weightincreasedate Dayasnumber 1 0 01NOV07 17471 4 2 02NOV07 17472 9 6 03NOV07 17473 16 12 04NOV07 17474 25 20 05NOV07 17475 NA NA 06NOV07 17476 NA NA 07NOV07 17477 NA NA 08NOV07 17478 NA NA 09NOV07 17479 NA NA 10NOV07 17480

> str(datcsv) 'data.frame': 10 obs. of 6 variables: $ i : int 1 2 3 4 5 6 7 8 9 10 $ x : int 1 2 3 4 5 6 7 8 9 10 $ y : num 1 4 9 16 25 NA NA NA NA NA $ Weightincrease : int 0 2 6 12 20 NA NA NA NA NA $ Weightincreasedate: Factor w/ 10 levels "01NOV07","02NOV07",..: 1 2 3 4 5 6 7 8 9 10 $ Dayasnumber : int 17471 17472 17473 17474 17475 17476 17477 17478 17479 17480 The variable names are correct. The date variable is transferred as a character string and need to be translated into the R date format, in a subsequent step. The labels of the variables are not transferred. 3.0.2 Output from xls-file

> (datxls)

4

i x 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10

y Weightincrease Weightincreasedate Dayasnumber 1 0 39387 17471 4 2 39388 17472 9 6 39389 17473 16 12 39390 17474 25 20 39391 17475 NA NA 39392 17476 NA NA 39393 17477 NA NA 39394 17478 NA NA 39395 17479 NA NA 39396 17480

> str(datxls) 'data.frame': 10 obs. of 6 variables: $ i : num 1 2 3 4 5 6 7 8 9 $ x : num 1 2 3 4 5 6 7 8 9 $ y : num 1 4 9 16 25 NA NA $ Weightincrease : num 0 2 6 12 20 NA NA $ Weightincreasedate: num 39387 39388 39389 $ Dayasnumber : num 17471 17472 17473

10 10 NA NA NA NA 39390 17474

NA NA 39391 ... 17475 ...

In the xls format the variable names are also correct. The date variable is transferred as a integer value. The date value corresponds to the representation in R’s Date representation, but you need to specify that in a subsequent step. Note that the numeric representation of date values in SAS (variable Dayasnumber) differs from those in R and Excel (variable Weightincreasedate). The labels of the variables are not transferred. A note of caution: There are some limitations to the size of an excel dataset, both concerning number of variables and number of observations. We have not yet tested whether this problem is circumvented because we do not call excel explicitly. 3.0.3 Output from xport-file

In the xport format the variable names are no longer correct, but translated into upper case names with a length of 8-characters. If the starting sequence is identical SAS uses digits to distinguish between the names. The date variable is transferred as a integer value but with the Class Date. Thus they are printed in a readable format, and you can se from data structure that it is a date value. Note that the numeric representation of date values in SAS differs from those in R and Excel. Finally, the labels of the variables are transferred, even though the long label is shortened to a length of 40 characters. > (datxport) I 1 2 3 4 5 6 X Y WEIGHTIN WEIGHTI2 DAYASNUM 1 1 0 2007-11-01 17471 2 4 2 2007-11-02 17472 3 9 6 2007-11-03 17473 4 16 12 2007-11-04 17474 5 25 20 2007-11-05 17475 6 NA NA 2007-11-06 17476 5

1 2 3 4 5 6

7 7 7 8 8 8 9 9 9 10 10 10

NA NA NA NA

NA NA NA NA

2007-11-07 2007-11-08 2007-11-09 2007-11-10

17477 17478 17479 17480

> str(datxport) 'data.frame': 10 obs. of 6 variables: $ I : int 1 2 3 4 5 6 7 8 9 10 $ X : atomic 1 2 3 4 5 6 7 8 9 10 ..- attr(*, "label")= chr "x has a short description" $ Y : atomic 1 4 9 16 25 NA NA NA NA NA ..- attr(*, "label")= chr "A very long description that is clearly" $ WEIGHTIN: atomic 0 2 6 12 20 NA NA NA NA NA ..- attr(*, "label")= chr "Weightincrease" $ WEIGHTI2:Class 'Date' atomic [1:10] 13818 13819 13820 13821 13822 ... .. ..- attr(*, "label")= chr "Weightincreasedate" .. ..- attr(*, "SASformat")= chr "DATE7." $ DAYASNUM: atomic 17471 17472 17473 17474 17475 17476 17477 17478 17479 17480 ..- attr(*, "label")= chr "Dayasnumber"

> label(datxport$X) [1] "x has a short description"

> label(datxport$Y) [1] "A very long description that is clearly"

4

Summary

In most cases the use of the xls-format is probably the most convenient. If the dataset is documented with labels the xport format may be the best, even though the variable names is mangled. Finally, the csv-format may be the best for very large datasets that cannot be handled within the limitations of R.

A

R-functions

In this section the R functions are presented. They may also be downloaded here.

6

A.1

readsas.r

"readsas" <function(dataname,libname='sasdata',dir=getwd(),via='csv'){ if (via=='csv') { expdata<-readsas.csv(dataname,libname=libname,dir=dir) } else { if (via=='xls') { expdata<-readsas.xls(dataname,libname=libname,dir=dir) } else { # via = xport assumed expdata<-readsas.xport(dataname,libname=libname,dir=dir) }} expdata }

A.2

readsas.csv.r

`readsas.csv` <function(dataname,libname='sasdata',dir=getwd()){ libnamestr<-paste('libname ',libname,' "',dir,'";',sep='') exportstr<- paste('proc export data=',libname,'.',dataname,sep='') outfilestr<-paste(' outfile= "',dataname,'.csv"',sep='') dbmsstr<' DBMS=CSV REPLACE;' sasprogname<-paste(dataname,"exportcsv.sas",sep="") sasprog<-cat(libnamestr,exportstr,outfilestr,dbmsstr, sep="\n",file=sasprogname) ProgramPath<-Sys.getenv('ProgramFiles') sascmd=paste('"',ProgramPath,'\\SAS\\SAS 9.1\\sas.exe"',sep="") #sascmd='"C:\\Program Files\\SAS\\SAS 9.1\\sas.exe"' system(paste(sascmd, sasprogname)) read.csv(paste(dataname,'.csv',sep="")) }

A.3

readsas.xls.r

"readsas.xls" <function(dataname,libname='sasdata',dir=getwd()){ newdataname <- substr(tempfile(tmpdir=''),2,13) libnamestr <- paste('libname ',libname,' "',dir,'";',sep='') outlibnamestr <- paste('libname out excel ',' "',dir,'/', newdataname,'.xls";',sep='') datastepstr <- paste('data out.',dataname,'; set ', libname,'.',dataname,'; run;',sep='')

7

sasprogname <- paste(dataname,"exportxls.sas",sep="") sasprog <-t’cat(libnamestr,outlibnamestr,datastepstr, sep="\n",file=sasprogname) ProgramPath <- Sys.getenv('ProgramFiles') sascmd = paste('"',ProgramPath,'\\SAS\\SAS 9.1\\sas.exe"',sep="") system(paste(sascmd, sasprogname)) read.xls(paste(newdataname,'.xls',sep="")) }

A.4

readsas.xport.r

`readsas.xport` <function(dataname,libname='sasdata',dir=getwd(),filetype='sas7bdat'){ if (!file.exists(paste(dataname,'.',filetype,sep=''))) { print('!warning filetype does not exist')} #newdataname<-dataname newdataname<-substr(tempfile(tmpdir=''),4,11) if (nchar(newdataname)>8) {newdataname<-substr(newdataname,1,8)} file.copy(paste(dataname,'.',filetype,sep=''), paste(newdataname,'.',filetype,sep='')) optionstr<-paste('options VALIDVARNAME=V6;') libnamestr<-paste('libname source ',' "',dir,'";',sep='') xportlibnamestr<-paste('libname xportout xport ','"',dir,"/", newdataname,'.xpt" ;',sep='') copystr<- paste('proc copy in=source out=xportout memtype=data;',sep='') selectstr<-paste('select ',newdataname,'; run;',sep='') sasprogname<-paste(dataname,"exportxport.sas",sep="") sasprog<-cat(optionstr,libnamestr,xportlibnamestr,copystr, selectstr,sep="\n",file=sasprogname) ProgramPath<-Sys.getenv('ProgramFiles') sascmd=paste('"',ProgramPath,'\\SAS\\SAS 9.1\\sas.exe"',sep="") system(paste(sascmd, sasprogname)) file.copy(paste(newdataname,'.xpt',sep=''),paste(dataname,'.xpt',sep='')) read.xport(paste(newdataname,'.xpt',sep="")) }

8