You are on page 1of 7

NESUG 2008

Foundations & Fundamentals

Reading Difficult Raw Data


Matthew Cohen
Wharton Research Data Services
ABSTRACT

Raw data from outside sources is often messy and can be difficult to read into SAS . This paper will cover many of
the features and creative uses of the INFILE and INPUT statements that make this task easier. It will also discuss
using file and directory SAS I/O functions to add another level of flexibility. This is intended for beginner or
intermediate programmers, or anyone who reads raw data into SAS.
Since all of the examples in this paper are based on real situations, most require the use of multiple options.

EXAMPLE 1
This first example reads from a fixed width text file. The value for each variable can be found between the same
starting and ending columns throughout the input file. The unusual aspect of this file is that the persons name is on
one line and their age, sex, income, and education is on the next. This sample file has 6 variables and 4 records.
Below is the data followed by the code to read it into SAS.
Joe
20 M
Jane
24 F
Matt
30 M
Mike
55 M

Smith
35000 BS
Doe
50000 MS
Brown
65500 HS
Johnson
85000 PhD

data new;
infile "c:\myfile.txt";
length fname $6 lname $10 sex $1 education $3;
informat lname $10. income 7.;
input fname 1-6 @7 lname /
age 1-3 sex +1 income @14 education;
run;
There are several things to note in this code:
The default length is 8 for both character and numeric variables. I declare a length for each of the character
variables, but leave the numeric variables unchanged.
The INPUT statement actually reads in the data. In column input, the variable name is followed by the beginning
and ending columns. In formatted column input, the first column of the variable is written following an @, and the
informat tells SAS how many characters to read. SAS will accept any combination of formatted and unformatted
column input.
+1 (or +<any number>) moves the pointer one column to the right.
/ moves the pointer to the next line.
Data may also be written directly into the SAS editor for testing purposes using the DATALINES statement.

data new;
infile datalines;
length fname $6 lname $10 sex $1 education $3;
informat lname $10. income 7.;
input fname 1-6 @7 lname /
age 1-3 sex +1 income @14 education;
datalines;
Joe
Smith
20 M 35000 BS
Jane Doe
24 F 50000 MS
Matt Brown
30 M 65500 HS

NESUG 2008

Foundations & Fundamentals

Mike
55 M
;
run;

Johnson
85000 PhD

In rare cases, external files will behave differently than datalines. The rest of this paper reads the raw data from
external files, the method that is more common when reading large amounts of data from outside sources.

EXAMPLE 2
This example reads from a pipe-delimited ( | ) text file. A delimiter is a character used to denote the end of one
value and the beginning of the next. It is needed because the values for each variable may be in different locations
on each record. The advantage of the pipe over comma or space delimiters is that the pipe is rarely used in the
actual data values.
ID | ticker | trade date | trade time | price | sequence num
1|abc|20080101|093000|10.23|123456|
251|def.g|20080229|170000|5.02|654321|
255|bls|20080915|123000|100.00|987654|
260|xyz|20080916|091500|15.75|555132|
data new;
infile "c:\myfile.txt" dlm="|";
retain id ticker trdate trtime price seq_num;
length ticker $5 seq_num $6 trtime_temp $6;
informat trdate yymmdd8.;
format trdate yymmdds10. trtime time8.;
input id ticker trdate trtime_temp price seq_num;
trtime=hms(input(substr(trtime_temp,1,2),2.),input(substr(trtime_temp,3,2),2.),
input(substr(trtime_temp,5,2),2.));
drop trtime_temp;
run;
This is a fairly simple example, but still takes advantage of several techniques.
The INFILE statement includes the delimiter (DLM) option and sets it equal to pipe.
The RETAIN statement has many uses, in this case it orders the variables in the output dataset.
SAS needs an INFORMAT to correctly read date values, and a FORMAT to display date and time values in a
human-readable form. The informats can be written directly on the INPUT line instead of as a separate
statement I find it easier to read when they are separate.
Notice that the variables on the INPUT statement are listed one after the other, in the same order as the raw
data. This is known as list input, the alternative to column input. This raw data file cannot be used with column
input.
The time values are actually read into a character variable because they do not match any SAS informats. I use
the SUBSTR function to read the hours, minutes, and seconds separately, and feed them into the HMS function
which creates a SAS time.
Even a very simple input file requires some knowledge of SAS options and functions.

EXAMPLE 3a
This example is very similar to the previous one, but introduces a small data error - the last value on the third record
is on a new line. For simplicity, the remaining examples do not transform the time variables.
1|abc|20080101|093000|10.23|123456|
251|def.g|20080229|170000|5.02|654321|
255|bls|20080915|123000|100.00|
987654|
260|xyz|20080916|091500|15.75|555132|

NESUG 2008

Foundations & Fundamentals

data new;
infile "c:\myfile.txt" dlm="|" flowover;
retain id ticker trdate trtime_temp price seq_num;
length ticker $5 trtime_temp $6 seq_num $6;
informat trdate yymmdd8.;
format trdate yymmdds10.;
input id ticker trdate trtime_temp price seq_num;
run;

The FLOWOVER option is the default for the INFILE statement, and does not have to be written in the code. In
cases where the record ends before all input values have been read, it tells SAS to look on the next line for the
remaining values.

SAS correctly reads the sequence number even though it is on the next line.

EXAMPLE 3b
Here, a new variable named flag has been added to the input data.
1|abc|20080101|093000|10.23||123456|
251|def.g|20080229|170000|5.02||654321|
255|bls|20080915|123000|100.00|y|
987654|
260|xyz|20080916|091500|15.75||555132|
data new;
infile "c:\myfile.txt" dlm="|" dsd flowover;
retain id ticker trdate trtime_temp price flag seq_num;
length ticker $5 trtime_temp $6 flag $1 seq_num $6;
informat trdate yymmdd8.;
format trdate yymmdds10.;
input id ticker trdate trtime_temp price flag seq_num;
run;

The flag variable is missing in some cases, causing two pipes to be read in a row. The DSD option tells SAS to
treat them as separate delimiters and set the value in between to missing.

With the DSD option turned on, SAS no longer reads the wrapped value correctly. It sees the pipe after the flag
variable as one delimiter and the end of the record acts as another delimiter. Two delimiters in a row cause the
sequence number on the third record to be set to missing. Even worse, SAS treats the sequence number as the first
variable in the next observation. From there, the rest of the file is incorrect. FLOWOVER can be useful, but can also
have unintended consequences.

EXAMPLE 3c
The alternatives to FLOWOVER are MISSOVER and TRUNCOVER. Both MISSOVER and TRUNCOVER will set
variables to missing if they are not found on the original input record; neither wrap to the next line. The difference is
that MISSOVER will set a variable to missing if it cannot read the entire value. TRUNCOVER will write out any part
of the value that it can read, and then stop. It is actually very difficult to find a situation where MISSOVER and
TRUNCOVER produce different results since SAS can almost always read the whole value.
MISSOVER and TRUNCOVER are always the same when using list input or formatted column input. The only time
they produce different results is with column input, without the use of informats. And even then, there is only a
difference if the record length is shorter than expected. Using the first input file as an example:
Joe
20 M
Jane
24 F
Matt
30 M
Mike
55 M

Smith
35000 BS
Doe
50000 MS
Brown
65500 HS
Johnson
85000 PhD

NESUG 2008

Foundations & Fundamentals

data new;
infile "c:\myfile.txt " <missover|truncover>;
length fname $6 lname $10 sex $1 education $3;
informat lname $10. income 7. education $3.;
input fname 1-6 @7 lname /
age 1-3 sex 4 +1 income education 14-16;
list;
run;

The education variable is now read in by stating the exact columns rather than using a format.
MISSOVER will set Education to missing for the first three records because they only use 2 of the 3 allotted
characters. The value PhD is read correctly.
TRUNCOVER will store all four values as they appear in the raw data.
The LIST statement writes the input line to the SAS Log along with a ruler, which may help to debug programs.

EXAMPLE 4

rd

Example 4 introduces a new data error . There is a blank row after the 3 record.
1|abc|20080101|093000|10.23||123456|
251|def.g|20080229|170000|5.02||654321|
255|bls|20080915|123000|100.00|y|987654|
260|xyz|20080916|091500|15.75||555132|
data new;
infile "c:\myfile.txt" dlm="|" dsd flowover;
retain id ticker trdate trtime_temp price flag seq_num;
length ticker $5 trtime_temp $6 seq_num $6 flag $1;
informat trdate yymmdd8.;
format trdate yymmdds10.;
input id @;
if id = . then delete;
input ticker trdate trtime_temp price flag seq_num;
run;

The trailing @ option tells SAS to hold the current input record.

While the input record is being held, I perform validation of the ID variable. If it is blank, the record is deleted,
otherwise SAS continues with the input.

EXAMPLE 5a
In some ways, the trailing @ is the opposite of the FLOWOVER/MISSOVER options. FLOWOVER tells SAS what to
do if there are fewer values in a raw data record than there are input variables. The @ and @@ options tell SAS
what to do if there are more raw data values than input variables.
In this example, each record has two observations. For simplicity, only 3 variables are used.
1|abc|20080101|251|def.g|20080229|
255|bls|20080915|260|xyz|20080916|
data new;
infile "c:\myfile.txt" dlm="|";
retain id ticker trdate;
length ticker $5;
informat trdate yymmdd8. id 3.;
format trdate yymmdds10.;
input id ticker trdate @;
output;
input id ticker trdate;
output;
run;

The first INPUT statement reads the first 3 values as usual. The trailing @ holds the pointer after the third value.

NESUG 2008

Foundations & Fundamentals

The first OUPUT writes them out to the first observation.


The second INPUT tells SAS to read the next 3 values, which are then written to the second observation. SAS
moves on to the next input record and starts again.

EXAMPLE 5b
The input in this case is similar, but the ID is only listed once and is followed by two sets of observations.
1|abc|20080101|def.g|20080229|
255|bls|20080915|xyz|20080916|
data new;
infile "c:\myfile.txt" dlm="|";
retain id ticker trdate;
length ticker $5;
informat trdate yymmdd8. id 3.;
format trdate yymmdds10.;
input id @;
input ticker trdate @;
output;
input ticker trdate;
output;
run;

Three inputs are used one for the ID and one more for each observation.

This technique can get complicated, but it is flexible enough to handle many difficult raw data situations.

EXAMPLE 5c
To read streaming data, where the observations are listed on the same line, one after the other, use the trailing @@
option. Here, 4 observations appear on the same raw data record.
1|abc|20080101|251|def.g|20080229|255|bls|20080915|260|xyz|20080916
data new;
infile "c:\myfile.txt" dlm="|" dsd flowover;
retain id ticker trdate;
length ticker $5;
informat trdate yymmdd8.;
format trdate yymmdds10.;
input id ticker trdate @@;
run;

The trailing @@ option tells SAS to keep reading the current input record until all fields have been read. SAS
loops through the variables on the INPUT statement until the end of the record.
Notice that there is no ending delimiter after the last value. If one exists, SAS continues to look for more input
data. SAS still reads the data correctly, but notes a Lost Card and gives an error.

EXAMPLE 6
This example is from a different data source that tracks product sales. The input file is tab delimited, and the lines
are very long and filled with characters that have special meaning in SAS. Even so, this is a relatively easy file to
read.
Dell
225 1
17" LCD monitor
Sony
1499.50 1
Brand new 2008 Sony 50 inch plasma television; 1080i
resolution; speakers on each side of the unit; 10 ports including HDMI, component,
composite, and s-video located on the front and back of the unit; swivel base
included; Energy Star compliant.
Canon
150 2
Canon A280 digital camera

NESUG 2008

Foundations & Fundamentals

data new;
infile "c:\myfile.txt" dlm="09"x dsd missover lrecl=1000;
retain company price quantity description;
length company $20 description $1000;
input company price quantity description;
run;

Tab delimiters are specified using a hexadecimal string literal. 09 in hexadecimal is a tab.
LRECL is the logical record length. The default is 256, meaning that SAS will read the first 256 characters of a
record unless directed otherwise. SAS prints the note One or more lines were truncated. if a record is longer
than the LRECL.

If the length of a long character variable is not given in the datas documentation, it will take 2 or more passes through
the data to determine an appropriate length in SAS. If the SAS length is too short, the data will be truncated. A
variable length that is too long wastes memory. This is because SAS pads unused portions of text variables with
blanks.
To read in the data initially, set a length that is likely to be too big for the variable. Once it is in a SAS dataset, use
the LENGTH function to determine the maximum actual length used. The ideal is one character less than the defined
length. The single blank character ensures that nothing was truncated, but wastes minimal space. If the maximum
length equals the defined length, increase the defined length and read in the data again.

EXAMPLE 7
The raw data files I receive are getting significantly bigger over time. I often receive them in a compressed (zipped)
format. Decompressing the file then reading it in SAS requires an extra manual step in what might otherwise be an
automated process, and also takes extra disk space. SAS can read directly from compressed files with the INFILE
statement.
infile PIPE "unzip -p /home/myfile.zip" firstobs=2 obs=11;

The PIPE option tells SAS to read the results of the following command. This example reads from a zipped file
on UNIX and writes directly to SAS. This will also work on Windows with the right software.
The FIRSTOBS option specifies the first raw data record to read, while the OBS option specifies the last record
to read. FIRSTOBS = 2 is useful when the variable names appear in the raw data. In this case, only records 211 (10 records) are read. It is often easier to test and debug programs using a small number of records.
infile PIPE "ls /home/datadir/*.txt |perl -pe 's/\s+/\n/'";

This is a more complicated version of a piped INFILE statement. It reads all the text files in the directory and
writes them out to SAS on separate lines.

EXAMPLE 8
Another method of obtaining directory and file information is with the SAS I/O functions. Some of the I/O functions
work on SAS datasets, others on raw files or directories.
filename testDir "c:\mydir";
filename readme "c:\mydir\readme.txt";
data _null_;
dirID = dopen("testDir");
if dirID = 0 then do;
put "Could not open directory";
stop;
end;
dirNum = dnum(dirID);
put dirNum=;
fname = dread(dirID, 15);
put fname=;
if fexist("readme") then put "Yes";
dirID = dclose(dirID);
run;

NESUG 2008

Foundations & Fundamentals

The FILENAME statement defines a raw data file or directory in SAS.


Most I/O functions require you to open the file or directory and retrieve an ID number.
The DNUM function returns the number of files in the directory.
The DREAD function returns the name of a file. 15 is a sequence number the 15th file read from the directory.
The FEXIST function reports whether or not a specified file exists.
DCLOSE closes the directory.

This example can be used to read an entire directory of raw data files, or a subset of files if a little extra logic is
applied. I use the FEXIST function to look for new documentation from the data vendor. The I/O functions can
accomplish some of the same tasks as the piped infile statement, using SAS commands instead of PERL.

CONCLUSION
This paper has attempted to explain, by example, many of the options on the INFILE and INPUT statements, as well
as SAS I/O functions that can be used to read difficult raw data. Below is a summary of the options and a description
of each. Please note that these are not the official SAS descriptions, and may leave out some details.
INFILE options:
DATALINES Used to write raw data directly into the SAS editor.
DSD 1. Removes quotes surrounding values in the raw data.
2. Treats consecutive delimiters as separate, and sets the value between them to missing.
3. Sets the default delimiter to comma.
DLM Sets the delimiter.
FLOWOVER If the input record ends before all variables are read, tells SAS to look on the next record.
MISSOVER If the input record ends before all variables are read, tells SAS to set any remaining variables
that can not be read completely to missing.
TRUNCOVER If the input record ends before all variables are read, writes out any part of the remaining
variables that have been read and sets the rest to missing.
LRECL Logical record length. Sets the maximum number of characters to read on a raw data record.
PIPE Reads the results of the following command, rather than a file.
FIRSTOBS Sets the first line of an input file that SAS should read.
OBS Sets the last line of an input file that SAS should read.
INPUT options:
+n Moves the pointer forward n spaces.
@n Moves the pointer to column n.
/ - Moves the pointer to the next line.
#n Moves the pointer to line n
(trailing) @ - holds the pointer at its current location on a raw data record.
(trailing) @@ - holds the pointer at its current location on a raw data record and continues until the end of
the record.
I/O functions:
DOPEN Open a directory for reading
DNUM Return the number of files in a directory.
DREAD Return the name of a given file.
FEXIST Report if a given file exists.
DCLOSE Close an open directory.
Statements:
List writes the input line to the SAS Log along with a ruler.

CONTACT INFORMATION
Matthew Cohen
Senior IT Project Leader
Wharton Research Data Services
cohenmw@wharton.upenn.edu
http://wrds.wharton.upenn.edu
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries.

You might also like