Page No.

1

SAS is produced by the SAS Institute in Cary, NC. It is the most powerful and comprehensive statistics software available. We should avoid calling SAS a "program," since we write "programs" in SAS. But it is also not appropriate to refer to SAS as a "language" like C++, Fortran, etc. SAS actually contains several computer languages within it. Even "application" doesn't seem to fully describe SAS, so maybe we should just use the term SAS gives itself at the top of its output, "The SAS System." In 1976 the SAS Institute came out with its first software, a mainframe-based statistical analysis software package. Since then, SAS has enjoyed phenomenal growth. Its software is available for all the major platforms, and has grown beyond statistics to a variety of data management and business applications. To read more about SAS, see their website at www.sas.com. A history timeline is given here. SAS is also a community. Users groups (SUGI, NDSUG, RRVSUG) provide opportunities to interact with other SAS enthusiasts, hold conferences and publish articles about SAS programming techniques. There is even an annual SAS ballot, in which users can vote for the changes and enhancements they would like to see SAS work on. SAS Windows PC SAS is a Windows GUI for running SAS. When you open the program, you will notice three windows (probably overlapping) called “Editor,” “Log,” and “Output.” An example appears below. The windows have been arranged so you can see them all.

The editor window is used to write programs. Text output will appear in the output window, and the log window will contain error messages and other program execution information. The typical process is to type a program in the editor, then submit it (running man icon) and look at your log and output results. After a program has been submitted, it remains in the editor, so you can make modifications and submit it again. You can save your programs and open previously saved programs from the file menu. However, the "Save" command will apply to the active (top) window. You can save editor, log, or output files, accordingly. An extension of .sas, .log, or .lst (respectively) will be added automatically unless you supply some other extension. (Using .txt may be helpful if you intend to open the file with Notepad or

Page No.2

a word processor, but most of the time it is best to stick with SAS's defaults.) There is one little catch you should remember: In order to open a saved program, an editor window must be active when you select "Open" from the file menu. If you have closed your editor window, you must first open a new one by selecting "Enhanced Editor" from the "View" menu, then proceed to the "File-->Open" dialog.. If you have font problems (incorrect characters) when you open an output file in a word processor, try changing the font to "SAS Monospace." This should work if you are on a computer with SAS installed. If SAS is not installed, you may not be able to get all of the characters correct, but any monospace font, like Courier, will straighten out most of the formatting. Take a look at the program statements in the editor window. Notice that each line ends with a semicolon. SAS uses semicolons to define the end of a statement. It doesn't matter how the text is arranged, whether there are extra spaces, indentions, extra lines, multiple statements on a line, or statements split across lines. Such formatting can, and should, be used to enhance readability for humans, but to SAS all that matters is where the semicolons are. Enterprise Guide is a new environment for running SAS. Although the same programs work and produce similar results, there are some differences in the appearance and behavior of the interface. If you purchase the "SAS Learning Edition," this is the interface that will be presented after it is installed. An example appears below. In Enterprise Guide, your work is organized into projects, which appear in a collapsible tree structure on the left. The three windows mentioned above do not open automatically. In order to type in a program, you open a code window which replaces the editor. Output windows open as needed when programs run, but not necessarily automatically. Output, log, and code windows (also datasets) can be opened by double-clicking icons in the project tree. The contents do not accumulate in the log and output windows as they do in PC SAS. When you submit a program, you have the choice of overwriting previous output or starting a new node in the project tree. This keeps your results more organized. Note, the "running man" icon that is used for "submit" in the old version is replaced by a sheet of paper with a down-arrow beside it. You can use the regular interface described under "PC SAS" with Learning Edition, but you have to find the sas.exe file in the program files directory. Make a shortcut to this file and place it on your desktop to use for starting SAS.

Help When you need more information about making SAS do what you want, there are several sources you can access. The first is the "Help" facility installed on your computer and found at the right end of the menu bar. Unfortunately, because SAS is a huge and powerful product, the help can be a challenge to navigate. The appearance and content of your help menu may vary depending on the version of SAS you are working with. The examples here use version 9.1.3. Under "Help," select "SAS Help and Documentation." Then you should see a window like this:

Page No.3

Click on the plus sign by SAS Products and the tree opens like this:

There is an important lesson right here in looking at this list in help. SAS is not just one program. It is a system of interconnected modules ("products"). In some ways it is like Microsoft Office, which has a number of components like Word, Excel, PowerPoint, etc. The list you see under "SAS Products" is not

Page No.4

necessarily all of them, and you do not necessarily have all those listed available to you. In this course, we will only be dealing with a few of them, and most of the time, we will be in "Base SAS." Click the plus sign by "Base SAS" and you will see:

Again we have a long list of choices. The challenge is to learn where to look for the information you need, because there is simply so much that it can easily cause "information overload." As you begin learning, the most helpful sources of information will be under "SAS Command Reference," "SAS Procedures," or "SAS Language Dictionary." Spend some time looking around in "Help" and familiarizing yourself with what is there. Notice that there are also index and search facilities. Unfortunately, these often return too many "hits" from modules other than those you are interested in. Thus, there is no substitute for learning to navigate the help tree! SAS also provides SAS OnlineDoc 9.1.3 for the Web which is similar to the help but may be more upto-date. In fact, you can download it in PDF form and print whatever you want. The Online version has a search facility which allows you to restrict your search to a particular module or procedure. In some cases this can be extremely helpful for avoiding unwanted hits. Program Organization A statement in SAS is sort of like a complete sentence (although the analogy won't extend as far as having a subject and verb) or a single command. Every statement must end with a semicolon, which signals the end of a command. Inside a statement you will find such things as keywords, options, and user-supplied names. SAS is pretty flexible in its ability to interpret words within a statement. For example, you can separate words with spaces, tabs, or returns--any white space will be treated the same way. You can split statements across multiple lines or put multiple statements on one line. We will try to develop a style that is easy to read and follow by using indentation and comments. SAS is also forgiving, in that it tries to figure out things that might be mistakes. For example, you can put two statements together without a space, because as long as a semicolon is there, SAS can tell where one ends and the next begins. SAS is designed for data analysis, so programs are organized into “steps” that correspond roughly to steps you go through in analyzing data. However, these steps are not just a matter of programming style. They are blocks of code that SAS treats as a whole. Some information is passed from one step to the next, but you should think of steps as independent units.

Page No.5

There are data steps and proc steps. The main purpose of data steps is to create data sets. Proc steps may perform analysis tasks or other actions. Every data step begins with the keyword "data" and every proc step begins with the keyword "proc." (How logical is that?) A "run" statement can signal the end of a step (we will deal with exceptions later) and triggers execution of statements preceding it. A step always ends when another statement that begins with "data" or "proc" is encountered, signaling a new step. Most of the time there must be a "run" statement at the end of a program or the last step will not be executed. Some procs also require a "quit" statement in order to stop them completely. Some commands are called "global statements" because they are not really part of a step. Some examples are "options," "title," and "libname" statements, which will be described later. The effects of these commands typically hold across many steps. We should understand the flow of events after a program is submitted. In most cases, SAS first reads all the commands in a step, checks them for errors, then executes the instructions before going on to the next step. If errors are found, SAS writes error messages and warnings to the log. If there are serious errors, SAS will not execute the statements in the step. If another step follows, SAS will go on and try to run the next step. This may produce unexpected results if a later step depends on output from a previous step that had errors. So, errors do not always cause SAS to stop reading and executing code. This is why it is important to ALWAYS CHECK THE LOG. You may get output that looks fine, but was produced in spite of errors that caused the results to be wrong! A step could be just one line of code, but it could also be many lines. The program shown below has four steps. Each begins with a keyword, either "data" or "proc," and ends with a run statement (even though it is redundant to place a "run" before another data or proc step, it turns out to be useful in some cases). Don't worry about the meaning of the other statements right now, but study the structure of the steps. Note how the editor separates the steps with horizontal lines and adds color codes to various elements of syntax.

Incidentally, SAS has two editors. The one that opens by default in PC SAS is the "Enhanced Editor." We may explore uses of the other editor later, but for now we'll stick to this one. If you ever need to open another editor window (such as if you have closed one, or you want to have two programs showing at once), choose "Enhanced Editor" rather than "Program Editor" in the "View" menu. Exercises: The following may be copied into a word processing document, which may be edited to answer the questions and then submitted. There is no need to run SAS programs for these exercises. 1. How many steps are there in the following program? Mark the beginning and end of each step and label it as a data or proc step. data one;

Page No.6

do x=1 to 1000; y=int(ranuni(0)*6+1); output; end; run; proc freq; tables y; data two; set one; z=7-y; run; proc freq; tables z; run; 2. Suppose the program above was modified so that an error occurred in the second step. What would probably happen when SAS tried to process the remaining steps? (Be sure you have #1 right!) 3. Find three errors in the following program, based on the information in this lesson. (Assume there are no statements split across multiple lines.) data thirsty; infile "c:\drinkexp.txt" input subject brand rating; procprint;run; data two; set one; Chapter 2 Data Sets SAS stores data in “SAS data sets,” using its own internal database format. If you are familiar with any database software (Oracle, Sybase, Dbase, MS Access, etc.), you will find that data sets correspond to tables in a database program. The rows are called observations (database "records"), and the columns are called variables (database "fields"). In the example data set below, x, y, z, and w are variables, while 1, 2, 3, and 4 are observation numbers. It is also becoming more common to use the database terms with SAS, so it is good to be aware of both terminologies. Rows=Observations=Records. Columns=Variables=Fields.

However, data sets contain more than just variables and observations. Such things as how values are to be printed, labels to use instead of variable names, and sorting and indexing information can also be included. As you learn about SAS, you will see that SAS gives you a great deal of control over almost any aspect of your work that you might think of. However, greater control is obtained by using more complex program statements. So, SAS has many "default" settings that save you from extra work and headaches as long as you are satisfied with what is specified by the default. Therefore, many standard actions can be accomplished by very short SAS commands. When there is need for more control, additional commands are available. The first of these "defaults" that we will encounter concerns where data sets are stored. Libraries SAS stores data sets in libraries. (They are really just computer directories or folders, but the term comes from the mainframe language.) There is a default library called “work.” This is where the data will be placed if you don't specify another location. However, work is a temporary library, that is,

Page No.7

when SAS is closed, all data sets in work are deleted. However, if you want a data set to be saved permanently, such as in your “My Documents” folder or anywhere else you wish, you can tell SAS to designate a permanent library using a libname statement. The libname statement will look something like this:

There are three parts to this command. The first part is the keyword "libname" which tells SAS you want to define a new library. The second part, "myownlib" is the name that SAS will use to refer to this library. This is called the libref. You can supply any name you like here, as long as it meets the guidelines for allowed names, which are: you can use letters, numbers, and underscores, but can't start with a number. A libref can only be eight characters long, though SAS variable names can be up to 32 characters. The third part of the statement is the pathname (DOS style) showing the actual location of the folder or directory you want to use. It is enclosed in quotes (single or double). On some Windows systems, 'Desktop' and 'My Documents' can be used and will automatically be assigned to the correct path. 'A:' usually works to assign the floppy drive, but only if there is a disk in the drive (applies to any removable volume). Note: The libname statement does not create a folder. It essentially creates a shortcut or alias to an existing folder. We call the library "permanent" because the data is not deleted when SAS closes, but the libref itself will normally have to be recreated in future SAS sessions. The illustration below shows the SAS Explorer pane (from the left side of the SAS window, note the tab at the bottom) displaying the currently defined libraries. Some of these libraries are standard in SAS. The one called "Rao" has been created with a libname statement. Note that "Work" is explicitly listed.

If you double-click the "Work" file drawer, the data sets in "Work" become visible. The spreadsheet icon in the example below represents a data set named "One." If you double-click on "One" it will open in a "Viewtable," a spreadsheet-like view.

Page No.8

The viewtable has two modes, "Edit" and "Browse." In browse mode, you cannot change the data. To switch modes, go to the "Edit" menu and select the mode you want. A new data set can be created by selecting "Table Editor" under the "Tools" menu. (Note: In order to go back to previous windows in the explorer pane, you press the folder with the up arrow on it. But this will disappear if the "Explorer" is not the active window. Click on the "Explorer" window to bring it back.) (Note: There is also another "Explorer" under the "View" menu. This opens a window much like a Windows Explorer.) When you create a SAS data set, you give it a name, such as "One" in the example above. Names can contain letters, numbers, and underscores, but cannot start with a number. They can be up to 32 characters long. You can refer to a data set in the work directory by its name alone, because that is the default location. But data sets are actually identified by “two-level names,” where the library is given first, followed by a dot, then the dataset name. In other words, the form is libref.datasetname. Since work is the default library, datasetname alone is equivalent to work.datasetname. In order to store a dataset permanently, specify a two-level name, with a first level being a defined libname other than work. In rare cases, you may want to change the default library to something other than work. To do this, use the special libref "user" (libname user 'C:\folder1\myfolder';). This allows you to use a one-level name with a permanent library that you specify. It does not change which libraries are permanent and temporary. To create temporary data sets when the default library has been changed, use a two-level name with "work" in the first position. There is also a way to export data in some common file formats. See "File-->Export Data." In Enterprise Guide, the datasets appear in the project tree under the code that created them. They will be saved when you save your project. (But when you run code you may get a choice of whether or not to replace the existing data. This may affect what is saved in the project.) Getting Data Into SAS Most of the time, when you begin working on a project in SAS, your data will not be in a SAS data set. In order for SAS to perform analysis tasks, the data will be need to be brought into a SAS data set. Because data may come in so many different forms, SAS is very flexible and provides a variety of ways to do this. We will keep our first examples simple, but be assured that SAS can handle very complex data reading tasks! The simplest way to get data into a SAS data set is using “instream data.” This means the data is included “in the stream” of the programming statements that will load it. Here is an example. An explanation of the program follows.

Page No.9

The first two lines are comments. A comment is simply text inserted into computer code that the programming language will ignore. Comments are often used to explain what is happening in the program (for others or for future reference). Sometimes they are used to temporarily disable some statements in the program. Sometimes they are used to "beautify" or enhance the readability of the program. In SAS, there are two ways to write comments. A statement that begins with an asterisk and ends with a semicolon is a comment. This type of comment will only work for one statement at a time. If a larger section of a program is to be commented, that is, multiple statements in one group, you can use "/*" to begin the comment and "*/" to close it. Even semicolons are ignored by this syntax. The third line contains a libname definition. The library will be called “mysaslib” and the actual folder location on your computer is given in the quote marks. The fourth line begins with the keyword "data," which tells SAS this is the beginning of a data step. We can see by the two-level name that the dataset will be called “myfirst” and will be stored in the “mysaslib” library. The fifth line, the input statement, tells SAS what variables to put into the data set. Each variable name may need to be followed by a code that tells SAS what kind of variable it is and how to read it. These codes are called informats. However, there is a default for this, called "standard numeric," which is just an ordinary number in decimal form, with no commas. Since "age" fits this description, we do not have to include an informat for it. On the other hand, "name" is a character variable. The "$" tells SAS to read "name" as character variable eight characters long. Character variables have no numeric value, and can contain letters, numbers, and most other symbols. They are also known as "strings." The “cards” statement tells SAS that the list of data to read is coming next ("datalines" and "lines" may be used as synonyms for "cards"). The data are organized in a straightforward way. Each observation is on one line and the variable values are separated by spaces. This is called “list input.” A semicolon on a line by itself indicates the end of the data. (The data are not program statements, so there are no semicolons in the data list.) In addition to comments, we can use indenting to make our programs more readable. For example, the statements under a data or proc statement can be indented to make the the steps look like an outline. Data given in cards should be placed along the margin, though. Next, SAS encounters a proc statement, and will therefore compile and execute the data step before going on. The dataset “myfirst” is now created and populated with three observations. SAS continues by compiling the proc step, which consists of only the “proc print” statement. Without any other commands specified, this will cause the default action of printing the most recently created data set, which, of course, is “myfirst.” Print does NOT mean "print to the printer." Proc print produces a formatted "printout" of the data set in the output window. You can save or print (really) this output using File Menu commands. The result looks like this:

Page No.10

In case you do not want to use the default (last created) data set, or just want to your program to be more obvious to the reader, you can specify the data set that proc print will use this way:

Using Titles SAS provides several ways to modify the appearance of the output it produces. Notice that in our example the heading "The SAS System" together with the time, date, and page number appear at the top of the page. "The SAS System" is the default page title. You can supply your own titles by using title statements, as shown here:

You can have multiple lines of titles. Just add more title statements with higher numbers. Title statements are global; they don't belong to a particular proc and are in effect until changed or deleted. Redefining a title deletes all previously defined titles of that number or higher. To delete a title without replacing it, just include a blank title statement, like "title3;" . This will delete the old title3, as well as title4 or any other higher-numbered titles. Producing HTML (web page) output

Page No.11

Both PC SAS and Enterprise Guide can produce HTML or text output. Enterprise Guide displays HTML by default:

a. b. c. d. e. f. g. h. i. j. k. l.

a. b. c.

Pretty, isn't it? To get HTML in PC SAS, you can go to "Tools-->Options-->Preferences," click the "Results" tab, then check "Create HTML." Exercises Submit one Word document with all answers either typed in or copied in. 1. Indicate which of the following are valid names for variables and librefs (two questions). sales SALES Sales Sales.12 Sales.Month.12 12MonthSales Month12Sales More_Sales More&More Moore_and_Moore Month12 _Month12_ 2. If a database has a table with 500 records and 14 fields, how many observations and variables would a SAS dataset containing the same information have? 3. If a spreadsheet had 5 columns and 4 rows, how many observations and variables would a SAS dataset containing the same information have? 4. Looking at the screen print below, answer the following questions: Is this a temporary or permanent data set? How many observations are there? How many variables are there?

Page No.12

d. e. f. g.

How How How How

many many many many

records are there? fields are there? rows are there? columns are there?

5. Copy the data below into the SAS editor (use copy and paste) and write a data step to read it, followed by a proc step to print it (to the output window). Make sure the printout matches the original data. The variables are Name, Age, and Grade. Age and Grade are to be read as numeric variables. Allow the data to be saved to the work directory. (Submit Program, Log, and Output.) Marissa 13 7 Andy 7 1 Martha 9 3 John 10 4 Larry 11 6 6. Copy the following data into the SAS editor and write a data step to read it into a data set. Print the results in html format. The variables are Field, Fertilizer, and Variety. Submit the Program, Log, and the html output instead of the normal output. 1 A Magnus 2 B Arbin 3 A Carver 4 B Visser 5 A Turnip 6 B Danun 7. Create a libref for a folder on your hard drive and another for a floppy disk. (If you don't have a floppy disk, you may use a pen drive). Modify the program in problem 5 so that the data set is saved in each location. Use the explorer window and the viewtable to verify that it is actually there. (Submit your program and log for this problem.) 8. Submit only the program statements (editor) for this problem. Note that to "reference a data set" means to tell a proc, like proc print, which data set to use. a. Write a libname statement that assigns the name "SaleData" to a library that corresponds to folder on the C drive with the same name. Then show how you would reference a data set called "January" in that folder. b. Write a libname statement that makes the folder in part a the default library and show how you would reference the data set "January" in that folder. Show how you would now reference a temporary dataset called "February".

Page No.13

Chapter 3 Keeping Up Appearances Suppose your output doesn't fit on the page the way you like when you print it or paste it into a word processor. An options statement can be used to adjust the number of lines and columns used in the page formatting. In the example below, "ps" stands for "pagesize" (you can also spell it out if you like). This is the number of lines on a page. Next, "ls" stands for "linesize" (you can spell this out too), which is the width of a line in characters. If you don't want the date and time displayed, you can include "nodate," and if you'd like to reset the starting page number, "pageno=n" will do that, where you replace "n" with the number you want. If page number is not reset, the page numbers keep incrementing in the output window, even if you clear it. There are many more options available (see the help or manuals).

Reading from a File Now, back to reading data. Our example data set is small, and easy enough to type in. If the data set is large, it may not be convenient to type everything into the program. In this case, the data may be saved in a text file outside of SAS. Let's say we had the same data saved in a file called F:\sample.txt. The following program would then have almost the same effect as the previous one (can you tell what will be different?):

Page No.14

(Since the library isn't specified, this data will be saved in work.) There are some details to take note of here. It may seem that the infile statement replaces the "cards" section, but that is not what really happens. The infile statement comes before the input statement, whereas the "cards" section comes after the input statement. (In fact, "cards" must always be placed last in the data step.) When you include a "cards" statement, SAS automatically assumes there is a default infile statement that says "infile cards;" before the input statement. In other words, SAS treats the "cards" section just like an external file. This is confirmed by the fact that it does not appear in the program statements copied to the log. You can explicitly include the infile cards statement, especially if you need some of the optional commands available with infile. This will be discussed further in a future lesson. There is also an import wizard available from the File Menu. It will load Excel and other popular file types too. (One caveat for Excel imports: You must look under the "options" button in the wizard, where there is a check box to indicate whether the first line contains data or variable names.) In Enterprise Guide, this can be accomplished by going to the "Insert" menu and choosing "data." Labels for Variables SAS allows variable names to be 32 characters long. You can have upper and lower case letters, numbers, and underscores, but cannot start with a number. Programming statements do not distinguish between upper and lower case, but the cases are remembered for use in output. This flexibility takes care of most variable naming needs. However, there are times when even more flexibility is desired. Perhaps you want true spaces (not underscores) between words, or special characters that are not allowed. Or maybe you'd like to use a short name like "LEye" in your program, but want the output to say "Left Eye Acuity." For these situations, SAS allows us to assign labels to variables. Labels can be up to 256 characters long and may include almost any text symbols. Labels are assigned in the data step and are stored with the data set. Many SAS procedures, like proc print, can use the labels in producing output. In the example below, note the syntax of the label statement in the data step: The statement begins with the keyword label, followed by a variable name, equal sign, and the label in quote marks. More labels may be assigned in the same statement. They are listed one after the other separated by spaces (no commas). To use the labels in proc print, the option label is added to the proc print statement. The example shows the results both without and with the label option.

Page No.15

Character Informats So far we've discussed reading fairly straight-forward data consisting of numbers or short words. We will now explore more complex data types. Consider this example:

Page No.16

No errors in the log....

Page No.17

Why isn't "Amphitheater" complete? The default length for character variables is eight. SAS has only read eight characters from the data even when more characters are present. We need to tell SAS to make the x variable hold more characters. Try this:

The "$12." expression in the input statement is called an "informat." Think of it as an "input format" that tells SAS what to expect the data to look like. The dollar sign signifies that it is a character variable, and the 12 is the length of the field to be read, and also the length of the resulting variable. All informats have a period in them. This is part of the syntax that SAS uses to recognize an informat. Let's look at four more examples to demonstrate the behavior of informats. These examples will use two character variables. In the first example, we show what happens with only dollar signs to indicate that they are character variables. Of course, the values are now cut off at eight characters, but otherwise the data are read correctly:

Page No.18

Now suppose we attempt to fix the length problem by putting in informats. Then, SAS reads the full 12 characters for x, regardless of whether or not there are spaces included:

Page No.19

To fix this, we need SAS to treat a space as a delimiter just as it does when a dollar sign is used alone. The "colon modifier" placed in front of the informat will give this result. In fact, the dollar sign alone is an abbreviation for ":$8.".

Page No.20

The character informat can also be used to create variables shorter than eight characters. For large data sets, this can result in considerable space savings. For example, the data set might contain a variable that is a one-character code (such as M or F for male or female). Using an informat of $1. would then be appropriate, and would save seven bytes of storage for every observation. Running out of Line An additional problem occurs when using these techniques to read from an external file. A character informat at the end of a line causes SAS to try to read all of the characters for the width of the field specified by the informat. If the line is shorter than that, errors will be reported in the log and the data will not be read correctly. One solution is to add the option "truncover" in the infile statement after the file path. The meaning of "truncover" is something like "keep reading over the whole field even if the line is truncated." (This problem does not occur with instream data.) A colon modifier with the informat can often solve this problem, too. There is a related option, "missover" that can also be used when there are not enough variable values at the end of the line and you want the remaining ones set to missing in the SAS data set. Without "missover", SAS would go to the next line and try to continue reading the input variables for which it did not find values. The two options behave almost identically. The difference is that missover will set short values to missing when an informat (no colon) is used, while truncover will not. Truncover also does the same thing when more than one value is missing at the end of a line. Exercises From now on, include titles in all your output that give your name, the lesson number, and problem number. Unless otherwise directed, do not change data sets, and turn in your editor, log, and output. 1. Download the file at this link. It contains four variables, FirstName, LastName, Age, and Score. Write a SAS program to read this data from the external file and print it to the output window. Save the data set in a library that you specify (not work). Experiment with title and options statements to change the appearance of your output.

Page No.21

2. Copy the data below into the SAS editor and write a data step to read it, followed by a proc step to print it (to the output window). Make sure the printout matches the original data. The variables are first and last name, sex, and age. Include labels in your data set and output. Also include a two-line title and suppress the printing of the time and date on each page. Andy Stewart M 47 Martha Gustafson F 55 Marissa Maneschevitz F 32 John Fitzgerald M 28 Jacqueline Martin F 33 3. Download the file at this link. Similar to a previous exercise, it contains four variables, Age, Score, LastName, and FirstName. Examine the data file carefully, then write a SAS program to read this data from the external file and print it to the output window. Include an appropriate title. Lesson 4: Numeric Formats and Informats SAS stores all data with only two variable types, character and numeric. We have seen that character variables can have different lengths. In contrast, numeric variables (including integers and dates) are almost always stored in 8-byte floating-point form. These numbers have a precision of about 16 significant digits. Some other languages refer to this as a "real" data type (which is not mathematically correct, of course). Therefore, we do not have to be concerned about how numbers will be stored. We do, however, have to think about how to read and write them. Let's begin with a simple example. Here we see a data set with three numbers. There are no complications in reading these numbers; the variable x in the input statement is a numeric variable by default (no informat). Observe that in the output, the first two observations were written in the same form they appeared in the data, but the third was not. Because it is such a large number, SAS defaulted to printing this number in scientific notation. The "E18" is interpreted as "times 10 to the 18th power."

Page No.22

The appearance of observation 3 in the output can be changed by adding a format statement to proc print, as shown below. The actual format code is "19.". The period or decimal point is part of the code and is standard syntax for all formats and informats. Character formats and informats start with "$". In most formats and informats there is a number right before the period that indicates the field width. For numeric variables only, a number after the period indicates the width of the decimal portion of the number. Notice that 19 digits are now displayed, however, the last two are not the same as the original data. There has been some rounding error.

Page No.23

Commas can be included in the output. Note that the field width had to be increased to accommodate the commas.

Or perhaps you want dollars:

Page No.24

Let's turn to reading numbers in various formats. First, we should note that we cannot read numbers that are not in "standard format" without an informat, a code that tells SAS how to interpret the number it is reading. See what happens in this example:

Page No.25

The commaw. (dollarw. is the same) informat reads numbers with commas, as well as dollar signs and some other imbedded symbols. The w stands for the field width, but be sure to count the commas (and dollar signs) when determining the number of columns needed. Also, make use of the colon modifier just as with character informats, if the field widths vary. (If you are working with other currencies or European style numbers, check the documentation for alternative informats.)

Page No.26

Handling decimal places correctly can sometimes be tricky. Here we see a basic example with no informat. In addition, no format has been specified in proc print, so SAS chooses a format that it "thinks" is "best." Note that the third observation is rounded, but this is only because of the printing format that was used, and does not mean the number is rounded off in the data set. If you add a format statement with a 9.4 format to the proc print step, all the original digits will be displayed.

Page No.27

Here is an example that is incorrect. Numeric formats and informats can specify a number of decimal places by putting a number after the period. The informat below, 4.2, is saying that SAS should read a field of width 4 with two decimal places.

Page No.28

One problem is that the decimal is part of the field and needs to be counted, which caused the second and third observations to be cut short. We should have 5 instead of 4 for the field width.. Secondly, when a width for the decimal portion of the number is specified in an informat, it means that those decimal places are to be assumed whenever no decimal point appears in the number. It does not override an existing decimal point. The first observation is thus interpreted as 11.22, that is, assuming that the last two digits should be in the decimal portion. This may or may not be correct, so great care must be taken when using this method. Normally, it would only be used when the data is known to be recorded with an implied decimal. Usually such data are not mixed (with and without decimals). The most common situation in which mixing might occur is when a variable is a percent or proportion, and has been recorded inconsistently, using both notations. In that case, you might want 22 and .22 to mean the same thing. A 3.2 format would accomplish the correct result. Here the field width is corrected, the decimal is left off of the informat, and a format is included in proc print to display all the decimals.

Page No.29

Formats can be permanently stored with the data set. If this is done, the formats are then available to any proc that can use them, without writing another format statement. The program shown below uses the "5." informat, but also stores the "9.4" format in the data set. There is no need for another format statement in proc print. However, if you want to use a format other than the one stored with the data, you can still specify it in proc print (perhaps "format x 7.2;"). It should also be mentioned here that SAS syntax allows specifying a format for several variables at once. For example, "format x y z 7.2;" would apply the 7.2 format to all three variables, x, y, and z. Or, you can use "format x 7.2 y z 5.;" which will apply the 7.2 format to x and the 5. format to y and z.

Page No.30

While this covers the most frequently needed informats for numbers, there are many other special cases. Check the SAS documentation (under Base SAS, Language Reference: Dictionary) for other formats and informats. Date Formats and Informats Since SAS has only two data types, you may wonder what we do with dates. While it is possible to store dates in character form, doing so would make calculations with dates very difficult. Dates are stored as numbers, precisely, the number of days since (or before) January 1, 1960, which is "day zero."

Page No.31

You cannot read a date without an informat, except perhaps in the rare event that it is already coded as the number of days since 1/1/1960. Dates can be written in many ways, and SAS can read almost any of them, with the right instructions. In the example above, the dates are given in the most common American format, month/day/year. The informat uses the codes mm for month, dd for day, and yy for year. In some other countries, dates are written in the form day/month/year, so in SAS we simply switch the order accordingly, to "ddmmyyw. Once again, w stands for the width of the field that is being read, including the delimiters. SAS does not require a specific delimiter when interpreting the date, so it does not matter if it says 1/1/1960 or 1-1-1960 or 1.1.1960. The width is usually 10 to accommodate two digits for month, two for day, and four for year, plus two delimiters. However, it could be 8 if only two year digits are used, and it could be 6 or 8 if no delimiters are used (010160 or 01011960). SAS will interpret these variations correctly as long as they are not ambiguous (1160 for January 1, 1960 would not work), as in this example:

Page No.32

Of course, we don't want our printout to give dates like "0" or "16604," since they are not very meaningful to human readers! Therefore, we should include a nice format to make it readable.

There are many formats to choose from. See the SAS documentation for more. Here are some more examples. In the program below, the dates appear in the data in three different formats. The first one is like those discussed above, the second is what SAS considers a "standard" date, and the third is a Julian date (used in many businesses--it's the year followed by the number of the day in the year).

Page No.33

Three proc print steps follow, to demonstrate the dates with no format, and with six different formats. As with character and numeric formats, date formats can also be stored with the data set.

Note that SAS also has informats and formats for time values and date-time combination values. If you have need of these, or want to explore the many other possibilities, check out the SAS documentation. Interpreting Two-digit Years In the years leading up to 2000, the government and businesses became very concerned about the "Y2K" problem. Many programs stored years using two digits, because most business software did not need to deal with dates outside of the twentieth century. In 2000, this all changed. SAS never had a "Y2K" problem for storing dates, since a date is just the number of days before or after January 1, 1960. However, there can still be a problem when reading dates with two-digit years from a file or

Page No.34

instream data. SAS interprets two-digit years as belonging in a specific 100-year interval. The first year of this interval is called the Year Cut Off, and the SAS system option "YearCutOff=n" is used to set it in an options statement. The default value of YearCutOff is set by the administrator when SAS is installed, typically 1920. Exercises 1. Copy the raw data below into a SAS program. a) Write a data step to read these data into three variables: Invoice, Amount, and Quantity. Using proc print, display the data so that all the Amount values are formatted like the Amount value in the first observation of the raw data. b) Then, revise the data step so that a format for Amount is stored with the data set, and show the results in proc print, without using any format statement in proc print. Use appropriate titles that identify which part of the exercise the output comes from. 12244 $1,499.99 144 32189 $20,000 1 92314 49.28 3 2. Copy the data below into a SAS program. Write a data step to read them into a SAS data set. The variables are capitol, state, capitol population, and state population. Store labels with the data set. Print the data with proc print, displaying labels, using appropriate formats and a title. Bismarck ND 56,344 633,837 Pierre SD 13,939 764,309 Helena MT 26,718 917,621 Madison WI 218,432 5,472,299 3. Download this file. The data contain the names of some of the past presidents of the United States together with their birth and death dates. The data are aligned in columns, as shown in the example below (the longest name). Save the file to your computer and use an infile statement to read the data from the file into a SAS data set. Read the entire name into one variable, using a character variable length of 23. Store formats for the dates with the data set. Write three proc print steps that result in three different sets of date formats. Use appropriate titles. William Henry Harrison 02/09/1773 04/04/1841 Lesson 5: Input Styles It's time to dig a little deeper into the technical details of data reading. Before we do that, it is important to make clear what we mean by "reading" and "writing" in a data step. "Reading" is the process whereby SAS interprets raw data from a file or an instream cards section of the program, or when it accesses an existing data set. Thus, "reading' means SAS is bringing data values into computer memory to process in the data step. "Writing" is the process of saving the finished data to a SAS data set file on a hard drive (or other computer storage device). It is NOT the same as "printing." which is a term by which we usually mean "use proc print to display text in the Output Window." Imagine that you are a computer and are instructed to read a file. What do you actually do? Well, as a human, how do you read a book? First you have to open it. You find the starting place, and you begin reading, which at the most fundamental level means you read a character, interpret it, and move on to the next one, and repeat. Upon reaching the end of a word, you interpret the word, and move on to the next one. Upon reaching the end of a line, you move down a line and go back to the left and read the first character, and so on, until you reach the end of the book, at which point you close it and stop. Well, something like that, anyway. For us, so much happens automatically, we don't have to think about it. But, a computer doesn't think at all. It merely follows a sequence of instructions, and this sequence, in some ways, is similar to processes we humans follow when reading a book. When SAS opens a file to read it, it creates two pointers, which are nothing more than numbers for keeping track of position in the file. One pointer is a position in a line of text (column), the other is the line number in the file. (We are assuming a standard text file here. SAS can also read binary files, but that is another subject.) The pointers are initially at column 1, line 1. As SAS reads data, the pointer moves along the line (not literally, of course, the number just changes). SAS needs to know when to

Page No.35

begin reading a value and when to stop. What is obvious to us is not necessarily obvious to a computer. Consider the following line: 12 33 42 51 24 Do you see five numbers? How about this: 1233425124 If I told you this was a string of five two-digit numbers, you would know immediately how to interpret it. We are now going to consider what must be done to tell SAS how to interpret lines of data in ways somewhat analogous to how a person would interpret them. SAS has four input styles. It is easiest to learn about them one at a time, but they can actually be used together (mixed input) on the same file in almost any order. The input styles are: list, column, formatted, and named input. List input is perhaps the most straightforward. The data values are simply given in a "list," one after the other, with a delimiter between them. The delimiter is often just a space (that's the default), but may be a comma, tab, or other character. List input, in the simplest case, can be recognized if all you see in the input statement is variable names and dollar signs following the character variables. Typically, the data will not be aligned in columns. To make our example easier to follow, the data is shown in a "cards" section, but we will discuss it as though it were really an external file. Some differences involved in actually using an external file will be dealt with later. Notice that the data values are separated by a single space and are not aligned in columns. The input statement contains only variable names and one dollar sign.

When SAS processes the input statement, it creates, in memory, an "input vector," which is temporary storage for all the variables in one observation. At this point, the attributes of the variables (length, character vs. numeric) are fixed. As the data step goes through its commands, it fills up the fields of the input vector, and when it gets to the end, it outputs one observation to the data set. If there is more to do, it goes back and starts over, first clearing the input vector, then filling it up again. One pass like this is known as a data step iteration. The default length for character variables is eight characters, so in our example, that is the length of the "name" variable. Age and height are numeric variables, also taking up eight bytes of memory, but that has no bearing on the number of characters to be read. As reading begins, the pointers are set to line 1, column 1. SAS is going to read up to eight characters into the variable "name." However, in the list input style, if first checks to see if the current character is a space (or other blank character). If so, it advances to the next, and keeps going until it comes to a non-blank character. Then it starts reading characters into the input vector. When it comes to another blank, it stops. If the data value has more than eight characters, SAS will read the first eight and save them, but will then continue advancing the pointer until a space is encountered, then stop. (Longer character variables can be specified using the "colon modifier," ":$w." as has been shown in a previous section. The colon causes SAS to read in list style even when there is an informat present. There is also an "ampersand modifier" which allows use of a character informat and reads imbedded spaces. Multiple spaces then signal the end of the field.)

Page No.36

The second variable, "age," is a number. With list input, SAS can only read numbers in "standard" format, which essentially means there can be only numerals, decimals, and minus signs present. SAS again advances the pointer to the first non-blank character and begins reading the number until it encounters a space. The number of columns read is immaterial this time, as SAS will simply store the number with as much precision as it can, regardless of how many numerals were given. When it reaches a space, SAS saves the number to the input vector and stops. The third variable, "height," is read the same way, except that we are now at the end of a line. Upon reaching the end of the line, SAS saves the value, moves the column pointer back to 1, and advances the line pointer to the next number. As it has come to the end of the input statement as well, it considers the observation complete and writes the contents of the input vector to the output data set. It then clears the input vector and starts over, repeating until the end of the file is reached. Missing values for character variables require special care with list input, since spaces are interpreted as delimiters and cannot then represent missing values. Missing values may have to be coded using a word or symbol (e.g., "missing"). Options that change the delimiter (discussed later) can also solve this problem. Numeric missing values are represented by a decimal point (period). If values are missing at the end of a line (without a placeholder of some kind), the infile option missover can be used to set the remaining variables to missing and go on to the next line. If this is not done, SAS will try to read the variables from the next line. Column input requires the data to be arranged in aligned columns. The key element of column input is that you give a range of columns after each variable name, or after the dollar sign of a character variable, in the form of start-end. Informats cannot be used in the input statement with column input. (They can be specified in a separate informat statement, introduced in the next lesson.) There is no need to worry about the column pointer in this case, as SAS changes it according to the ranges specified. The ranges need not be in order. They can even overlap or be re-read.

The next example illustrates some of the advantages of column input. Notice that the character variable will read whatever is in the associated columns and is automatically set to the length specified by the range of columns. Even spaces and special characters are correctly read and stored. Missing numeric values need not be coded with a period, as spaces will be correctly interpreted as missing.

Page No.37

One of the best features of column input is that no delimiters are required. Many mainframe programs produce this kind of file, which has no extra spaces or delimiters. This can greatly reduce the size of a file. As long as the column positions are known, there is no difficulty with reading this data in SAS.

Page No.38

Formatted input relies on informats and pointer controls to determine where and what to read into each variable. In the example below, the @n syntax gives the starting column for each variable. It repositions the column pointer to that column. The "name" variable has a default length of eight. The numeric variables are read until a space is encountered. It is important to note that the syntax here is the @n comes in front of the variable name, and the informat comes in its usual position after the variable name.

Variables need not be in left-to-right order. Informats may be used as needed.

Page No.39

Even though I have shown the @n for each variable here, it is not really necessary in all cases. The initial "@1" is not required, as the pointer starts out in column 1. With formatted input, the line pointer will end up at the next character past the field that was just read. IF this is the start of the next variable, you are ready to go and do not need to reposition the pointer. The syntax "+n" may be used to advance the pointer a specified number of columns. Just as with column input, delimiters are not necessary when using formatted input.

Because formatted input and column input cause SAS to read a specified number of columns, a problem occurs if the end of the line varies, as in the example of the dates given below. When the data are instream, SAS does not have a problem. But if the file is external, a line that is too short will cause an error. The truncover option in the infile statement will cause SAS to treat short lines as if there were spaces in the file to fill out the remainder of the field. If there is no data at all for the field,

Page No.40

the observation will be set to missing. The syntax is demonstrated here with instream data, but is only necessary for external files.

It is often helpful with formatted input, if there are many variables, to line them up vertically in the input statement, with the @'s all in front and the informats after. I have not shown an example, but with a large number of variables, this makes it much easier to prevent and correct mistakes. The log below is the result of trying to read this same data from an external file without the truncover option. Observe that SAS is looking for the "bday" variable at the beginning of line 2, and since what is actually there is a name variable, it declares the data to be invalid. The reason for this is that the "bday" given in line 1 is only 7 characters in length. Since there are not enough characters, SAS goes to the next line to look for the variable. After failing to read "bday," the end of the input statement is reached, so SAS tries to move the next line to start a new input cycle, but finding the end of the file, it quits and writes only one observation to the data set.

Page No.41

Including the truncover option in the infile statement fixes this problem.

Named input is used when each variable is identified in the data with a variable name and the equals sign. This input style is not very common. While named input can be mixed with the other styles, it must be last on the line. You cannot change back to another style.

Page No.42

However, the idea demonstrated here is very useful. In some data sets, it is difficult to identify where the variables are in the line. If they are "signaled" by some identifiable text, we can use that text to determine the position of the pointer and use formatted input. In the example below, we have given a character string in quotes with the @ sign, where we had the column number in previous examples. SAS scans from the current column until it matches the text in quotes after the @ sign. Note that a trailing space was included in the quotes in this example. An alternative would be to leave out the space and increase the field width to 3. In any case, you want to make sure the pointer ends up in the right place, which will be right after the text in quotes, and the field width matches the location of the value to be read. In cases like this, a lot of effort is required to determine what consistent properties of the data source can be reliably used. Careful study of the file and testing of various scenarios is required, or the result may be misread data, quite possibly with no warning of the errors.

Page No.43

If the variable values are not in a consistent order in the data, we can deal with that too. By using "@1" to reset the pointer back to the beginning of the line, we can make sure nothing is missed:

To finish this off, here is an example of mixed input styles. All four styles, list, column, formatted, and named, are demonstrated in one input statement.

Page No.44

Exercises: For each problem below, write a SAS program that creates a data set using instream data. Copy and paste the given data segments into the SAS editor. This is important in order to keep the data in the same format as presented. DO NOT adjust the data, such as by adding spaces or aligning columns. Include a proc print step in the program, and turn in the editor, output, and log files. Use appropriate titles and formats, if necessary, in your output. 1. The variables are Name, Age, Height, Sex, and JoinDate. Assume that no values will be longer than those included here, and use the shortest possible length for the character variables. Use list input (with colon modifiers where needed). John 21 70 M 2/14/97 Jo 18 62 F 3/27/99 Mark 32 68 M 6/22/98 Linda 25 65 F 12/14/97 Carey 27 59 F 8/20/98 2. The variables and the columns they are found in are LastName (1-15), FirstName (16-30), Address (31-55), City (56-65), State (66-67), and Zip (68-72). Assume that character variables may fill the entire field width. Use column input. Johnson Michael 121 1st St S Brookings SD57006 Big Hammer Beatrice 45031 271st Ave S Moorhead MN56560 Helms-Marquart Charlotte 302 N Mason-Dixon Ave Somewhere DC01221 Cutler George Rural Route 2 Zap ND58563 3. In this problem, the same variables are used as in the previous problem, but additional variables are included, namely the Social Security Number, Ownership Status, and Move-in Date. You can see where these new variables are added. The field widths of the variables that were in the previous problem are unchanged. Use formatted input to read this data, but change the order of reading to First Name, Last Name, Social Security Number, Ownership Status, Move-in Date, Address, City, State, and Zip. Use at least one example of the "+n" syntax to move the pointer. 503118596Johnson Michael Own 121 1st St S 01051974Brookings SD57006 471559684Big Hammer Beatrice Rent45031 271st Ave S 10221987Moorhead MN56560 362995874Helms-Marquart Charlotte Own 302 N Mason-Dixon Ave 07091991Somewhere DC01221

Page No.45

474843859Cutler

George

RentRural Route 2

12161996Zap

ND58563

Lesson 6: Creating Variables Creating Variables with Assignment Statements You can create variables that are not in the data that you are reading. In the following program, avgrowth is calculated from the other variables in the data. Second, dummy is assigned a constant value. Since the value assigned to it is a character string, indicated by the quotes around it, dummy will be a character variable of length 3, because that is the length of the first value assigned to it. Unless a length statement is used to set the length of a character variable before it is used, the length will be determined by the first value assigned to it. Numeric constants can also be assigned, simply by putting a number on the right side of the equal sign (no quotes).

Obviously, variables are created by the input statement, but they are also created if they are specified in a length, attrib, format, or informat statement (see below). They can also be created by array definitions (a later topic), or by assignment statements, such as those in the example above. An assignment statement is made up of a variable name, an equal sign, and an expression representing the value to be assigned to the variable. The variable can appear in its own assigned expression, such as x=x+1, or x=log(x). A very special form of assignment statement, called a sum statement, or an accumulator, is an exception to this syntax. In the example below, p and q are accumulators. Their values are incremented, starting from zero, by the amount specified, for each succeeding observation. The accumulator p+1, below, is essentially the same as p=p+1, except that it is initialized to zero, which does not automatically happen if you use p=p+1. (See also the retain statement.)

Page No.46

The arithmetic operations and mathematical functions used in assignment statements for numeric values are quite intuitive. The syntax is similar to that used for formulas on a graphing calculator or spreadsheet. The arithmetic operations are "+" (add), "-" (subtract or negative), "*" (multiply), and "/" (divide). Exponents are given with a double asterisk, such as "3**2" (three to the second power). Parentheses are used in the usual manner for controlling order of operations. Many functions are available, and their names can often be guessed because of their similarity to standard mathematical notation. All functions have at least one argument enclosed in parentheses. Some examples are sqrt(x) for square root of x (where x can be a number, variable name, or other expression that evaluates to a non-negative number), log(x) for natural log, and exp(x) for the exponential function ("e to the x"). There are also some constants, such as pi, given by the function constant(pi). For more detailed information about functions, see the SAS Documentation under "Base SAS/SAS Language Reference: Dictionary/Dictionary of Language Elements/Functions and CALL Routines." (Note: Some of the documented functions may not work in The Learning Edition.) Here are a few more examples:

Since dates are numbers, you can do simple things like subtract two dates to find the number of days between them, without any problem. However, for more complicated tasks, SAS has quite a few date-

Page No.47

related functions. For example day(x) returns the day of the month, month(x) returns the month number for a date, and qtr(x) returns the quarter number. If you have to do any serious computations with dates, check the SAS documentation for available tools. Remember, SAS also has date-time values, and functions to go along with them, as well.

For character variables, there is an operation called concatenation, indicated by "||" (two vertical bars), that puts two character strings together. There are many, many functions for character variables. We will just look at a few: substr(source, position, length) which extracts a substring, trim(source) which eliminates trailing blanks, length(source) which calculates the length of the value excluding trailing blanks, and upcase(source) which changes all the letters to upper case. In the program below, a length statement has been used for the city variable, to allow up to 15 letters. Note that this method would not work for city names that have more than one word, like "New York City." The st (state) variable has been given an informat for two characters, but a colon modifier is used so that the pointer will move on to the beginning of the zip code. Zip codes should always be character variables, otherwise those that start with zero will be shortened.

Page No.48

The first assigned variable, addr1, is created by simply concatenating all three variables. Note the (possibly undesirable) result, with the "extra" spaces between city and state, and the lack of spaces between state and zip code. The spaces are there because the variable length is, in fact, 15, and the unused positions are filled with spaces. Concatenation uses the whole variable, including spaces. In addr2 we have removed the trailing spaces from city by using a trim function. Now there are no spaces between any of the combined variables. In addr3, we have included punctuation and spaces between the variables. Notice that the concatenation operation works with constant expressions enclosed in quotes, as well as variables. Spaces are preserved just as written between the quotes, including the one space after the comma and the two spaces in front of the zip code. The upcase function is demonstrated in addr4, which converts addr3 to all uppercase characters. Following that, the substring function is used to create a four-letter abbreviation, by extracting the first two letters of city and combining them with the state code. Note the order of the three arguments, first the source variable, then the starting position, then the number of characters to extract. The last assignment statement shows how we can combine various functions to perform a specialized task. The idea here was to find the middle character of the city variable, defined to be the actual middle character for odd lengths and the letter immediately prior to the middle for even lengths. The substring function is used to extract the character, but the starting position must be calculated. The length function divided by two would be almost right, as it works fine for even lengths, but for odd lengths gives a half, like 4.5 for "Brookings." Since the middle character is the next higher whole number, we can use the ceiling function, one of several rounding functions available, this being the one that always rounds any decimal value up to the next integer. Length, informat, and attrib statements

Page No.49

An alternative to specifying informats in the input statement is to use an informat statement. The informat statement has the same syntax as the format statement. It doesn't do anything that can't be done in the input statement, but it might be convenient to keep things organized, as in this example:

Numeric variables have a default length of 8 bytes in SAS. As we have seen, there is also a default length of 8 for character variables, if they are read using a $ informat, or if an informat is used, the length depends on n in "$n.". In a later section, we will see that if character variables are created using data step programming statements, they get their lengths from the first value assigned to them. The length statement can be used to override the default lengths for both character and numeric variables. It's not often we want to change the length of a numeric variable. Sometimes space can be saved when the values are integers. The allowed lengths are from 3 to 8 for PC SAS. A length of 3 will accommodate accurate integer values from -8192 to 8192. A length of 4 works to slightly over 2 million. It is not recommended to use shortened numeric values when fractions (decimals) are involved.

In the above example, you can see that the length statement has syntax similar to the format or informat statements. However, the "dot" is not required. Here it has been left out for the numeric length and included for the character length, just for an example. The dollar sign, however, is required for character variables. The length statement must occur before the first use of the variable in the program, or it will not have any effect. Another way to use the length statement is shown below. This example sets the default numeric length to 3. Unless you specify other lengths, all numeric variables in this data set will have length 3. (This only works for numeric variables.) The character variables will have length 8.

Page No.50

Another way to do this is with the attrib statement, which is more complicated and allows you to set the lengths, formats, informats, and labels all in one command:

a. b. c. d.

Exercises: For each of the following exercises, copy and paste the data given in the problem into the SAS editor. Write a data step to read the data and create the new variables described, then print the results using proc print, using appropriate titles. 1. These numbers represent dimensions of cardboard boxes, length, width, and height, in inches. Calculate the volume of each box in cubic feet. Calculate the amount of cardboard needed to make each box in square feet, assuming that the top and bottom flaps meet in the middle (this results in a double layer of cardboard for both top and bottom). Suppose the cost of cardboard is $.05 per square foot, and there is a fixed cost of $.25 for manufacturing each box regardless of size. Calculate the cost of manufacturing each box. Calculate the cost per cubic foot of volume. 32 18 12 16 15 24 48 12 32 15 30 45 20 30 36 2. This problem will provide a little practice in writing complicated formulas in SAS, paying attention to order of operations. Use the data below, with variables a, b, and c, and apply the following formulas to create two new variables called root and trunk. The first observation's results are -1 and 2.094, respectively.

Page No.51

, 165 4 -20 2 12 22 -11 3 -15 -9 3. Read the following data into three variables, making sure to get complete names. Use the character functions and operators to extract initials from the following names so that they look like "J.F.K." Then create an abbreviation for each name that looks like "J-n F-d K-y". John Fitzgerald Kennedy Martha Helena Goetz Frederich Anthony Sailer Albert Blake Codwell Lesson 7: Data Set Options, Set and Merge Statements Data Set Options "Options" are used in SAS in various places, and with various different kinds of syntax. For this reason, the concept of options can be a bit confusing. An earlier lesson introduced the idea of a global system options statement, where we can set things like linesize, page numbering, and so forth. Data set options are something different. They are commands added to a basic data step statement to refine or modify the work of the data step, or they can be used to modify how a data set is used in a proc step. (The term, "options," is also used for any optional command in any SAS statement. This is why we have to distinguish between system options, data set options, and other options.) We begin with the drop and keep options. The examples below show how these two options can be used to accomplish the same thing. The choice between them is purely a matter of convenience. The drop option lists the variables to be omitted from the data set, while the keep option lists those which are to remain in the data set. There could be many reasons why you might not want all of the original variables in your data set. In the next example, we use the original variables for calculations, but do not include them in the data set. In other cases, we may create variables that are only used in the program and do not need to be saved.

Page No.52

Take note of the syntax used here, as it is unique for data set options. All the options to be applied to a particular data set are enclosed in parentheses, following the name of the data set to which they apply. An equal sign after each option is followed by the list of items involved. Here they are part of the data step, but they are NOT called DATA STEP options, but rather DATA SET options, and they can be used any time a data set is referenced, such as in this proc print statement:

(A var statement under proc print would be more appropriate, but this works. Var statements are covered later.) The Set Statement The set statement is used to create a new data set from an existing one. In the simplest example, it merely copies a data set. The following program will create a new data set called two that is an exact copy of one, providing one exists.

Now suppose that a data set called one has been created, as shown in the data step below, and that you later wanted to create a new data set called two, with some new variables, but leaving the old one unchanged. The set statement can be used to copy the original data set and add new variables.

SAS will now go through all the observations in one, and place in the new data set two the original variables together with the new ones that are defined. There are two data sets now. Suppose we want to print both of them. Most procs, including proc print, will use the most recently created data set by default. Thus the second statement below would print only two. To specify another data set, use the syntax shown in the first statement below. Each “proc print” statement prints one data set.

Page No.53

Using Data Set Options with Set Statements Next we show how data set options can be applied to either input or output data sets when using the set statement to read another data set. In the example below, data set one is first created to serve as the input data set. The data sets two and three will be exactly the same, but the processes by which they are produced differ. When two is created, all four variables from one are read into memory, then when the data are written to two the variables x and y are eliminated. When three is created, only sumxy and prodxy are read into memory from one, and subsequently saved to three. If there are many variables and observations involved, the second method is preferred because it is more efficient (saves system resources). However, you cannot drop variables from an input data set if you intend to use them in calculations!

The firstobs and obs options specify the starting and stopping observation number to read. They do not apply to output data sets, but are used with set statements and in procs where data sets are referenced. In the first example, a second data set is created using these options to make a subset of the original data. Note the observation numbers in the output.

Page No.54

In the next example, the option is given in the proc print statement. Compare the observation numbers. In both cases, the observation numbers correspond to the observations actually stored in the data. They are not renumbered by proc print.

Finally, we discuss the rename option. At times it will be necessary to change the names of variables. As with the other options, this can be done when reading or writing the data. When used with an output data set, it renames the variables when the data set is saved, and does not affect the names used within the data step, as shown below. Also, the rename option requires another set of parentheses for the list of variables to be renamed. The name change is specified as "old name=new name."

Page No.55

When the rename statement is used with an input data set, the new names are in effect during the data step, as in this example:

Using Firstobs and Obs in the Infile Statement Firstobs and obs can also be used in the infile statement when reading from an external file or cards. Here, the syntax is different, since they are not connected with a data set, and the options are simply typed on the line following the filename or cards keyword. There is one other difference--in this case, the numbers refer to the starting and ending line in the data, which is not always the same as the observation (e.g., observations might take up two lines each).

Page No.56

Concatenating Data Sets Using the Set Statement The set statement can also be used to combine data sets in various ways. The first way is called concatenation, and is simply combining them in order, one after the other. The example below uses two data sets, but there can be more. When using concatenation, all of the observations of all of the listed data sets are included in the result. All variables from all the sets are included, with missing values assigned in cases where a variable does not occur in the original.

Suppose the data sets have different variable names.

Page No.57

You could use data set options to re-align the variables in a case like this.

Combining Observations with the Set Statement The second method of combining data sets is called "one-to-one reading" and may be thought of as a side-by-side version of concatenation. The programming difference is that while in concatenation the source data sets are listed in one set statement, in one-to-one reading each source data set is given in a separate set statement. If the same variable names occur in more than one source data set, the values of the later sets overwrite the earlier ones. If the data sets do not have the same number of observations, the result is cut off at the length of the shortest set. The example below shows two data sets of different lengths and the same variables. The x variable is changed using a rename option during input. Thus the source variable is overwritten with the second data set values, but because of the name change, we now have x and y coming from the original x variables.

Page No.58

Interleaving Data with the Set Statement The third method is called interleaving. This is like concatenation, except that the observations are combined so that they are sorted, instead of having one data set placed after the other. The variable(s) on which the sort is done are called by variables, and the interleave is done by adding a by statement to the concatenation program. The by variables must be sorted before the interleave is done, so if they are not in order to begin with, we use proc sort to do the job. Proc sort will also have a by statement, listing the by variable(s), just like the data step that does the interleave.

Page No.59

Merging Data Sets Now, if you're thinking ahead, you must realize that we should be able to take this to another level-how about putting the data side-by-side while matching up variables? That's called a merge. And that's where we're going next. But before getting to the programming, there are some things that need to be explained. We need to be very careful about how things match up. For convenience, we'll visualize the data sets laid out side by side, so that we have a right and a left data set. The simplest case occurs when there is exactly one item on the left to match exactly one item on the right. This is called a one-to-one merge. If there is exactly one on the left to match more than one on the right, we call it a one-to-many merge, and if the roles are reversed, a many-to-one merge. Finally, if there are multiple instances of variable values on both sides, it is called a many-to-many merge. The latter should usually be avoided, but an example will be provided to show what happens. Our first example will be one-to-one reading again, but this time using a merge statement. The difference is that the data set continues building until the end of the longest data set is reached. (In the following examples the variables have been given unique names to avoid needing rename options.)

Page No.60

Next, we do a one-to-one merge. The data sets will be matched by the name variable, therefore this variable may be called the match key. The match key is identified using a by statement, and it must be the same in both source data sets. Just as in interleaving, the data sets must be sorted. To simplify the example, the data are entered in sorted order. Note that if something doesn't have a match, it is included, with missing values where the matching data should be. Also, this is a good time to mention that SAS character variables are case sensitive, so that "john" and "John" are two different values. If your data might have mixed case, the upcase function can be used to convert everything to upper case to ensure it matches.

Page No.61

Now consider a one-to-many merge. We will have one instance of each name on the left, and multiple instances of some names on the right. There is no change to the program (only the data). The result is that one observation is produced for each observation on the "many" side, but if there is no match, again the relevant observations on either side are included with missing values.

Page No.62

Here is the result of a many-to-many merge. The technical details of what happens are a bit complicated, but we can say, in a simplified way, that SAS matches lines side-to-side until one side runs out of the matching observations, then it repeats the last observation from the short side until the long side runs out of the matching value. (If you go back to the previous examples, you can see that they are special cases of this more general result.) There are very few situations where this behavior is desirable. Most merging is done with one-to-one or one-to-many relationships.

Page No.63

As a final comment, note that while there is a useful purpose for having two or more set statements, there is nothing similar for merge statements. You can put two or more merge statements in a data step, with or without a by statement, and not get an error message, but the resulting data values may be intermingled in "unexpected" ways. Exercises 1. Refer to the data in Exercise #1 of Lesson 6. For each of the following, include a proc print to display the results. Use appropriate titles. This can all be done in one program. Create a data set with only the three given variables. Name the variables x, y, and z. (This is the only time you will use cards.) Create another data set, using a set statement, that takes the SAS data set in Part a as input, renames the variables length, width, and height during reading (on the way in), and calculates the volume and cost as in Lesson 6. Do as in Part b, except this time rename the variables when the data set is written (on the way out, this does not mean in proc print). Do as in Part b, except this time use obs and firstobs to bring in only the third and fourth observations from the original data set. Using the data set created in Part b as input, create a new data set that contains only the volume and cost variables. 2. Refer to Exercise #2 of Lesson 5. Begin with the same input statement as was used in Lesson 5, then create a new name variable of the form "Lastname, Firstname" and another of the form "City, ST zip#". Use an option to eliminate the original five variables used to create these new variables. Create another data set where you use obs and firstobs in the infile statement to read the second and third lines of data only.

a. b. c. d. e. a. b.

Page No.64

a.

b.

c. d.

a. b.

3. Download this data. Do not change the file or copy it into your editor. Look at it with a text editor, such as Notepad. There are two sections, with headings that say "1." and "2." (These lines are not to be considered observations). Write two data steps, using this one file as the source in both cases, but using firstobs and obs to control the starting and ending line so that each section is read into its own data set. The data have a city name and state abbreviation. Read these as two variables, "City" and "State." Use a length statement to make the city variable long enough not to cut any of the names short and make the state abbreviation of length 2. Create a new variable that forms a single five-character abbreviation using the first three letters of the city and the two letters of the state abbreviation. Create another variable that concatenates the city and state with a comma and one space immediately following the city. It should look something like this: Obs city state shortcity longcity 1 Bismarck ND BisND Bismarck, ND Concatenate the two data sets and print the result. (Do not sort the data before this step.) Next, interleave the two data sets. Remember they must be sorted first. Use "state" as the by variable. Print the results but use a data set option within proc print to show only the "longcity" variable and observation number. Next, merge the two data sets using "state" as the matching variable and applying the rename option as needed (within this data step). Use an option in the data statement so that only the two original city variables and the state remain in the new data set. Print the result (with no data set options used in proc print this time). 4. Copy the SAS code below into the editor to start with. Assume that these data sets represent an inventory list that is being revised at each step. The prices change each time, but the "itemno" is revised between new1 and new2 only. Write a program that does the following, and print each of the data sets you create. Merge new2 and new3 with itemno as the match key, and show old and new prices. Merge new1 and new3 with name as the match key, including the old and new values of "itemno" and price in the result.. data new1; input itemno name $ price; cards; 325 PrintCrd 211 276 KeyPad 37 842 PnclHldr 8 422 PaprShrd 132 523 Basket 29 ; data new2; input itemno name $ price; cards; 333 PrintCrd 399 277 KeyPad 25 802 PnclHldr 12 417 PaprShrd 122 515 Basket 17 ; data new3; input itemno name $ price; cards; 333 PrintCrd 386 277 KeyPad 25 802 PnclHldr 11 417 PaprShrd 135 515 Basket 15

Page No.65

; Lesson 8: Proc Print, Proc Sort and ODS Getting More From Proc Print Proc print is used to display data in the output window. The output can be saved, printed, or copied into a word processor or other program. To make the formatting come out right in a word processor, a monospace font (all letters take up the same amount of space) must be used. A computer with SAS installed should have the "SAS Monospace" font available; otherwise, choose something like "Courier." In previous lessons we have made frequent use of proc print, but we have only introduced a few options, namely label and data=. Actually the output of proc print is highly customizable. The documentation can be found in the documentation under "Base SAS Procedures Guide/Procedures/The PRINT Procedure." We will explore some of the more common options and statements here. Recall the example from a previous lesson, where we showed that data set options can be used in a proc print statement.

As a demonstration of data set options, that was good, but it is not the way this is usually done. Proc print has a var statement which is used to list the variables that are to be printed. The proc step below will produce the exact same output. The var statement can also be used to change the order of the columns, as they will be printed in the same order they are listed in the var statement.

The noobs (no observation number) option can be added to the proc print statement to suppress printing the column of observation numbers.

Page No.66

Or, perhaps there is a variable in the data set that uniquely identifies each observation and is not the same as the observation number, such as a customer number, Social Security Number, or subject number in an experiment. In that case, add an id statement (that's pronounced eye-dee). The id statement causes the specified variable to be printed at the left instead of the observation number. Note that the id variable is not included in the var statement, if one is used (including it will cause it to be printed twice).

The label option was introduced in an earlier lesson. Consider the following program and output, with labels made up of several words. By default, SAS fits the labels on the column headings as best it can, using spaces to split the labels onto multiple lines. In this example it is not quite satisfactory. We notice that although "First Name" occupies two lines, "Last Name" does not. (data)

The split= option can be used to control where the breaks in the labels occur. When using the split option, the label option does not have to be specified because it is implied. Any character can be used to control the split. Often a space will do.

Page No.67

That fixes the "Last Name" problem, but now the "Question" headings take up four lines. Maybe that's not what we want. By specifying a different split character we can control where the splits occur.

Due to the wider headings, the variables are now printed in two sections. Note that the id variable is repeated in the second section (obs will do the same if used). Even if there are many observations, SAS will, by default, break each page into sections like this, if all the columns don't fit on one line. You can use a rows=page option to force each page to be all one section.

Page No.68

SAS has many rules for trying to decide what the best way to fit things on a page will be. Sometimes it prints the headings vertically. The direction can be forced using a heading= option. The arguments are v (or vertical) and h (or horizontal).

The sum statement causes SAS to calculate the sum of one or more variables. (data)

The by statement causes proc print to generate separate reports for each value of the by variable. (Data must be sorted by the by variables.)

Page No.69

And you can print subtotals for the by groups like this:

Getting More From Proc Sort Data sets can be sorted using proc sort. A by statement lists one or more variables to use as sortkeys. The keyword descending can precede any variable for which the sort order is to be reversed. A data= option specifies the data set to sort; otherwise the most recently created data set is used. Here is a simple example that sorts by one column. The original data set is replaced by the sorted one.

Page No.70

In the example below, an option to create a new data set called two is included, along with a data set option specifying which variables to keep. This leaves the original data one as it is.

Any number of sort keys can be specified. The next example shows a sort by school, then score within school, in descending order.

Page No.71

Proc sort can also eliminate duplicate observations while sorting. The noduprecs and nodupkey options can be added to the proc sort statement. Noduprecs eliminates observations that are exactly alike for all the variables, while nodupkey eliminates those that have the same values in the sort key variables, even if there are differences in other variables. Introduction to ODS ODS stands for "Output Delivery System." When a SAS procedure produces output, it is actually producing data, which is then passed to the Output Delivery System, which determines what should be done with the output data. ODS has "destinations," which refer to the type of output to be produced. For example, the "listing" destination is the output window. In this lesson, we will demonstrate two other destinations, "html" and "pdf." Another important one is "rtf" which produces documents you can import into word processors. Destinations are opened and closed as shown below. The ods keyword, followed by a destination name, will open the destination. To close a destination, include the keyword "close." In the example, the listing destination is closed, meaning no output will go to the output window. Then the html destination is opened, meaning html will be created. After proc print does its job, the destinations are returned to their previous state (always a good idea). We see a nice html document in the results viewer.

Page No.72

ODS can send output to multiple destinations. (We didn't need to close the listing destination--it was just done to provide an example.) It can also create files, as shown below.

SAS usually displays the html or pdf document in the Results Viewer window, even if it is written to a file. Here is a pdf displayed in the Results Viewer.

If you're preparing documents for publication, presentation, or the web, you may not be satisfied with the way this output looks. ODS allows us to modify the way documents look by applying styles. For example, the journal style is designed to be compatible with the requirements of many scientific journals:

Page No.73

A set of styles is installed with your SAS system. This little program will display the styles available on your system:

Exercises: The file usedcars3.txt contains part of an inventory of used cars from a local car dealer. The variables are year, make, model, color, miles, inventory number, and price. Read this data into a SAS data set. Include descriptive labels for all the variables, at least some of which should be made up of several words (separated by spaces) for this exercise. Include formats for miles and price. Save this data set for the next lesson (think about what that means--hint: permanent). Don't forget to include titles that identify which question your output belongs to. 1) Print the data without labels and with the inventory number as an ID variable. 2) Print the data with labels, including inventory number as an ID variable. Use a statement (not data set option) so that only inventory number, make, model, year, and price, in that order, are printed. 3) Look at your headings in part 2. They will be printed either vertically or horizontally. Use an option to force them to print the other way. 4) Now take the label statement you put in the data set and insert it in proc print. Revise this label statement to include a split character and print all of the variables (without using an ID statement) with the split labels. Do not just replace spaces with a split character. Make sure to do something that will demonstrate the difference that using a split character makes. Continue using these labels for the rest of the parts below. 5) Sort the data in order of price from highest to lowest and print the result. 6) Print the data with columns for all the variables except color, but separate the observations into groups for each color, without any sums or subtotals. 7) Print the year, make, model, color, and price, but include subtotals and a grand total of the price for each make. 8) Sort the data in alphabetical order of make, and within make, in order of miles from lowest to highest. Include a nice title, suppress the page number and date, and put the result in a new data set and print it using the ODS pdf destination with journal style. Save the pdf file and submit it separately.

Page No.74

Lesson 9: Proc Rank, Proc Contents, and proc Format Ranking Observations Proc rank is used to generate rankings for observations. Ranks may be useful in their own right, or they may be needed for non-parametric statistical methods. The procedure is fairly straightforward. A var statement names the variables to be ranked, and a ranks statement names the variables that will contain the ranks. Both data= and out= options are available as in proc sort, but there is a difference in default behavior that sometimes causes confusion. Suppose I submit the following program:

Whereas proc sort would have given us a sorted data set one, proc rank didn't put the ranks in one. Where did they go? A look in the log shows us that a new data set called data1 was created. Proc rank is one of several SAS procedures that follow this convention: if you do not provide data set names for new data sets, they will be named sequentially as data1, data2, etc. Proc rank will not over-write an existing data set unless you supply a name.

If we had not specified a data set for proc print, it would simply have printed data1 since it is the most recently created data set. Specifying the data set is a good idea, though, because we can easily make

Page No.75

mistakes by not paying attention to which data set is being processed. Here, we name the output data set, with ranks, two, and let proc print display it by default.

Look at the scores and the ranks. Are they what you expected? Perhaps not, as we often think of the highest score as being "number one," but here the lowest score is "number one." This is an ascending ranking. If you want the ranks to go the other way, you need a descending option to reverse the order of the ranking. Proc rank doesn't allow ascending and descending ranks in the same proc rank step. You can overcome this by using two steps, taking the output of the first as input for the second.

Notice that srank and grank are both produced in the first rank step, and both give lower ranks to larger numbers (descending). The second step takes the ranked data set, two, as its input, and adds srank2, and ascending rank for the scores. Did you notice the values for grank? You may wonder how or why you get ranks that are not whole numbers. This happens because some values are tied. In fact, there are two of each grade. SAS has taken all the tied observations and averaged their ranks. You can use a ties= option to specify what to do in case of ties. The possibilities are high, low, and mean. If you use high or low, it will take the highest or lowest rank of the tied cases. Be careful! The result you want may be affected by whether you are ranking in ascending or descending order.

Page No.76

Getting Information About the Contents of a Data Set Proc contents displays information about the variables in a data set, as well as various characteristics of the data set. The information you are most likely to be interested in is the third section on variable attributes. The variables appear in alphabetic order. The "#" column indicates the order of the variables in the file, while the "pos" column gives the actual position in bytes from the beginning of the line. "Type" should be obvious, and "Len" is length, of course.

Proc contents doesn't have many options, but here are a couple of them. Short gives a very short version of the output, which is actually just a list of the variables. Varnum causes the variables to be displayed in the order of their position instead of alphabetically.

Page No.77

Custom Formats and Informats Proc format creates custom formats and informats. As we have seen, informats are used in reading data and determine how a value will be stored. Formats are used to determine how a value will be printed. Custom formats and informats allow grouping of values, for example, ranges of numbers could be recorded or printed as "Low," "Med" or "Hi" values. We will focus on formats, although similar commands can be used to produce informats. Your format will need a name. There are some rules that must be observed, in addition to the normal rules for SAS names (you can use letters, numbers, and underscores, but can't start with a number). First of all, the name you choose cannot be the same as that of an existing format supplied by SAS. The length cannot exceed 32 characters, but this includes the "$" that must begin a character format, and an "@" prefix automatically appended by SAS to user-defined informats. (You may see this in the log.) Also, character format names cannot end in a number. Well, it's not likely you'll want to make any names that long anyway, and the "$" requirement is familiar, so if you make it a practice not to end with a number, you shouldn't have too much trouble. To avoid duplicating a name SAS already has, it is a good idea to include a short character combination that is unusual--perhaps your initials, business acronym, etc. as part of the name. Suppose you have a data set like the following, with a product number (a character variable) and a price. You'd like to print a report that contains the product description and the price.

A good way to do this would be to create a format that associates a description with each product number. In proc format, the value statement is the actual command that defines a format. (A similar invalue statement defines an informat.) In the example below, the expression following the key word "value" is the name of the format. Note that there is no period at the end of the format name here, but the period is used when the format is associated with a variable, as in the format statement under proc print. The expressions after the name are called value range sets. In this case each variable value is assigned one formatted value, but there are other possibilities. The formatted values can be up to 32,767 characters long, but some procedures only use the first 8 or 16 characters.

Page No.78

The original data did not contain labels, so here we show that label statements can be added in proc print (and some other procs) as well. A real world application like this would probably involve thousands of items. It would not be good to rebuild the format every time it was used, so user-defined formats can be permanently saved and accessed when needed. The simplest way to do this is to use the special libname library. If you include a libname statement like that shown below, together with the library=library option in proc format when creating the format, then put the same libname statement in any program that uses the format, SAS will store the format in the specified directory and will search for it there when you want to use it.

The following program will find and use the format created above.

In the next example, we have a list of students in various grades. An informat is created to classify the values into categories representing the school level. This is a character informat, since the resulting values will be character strings. It may be tempting to think that the numbers representing grades are numeric, but they are treated as character values too. This is important when specifying ranges, because numbers and alphabetic expressions do not sort the same way. The first value range set, which defines the Elementary category, represents grades 1, 2, and 3. Multiple values can be listed on the left side, separated by commas, or ranges can be specified, using a dash. SAS accepts these values without quote marks around them, but quotes can be included if desired, such as "1", "2", "3". There is an important reason why this range was not given as 1-3. That is because in character sort order, 10, 11, and 12 come between 1 and 2! Using 1-3 would indicate that students in grades 1012 were to be classified as Elementary, which is incorrect. Furthermore, these values are defined again later, which produces an error and the informat is not created. Similarly, for the High Schl category, a range of 9-12 cannot be used. SAS complains with an error message in the log, that "Start is greater than end" in this case. If a value occurs in the data that is not defined in the format procedure, SAS uses a default informat, as occurs with the "K" grade level. You can also use other as a range for anything that does not fit what has been listed.

Page No.79

The next example shows a numeric format. The ranges include the words low and high, which can be used for unspecified or infinite lower and upper bounds. Also, the less-than sign is used as a way of excluding endpoints. Each of the ranges (except the last) given here will exclude the upper endpoint. For example, a score of 79.99999 would get a value of 2, but a score of 80 would get a value of 3. If you want to exclude a lower endpoint, put the less-than sign before the minus, such as "60<-70."

Exercises: Make use of the used cars data set created in the previous lesson. There should be no data step in this lesson. 1) Rank the prices so that the highest is number 1, and the miles so that the lowest is number 1, and send the result to a new data set and print it. (Note: this has to be done in two steps because all ranks in one step will go the same direction.) 2) Run proc contents on the data set created in the previous problem, displaying the variables in the order they exist in the data set..

Page No.80

3) Create a format that combines the colors into three categories, "light," "dark," and "other" for those that aren't specifically assigned. Use your own judgment in classifying the colors. Be sure to leave at least one out so there is something for the "other" category. 4) Create a format to classify the miles into categories of "high," "medium," and "low." Use your own judgment to define these categories. 5) Print the data using your new formats. Keep these formats for use in later exercises. Lesson 10: Basic Statistics with Proc Univariate, More ODS Summarizing Data with Proc Univariate Proc Univariate and Proc Means are procedures in Base SAS that calculate statistics one variable at a time (they do not explore relationships between variables). The two procedures have quite different listing output but many similar capabilities. Proc univariate is the more extensive of the two. In order to demonstrate these procedures in a meaningful way, a larger data set than those we have seen previously will be needed. The data set we will use is shown below. A text file containing the data is here. The data set contains three variables, a group variable with values 1, 2, and 3, a discrete variable x with values 1-6 (a die toss) and a continuous variable y.

First, we look at a very simple proc univariate step. The "var" statement lists the variables for which analysis is to be performed. If "var" is omitted, univariate will give analyses for all numeric values in the data set. Incidentally, this would include any for which the analysis is silly, such as the "group" variable in this example. Thus, specifying variables to be analyzed is a good idea.

Page No.81

The results include a fairly detailed summary with all kinds of statistics for the variable, spread over two pages.

Page No.82

Proc univariate has many options and optional statements. We will explore a few of the more common ones. For more, see the documentation under Base SAS/Base SAS Procedures Guide: Statistical Procedures. In the middle of the first page of output, above, note the section titled "Tests for Location: Mu0=0." These are statistical hypothesis tests where the null hypothesis is that the mean of the random variable is equal to zero. The small p-values indicate that the null hypothesis should be rejected, and the conclusion drawn that the mean is not zero. Perhaps you would like to test whether the mean is some other value, say, 100, for example. You can add the following option to the proc univariate statement. The higher p-values indicate that this null hypothesis cannot be rejected.

Two other options in the proc univariate statement are normal and plot. The normal option produces the section on tests for normality, and the plot option gives the stem and leaf, box and whisker, and normal probability plots below.

Page No.83

The notation below the stem and leaf plot, which says "Multiply Stem.Leaf by 10**+2" means that if you read the numbers like 6.9 for the first one, you should multiply that by 10 to the second power, so it is really 690. This data is badly skewed, so the box plot is not at all symmetrical. It usually has a dashed line through the middle of the box for the median. The "+" represents the mean. To interpret the normal probability plot, look at the band made up of "+" signs. The asterisks are the data, and if they mostly fall within the band, the data may be considered normal. In this case, the data are not normal, as the normality tests also show, since the low p-values indicate the assumption of normality should be rejected. Optional Statements in Proc Univariate Like proc print, proc univariate has a by statement, which will produce separate analyses for each value of the variable specified. In this case, the result is three sets of output for each value of "group" (results not shown).

Page No.84

The graphics shown above are somewhat rough, but proc univariate can also produce high resolution graphs, such as a histogram, which is displayed in a graph window. If a "var" statement is used, the histogram variable must be included in the listed variables. The "/normal" portion is an option to the histogram statement that superimposes a normal curve on the histogram. This demonstrates again that the normal distribution is not a good fit to this data. (Other distributions can be specified for the curve.)

The "qqplot" statement produces a high resolution version of the qqplot. Here is an example with the exponential distribution. A qqplot should fall in a nice straight line if the distribution is a good fit. Obviously, we are not having too much luck fitting a distribution for y!

Page No.85

Producing an Output Data Set Univariate can also produce a data set containing the statistics seen in the output. If this is the only goal, a "noprint" option in the proc statement is a good idea. This suppresses the usual listing output in the output window. A "var" statement must be used with the output statement to determine which variables will be used for the output data set. The desired statistics must also be specified. There is a long list of these; again, see the help or documentation for details. In the example here, standard deviation and mean have been requested. Note that the syntax of the output statement requires a keyword for each requested statistic, followed by an equals sign, followed by a list of variable names for the statistics, one for each variable in the "var" statement. There will be only one observation, unless a "by" statement is also given, in which case there will be one for each value of the "by" variable, as shown in this example.

Using ODS to Control Output We saw that proc univariate creates several sections, each with its own heading and a table of information (except the graphs). In ODS, each of these sections is an output object. An output object generally has two parts, the data component, and the table definition. The data component is obviously the data that will be displayed in the table, and the table definition is a set of instructions that describes how to format the data. Each output object has a name and can be accessed separately

Page No.86

through ODS. To see information about the output objects your procedure is producing, you can issue the following ODS commands and look at the results in the log (only part shown here):

Notice that the name of each object corresponds roughly to the label in the output. In most cases, just the spaces are eliminated. This makes it fairly easy to identify the object name. Sometimes in the output there will be more specifics, like the section on "Tests For Location" which also gives the null hypothesis value in the output, but that is not part of the name or label. In any case, having the name, we can now use the ODS select statement to choose which objects to print or send to any ODS destination. Alternatively, and ODS exclude statement can be used to eliminate unwanted objects with similar syntax.

Page No.87

ODS can also be used to save objects to SAS data sets. They are then available for use in data steps or other procedures.

Exercises: Use the data set of used cars inventory from previous lessons for the following problems: 1) Use proc univariate to analyze the price variable in the used cars data. In addition to the default output, produce tests of normality, and low-resolution plots (box plot, stem & leaf, and qqplot). 2) Use proc univariate to analyze the miles variable, change the null hypothesis value for the tests of location to 50,000, and use ODS commands to display only the tests of location in the output window.

Page No.88

3) Use proc univariate to analyze the price variable. Use ODS commands to print only the "Moments" object in the output window and to save the "Moments" object to a SAS data set and print it. Lesson 11: Proc Means and Proc Freq Summarizing Data with Proc Means Proc Means can provide some of the same information as proc univariate, but has different output formatting and different options. For relatively simple reporting of summary statistics, proc means provides a more compact output. The example below shows proc means with a by statement and the first section of the output. The function of the by statement is the same as in proc print or proc univariate.

The statistics for the listing are requested as options in the proc statement (see documentation for complete list). Those shown above are printed by default. The next example shows how some others might be requested. And, like proc univariate, proc means produces an output data set. Note that the statistics for the listing and for the output data set need not match.

Page No.89

Proc means allows a shortcut in the output statement if only one statistic is requested and the same variable names as the original variables are desired for the output statistics. This does not work in proc univariate.

Proc means also has a class statement. This is somewhat like a by statement but the results are grouped together.

If you have more than one class variable, there is also a way to get summarizations at more than level of combinations of classes. This is done with the types statement. The example below gives the overall summary, represented by the empty parentheses, a summary by group, and a summary for each combination of group and x. Only part of the output is shown. In the output window, the different levels are placed into different tables.

Page No.90

When the class and types statements are used together with an output statement, all the different combinations go into the data set. The _type_ variable indicates which level it is, such as overall, where _type_=0, group, where type=2, etc. Here again, we show only part of the output.

Summarizing Categorical Data with Proc Freq Proc Freq produces frequency tables for numeric or character variables. The "tables" statement is used to specify which variables to use in the table(s). If no tables statement is given, a one-way table for each variable in the data set will be produced (this is not usually a good idea). Multiple tables can be specified in one tables statement, and multiple table statements can be given. The data for this example is here.

Page No.91

Several options are available in the tables statement and are listed after a slash if used. The example below shows nocum and nopct options, which suppress the cumulative statistics and percents. The nofreq options will suppress the frequencies.

Two-way tables are requested using an "*" symbol notation, as shown below. The first variable will be listed vertically. The upper left cell gives the key for the numbers in the table. Three-way and higher tables can be requested, if desired. Proc Freq then produces a collection of two-way tables, one for each of the additional values of the other variable(s).

Page No.92

Some useful options for two-way tables are norow, nocol, nofreq, and nopct. These are used to suppress each of the four numbers in the table cells, and are especially helpful if the tables are large.

If a table is to be built from a continuous variable, proc format can be used to group the values in a suitable way.

Page No.93

Proc freq has many more capabilities. It can produce output data sets and many statistical tests and measures of association. See the documentation for further information, under "Base SAS/Base SAS Procedures Guide: Statistical Procedures." Here is one example of using proc freq to conduct a chisquare test of independence.

And if you want the results in an output data set:

Page No.94

Exercises: Use your permanent usedcars dataset as the source for the following problems. Use your saved formats whenever you are asked to use your formats from the previous lesson. 1) Using proc means: a) Display the mean and standard deviation (only these two statistics) for the miles variable in the output window. b) Display the mean and median of the price, but using color as a class variable. c) Use your formats for color and miles and display the mean and median prices, with both color and miles as class variables, in their formatted form. d) Produce an output data set that gives the mean and standard deviation of the miles for each make of car, using a by statement. Print the result. e) Produce an output data set that gives the mean and standard deviation of the price for all the cars and for each make of car, using class and types statements. 2. Using proc freq: a) Display a frequency table of the makes. b) Display a two-way table of color by make, showing only the counts in each cell, and include tests of independence. Align colors vertically and makes horizontally in the table. c) Use the format for classifying miles that you created in Lesson 9 to make a table of make by milegroups. Align makes vertically and miles horizontally. Print the counts and row percents (for each make, percent in a mile-group). Lesson 12: Proc Chart, Proc Plot, and Proc Corr Bar Graphs with Proc Chart Proc chart is a procedure that produces text-based bar charts as well as pie and star charts.

Page No.95

The vbar command produces a "vertical bar chart," as seen above. The discrete option, listed after the slash on the vbar statement, indicates that the variable is to be treated as a discrete value, with one bar for each value of the variable. (This does not apply to character variables.) Without this option, SAS groups numerical values in evenly-spaced classes like a histogram. This is appropriate for the y variable, as shown below. Note that the horizontal axis now indicates that the labels are midpoints of the respective ranges.

Page No.96

That this is inappropriate for variables that are actually discrete (unless there are a large number of values), is apparent from the chart for x done without the discrete option, where you can see the data have been strangely grouped with non-integer midpoints (3 and 4 are combined with the label 3.6).

Complementing the vbar statement is the hbar statement, for "horizontal bar chart," of course. In addition to the horizontal display, the hbar statement also displays some statistics for each class.

Another possibility is the block statement, shown here with a subgroup= option which subdivides the columns by another variable value. First, a warning: these "3-D" charts have limits because they require a certain amount of space on the page. If SAS cannot fit your requested chart on a page, you may get an error message in the log and a substitute graph in your output. You may be able to adjust

Page No.97

your linesize and pagesize options, or change the number of bars to display, or turn dimensions around in order to get better results.

A group= option is also available to produce side-by-side bars, and they can even be used in combination. Here is part of a block chart of y using a group option with the variable "group."

Here is a vertical bar chart using both the group option for a side-by-side arrangement and the subgroup option for vertical stacking, thus combining three variables in one chart.

Page No.98

Some other options you can try include "levels=n" which specifies the number of bars, "space=n" which regulates the space between bars, "midpoints=list" where you specify the midpoints you want for the bars, and "freq=variable" which is used when one of your variables already contains the frequency to be used in making the height of the bars. For complete information about the chart procedure and other available options see the SAS Online Documentation under "Base SAS/Base SAS Procedures Guide/Procedures/The Chart Procedure." And finally, here is an example of a pie chart.

Page No.99

Scatterplots with Proc Plot Proc plot produces text-based two-dimensional plots. Note that there are special characters used in the output window for the axes; if you copy and paste these, you need to use the SAS Monospace font to have it come out right. The data set shown below is used in these examples, and can be downloaded here.

Page No.100

A simple plot of y by x is produced by the following code. Note that the first named variable goes on the vertical axis, and the second on the horizontal axis. By default, SAS prints characters to represent the data points. Sometimes more than one point occurs in one place. Higher letters are used to indicate the number of points represented by that letter. In this example it didn't happen, but SAS prints a legend to explain it, anyway.

Another thing to notice is that the title bar of the editor window says "PROC PLOT running." Proc plot is one of a few procs that does not exit when it encounters a run statement. Instead, it stays active and waits for more plot statements. It will exit if followed by another step, otherwise, a quit statement can be used at the end, as shown in the next example. (Although "run;" is not necessary in this example, it would be good practice to include it, before the "quit;".) If you want some other symbol, instead of the default A-B-C for the data points, you can assign it as shown below. In this case, if there are two or more points in one place, SAS prints a message telling how many points are hidden, but you cannot tell which ones they are. (There is an example further down.)

Page No.101

You can also create a kind of three-dimensional plot by assigning symbols based on a third variable. SAS only uses the first character of the value, but if this is a problem, a custom format can be used to define more useful symbols.

It is also possible to display a relationship between one variable and two or more others. Usually we put the common variable on the horizontal axis. This is done by putting two or more plot expressions in one plot statement, and using an overlay option, which follows the "/". In the example below, a by statement has also been added, and the output is shown for group=2 only.

Page No.102

Sometimes it's nice to have reference lines on a plot. Proc plot has the vref= and href= options to provide these. Now the v and h do not refer to the orientation of the line, but to the position. Thus a vref is a vertical reference, which is a horizontal line at a particular vertical height.

Page No.103

Information about more options for proc plot can be found in the documentation under "Base SAS/Base SAS Procedures Guide." Correlations with Proc Corr Proc corr computes measures of association between two variables. The primary one, of course, is correlation (Pearson product-moment correlation), but Spearman's rank-order, Kendall's tau-b, Cronbach's Alpha, and others are also available. For details of these, see the documentation, under "Base SAS/Base SAS Procedures Guide: Statistical Procedures." Using proc corr can be as simple as writing one line, much like proc print. By default, proc corr calculates correlations for all pairs of numeric variables in the data set. Some of these may not be sensible (such as for an id variable). The output included simple statistics (n, mean, standard deviation, sum, min, and max) and a matrix with all the variables listed vertically and horizontally, so that one finds the desired correlation by looking at the intersection of a row and column. In fact, the layout resembles that of an actual correlation matrix as used in statistical theory, if only the top number of each entry is considered. the bottom number is a p-value for the hypothesis test whose null hypothesis is that the correlation is zero (there is no correlation), and whose alternative hypothesis is that the correlation is not zero (there is correlation).

Page No.104

If there are variables you do not need to correlate, or if there are too many variables to look at all at once, some modifications can be made. A var statement can be added to determine which variables will be included.

In addition to the var statement, a with statement can be used to make the output more compact. The "var variables" are listed horizontally, and the "with variables" are listed vertically.

Page No.105

Exercises: 1. Using proc chart or proc plot and the used cars data from the previous lessons: a) Make a vertical bar chart for the makes. b) Make a histogram for price with seven bars (histograms for continuous variables should have no spaces between bars). c) Produce a plot that relates year and price, with price on the vertical axis. 2. The following numeric data are values of the variables x and y respectively. cards; 23 35 27 36 24 35 21 32 29 36 28 39 ; a) Copy this data into the SAS editor and write a data step to read it. The prediction equation (regression line) relating y to x for this data is given by yhat=20.60811 + 0.58784x, where yhat is the name for the predicted value. Define this variable and include it in the data set.. Also include the natural log of x and the natural log of y in the data set. b) Use proc means to find the means of y and yhat (you might notice something interesting). c) Make an overlay plot that shows the y values and the predicted values with two different symbols, plotted against the x variable.. Include a vertical reference line at the mean of y, as found in part b. d) Find the correlations between all five variables (default output of proc corr). e) Display the two by two correlation table of x and y. f) Use var and with statements to display only the correlation of the log of x with the log of y. Lesson 13: Proc Reg No doubt one of the most widely used (and therefore abused) statistical procedures is regression. We are not going to learn how to do regression here. However, because of its popularity, we will use proc reg to demonstrate some of the typical syntax of statistical programs. Proc reg is not part of Base SAS, It is part of the statistical package called SAS/STAT. Therefore the documentation is found under "SAS/STAT," then "SAS/STAT User's Guide," then "The REG Procedure." The data used for these examples is here. The file contains the variables x, z, and y and has two header lines. The idea is to use proc reg to derive an equation for the prediction of y based on x and z. This is often called the "model" or "prediction equation." The model statement in proc reg is used to define the form of the equation. The dependent variable (the one to be predicted) is given, followed by an equal sign, and then the independent variables (those on which prediction is based) are listed after the equal sign. Technically, the statement "model y=x z;" specifies an equation of the form Where y-hat is the predicted value of y, and the beta-hats are the estimated coefficients of the equation, with beta-hat-naught being the intercept. In the SAS output the beta-hats are called "parameter estimates." Here is the program and output for this data:

Page No.106

We are not going to go into the interpretation of these results here. What we want to do is study how additional statements and options are used customize the results of the regression procedure. While the options, statements, and syntax vary for different statistical procedures, learning about proc reg will give you some of the general ideas, and thus give you some background for learning other procedures when you need them. Note that the model statement specifies the form of the equation that will be fit to the data. There can be more than one model statement. If that is the case, it might be helpful to give an identifying label to each model. This label will be printed at the top of the output for each model. By default these are numbered, as in the above example, where the model is called "MODEL1." Labels can be up to 32 characters, with no spaces, followed by a colon, and placed in front of the model statement, as follows:

This produces one section of output for each model. The first is identical to that shown above, except for the heading that says "Model: Full" instead of "Model: MODEL1." The second section looks like this:

Page No.107

Proc reg, like proc plot, does not automatically quit running when it encounters a run statement. Unless another proc follows, it will wait for more statements to be submitted. For example, if you added the following lines to the program above, left them selected as shown, and clicked submit, SAS would produce the output for the next model, without re-running the rest of the program. (Any selected text in the Enhanced Editor is submitted without the rest of the program, a source of great irritation when done accidentally!) If you want to make proc reg quit, issue a "quit;" statement at the end of the program. One of the minor benefits of this is that it leaves the Output window on top, rather than bringing the Editor back up. Experiment with this a bit and you'll see.

Sometimes it is convenient to have the results of the regression, such as the parameter estimates and other statistics, in a data set. Proc reg uses an option in the proc statement, "outest=", to do this. Other options can be added to control what statistics are included. (You might notice that the editor has some trouble with the color coding on these, but even if they aren't blue, they still work.)

Page No.108

Notice the naming of some of the variables, how they begin and end with an underscore. It is important to include these underscores when referencing the variables, since they are part of the name. Next, we will give some example options for the model statement, which are placed after a slash. Some of these options control what goes in the output, and some affect the modeling process. The "noint" option is used to fit a model with no intercept (recall the intercept is automatically included in the examples above). The "VIF" option adds a "Variance Inflation" column to the parameter table, and the "P" option gives a table of "Output Statistics" that includes predicted values of y (y-hats) and the "Residual," which is the difference between y and y-hat..

Page No.109

Another statement in proc reg is the output statement. This creates a data set, but unlike the "outest=" option in the proc statement, which gives observations for each model, this data set will contain output statistics for each observation in the data, such as printed in the example above. There is no slash in the output statement, the options simply follow the word "output." You should specify a data set name with the "out=" option, and then list the statistics you want included, such as predicted values, studentized residuals, etc. Each statistic has a keyword that requests it, then you must specify a variable name to use in the output data set. Thus "p=yhat" means to include the predicted value using the variable name "yhat." Some other examples are "r=" for residuals, "ucl=" and "lcl=" for the upper and lower confidence limits of the prediction. See the Online Documentation for the complete list.

Of course you can take this data set and make plots with proc plot. But proc reg also has its own plot statement built in. You can plot any of the variables in the original data set, plus the same new variables that are available in the output statement. These are named like the keyword that specifies them in the output statement, followed by a period. Thus, the predicted values are given by "p." and the studentized residuals are given by "student.", for example.

Page No.110

Exercises: Use the used cars data from previous lessons. In proc reg, do the following (This should all be done in one program, with one proc reg step): a) Compute a regression model for price based on miles and age of the car, and a second model for price based on miles alone. Use labels for the models. b) Create a data set which contains the parameter estimates and rsquare values for each model. c) Create a data set containing the predicted values and residuals (as well as the original data, which is included automatically). d) Plot price vs miles and price vs year using a plot statement in proc reg. e) Plot the residuals against the predicted values for each model using plot statements in proc reg (note: these plots use the residuals and predicted values produced by the immediately preceding model statement). f) Print the data sets created by proc reg. Lesson 14: Proc Transpose and Proc Report Proc transpose is used when you need to "turn your data on its side." Basically, it turns observations into variables and variables into observations. Suppose we had the following sales data in Excel:

Page No.111

Now we can clearly see that this data consists of four columns and five rows. The header row across the top and the first column are not really data. However, when we read this into SAS, we will need to consider the first column to be a variable also, or we will lose the designation for the week that goes with each row. The following program will do:

And the result looks like this:

Now suppose what we really wanted was for each salesperson to be an observation, and each week to be a variable. The id statement tells proc transpose which column holds the new variable names. If no id is given, SAS will use default variable names like _COL1_ etc. Also notice that SAS puts the old variable names in a new column called _NAME_. This variable name can be changed using a "name=" option in the proc transpose statement.

Page No.112

Now suppose we have another variable, say "City," and within each city we have weekly sales data for our four salesmen. The number of weeks is reduced here for brevity and also to show how certain missing values are handled. We want to transpose the variables again, but this time within each city.

At first glance, proc report looks like it does the same thing as proc print. In the simplest form, the output can be very similar to proc print. But you might say proc report is like proc print on steroids--it has far more capabilities. The example below shows how the default settings of proc print and proc report compare. Proc report doesn't print observation numbers, doesn't skip a line below the headings, and aligns the columns differently. But that's just the beginning. The data for this example is here.

Page No.113

Notice the "nowindows" option in the above example. We will always use this option in this lesson. Proc report comes with its own window-based user interface, which will come up automatically if you do not specify "nowindows." We will not be studying the report window in this lesson. The column statement is used to specify which variables are printed and the order in which they appear. A define statement is used for each variable to specify the the options associated with that variable. In the example below, name is defined as a group variable. Notice that the names are no longer repeated on each line (there will be more significance to the group variable later). For each variable, an expression in quotes has been included--this is the same as a label, and can be controlled with split characters similar to what we have seen in proc print. Notice that the last column has the label words split in an awkward way. This is due to a column width of 5, which comes from the format given in the define statement for the last column. Observe the three different formats that were used, and the effect of each.

Page No.114

Now for a few more options to dress up the output. First, in the proc report statement, we add the options "headskip" and "headline". They add a line under the headings and a "skip" or blank line beneath them. The formats have all been changed to accommodate the money amounts properly, but two new options are demonstrated in the define statements. For the cheese variable, a "width=10" option is included. This expands the column for that variable to 10, so that the column is now wider than the format specifies. For the cream variable, a "spacing=5" option is added. This puts five blank spaces in front of the cream column. Notice how this affects things differently, compared with widening the whole column. Next, we have two new statements, break and rbreak. Break works together with the "group" designation given for name. It breaks the report for each change in the value of the group variable listed in the break statement. Notice that it says "break after". This gives the location of the break information, which can be either before or after the group it refers to. After the slash come some options. The "ol" means overline--it prints a line above the break line. There is also "ul" which means underline. Then we have "dul" which means double underline. This is done using equals signs. There is also "dol" for double overline. "Skip" means to insert a blank line after the break line, and summarize means to print summary statistics for the numeric columns. The default is sum, but many other statistics are available. The specific statistic is given as an option in the define statement, "sum" for sum, "min" for minimum, "max" for maximum, "mean" for mean, etc. The rbreak command means "report break" so it provides summarization for the whole report. Similar options apply, except no variable is specified since it applies to the whole report and not the grouping of any variable. This provides the grand totals in our example.

Page No.115

The next example shows how you can duplicate a column for the purpose of getting different summary statistics. In the column statement, new columns are introduced using the syntax "variable=newvariable". The new variables are then given their own define statements, where the summary statistic "mean" is now included. The other thing that is changed is that the columns have been grouped, inside parentheses, and each group has a label specified as the first item in the parentheses. This provides a group heading for all the variables in the parentheses. Also, the labels have a minus sign at the beginning and end. This causes a line to be drawn out to the width of the column group for each label. (The headings are incorrect)

Page No.116

You can also calculate columns from the other columns. The new column must be listed in the column statement, and will have a define statement as well. To actually calculate the values, a compute block is used. It begins with the statement "compute newvariable" and ends with "endcomp;". The formulas for calculation use data step syntax, for the most part. However, there is a two-level name for some variables, as shown here. The variables are given as "variable.sum" because they are used as sum variables for summarization purposes. If we did not have the group variable, this would not be necessary.

Page No.117

Perhaps the idea of the two-level variable names, like cream.sum, will be more clear if we leave out the detail levels and show these variables only as summary variables. Taking the month variable, which relates to the detail rows, out of the column statement and deleting its define statement will do that. We also no longer need the break statement, and add an overline to the rbreak statement since all the dividing lines were in the break statement before.

Page No.118

Proc report has many more features and capabilities. We have only introduced some of the basic ideas here. If you are ever in a situation that calls for the generation of periodic reports on the same kind of data, it is worthwhile to spend the time to create a nice report that can be used over and over. Exercises 1. Begin with this data step: data one; input trt $ s1 s2 s3 s4 s5; cards; Cont 4 5 5 4 6 Fast 5 5 6 6 5 Drug 7 7 6 5 6 ; Suppose this is data from a designed experiment with three treatments and five subjects. To analyze the data in SAS, we will need it rearranged so that there are 15 observations, each having the treatment, subject identifier, and just one subject response. Use proc transpose to rearrange the data in this way. 2. Begin with this data step: data one; input trt $ subject $ t1 t2 t3; cards; Cont s1 4 5 5 Cont s2 5 4 6 Cont s3 6 6 5 Fast s1 5 5 6 Fast s2 5 6 7 Fast s3 7 8 8 Drug s1 7 7 8 Drug s2 6 8 9 Drug s3 5 7 9

Page No.119

; Use proc transpose to turn the subject and time around within each treatment, calling the new time variable "time" and the using the subject values for the new subject variable names. 3. Refer to the used cars data from previous lessons (usedcars3.txt). Create a report that groups the cars by year and gives the average mileage and price for each year. Include nice headers and at least one example of each of the format, width, and spacing commands. Give each column a heading in the define statements. 4. Again using the used cars data, create a report that summarizes the prices for each make, giving the total and mean but omitting detail information. Lesson 15: Tabs and Other Delimiters, Controlling Observations, Variables, and Output Data Sets There are essentially three ways that data can be identified in a raw data file. Either 1) The position of each value in the file must be known, either by columns or some other scheme, 2) The values must be identified by a name or other symbol, or 3) The file is "delimited" by a symbol that tells where one variable stops and the next begins. The latter case is the subject of this lesson. We also discuss some issues about missing values in list input. We have already dealt with delimited files where the delimiter is a space. This is what we were doing whenever we worked with the list input style. Now consider the program below. Notice that there are two missing values, indicated by a dot (period). Although all the data values are numbers, the second variable has been defined as character in the input statement. In the output, the missing character value is blank, but the missing numeric value is represented by a dot (i.e., the decimal point of a number that isn't there). If column input is used, blanks in the data will be correctly interpreted as missing values, but with list input that would cause a problem.

The data in the above program was entered using tabs between the values. The SAS editor does a funny thing with tabs--although you can move the cursor back and forth over them, and the cursor will "jump" just as you expect, when SAS executes the program, it interprets the tabs as a series of spaces. If you try reading the data above with column input, you will find that the data are located in columns 1, 5, and 9. However, when the data are read from an external file, we encounter a much different situation.

Page No.120

Fortunately, there is an option for the infile statement, called expandtabs, to fix this. It still requires that the "dots" be in place for the missing values, though.

Of course, our delimiters need not be tabs or spaces--they can be almost any character. Commas are probably the most common alternative, though sometimes we see hyphens or slashes used. In such cases, we can use the "dlm=" option, or spell it out, "delimiter=". Here we have replaced the tabs with commas in the external file.

There are special delimited files that have more complex rules. You may have seen "csv" files, for example, that can be produced and read by Excel. That's an abbreviation for "comma separated values," which implies that they use a comma as a delimiter. But there is more to it that that. Missing values are not represented by dots or spaces, just two delimiters in a row. There is also a feature that text with embedded commas will be saved inside quote marks, so that the imbedded commas are not interpreted as delimiters. To read this kind of file, use the "dsd" option in the infile statement. The "dlm=" option can be used with it, in the rare event that the delimiter is something other than a comma. Here is a csv file, created in Excel, with the same data and empty cells for missing values.

Page No.121

The data import wizard will also import csv and tab-delimited files, as well as a number of other types, so you may want to check that out as well. In a previous lesson, the obs and firstobs options were used to control which observations were included in a data set In this lesson some ways to fine-tune the selection of observations and variables are introduced. A text file containing the data below can be downloaded here.

You can decide whether or not an observation is included based on the value of a variable.

Or you can go the other way.

You can use comparison operators, like <, >, <=, >=, and their corresponding character abbreviations, lt, gt, le, and ge. If you use a character variable for the test, enclose the value in quotes and remember that case matters.

Page No.122

You can use logical operators like "and" and "or," and control order of operations using parentheses.

When reading from another SAS data set, such as when using a set statement, the same syntax can be used, or, you can use "where" in place of "if."

The use of if and where can be summarized as follows: IF: Use in any data step, not in proc steps WHERE: Use only when input is a SAS data set, in data or proc steps. A data set can create more than one data set at a time. Perhaps you want to create a temporary data set for immediate use in the work directory, but also want to save it to a permanent library, as shown in the example below. Note that the data set names were placed on separate lines and indented in order to make it more clear to the reader what is being done.

Page No.123

Suppose, on the other hand, that you want to create different data sets. The following program will create three data sets, each containing different variables, but the same number of observations as the complete data.

You can also send different observations to different data sets. To do this, construct a logical test involving "if-then" clauses together with the "output" keyword to specify the target data set.

Page No.124

Notice that the three data sets being created are still named in the data statement. Data set options, such as keep and drop, can be used here just as in the last example. The input statement comes next, then the output instructions. As each observation is read from the cards (or infile, or source data set), the three if statements are processed, and whenever the logical condition is met, the observation is output to that data set. Some observations are output to more than one data set, because more than one condition is met. If the intention is to split the data into mutually exclusive and exhaustive sets, then "if-then-else" clauses can be used to good effect. An "else statement" is only executed if the previous "if statement" is not (the logical test fails).

Exercises: Use the example data given at the beginning of the lesson to do the following. Print all data sets. 1. In one data step, create three data sets containing the following combinations of variables: a) Name and age, b) Height and weight, and c) Name and height. 2. In one data step, create three data sets containing the following groups of observations with all variables.: a) All observations for names that come after "E" in alphabetical order, b) All observations with ages of 9 or 10, and c) All observations where the height is less than or equal to 48 and the weight is greater than or equal to 90. In one data step, create three data sets containing mutually exclusive observations (use if-then-else): a) All observations where the weights are less than 75, b) All observations where the weights are at least 75 but less than 100, and c) All observations where the weights are at least 100. Lesson 16: Multiple Lines, Multiple Observations There are a number of situations where the lines in a raw data file do not correspond to the observations in a SAS data set. In this lesson, we explore ways to read data files that have multiple observations on one line and multiple lines for one observation. Since the latter is usually simpler, we will start with that. Consider the following data step (data here):

Page No.125

By default, when SAS runs out of data on one line but still has variables to read, it goes on to the next line. In past lessons, we found that sometimes this causes problems, but here, it does what we want. In fact, it doesn't matter whether some observations take up one line and some take up two, as long as each new observation starts at the beginning of a line. If an informat causes problems with a short line, the truncover option or the colon modifier can be used, as demonstrated in previous lessons. Do not use missover, though! As long as you read all of the variables (in order), the above method works fine. There are some other tricks available if you do not want to read everything on a line, or want to read things in a different order. A slash in the input statement tells SAS explicitly to go to the next line. In this case, if there were more variables on the line after first and last name, the slash would still cause SAS to go to the next line after reading Lname. Of course, now there is no flexibility in terms of having some lines broken and some whole--all observations must be on two lines.

Another kind of notation explicitly numbers the lines in an observation. Again, there must be a consistent pattern in the data, so that all observations occupy the same number of lines. Both of these examples give the same output as shown above.

Page No.126

The "number" notation allows skipping around within the lines that make up an observation, as shown here: (This example needs to be fixed, it is reading the same variables twice. This works best with column or formatted input.)

Now let's look at the opposite case--where there is more than one observation in one line of the source file. Consider the following data step:

In this case, only two observations were included in the data, because SAS automatically goes to a new line when it reaches the end of the input statement. What we need is a command to stop it from doing that. But first, a word about iterations. When a SAS data step reads from an external file or another data set, there is an automatic loop in the data step that repeats over and over until the input data runs out. One pass through this loop is called an iteration. Many things can happen during an iteration, including the creation of new variables by assignment statements, and the elimination of observations by "if" or "where" statements. In the simplest cases, one iteration reads one line from the raw data

Page No.127

and outputs one observation, but many modifications are possible. In general, the commands for an iteration begin with the input statement (or set or merge) and go to the end of the data step, excluding cards. An iteration ends with an implicit output or the last of the data step commands if there is an explicit output statement. There can be other loops inside an iteration, as we shall see momentarily. But the iteration itself is an implied loop that repeats as long as there is input data available. To hold the input line, that is, keep the data pointer from moving to the next line, we place an "at" symbol ("@") or "double at" ("@@") at the end of the input statement. The "@" holds the line during the current iteration only, while the "@@" holds the line across multiple iterations. In the example above, the iteration consists only of reading name, ht, wt, and age, then outputting an observation. Therefore, we need to hold the line for the next iteration, which requires the "double at."

In Lesson 14 we used the following example in an exercise on proc transpose. We first read the data into five variables, then, transposed it so that there was a treatment and one response for each observation (together with a "name" column).

Now, we will see how to read this directly in the form we need. In the example below, the lines from "input" to "end;" determine one iteration. We first read the "trt" variable, then hold the line for this iteration. Then we start a "do loop." The "do loop" allows us to repeat some commands according to a pattern. It begins with the keyword "do" and ends with an "end" statement. The variable s is the index of the loop. It will hold the values from one to five, in turn, as the loop is executed. Each time through the loop, the input statement directs SAS to read a value of y, hold the line (for the current iteration), after which the output statement writes an observation to the data set. When the loop is finished, the iteration is complete, and SAS moves to the next line in the data and begins another iteration.

Page No.128

Next, we put together the do loop and an if-then-else construct. Note how the size values are defined. You must be careful with the length of the variables in this case, since it will be set by the first value given in the program. (Another way to handle it is to use a length statement before the input statement.)

Exercises: Write data steps to read the following data (copy and paste it into cards just as it is). Print the results. 1. A series of y values: 13 25 22 17 19 11 16 18 21 14 17 20 18 15 2. Name, age, and grade. Kelly 9 3 John 10 3 Mark 10 4 Joan 11 4 April 8 2 Larry 9 2 Daniel 11 5 3. Treatment and four responses for each. Read in univariate style, with a treatment and one response per observation. ctrl 77 69 72 79 trad 87 96 89 82 new 89 94 96 81 4. Treatment, gender (male), three responses, gender (female), three responses. Read in univariate style, with a treatment, gender, and one response per observation. ctrl M 77 69 72 F 79 81 72 trad M 87 96 89 F 82 99 85 new M 89 94 96 F 81 83 87

Page No.129

Lesson 17: Generating Random Data SAS provides several functions for generating pseudo-random data. The most popular are the functions that provide normal and uniform distributions. Uniformly distributed values may be generated by the uniform(seed) function (alias ranuni), which gives random numbers in the interval [0,1). Standard normal values are generated with the normal(seed) function (alias rannor). In both cases, using a seed of 0 gives a random start based on the system clock. For publication, specify a seed of your choice so that others may duplicate your values.

Note that there is no source data (either raw or other SAS data set) for this data step. There is only one iteration of the data step, therefore we control the entire process using a do loop. The next example shows how to generate a simulated die toss. The die tosses are, of course, integers from 1 to 6. So we need to convert from a uniform interval on [0,1) to a uniform discrete distribution with values from 1 to 6. Multiplying the uniform values by 6 gives the interval [0,6). The "int" function takes the integer part and discards the decimal, so now we have integers from 0 to 5. Adding one gives the desired result.

If you want to generate random normal values to simulate a population, you need to know the standard deviation and mean. You multiply the standard normal values by the standard deviation, then add the mean. Say we wanted heights of male college students, and believed the mean was 70 inches and the standard deviation was 5 inches. Then, the following program would give a good simulation.

Page No.130

Perhaps it would have been more satisfying to write the equation above as x=70+normal(0)*5. The result is the same, of course. But we like to think of the mean as the value around which the population varies, so it makes sense to start with the mean, then add the term that creates the variation. A similar strategy can be used to obtain function values for a series of numbers. For example, suppose you wanted to make a table of the probabilities for a binomial distribution with n=10 and p=0.2. The following program gives the cumulative probabilities. Notice that the loop counter (index) is actually a variable we want to keep, and is used in the calculations.

Or, suppose you'd like to graph a parabola in SAS. In this example, the loop counter is used in the calculations too, but this time, we don't increment it by 1, but in steps of 5 each time the loop executes.

Page No.131

Suppose we'd like to simulate a discrete distribution with unequal probabilities for each value, such as the following: x P(x) 1 0.1 2 0.2 3 0.4 4 0.2 5 0.1 This can be done by "cutting up" the uniform interval and assigning different values to different sized parts of it.

Page No.132

Exercises: 1. Generate 1000 tosses of two dice, calculate the sums, and make a bar chart for the sums. 2. Simulate 10,000 observations from the following distribution and print a frequency table of the results. x P(x) 0 0.2 1 0.3 2 0.5 3. Suppose the population of male college students has a mean height of 69 inches and a standard deviation of 4.5 inches, while the population of female college students has a mean height of 64 inches and a standard deviation of 3.5 inches. Simulate heights for 50 male and 50 female college students. Each observation should include a gender variable and a height variable. Use proc means to see how close the mean and standard deviation of your simulated values come to the specified values. Lesson 18: Using Arrays to Program With Variables An array in SAS is a data step programming tool that allows us to reference a series of variables mathematically, most often as part of a loop. The array itself is not part of the resulting data set. It is only a temporary structure used during the data step to manipulate variables. An array essentially assigns an index number to each variable in the array. Then statements in the program can be used to calculate the index number of the variable to use or set its value. In the following example, the array definition also creates the variables that are referenced by the array. The syntax consists of the "array" keyword, followed by the name of the array (which doubles as the base of the variable names) and then number of variables to create, in parentheses. The variable names will have their index number appended. When used in an array reference, the index (or mathematical expression to generate the index) is enclosed in square brackets.

Notice that we have created just one observation. The do loop is cycling through the variables, rather than through observations. Study the correspondence between the variable names shown in the data set and the values of the index of the array, which was also used to calculate the variable values. An

Page No.133

implicit output was used in this example. Since only one observation was created, there was no need to specify an "output." In the next example, we add another loop in order to create 10 observations. This time an explicit output is needed, just before the end of the outer loop, when the values of an observation have been assigned, and the observation is ready to be saved. We also demonstrate that mathematical expressions can be used to control the index values, and that loop counters can be creatively used in assignment statements as well.

The names of variables in an array are not restricted to this form, though. Nor does the array have to create new variables. In the next example, we use a format statement as the first line in the data step. This creates the three variables, length, width, and height, while at the same time saving a format for them. Next, we define an array called "dims." Instead of giving a number of variables to create, e.g. "(3)," we actually list the names of the variables to be used in the array. The effect of this is that the array reference "dims[1]" is associated with the variable "length," "dims[2]" is associated with "width," and "dims[3]" is associated with "height." In the example, we show that both the actual variable names and the array references can be used in the program. This program shows how you might go about adding a simulated "random measurement error" to the measurements of a rectangular solid.

Suppose a teacher gives 5 quizzes and drops the lowest score. To "drop" the lowest score means to replace it with zero. Here is a SAS program that will do that, while also keeping the information about the dropped score. This example differs from the previous ones because the data step is reading from cards. Therefore, the "iteration" of the data step comes into play. The statements in the data step repeat for each observation read from the data. Here, the array statement is given first and creates the variables q1-q5. However, it can also be placed after the input statement, allowing the input statement to create the variables. Either way, there is no conflict. Two new variables are created, "low" to hold the low score, and "lowi" to hold the index of the low scoring variable. These are initially assigned the values of q1, then the do loop compares the values to each of the other scores, to see if any are lower. When the loop is finished, the proper variable is assigned a zero value.

Page No.134

Exercises: 1. Create a data set that contains the first four powers of the numbers 1 through 10 (e.g., 2, 4, 8, and 16 are the first four powers of 2). Use an array to assign the values to each variable. Use only one assignment statement, and take advantage of a loop index to assign the right power to each variable. 2. For the following data, use arrays to find the highest score in each row then create adjusted scores so that each is a percent of the maximum. Assume that the columns are students and the rows are quizzes. 10 9 7 8 5 58468 42649 35845 3. Use the same data as #2, but consider each row to be the scores of one student. Use an array to move the smallest score to the last position. You will compare each pair in turn, and if the first is smaller, switch them. (Note there are four pairs to check.) You will need a temporary variable to hold one value while you do the switch (i.e., x2-->tmp, x1-->x2, tmp-->x1). Lesson 19-20: Project Yahtzee Simulation In the game of Yahtzee, five dice are tossed, and various combinations of numbers, similar to poker hands, are assigned point values. In the game, dice can be selected and re-tossed, but we will focus on calculating the probabilities for the first toss only. We will also deal only with the "lower half" of the score card in the game. For the interested student, continuing this project to account for the complete rules of play would be an entertaining challenge. Anyone not familiar with Yahtzee should try a web search for the rules of the game. Some sites have applets that let you play online. All you really need to know for this lesson, though, is which combinations are counted. We will call these "hands," as the combinations in poker are called. The hands in Yahtzee are: Three of a Kind (three of one number and two others that are different) Full House (three of one number and two of another number) Four of a Kind (four of one number and one other) Yahtzee (five of the same number) Small Straight (four consecutive numbers) Large Straight (five consecutive numbers) Chance (anything that does not fit the above patterns) We begin by creating an array to hold the die tosses. There are five dice, so there will be five elements in the array. A do loop can be used to "toss" the dice.

Page No.135

That gives us one toss. To simulate the game, we will want to toss the dice many times and see what the probability of getting each scoring combination, or hand. We will need another loop, surrounding this one, to give these repeated observations. Note that an explicit output is now needed. Since this program is going to get rather complicated, we will pay close attention to issues of style and readability. Putting in comments to identify the beginning and end of major loops is helpful. Care should be taken with indenting, to make sure all lines associated with a particular loop are indented at the same level. The statement that begins a loop and the corresponding end should be at the same level of indenting, and statements within the loop should be indented two spaces from the level of the loop. Statements within nested loops are indented again.

Suppose we systematically build up the identification of the hands. There are many ways to do this. Some are easier to program, some are more efficient from a processing standpoint. At the beginning, it may not be clear what the best method is, so you should try some of your own ideas before reading on. The solution presented here is kind of a compromise. It may not be the easiest to program, nor the most efficient method. To start, let's see if we can identify a Yahtzee. Now, Yahtzees are quite rare, so we can't rely on getting one by doing 10 random tosses. The best thing is to put in a temporary piece of code that will artificially give us a Yahtzee, so we are sure to have something to idenify.

Page No.136

That was easy, huh? OK, now you should take some time and think about what is required to identify a "Four of a Kind." Consider all the possible ways that one would show up in the data. How can you check for all the possibilities in an efficient way? Is there something that can be done to make the search easier? Don't read on until you've thought about it!

Well, I hope you thought about it. Maybe you came up with the idea that it would be easier to identify the hands if the dice were sorted. In fact, that is a very big help. But, sorting between variables is not such a straightforward thing. We can do something called a "Bubble Sort." It is one of the simplest sort algorithms to program. For more information, look it up on the internet (Wikipedia has a good explanation). The sort routine can be inserted after the data are generated, and before the identification part of the program. Here we have included a set of test values that are exactly backwards. The sort routine handles these correctly, along with all the random observations. Examine the sort routine thoroughly so you understand how it works, and how it makes good use of the array structure. (A drop option has also been added to the data set to streamline the output.)

Page No.137

Sorting the dice means that all dice that are equal will be next to each other. Thus, to check for a Yahtzee, all we need is to find out if x1=x5. If x1 and x5 are the same, it is not possible (in sorted order) for the numbers in between to be different. Some examples of Four of a Kind (after sorting) are: 12222 22223 As you can see, either x1=x4 or x2=x5. If it is not a Yahtzee, then these two conditions will identify Four of a Kind. When it comes to Three of a Kind, we run into a little complication. If we follow the strategy used for Four of a Kind, we would check if x1=x3, x2=x4, or x3=x5. Consider the following examples: 11123 12225 24555 These would all be correctly identified. But what about: 11122 33555 As you can see, these would all fulfill the first and third conditions proposed above, but they should be classified as Full House. Therefore we also need to check for a Full House in these cases. The following identification routine checks for these types of Hands. At each stage, we have to be very careful that all possibilities are accounted for.

Page No.138

Now we are down to the straights. Large straights are simpler, as the only possibilities are 1-2-3-4-5 and 2-3-4-5-6. Small straights have a number of different forms, such as 1-2-3-4-6 and 1-3-4-5-6, where none of the numbers are the same, and a number of possibilities involving numbers that are doubled, such as 1-1-2-3-4, 1-2-2-3-4, and 2-3-4-5-5, to give just a few examples.

Page No.139

Now we can change the number of observations to 10,000 and use proc freq to count the hands (DO NOT PRINT!). Here is a table of the theoretical proportions. These are given with 4 decimal places, which is convenient for a simulation of 10,000, because if you ignore the decimal point, it is the expected number out of 10,000. Yahtzee .0008 Four of a Kind .0193 Three of a Kind .1543 Full House .0386 Large Straight .0309 Small Straight .1235 Chance .6326 Exercises: 1. Create another array called "d" (for differences) with four variables. After the sort routine, load the "d" variables with the differences between the dice. That is d1=x2-x1, d2=x3-x2, etc. Rewrite the identification routine to use the differences rather than the original die values. Plan your strategy by writing out what the differences look like for each hand, and try to come up with an efficient method of identifying the hands. When finished, run 10,000 simulations and compare your results (use proc freq) with the theoretical values given above. Lesson 21: Data Null It has been the author's experience that in many job interviews where SAS programming is an important part of the job description, there are questions about "data null." It appears that employers consider this a sort of "litmus test" of the level of a candidate's ability. Therefore, it is important that we give a little attention to this topic. The idea of "null" here is that we have a data step that actually doesn't create a data set. For a data set name, we use the special name "_null_" where the underscores are part of the name. This causes SAS to carry out the commands in the data step, but as far as the output is concerned, it is, well, "null" or nothing.

Page No.140

Why have a data step that doesn't save anything? Actually, it doesn't save a data set, but it can save something else, in particular, a text file. Thus, a data step can be used for report writing. or the creation of "raw data" files. The process is simply the reverse of reading a raw data file. Instead of an "infile" statement, there will be a "file" statement. Instead of "input," there will be "put." Instead of informats, formats. First, let's see what a put statement does in an ordinary data step. It sends lines of text to the log. You can have character expressions in quotes and variable names in the statement. The variables will have their current values printed. The following data step has three iterations, so three lines are printed to the log.

This is very useful for debugging data steps with loops and conditional statements, since you can examine the values the variables take as the data step executes. Next, add a file statement. This gives the location and filename that you want to save the results of the put statements in. Note that the log shows where the file is located, as well as the number of records written. It also shows that the data step is still writing a data set.

Page No.141

The same program, with only the data set name changed to "_null_", gives the log below. There is no "NOTE" about the data set, because none was created.

Now let's revisit the used car data from previous lessons. Suppose we begin with reading the data into a SAS data set. Then, we use data null to write some of the data to another file. Notice that the data step will iterate once for each observation in the source data set. The variables, as they are listed in the put statement, are sent to the file with just a space between them. This is a list put style, similar to the list input style.

Page No.142

We can use the formatted put style to control the appearance of the output.

But we can do way more than that! Here is an example that uses the internal variable "_n_" (the underscores are part of the name), which keeps track of the observations, to control when to print a heading. So, if we are on the first observation, the first if statement puts the two header lines into the file. The second if condition is not true, so it does not execute, then the final put statement sends the first observation to the file. For the remaining observations, the first if condition is false, so the header lines are never printed again. The second if condition has two parts. The observation number must be greater than 1, and the value of make must be different from the previous observation. The lag function allows us to compare values between observations, with lag1 being the previous observation, lag2 being the the second previous observation, etc. So, after the first observation, if there is a difference in makes, a blank line will be inserted, before the last put statement sends the detail information.

Page No.143

Another Example:

Page No.144

Lesson 22-23: SQL SQL stands for "Structured Query Language." SQL is an industry standard, that is, it is not something invented by SAS, but is used by all major software companies that deal with databases. Although each company has its own "enhancements," the basic language is the same for all of them. Thus, what you learn here will be applicable in Microsoft Access, Oracle, Sybase, and many others. As we have seen in an earlier lesson, in database terms, a data set is called a "table," an observation is called a "record," and a variable is called a "field." Most of the work in databases is done by means of "queries." Query, of course, is a word related to "enquire" and "question." A query, more or less, is a

Page No.145

question asked of a database, for which the answer is usually some portion of one or more tables, or a summary of the data in the tables. SQL is used in SAS by invoking Proc SQL, then submitting queries. Proc SQL is interactive, like proc plot and proc reg, so it will continue to accept queries after the initial statements are submitted. In fact, proc SQL does not require run statements to work. Each query is executed immediately upon submission. Proc SQL continues running until another proc step or data step is encountered, or a quit statement is submitted. We will once again refer to the used cars data set for some examples. Here is a simple query to start with.

A basic query begins with the keyword select. The idea is that we are going to "select" something from the database that we want to display. The results of the select statement are displayed in the output window. A select statement will have, at least, a list of variables to display, and a source table, given in the from clause. Unlike most lists in native SAS language, items in SQL lists are separated by commas. This is one of the most common sources of mistakes for beginning programmers. SQL is intended to be fairly "plain English" in nature, so we may be tempted to put commas between clauses, as in English, but SQL only uses commas for list separators. The next example shows how to form a list of variables, as well as how to use a where clause to restrict the records that are selected.

There is a shortcut if you want to get all the fields from a table:

In the next example, we show that new fields can be created by using functions (or any mathematical expressions). The count function counts the number of times a value occurs. The "group by" clause is added to the end so that the counts, as well as the sums, apply to each make. The effect is similar to the by statement we have seen in other procs. The keyword "as" is used to assign a name to the new field. This syntax can also be used to assign aliases for existing field names. Also included is a format for the price. Note that the result includes one row for each value of make, the group by field. Be careful not to select any other field which has differing values within one make, because that will force the output to give multiple rows for each make. As long as the results are unique within the group by

Page No.146

values, you will get one row for each. Also, if you leave out the group by clause, you will get one row for each row in the table, and all the "number" and "totprice" values will be the same, the grand total for the whole table.

You can also add an "order by" clause to sort the results differently. The example below will sort the above output by totprice. (To reverse the order, put "desc" AFTER the sort field.) As queries get more complex, writing style becomes important to keep track of the various parts of the query. It is good to start each clause on a new line, and if the lists are long, put each item on a separate line, with appropriate indenting.

The above examples show that SQL provides useful and easy ways to extract both detail and summary information from data sets (tables). But some of the most powerful uses of SQL have to do with extracting information from multiple tables. There is some similarity between what we will do here and what can be done with a merge statement in a data step, but SQL does things differently and usually more efficiently. One major difference is that SQL never requires data to be sorted in advance. In database systems, much attention is given to a process called "normalization." This essentially means that data are split up across multiple tables in order to avoid redundancy. For example, in a sales database, you might have one table that lists customers with their contact information, another table that lists the salesmen's information, and a third table that gives individual sales, including one field to link to the customer table and another field to link to the salesmen table. Then a query might be written to extract a particular sale, including the customer information and the salesman's information for that sale, getting some of the information from each of the three tables. To demonstrate how this works, we will split up our usedcars data into two tables. In this data, all of the model names are unique to one make, so model will be the linking field. We can use SQL to create these new tables, with, can you guess? "create table" queries. Here's one. (The create table query doesn't send anything to the output window.)

Page No.147

And here's the other.

So now we suppose that these are the two tables we have to get our data from, and we want to print a report with the make, model, year, price, and miles. In the list of fields selected, we now use two-level names to specify which table the variables come from (actually this is only necessary for those in more than one table). The asterisk can still be used to request all the variables from one table. In the from clause, we list all the tables being used. In the where clause, we "join" the tables by specifying which fields to match up in the two tables.

The where clause can contain the equality that defines how the tables are joined as well as other conditions or restrictions, combined by using logical expressions. More than two tables can be joined, and they need not be joined by the same field. Again, logical expressions can be used to combine these conditions. In case there are values that have no match, no rows are included for them. This

Page No.148

differs from the merge statement in a data step, where missing values are generated in such a case. However, SQL can produce the same effect, with something called an "outer join," if needed. Exercises: Use SQL and the School Sales data as the basis for the following exercises. Do not create any new tables except for #7. Also, there should be no other procedures used, nor any data steps beyond those that are in the program schoolsalesdata.sas which is linked above.. 1. Display the complete contents of the "sales" table, sorting by name. 2. Display the name and grade of students in the 11th and 12th grades only, sorted by grade from highest to lowest. 3. Display each student's name and her goal. 4. For each student, display name, grade, and total sales. 5. Run the following five queries. In each case, explain what SQL is counting, and why the results are different (or similar). a) select count(grade) as numgrade from students; b) select count(name) as numname from students; c) select count(distinct grade) as numgrade from students; d) select count(grade) as numgrade from students group by grade; e) select grade, count(grade) as numgrade from students group by grade; 6. For each grade, display grade, goal, number of students, and total goal (students*goal). 7. Create a new table that contains the name and sales (only those two fields) of students in the 12th grade. Use a select statement to display the results. Lesson 24-25: SAS Macro You may have used macros in Excel, or at least, you have probably heard about them. In Excel, macros are basically used to carry out a series of complex tasks by using one command (perhaps linked to a button in the spreadsheet). Thus a macro command carries out other commands. In SAS, the macro language is a language "above" the regular SAS language. It essentially generates SAS code for you. When you submit a SAS program, the first thing that happens is that SAS scans it for macro code. The macro code is interpreted, or compiled, into SAS language statements, then the statements are executed. We must distinguish between macro code in general, and "a macro" in particular. Some macro commands work anywhere in a program, but most work only inside a macro. A macro is like a subroutine or procedure that is compiled and stored in a library, and then "called" or "invoked" by a SAS program. SAS statements that are not part of a macro are called "open code." All of the programs we have written so far have consisted entirely of open code. We begin with some ways that macro language is used in open code. A fairly convenient application is to define macro variables that can be used to change values in a program. A macro variable can be defined at top of the program, where it is easy to find and change. It then can determine what happens further down. Here we look at a data step similar to that used in the Yahtzee program. The first two lines are assignment statements for macro variables. The percent sign is always the first character in macro language keywords. The %let commands assign values to the macro variables dice and reps. The values of macro variables are always text (character). This is because they will be inserted into the program statements that the macro language is writing for us, which becomes part of the program before SAS executes the program statements.

Page No.149

When the macro variable is used, that is, when we want to insert its value into the program, it is preceded by an ampersand (&). Whenever the macro interpreter sees the ampersand, it tries to interpret the value of the variable that follows. SAS calls this resolving the macro variable. So, in the array definition, you can see that the value of "&dice" will be "5", so that after interpretation, the line says "array x (5);". The next line, which sets up the loop for the number of repetitions, will be interpreted as "do i=1 to 10;". Since the number of dice tossed has to match the array size, the next line uses the dice variable again, and it will be interpreted as "do j=1 to 5;". In other words, what SAS sees after the macro compiler gets done is like this:

So, the idea is, if you have numbers or names in your code that you might want to change later, especially if they occur many times and might be anywhere in a long program, it is a good idea to put them in macro variables that are defined at the top of the program. Then you can change the values in the macro variables and automatically change all the related values in the program. The macro variables are also available in title statements, which can be very useful to keep track of the settings you use. For this to work, the title must be enclosed in double quotes, not single quotes. Single quotes will prevent resolving the macro variables, and you will get a title that says exactly what is typed between the single quotes.

Now let's create a macro. The macro begins with the keyword "%macro" followed by a name for the macro, and ends with the keyword "%mend" which means "macro end." You can put the name of the macro after %mend if you like, to keep track of which macro is ending. Now, what we are actually

Page No.150

doing here is creating a macro, not executing the statements inside. When the code below is submitted, there will not be any data set created or any output from proc print. The macro will be compiled and saved in a subfolder of the work directory called "sasmacr." There is NO message in the log that this has happened. (You will only get a message if there are errors.) This macro makes use of a %do loop. Note that there is also a %to and a %end. This is macro code that creates a loop for the statements inside. That means, as the macro runs, it is writing the code inside three times. The variable in the %do loop, dsnum, is a macro variable. Its value is text, just like the other macro variables. The data statement uses this value to name the data set differently each time the loop executes: one1, one2, and one3.

Now we need to run the macro. To do that, issue a command in SAS using the percent sign and macro name.

In the explorer window we can see that three data sets have been created.

Page No.151

Here is the output from the third loop.

Now consider the data step below. Suppose this is an example of a much larger data set that we have on hand (perhaps 52 weeks worth). For some reason, we need to split it up into separate data sets for each week.

Page No.152

We could write a data step like this, but you can see that if there were many more weeks, this could be tedious.

So let's get a macro to do the typing for us. Did you notice that the enhanced editor's color coding changes? Inside a macro it focuses on the macro language elements. The regular SAS language keywords no longer turn blue. Another thing to notice is the usage of semicolons. Macro statements end in semicolons like any other SAS statements. However, all macro statements are separate from the code that they create. In the data step above, there is no semicolon until all four weekly data sets have been named. In the macro, there is no semicolon after the data set option. If we put one there, we would get a semicolon after each one. Furthermore, there is a "%end;" which has a semicolon, but that semicolon is not going to be part of the text the macro writes, because it is part of a macro statement. Instead we put a semicolon by itself after the loop. This semicolon will be part of the text in the final program. The macro shown below writes a program just like that shown above.

Page No.153

Remember that a macro is stored in a library, and can also be stored permanently. The creation of a macro is separate from calling (using) a macro. The last line in the program above is the macro call. Once the macro is saved, one needs only to call it, not re-run the code that creates it. With that in mind, consider that we might want to pass information to the macro when we call it. Suppose in the program above we want to choose which weeks to split out when we call the macro, rather than automatically doing four of them. Let's assume for this example that there are 52 weeks in the original data set and we want to pick an arbitrary range. Then we can make our macro accept parameters (which are macro variables named in parentheses after the macro name, separated by commas), which we "pass" when we call the macro (putting the variable values in parentheses after the macro name, separated by commas). Here we show the macro with two parameters which supply the starting and ending week numbers for the %do loop. In the example, weeks 1 to 2 are requested in the macro call.

The idea here is that this macro would probably be saved permanently, then whenever we needed some of these data sets, we just call the macro with the appropriate week numbers. We don't have to rewrite the program each time. Now let's turn things around. The following SQL code combines the data from four weekly datasets for one of the salesmen and calculates a total.

Our goal will be to turn this into a macro in order to accomplish two things: 1. to eliminate the repetitive typing, and 2. to make it possible to pass parameters that specify the range of weeks and the salesman for the report. The way to approach this, as a beginning macro programmer, is to take an example of working code, like that shown above, put it inside a macro, then change things a little at a time, frequently checking if it still works. It is also good to separate the code in strategic places so that the repeated parts are on their own lines, being careful to note whether the punctuation is also repeated..

Page No.154

Now focus on the first line of repeated phrases. Note that in this case, all end with a comma and are the same except for the numbers. We put a do loop around this line, and delete all but one example.

Next, we need to replace the numbers with the macro variable i (the loop counter). But before we can do that, we need to understand a little more about how macro variables are resolved. For the following discussion, suppose that the value of macro variable num is "1" and the value of macro variable month1 is "January" (remember, macro variable values are text). In previous examples, our macro variables have either stood alone or they have come at the end of a word. Sometimes macro variables are embedded inside words. For example, suppose you wanted to replace the number in "month1sales" with the macro variable, num. If you put "month&numsales" the macro compiler would look for a macro variable variable called "numsales", not just "num". To get around this problem, the macro language uses a period (dot) to signal the end of a macro variable. Writing "month&num.sales" would be correct. The period is part of the macro variable reference, so that "&num." would all be replaced by the value of the num macro variable, resulting in the correct resolution, "month1sales". Macro variables can be "nested" so that one variable resolves to complete the name of another. A double & is used indicate this nesting. For example, "&&month&num" will first resolve to "&month1" and then to "January". You can think of it this way: The compiler will make two passes through this phrase (if it finds "&&&" it will make three, and so on). In one pass, it will hold off on interpreting things that are preceded by more than one "&", but it will remove one "&" in preparation for the next pass. Wherever there is only one "&" left, it resolves the variable, and the value becomes part of the text it can resolve in the next pass. If nesting is used, one dot is resolved with each macro variable. If there is supposed to be a dot in the result, make sure to include that in addition to the others. Thus "month&num..sales" resolves to

Page No.155

"month1.sales", "&&month&num..sales" resolves to "Januarysales", but "&&month&num...sales" resolves to "January.sales". Now, to return to our example, we have a period as part of the SAS code we want to generate. If we write "saleswk&i.week&i," the macro compiler will resolve this to "saleswk1week1," with no period! Therefore, we have to put in an extra period.

Now let's move on to the next line with a repeated pattern, the one that comes from inside the parentheses of the sum statement. There is one difference between this line and the previous one-this one does not have a comma at the end. That means we have to include a comma in all but the last loop. To do that, we use a %if - %then statement. The text between %then and the semicolon will be added to the program if the condition is true. The semicolon is the end of the macro statement and is not part of the generated text.

This could be considered a short form of the statement. In fact, you might find it a bit unsatisfying-don't you want it to "%do" something? It can be written that way, and must be, if there is a semicolon in the text you want included. Below is an alternate way of doing the same thing.

We can finish up the remaining repeated sections in a similar way. The completed macro looks like this:

Page No.156

Now we have finished the first goal, which was to eliminate all repeated typing. Our second goal was to make it possible to pass parameters that specify the range of weeks to report, and the salesman to report on. The parameters are the three macro variables in parentheses after the macro name. Since our original macro had a range of 1 to 4, it is a simple matter to go through the program and replace every instance of a 1 with "&start" and every instance of 4 with "&stop", being careful to add an extra period where necessary (these numbers don't occur in the program for any other reason than to refer to the data sets and variables we want to use). The salesman's name only occurs one time at the end, inside quotes. It is important that these quotes be double and not single in order for the macro variable to be resolved. An example of calling the macro is included, with the corresponding output shown below the program.

Page No.157

The process of detecting and correcting errors is more difficult when using macro language. One reason is that there are now two levels of errors, those in the macro code, and those in the program generated by the macro. For instance, if you delete one of the %if statement in the above program, which makes the SQL code incorrect, and create the macro (without calling it) this is what you get in the log:

Page No.158

No errors are reported, because there were no errors in the macro code. There are also no helpful notes telling us the macro was successfully created. There is nothing but a copy of the lines that were submitted. Now suppose we call the macro:

SAS reports the error, but it is not related back to the line numbers in the macro program! This is because the macro is generating the code in the background, and the lines it generates are not copied to the log. It does not identify the exact location where the error occurred. It would be very hard to figure out from this limited information, where exactly our error is. For this reason, it is important to try to write macros in small steps, and test them often. There are two tools are available for debugging. The first is the symbolgen system option (put in an options statement, use nosymbolgen to turn it off). This will cause macro variables and their values to be printed to the log when the macro is called. This can be helpful to find out if your macro variables are resolving correctly, but once again, it may not be easy to connect these messages to the code that you wrote. Here is part of the log that came from calling the macro with symbolgen turned on.

Page No.159

The second tool is the %put statement. This is much like the regular put statement, in that it writes messages to the log. By careful planning, we can make %put statements tell us what is going on in the program. Here, a %put statement is inserted after each %do loop just to tell us it has finished. This, in combination with the symbolgen option, gives a pretty good indication when the error occurred--after the third %do loop, and after the macro variable i had a value of 1. In other words, it occurs because the comma is missing after saleswk1--not saleswk3, as the error message suggests!

Using symbolgen may give you more information than you need or want, so you can also use macro variables in your own put statements to report the values you want to see. Each company that implements the SQL standard can add its own enhancements. SAS had added a useful feature that allows SQL to assign query results to macro variables. The syntax uses the key word "into" together with a macro variable name with a colon in front of it. Here we show how the

Page No.160

means of two variables are stored in two macro variables, which are then resolved in proc plot to create reference lines in the graph.

SAS has a number of built-in or automatic macro variables. One example is "sysdate". The following title statement will include the current date in the title statement.

Exercises: 1. Suppose you get a monthly sales report with the salesman's name and total sales for the month. The reports are loaded into SAS data sets with names like Month1, Month2, etc. You want to generate quarterly summary data sets that include each salesman's name, sales for the three months, and total sales for the quarter. Since this is not a one-time project, but an ongoing task that will presumably

Page No.161

continue for years to come, you want to automate the process as much as possible. The initial SAS program creates six months' worth of data and includes SQL code that will create the first quarter's summary. As you have seen, it is best to create macro code in small steps, and run the program frequently to make sure it works as you add more components. I will try to guide you through the steps here, but you only need to turn in the log and output for the final product, NOT every step listed below. a) Run the program as given to create the data sets. You can then comment out the data steps or delete them, as they do not need to run again (until you start a new session, unless you save them to a permanent data set). b) At the top of the program, define a macro variable called "q" for the quarter number and set it equal to 1. The idea is that this is the only thing the "user" should have to change when running the program for a new quarter. We will need to generate the corresponding month numbers from q. Define macro variables m1, m2, and m3, setting them equal to a month value that is generated from the quarter number. c) Define a macro around the existing SQL code, with a %macro, a %mend, and then a statement to call the macro. d) Now start substituting macro variables into the SQL code in each place where a quarter number or a month number appears. Make sure your program runs correctly before adding any %do loops. e) Now "compact" the code using %do loops each time a sequence of similar terms appears in the SQL code. f) When you have everything running correctly, add a proc print statement to print out the table that is created by the SQL command, and submit the log and output. 2. In this problem we will create a macro that will take parameters for a data set name and two variable names, then produce a plot (using proc plot) based on the parameters that are passed, and will also automatically generate a vertical reference line at the mean of the vertical variable. a) For purposes of this exercise, use stavwood.txt as the source data (this file has tabs in it). The variables in this data set are group, y, x1, x2, x3. Begin by getting the next two steps to work outside of a macro. b) Use proc SQL to get the mean for your vertical reference line into a macro variable. c) Now write a plot step and use the vref option with the macro variable to create a reference line. d) When you have all this working, put it inside a macro. Note, the data step that reads stavwood.txt is not part of the macro. The macro should have parameters for data set name, vertical axis variable, and horizontal axis variable, in that order. Now you must change the program inside the macro so that all references, and I do mean ALL, to these three things are replaced by macro variables. (Include the log from compiling your completed macro in your homework submission.) e) Call the macro, putting in your dataset name, y for the vertical axis variable, and one of the x variables from stavwood.txt for parameters. Make sure the plot is correct and there are no errors in the log. (Submit this output with your results. Include the log from calling your macro in your homework submission one time. ) f) Call the macro again, reversing your vertical and horizontal variables. What happens? (Submit this output with your homework.) g) Call the macro again, with another combination of variables. (Submit this output with your homework.) Lesson 24-25: SAS Macro You may have used macros in Excel, or at least, you have probably heard about them. In Excel, macros are basically used to carry out a series of complex tasks by using one command (perhaps linked to a button in the spreadsheet). Thus a macro command carries out other commands. In SAS, the macro language is a language "above" the regular SAS language. It essentially generates SAS code for you. When you submit a SAS program, the first thing that happens is that SAS scans it for macro code. The macro code is interpreted, or compiled, into SAS language statements, then the statements are executed.

Page No.162

We must distinguish between macro code in general, and "a macro" in particular. Some macro commands work anywhere in a program, but most work only inside a macro. A macro is like a subroutine or procedure that is compiled and stored in a library, and then "called" or "invoked" by a SAS program. SAS statements that are not part of a macro are called "open code." All of the programs we have written so far have consisted entirely of open code. We begin with some ways that macro language is used in open code. A fairly convenient application is to define macro variables that can be used to change values in a program. A macro variable can be defined at top of the program, where it is easy to find and change. It then can determine what happens further down. Here we look at a data step similar to that used in the Yahtzee program. The first two lines are assignment statements for macro variables. The percent sign is always the first character in macro language keywords. The %let commands assign values to the macro variables dice and reps. The values of macro variables are always text (character). This is because they will be inserted into the program statements that the macro language is writing for us, which becomes part of the program before SAS executes the program statements.

When the macro variable is used, that is, when we want to insert its value into the program, it is preceded by an ampersand (&). Whenever the macro interpreter sees the ampersand, it tries to interpret the value of the variable that follows. SAS calls this resolving the macro variable. So, in the array definition, you can see that the value of "&dice" will be "5", so that after interpretation, the line says "array x (5);". The next line, which sets up the loop for the number of repetitions, will be interpreted as "do i=1 to 10;". Since the number of dice tossed has to match the array size, the next line uses the dice variable again, and it will be interpreted as "do j=1 to 5;". In other words, what SAS sees after the macro compiler gets done is like this:

So, the idea is, if you have numbers or names in your code that you might want to change later, especially if they occur many times and might be anywhere in a long program, it is a good idea to put them in macro variables that are defined at the top of the program. Then you can change the values in the macro variables and automatically change all the related values in the program. The macro variables are also available in title statements, which can be very useful to keep track of the settings you use. For this to work, the title must be enclosed in double quotes, not single quotes. Single quotes will prevent resolving the macro variables, and you will get a title that says exactly what is typed between the single quotes.

Page No.163

Now let's create a macro. The macro begins with the keyword "%macro" followed by a name for the macro, and ends with the keyword "%mend" which means "macro end." You can put the name of the macro after %mend if you like, to keep track of which macro is ending. Now, what we are actually doing here is creating a macro, not executing the statements inside. When the code below is submitted, there will not be any data set created or any output from proc print. The macro will be compiled and saved in a subfolder of the work directory called "sasmacr." There is NO message in the log that this has happened. (You will only get a message if there are errors.) This macro makes use of a %do loop. Note that there is also a %to and a %end. This is macro code that creates a loop for the statements inside. That means, as the macro runs, it is writing the code inside three times. The variable in the %do loop, dsnum, is a macro variable. Its value is text, just like the other macro variables. The data statement uses this value to name the data set differently each time the loop executes: one1, one2, and one3.

Page No.164

Now we need to run the macro. To do that, issue a command in SAS using the percent sign and macro name.

In the explorer window we can see that three data sets have been created.

Here is the output from the third loop.

Now consider the data step below. Suppose this is an example of a much larger data set that we have on hand (perhaps 52 weeks worth). For some reason, we need to split it up into separate data sets for each week.

Page No.165

We could write a data step like this, but you can see that if there were many more weeks, this could be tedious.

So let's get a macro to do the typing for us. Did you notice that the enhanced editor's color coding changes? Inside a macro it focuses on the macro language elements. The regular SAS language keywords no longer turn blue. Another thing to notice is the usage of semicolons. Macro statements end in semicolons like any other SAS statements. However, all macro statements are separate from the code that they create. In the data step above, there is no semicolon until all four weekly data sets have been named. In the macro, there is no semicolon after the data set option. If we put one there, we would get a semicolon after each one. Furthermore, there is a "%end;" which has a semicolon, but that semicolon is not going to be part of the text the macro writes, because it is part of a macro statement. Instead we put a semicolon by itself after the loop. This semicolon will be part of the text in the final program. The macro shown below writes a program just like that shown above.

Page No.166

Remember that a macro is stored in a library, and can also be stored permanently. The creation of a macro is separate from calling (using) a macro. The last line in the program above is the macro call. Once the macro is saved, one needs only to call it, not re-run the code that creates it. With that in mind, consider that we might want to pass information to the macro when we call it. Suppose in the program above we want to choose which weeks to split out when we call the macro, rather than automatically doing four of them. Let's assume for this example that there are 52 weeks in the original data set and we want to pick an arbitrary range. Then we can make our macro accept parameters (which are macro variables named in parentheses after the macro name, separated by commas), which we "pass" when we call the macro (putting the variable values in parentheses after the macro name, separated by commas). Here we show the macro with two parameters which supply the starting and ending week numbers for the %do loop. In the example, weeks 1 to 2 are requested in the macro call.

The idea here is that this macro would probably be saved permanently, then whenever we needed some of these data sets, we just call the macro with the appropriate week numbers. We don't have to rewrite the program each time. Now let's turn things around. The following SQL code combines the data from four weekly datasets for one of the salesmen and calculates a total.

Our goal will be to turn this into a macro in order to accomplish two things: 1. to eliminate the repetitive typing, and 2. to make it possible to pass parameters that specify the range of weeks and the salesman for the report. The way to approach this, as a beginning macro programmer, is to take an example of working code, like that shown above, put it inside a macro, then change things a little at a time, frequently checking if it still works. It is also good to separate the code in strategic places so that the repeated parts are on their own lines, being careful to note whether the punctuation is also repeated..

Page No.167

Now focus on the first line of repeated phrases. Note that in this case, all end with a comma and are the same except for the numbers. We put a do loop around this line, and delete all but one example.

Next, we need to replace the numbers with the macro variable i (the loop counter). But before we can do that, we need to understand a little more about how macro variables are resolved. For the following discussion, suppose that the value of macro variable num is "1" and the value of macro variable month1 is "January" (remember, macro variable values are text). In previous examples, our macro variables have either stood alone or they have come at the end of a word. Sometimes macro variables are embedded inside words. For example, suppose you wanted to replace the number in "month1sales" with the macro variable, num. If you put "month&numsales" the macro compiler would look for a macro variable variable called "numsales", not just "num". To get around this problem, the macro language uses a period (dot) to signal the end of a macro variable. Writing "month&num.sales" would be correct. The period is part of the macro variable reference, so that "&num." would all be replaced by the value of the num macro variable, resulting in the correct resolution, "month1sales". Macro variables can be "nested" so that one variable resolves to complete the name of another. A double & is used indicate this nesting. For example, "&&month&num" will first resolve to "&month1" and then to "January". You can think of it this way: The compiler will make two passes through this phrase (if it finds "&&&" it will make three, and so on). In one pass, it will hold off on interpreting things that are preceded by more than one "&", but it will remove one "&" in preparation for the next pass. Wherever there is only one "&" left, it resolves the variable, and the value becomes part of the text it can resolve in the next pass. If nesting is used, one dot is resolved with each macro variable. If there is supposed to be a dot in the result, make sure to include that in addition to the others. Thus "month&num..sales" resolves to

Page No.168

"month1.sales", "&&month&num..sales" resolves to "Januarysales", but "&&month&num...sales" resolves to "January.sales". Now, to return to our example, we have a period as part of the SAS code we want to generate. If we write "saleswk&i.week&i," the macro compiler will resolve this to "saleswk1week1," with no period! Therefore, we have to put in an extra period.

Now let's move on to the next line with a repeated pattern, the one that comes from inside the parentheses of the sum statement. There is one difference between this line and the previous one-this one does not have a comma at the end. That means we have to include a comma in all but the last loop. To do that, we use a %if - %then statement. The text between %then and the semicolon will be added to the program if the condition is true. The semicolon is the end of the macro statement and is not part of the generated text.

This could be considered a short form of the statement. In fact, you might find it a bit unsatisfying-don't you want it to "%do" something? It can be written that way, and must be, if there is a semicolon in the text you want included. Below is an alternate way of doing the same thing.

We can finish up the remaining repeated sections in a similar way. The completed macro looks like this:

Page No.169

Now we have finished the first goal, which was to eliminate all repeated typing. Our second goal was to make it possible to pass parameters that specify the range of weeks to report, and the salesman to report on. The parameters are the three macro variables in parentheses after the macro name. Since our original macro had a range of 1 to 4, it is a simple matter to go through the program and replace every instance of a 1 with "&start" and every instance of 4 with "&stop", being careful to add an extra period where necessary (these numbers don't occur in the program for any other reason than to refer to the data sets and variables we want to use). The salesman's name only occurs one time at the end, inside quotes. It is important that these quotes be double and not single in order for the macro variable to be resolved. An example of calling the macro is included, with the corresponding output shown below the program.

Page No.170

The process of detecting and correcting errors is more difficult when using macro language. One reason is that there are now two levels of errors, those in the macro code, and those in the program generated by the macro. For instance, if you delete one of the %if statement in the above program, which makes the SQL code incorrect, and create the macro (without calling it) this is what you get in the log:

Page No.171

No errors are reported, because there were no errors in the macro code. There are also no helpful notes telling us the macro was successfully created. There is nothing but a copy of the lines that were submitted. Now suppose we call the macro:

SAS reports the error, but it is not related back to the line numbers in the macro program! This is because the macro is generating the code in the background, and the lines it generates are not copied to the log. It does not identify the exact location where the error occurred. It would be very hard to figure out from this limited information, where exactly our error is. For this reason, it is important to try to write macros in small steps, and test them often. There are two tools are available for debugging. The first is the symbolgen system option (put in an options statement, use nosymbolgen to turn it off). This will cause macro variables and their values to be printed to the log when the macro is called. This can be helpful to find out if your macro variables are resolving correctly, but once again, it may not be easy to connect these messages to the code that you wrote. Here is part of the log that came from calling the macro with symbolgen turned on.

Page No.172

The second tool is the %put statement. This is much like the regular put statement, in that it writes messages to the log. By careful planning, we can make %put statements tell us what is going on in the program. Here, a %put statement is inserted after each %do loop just to tell us it has finished. This, in combination with the symbolgen option, gives a pretty good indication when the error occurred--after the third %do loop, and after the macro variable i had a value of 1. In other words, it occurs because the comma is missing after saleswk1--not saleswk3, as the error message suggests!

Using symbolgen may give you more information than you need or want, so you can also use macro variables in your own put statements to report the values you want to see. Each company that implements the SQL standard can add its own enhancements. SAS had added a useful feature that allows SQL to assign query results to macro variables. The syntax uses the key word "into" together with a macro variable name with a colon in front of it. Here we show how the

Page No.173

means of two variables are stored in two macro variables, which are then resolved in proc plot to create reference lines in the graph.

SAS has a number of built-in or automatic macro variables. One example is "sysdate". The following title statement will include the current date in the title statement.

Exercises: 1. Suppose you get a monthly sales report with the salesman's name and total sales for the month. The reports are loaded into SAS data sets with names like Month1, Month2, etc. You want to generate quarterly summary data sets that include each salesman's name, sales for the three months, and total sales for the quarter. Since this is not a one-time project, but an ongoing task that will presumably

Page No.174

continue for years to come, you want to automate the process as much as possible. The initial SAS program creates six months' worth of data and includes SQL code that will create the first quarter's summary. As you have seen, it is best to create macro code in small steps, and run the program frequently to make sure it works as you add more components. I will try to guide you through the steps here, but you only need to turn in the log and output for the final product, NOT every step listed below. a) Run the program as given to create the data sets. You can then comment out the data steps or delete them, as they do not need to run again (until you start a new session, unless you save them to a permanent data set). b) At the top of the program, define a macro variable called "q" for the quarter number and set it equal to 1. The idea is that this is the only thing the "user" should have to change when running the program for a new quarter. We will need to generate the corresponding month numbers from q. Define macro variables m1, m2, and m3, setting them equal to a month value that is generated from the quarter number. c) Define a macro around the existing SQL code, with a %macro, a %mend, and then a statement to call the macro. d) Now start substituting macro variables into the SQL code in each place where a quarter number or a month number appears. Make sure your program runs correctly before adding any %do loops. e) Now "compact" the code using %do loops each time a sequence of similar terms appears in the SQL code. f) When you have everything running correctly, add a proc print statement to print out the table that is created by the SQL command, and submit the log and output. 2. In this problem we will create a macro that will take parameters for a data set name and two variable names, then produce a plot (using proc plot) based on the parameters that are passed, and will also automatically generate a vertical reference line at the mean of the vertical variable. a) For purposes of this exercise, use stavwood.txt as the source data (this file has tabs in it). The variables in this data set are group, y, x1, x2, x3. Begin by getting the next two steps to work outside of a macro. b) Use proc SQL to get the mean for your vertical reference line into a macro variable. c) Now write a plot step and use the vref option with the macro variable to create a reference line. d) When you have all this working, put it inside a macro. Note, the data step that reads stavwood.txt is not part of the macro. The macro should have parameters for data set name, vertical axis variable, and horizontal axis variable, in that order. Now you must change the program inside the macro so that all references, and I do mean ALL, to these three things are replaced by macro variables. (Include the log from compiling your completed macro in your homework submission.) e) Call the macro, putting in your dataset name, y for the vertical axis variable, and one of the x variables from stavwood.txt for parameters. Make sure the plot is correct and there are no errors in the log. (Submit this output with your results. Include the log from calling your macro in your homework submission one time. ) f) Call the macro again, reversing your vertical and horizontal variables. What happens? (Submit this output with your homework.) g) Call the macro again, with another combination of variables. (Submit this output with your homework.) Lesson 26-27: SAS Graph SAS/Graph is the high-resolution graphics package for producing plots and charts in SAS. Many basic commands are similar to those we have learned for proc plot and proc chart. In fact, you can often insert a "g" in front of "plot" and "chart" and end up with a workable result, using the same syntax. However, there are many more options and capabilities available. We will just explore a few of them here.

Page No.175

SAS/Graph is a separate module or package in SAS, like SAS/STAT. In the Online Documentation, you will find an entry called "SAS/GRAPH Reference" in the first set of branches. After opening that, you can click on "SAS/GRAPH Procedures," followed by the name of the procedure you want to use. We begin with proc gplot. You can get a passable plot by typing commands similar to those in proc plot: Note that the plot is displayed in a new window, a "graph" window. This window behaves differently than the output window in one important way--it does not scroll down automatically when a new plot is created. This is easy to forget! Don't be fooled when it looks like your results haven't changed!

Note that we have two series because of the "=group" in the plot statement. By default, proc gplot assigns different colors to each value of group and prints a legend. There are many things that can be done to customize this graph. The most common commands are summarized in SASGraphCommands.doc. We can customize the symbols and colors used for the two groups. To do this we write symbol statements, which are global in effect, so we usually place them above the gplot step (although they

Page No.176

work inside as well). The symbols are numbered and the symbol statements written accordingly, much like title statements. Like title statements, the symbol definitions remain in effect until you change them. Unlike title statements, a change to a higher number symbol does not affect the lower numbered ones. Also, individual commands inside symbol statements do not get cleared or reset by leaving them out of a subsequent symbol statement. The value= command determines the shape of the symbol, and the color= command, well, you know. These can be abbreviated v= and c=. Many common shapes and capital letters can be used for the value. Most color names that you would think of will work too.

If you want to connect the dots, use one of the interpolation methods, abbreviated as "interpol" or just "i". The methods are join, spline, and regression. "Join" connects the points with straight lines. "Spline" uses a polynomial function for a smooth fit. However, if the points vary too much, there can be wild peaks and valleys. Both these methods connect the points in the order they occur in the data set. With the spline method, you can add the "s" option, which stands for sort. It looks like "splines" but it means it will sort by the x variable so that the points are connected from left to right.

Page No.177

The regression method has several parts to its command. It starts with r. The second character is either l for linear, q for quadratic, or c for cubic. Then, you can have either cli or clm (confidence limits of prediction or confidence limits for the mean) followed by a confidence level like 80, 90, or 95. The regression equation is also printed in a note in the log.

Page No.178

A height= (h=) command controls the size of the value symbol, and a line= (l=) command selects a line style.

Now wouldn't it be nice if the symbols didn't spill over the frame of the graph? So glad you asked... The appearance of the axes is controlled by global axis statements similar to the symbol statements, except that they are not automatically applied. The axis definition must be assigned to an axis using the vaxis or haxis option in the plot statement. Here we see two of the options in the axis definition demonstrated. The order= option controls the spacing and range of the values displayed on the axis. For categorical data, a list of category names can be given in the parentheses. The label= option controls the appearance of the axis label. The options inside the parentheses apply to the text that follows, so some characteristics can be changed in the middle of a word. Similar options can also be used in title statements, which will control the appearance of the titles in graphs. Also shown here is a legend statement, and the corresponding reference in the plot statement.

Page No.179

At this time it is appropriate to add a note about sizing graphs. The graphs that SAS produces can be resized by dragging the borders like other windows objects. However, this shrinks or expands everything in the picture (like it should), so if you make the graph smaller you might no longer be able to read the text. Although there are options that can specify the size of the graph, it is worthwhile to experiment with another method. The intial size of the graph is determined (by default) by the size of the SAS window (the application window, not the graph window). Resizing your SAS window can go a long way toward giving you the results you want. Then, once a graph is produced, you should try resizing it (the graph window this time) by small amounts in both directions. This can affect the appearance considerably, through small adjustments in spacing between objects, and even in the displayed font. When the graphs above were produced, for example, the font for the numbers on the axes (tick mark labels) was thin and hard to read. A small adjustment in the size of the graph changed it to what you see here. Here is another type of plot, called a bubble plot. This is a three-dimensional plot, because the size of the bubble is determined by the variable on the right side of the equal sign. In this case there are only two values for group, so you only see two sizes.

Page No.180

SAS has more nice examples in the documentation. See some under "SAS/GRAPH Reference, SAS/GRAPH Procedures, The GPLOT Procedure." To find more information about symbol and axis statements, and tahe like, see "SAS/GRAPH Reference, SAS/GRAPH Concepts, SAS/GRAPH Statements." It is really worthwhile to browse through this documentation to get an idea of what SAS/GRAPH can do. Next we turn to proc gchart, the high-resolution version of proc chart. Again, we can begin with simple commands such as those we have learned in earlier lessons for proc chart. Here is a bar graph with continuous data.

Page No.181

In this case, x was treated as a continuous variable, and SAS used midpoints of 7 bin ranges that it chose according to some default rules. You can use the levels= option to specify how many bins you want SAS to create, or you can use the midpoints= option to list the midpoints you want. You can list the numbers in parentheses or use "(a to b by c)" notation. If your values are discrete, use the discrete option, as shown below. If the chart variable is character, there will be a bar for each value, and the discrete option is not used.

Page No.182

Like proc plot, you can also use a group option.

With a subgroup option, the colors of the bars and a crosshatch pattern can be controlled using pattern statements. The value= or v= option uses "L" for left slanting, "R" for right slanting, and "X" for crosshatch, followed by a number for the style.

Page No.183

You can make block charts:

Page No.184

Try out the hbar3d and vbar3d statements.

Page No.185

And, of course, there are pie charts.

Page No.186

The explode= option separates the listed slices away from the pie for emphasis. The angle= option turns the pie.

Page No.187

Exercises: Download the data program to get started. Produce a graph using proc gplot for each of the four problems below. 1. First graph y*x and z*x on the same axes, using default setting (symbols and axes). 2. Customize the graph by defining some nice symbols and modifying the axes as you think appropriate (don't add interpolating lines at this time). Add an appropriate title too, perhaps with some nice formatting or color options. 3. Use the line interpolation method, with two different line styles for y and z. 4. Do a plot of only y*x with a linear regression line and clm90 option. To get a little practice with proc gchart, simulate 200 tosses of a pair of dice and calculate the sums. Create a chart using proc gchart for each of the five problems below. Explore options such as coloring, patterns, etc. as you wish. 1. Create a frequency histogram (use the discrete option) for the first die. 2. Create a frequency histogram for the sum of the dice. 3. Make a pie chart for the first die. 4. Make a pie chart for the sum of dice with an "exploded" view of "7". 5. Create a side-by side bar chart for the two dice, with 1's grouped, 2's grouped, etc. For number 5, the data needs to be organized differently. You can do the other problems first, using the same data, then do one of two things: You can generate new data that has a die number for one variable and the die toss result for the other, or you can figure out how to rearrange the data you have so that it is in that form (this would be better practice). In any case, you need to end up with something like: Die X 1 4 2 3 1 1 2 5 etc. Optional challenge for geometry and craft fans: Write a data step that creates points on a circle, ordered in such a way that when using proc gplot with i=join, the points will be connected like this

Page No.188

string art:

(http://www.mathcats.com/crafts/stringart.html) Lesson 31: Introduction to IML SAS IML is a programming language for working with matrices. IML stands for "Interactive Matrix Language." The language is invoked, or started, by issuing a "proc iml;" command. IML is interactive. You can keep submitting statements, one after the other, and IML will execute them. No "run" statement is necessary. IML stops when another step (data or proc) starts, or a "quit;" command is given. Incidentally, the Learning Edition does not have IML. We can define a matrix in IML with an assignment statement and a list of elements in curly brackets, with rows separated by commas. Just like any other SAS code, the placement of the text on the line doesn't matter, so you can string them out on one line or organize them neatly in rows and columns. In order to see what the matrix looks like, you can use the print command.

The standard matrix operations, line addition, subtraction, and multiplication, are given by the usual operators, "+", "-", and "*". The division operator can be used with scalars, but if used with two conformable matrices, it will do element-wise division too, and a double asterisk is for exponents, as usual. However, there are many more operators in IML. The number sign ("#") is element-wise multiplication, and a double number sign is element-wise exponentiation. Double horizontal bars will concatenate matrices side-by-side, while double slashes will concatenate them vertically. There are a number of standard functions that are commonly used in IML. The transpose of X is given by "X`", which is X followed by a back-quote character (to the left of "1" on the keyboard). It can also be found by "t(X)" which is the function notation equivalent. The inverse of X is given by "inv(X)". Since there are many functions, the best thing to do is refer to the documentation. In the Online Documentation, IML has its own entry in the main tree. You can go to the link http://support.sas.com/onlinedoc/913/docMainpage.jsp, then find the branch that says "SAS/IML User's Guide. Click the plus sign next to it, and a long list of interesting branches open up. Near the bottom, we find "Language Reference." Open that up, and you will find, among other things, "Operators" and "Statements, Functions, and Subroutines." In these two sections you will find information about the operators and functions. There is no need to repeat all this here, you should make use of the documentation as necessary.

Page No.189

Exercises: 1. Explore the available functions and give examples of the use of five different functions. Lesson 33: Headings, Do Loops, and Sampling In reference to the assignment in Lesson 32, in the documentation, look at the section called "Working with Matrices." Then, under "Using Assignment Statements," find "Matrix Generating Functions." Here you can see how to use the J matrix function to create a new matrix. More options using the reset command: reset autoname; Autoname gives headings for the rows and columns of matrices that are printed. Default labels are Row1, Col1, etc. You can define a vector of headings, such as head={"Mon" "Tue" "Wed"}; Then in a print statement you can put print x[colname=head, rowname=rowh]; where rowh is another vector of headings for the rows. The mattrib command provides a more sophisticated way to assign row and column names. mattrib x rowname=(rowh) colname=(head); print x; This association of the row and column headings will last as long as the iml step runs, so you don't have to keep specifying them in the print statement. The range assignment like a=1:10; b=5:2; actually creates a vector of integers specified by the range. We saw that in subscripting we can use variables like a and b to specify the rows and columns to be selected. However, any vector will work, and the numbers need not be in order. For example, you could have a={5, 3, 7, 1}; Using this variable to select rows in a subscript notation would pull the 5th, 3rd, 7th, and 1st rows, in that order. In addition, once a mattrib statement has been used to define row and column headings, those headings can also be used in the subscript specification instead of the numbers. You could even make the row and column headings part of a new matrix. However, if they are character vectors (as usual), the matrix must be a character matrix. Look at the following: x2=("Row"||head)//(rowh||x); Visualize the results or try an example. The Char function may be used to convert a numeric matrix to character form for this purpose. We bring in the usedcars data set for the next example. If work.usedcars doesn't exist, run a program to load the usedcars.txt file into a sas data set. Usedcars has both character and numeric data. use usedcars; read all into UC. print UC; This will bring in only the numeric variables (there are three). read all var _num_ into UC; is equivalent, because it specifies numeric variables, which is the default. You can include this to make your program more readable. To create a character matrix for the character variables, write: read all var _char_ into UC2; Now we'll learn something about do loops. We begin with the J function to create a column of 1's and the following do loop: Y=J(34,1); do i=1 to 34; Y[i,1]=uniform(0); end;

Page No.190

Note that the editor turns the "do" red but it is not an error, it is some problem in the editor. Now do the following: Yr=rank(Y); UCsample=UC[Yr[1:10,1],]; Look carefully at what is happening here. Yr contains the rank values of the random numbers in the Y vector. I want to choose 10 randomly sampled rows from UC. I can get the indices from the first 10 entries of Yr. Remember they don't have to be in order. So that is what "Yr[1:10,1]" gives. It is in essence a list of 10 index numbers for rows I want to select. Now, I use that list or vector to subscript the UC matrix, and take all the rows, assigning the result to UCsample. Notice that Yr is actually a column vector, but the subscript notation accepts either row or column vectors for arguments. Now here is an alternative way to get the random numbers. Y=uniform(0); do i=1 to 33; Y=Y/uniform(0); end; This works by starting with a random scalar, then vertically concatenating 33 more random numbers to it, building up the Y vector that way. Lesson 34: SAS Data Sets and Matrices Let's say we have the dataset called "usedcars" in the work or user library. This time, we are going to select the first 10 observations into two matrices. We can write: proc iml; reset printall; use usedcars; read point (1:10) into UC; read point (1:10) var _char_ into UCC; Read the documentation in the IML section about working with SAS data sets for details. Now we want to go backwards, sending the contents of matrices back into a SAS data set. create ucs from uc; append; use ucs; list all; The create statement only creates an empty data set. The append statement adds the data from the matrix to the dataset. To specify the variable names for the data set, use create ucs from uc[colname={"year" "miles" "price"}]; This method with the from clause will only work with one matrix and therefore one type of data at a time. The other method is to use create with a var clause. In order to do this, each variable should be in a vector. year=uc[,1]; miles=uc[,2]; price=uc[,3]; make=ucc[,1]; model=ucc[,2]; color=ucc[,3]; stock=ucc[,4]; create uccs var{year miles price make model color stock}; append; use uccs; list all; To delete a data set so you can replace it, use call delete("ucs"); Another useful command allows sorting a data set from within IML. sort data=uccs by miles;

Page No.191

We can also sort a matrix by any column. call sort(uc, by year); Modules Modules are basically subroutines. A module can just be a structure that groups a set of commands executed at once, or it can be a subroutine that passes parameters, or it can be a function, which is used in an assignment statement that returns one (matrix) value. Lesson 35: Modules Consider matrices X={1 1, 2 2, 3 3}; Y={"A", "B", "C"}; Z={1 2 3, 2 3 4, 3 5 9}; X1=X[,1]; X2=X1+5; Side note: the following commands give the same result: J To create data sets that have both numeric and character variables, name vectors that have the same name as the variables in your data set and and put them in a create statement this way: create mix var{X1 X2 Y}; append; show contents; list all; The var list should consist of vectors. If you put in a matrix it will list all the elements in a column. If you want to recreate this data set (change it) you need to delete it first, using call delete("mix"); Modules provide a means of saving and reusing code. They can take three forms, a simple module without parameters, a subroutine module that passes parameters, and a function. Simple module. The name is XPX. X and XPRIMEX are global variables. This type of module depends on having the correctly named matrices available in the program, and having the program accept the matrices it creates as global variables. start xpx; xprimex=x`*x; print xprimex "Inside Module"; finish; That just creates the module, nothing runs. So then run it and show the XPRIMEX exists outside the module. run xpx; print xprimex "Outside Module"; Errors inside modules can do strange things. A module may be created even if it has errors. If this happens and you run it, it might go into pause mode. This means that it is waiting for your input. In this case you can submit commands and it continues to wait. To get out of the module, you can issue a "stop" command. You can also use a "stop" command inside the module in programming statements, such as in if-then conditionals. If the condition is met, the module will stop, but IML keeps running. There is also an "abort" command that you can use in programming statements which will exit out of IML under program control. You can also pause a module with programming statements so that it waits for your input. If you put a "quit" statement inside a module, IML will exit immediately and the module will not be created. Subroutine module. This type of module accepts parameters when it is called and can pass parameters back. Put parameters next to the module name in parentheses. The parameters will have local symbol table values, even if they have the same names as variables in the global symbol table. The second print command will not work. start xpx2(X); xprimex=x`*x;

Page No.192

print xprimex "Inside Module"; finish; run xpx2(Z); print xprimex "Outside Module"; You can pass parameters back out: This time the second print works, because the value of xprimex has been passed back out to the global variable Y. Here the global Z value is passed into the module's local symbol table where it becomes X. start xpx2(X,xprimex); xprimex=x`*x; print xprimex "Inside Module"; finish; run xpx2(Z,Y); print Y "Outside Module"; Function module. This kind of module returns a single value and is used in assignment statements like any function. The print statements inside the module would not normally be there, this is just to show what happens inside the module. module upsquare(X); print X "Incoming value of X"; n=min(nrow(X),ncol(X)); X=X[1:n,1:n]; print X "Processed value of X"; return(X); finish; Z=upsquare(X); print Z; Storing modules and matrices. If you want to save all your matrices and modules, a simple store command will do it. It will, by default, go to the current user library (work if you haven't changed it). There is some difficulty in keeping track of when the library goes into effect, so the best thing to do is to have a permanent library defined, then use reset storage=library.imlstor; store; Where library is your library name. You can give it a different name but IMLSTOR is the default name of the IML catalog which is a single file that holds all the matrices and modules. You can then call everything back into memory with a load command. load; But you can store and load individual matrices and modules too. For matrices just give the name, for modules, say "module=name". save X Y module=XPX; load X; Exercises: 1. Write a subroutine without parameters that finds the average value of each column in a matrix, then subtracts that value from each element in the corresponding column (this is called "centering" in regression). 2. Write a subroutine module that takes two parameters, a constant and a square matrix, adds the constant value to every element of the diagonal of the matrix, and passes the result back in a new parameter (this is related to "ridge regression"). 3. Write a function module that will return one of three identically sized input matrices that has the largest determinant (you decide what to do about ties). Lesson 36: First Simulation Setting a user libname while IML is running may not change the default library. It is not entirely clear to me how this works. You can define the libname user before starting IML, and then the default module storage will go there. You can save matrices by saying "store matrixname;" and you can bring them back with "load matrixname;" Modules need the syntax "store module=modulename;" You can

Page No.193

also use the load and store commands without any options (names) and you will load or store all matrices and modules that are available. IMLstore is the default name of the catalog in which matrices and modules are stored. A catalog is a single file that stores multiple objects. If you don't have a user library defined this will go into work. Otherwise it will save it in the specified user library. There is another way to do this though. You can use an option in the reset command, like libname stat510 "C: \Stat510"; reset storage=stat510.imlstore; You can use other names instead of imlstore but it is probably best to keep the default name. Monte Carlo simulations are done by generating data from a theoretical distribution, while bootstrap methods use actual data but resample from the existing data to calculate various statistics. Both methods work by generating thousands of samples and then studying the statistics that come from the sample. Starting with a monte carlo simulation: We set up some parameters for the simulation. proc iml; sims=10; *this is for the outer loop that says how many times the simulation is repeated; ssize=10; *this is the size of each simulated sample; mean=20; std=3; *these are parameters for a normal distribution which we are going to simulate. Next, we set up the basic loops. We will have an outer loop that repeats the samples, and an inner loop that takes one sample of size ssize. do i=1 to sims; do k=1 to ssize; end; end; *sims; My sample is going to be stored in a vector called sample, so with each simulation loop, we need to clear out the old sample, so add a free command just inside the simulation loop. Normal random variables are generated by taking the standard normal random function, normal(seed), multiplying it by the desired standard deviation, and then adding the desired mean. Here we create a sample by using a loop which appends a new random number to the vector with each iteration. It is not necessary to initialize the matrix. do i=1 to sims; free sample; *clear sample vector before starting sample procedure; do k=1 to ssize; sample=sample//(normal(0)*std+mean); end; print sample; *for test only, remove later; end; *sims; Let's study the distribution of x-bar (the sample mean). That is, if you do repeated sampling, what is the average (mean) of all the x-bars, and what is the standard deviation (or variance) of the x-bars? This is a basic question in the study of statistics. If you have a statistic, that is, a number calculated from a sample, usually intended to estimate something, what are its properties? More specifically, what is its distribution? Is it an unbiased estimate of the parameter you want to estimate? Does it have a small enough variance to be useful? Is its variance smaller than other candidate estimators? Sometimes there is nice theory to answer these questions, but in other cases we either don't have a solution to the needed equations, or we do not have data that meets the assumptions of the theory. This is when simulation comes in handy. What we want to do now is calculate xbar for each sample, then accumulate them in a vector. There will be an xbar for each sample (each loop of sims). Then we need to calculate the mean of all the xbars and the standard deviation of all the xbars. For the standard deviation, we will implement the formula

Page No.194

sx =

∑x

2

− nx

2

n −1

free xbar; *clear xbar vector before running simulation again; do i=1 to sims; free sample; *clear sample vector before starting sample procedure; do k=1 to ssize; sample=sample//(normal(0)*std+mean); end; xbar=xbar//sample[:,]; end; *sims; print xbar; *for test only, remove later; meanxbar=xbar[:,]; sxbar=sqrt((xbar[##,]-sims*meanxbar**2)/(sims-1)); print meanxbar sxbar; truesxbar=std/sqrt(ssize); *theoretical value of sxbar; print mean truesxbar; A better way to do the sampling is to use the matrix version of the normal function. You can take out the sampling loop and replace it with simpler commands as follows. The "free sample" command is also not needed this way. free xbar; *clear xbar vector; do i=1 to sims; sample=J(ssize,1,0); sample=normal(sample)*std+mean; xbar=xbar//sample[:,]; end; *sims; meanxbar=xbar[:,]; sxbar=sqrt((xbar[##,]-sims*meanxbar**2)/(sims-1)); print meanxbar sxbar; truesxbar=std/sqrt(ssize); *theoretical value of sxbar; print mean truesxbar; Now change the sims to 10,000 and see what results you get. Using Simulation to evaluate alpha and beta (significance and power) Alpha=P(Type I Error)=P(Reject Ho|Ho is True) Beta=P(Type II Error)=P(Do not Reject Ho|Ho is False) Power=1-Beta=P(Reject Ho|Ho is False) Lesson 37: Simulations for Power and Significance Simulations help us understand how statistics work. We now see exactly what is meant by xbar having a distribution, as we have calculated a bunch of xbars and found their mean and standard deviation. One of the purposes of simulation is to check if theory matches reality. So we have also compared the theoretical values of paramaters with the simulated values to see if they agree. We will now continue to examine issues with hypothesis tests. Continuing with the previous program, we will now add a hypothesis test. For convenience we will test whether the mean is zero. Any value could be used. So we have Ho: mu=0 Ha: mu<>0. We used xbar for the name of the vector that collected all the xbars in the simulation, so we need a different name to calculate the mean of the current sample for the t statistic. We will call it smean. The t statistic for the test of mu=0 is t=smean/(s/sqrt(ssize)) where s=sqrt((sample[##,]-ssize*smean**2)/(ssize-1)). Now the p-value for the test is the probability in both tails beyond the t value on one side and its opposite on the other side. IML has a probt function which gives the left tailed probability of any t

Page No.195

statistic, which could be on the left or right. So to get the right probability we can take probt(abs(t),ssize-1)*2. The minus absolute value is to make sure we are using the left value to calculate the probability. The t statistic has degrees of freedom, which will be ssize-1 in our notation. proc iml; sims=10; ssize=10; mean=0; *set the mean to 0 for now; std=3; alpha=.05; *provide an alpha value for the decision; free xbar; *clear xbar vector; free pvec; *like xbar, clear before restarting simulation; do i=1 to sims; sample=J(ssize,1,0); sample=normal(sample)*std+mean; smean=sample[:,]; *new name for sample mean; p=probt(-abs(smean)/sqrt((sample[##,]-ssize*smean**2)/(ssize-1)),(ssize-1))*2; *calculate pvalue; pvec=pvec//p; *accumulate the p-values in a vector; xbar=xbar//smean; end; *sims; numrej=sum(pvec<alpha); *summarize number of rejections; pctrej=numrej/sims*100; *and do percents; print numrej pctrej; *and print; The pvec vector stores the pvalues during the simulation, then at the end we count up the rejections in the numrej variable. The sum function adds up the elements of a matrix. The expression pvec<alpha creates a matrix of ones and zeros, one for every element where the comparison is true. When the program is all working, change the sims to a larger value such as 1000 or 10,000. The rejections should be about 5%, since that is the alpha value--the probability of rejecting when Ho is true. Now consider changing the mean. Well, let's try 3. I got about 36% rejections. So the true mean is not zero, thus the null hypothesis is false. We have rejected 36% of the time, meaning our beta is .64 and our power is .36. But if you try other means you will see that the power varies. This is a good time to think about what it means for a null hypothesis to be true or false. Suppose we put the mean at .0001. That is not zero, so the null hypothesis is false, right? Or is it close enough to be true? I ran the simulation this way and got no rejections. Now if we say the null hypothesis is true, that means they were all correct decisions. But if we say the null hypothesis was false, then they are all type II errors and the power of the test is 0. It is important to realize here that the standard deviation also plays a role. If you changed the standard deviation to .0001 you would get a very different result! Lesson 38: Power and Regression Here is our program from last time again. proc iml; sims=10; ssize=10; mean=0; *set the mean to 0 for now; std=3; alpha=.05; *provide an alpha value for the decision; free xbar; *clear xbar vector; free pvec; *like xbar, clear before restarting simulation; do i=1 to sims;

Page No.196

sample=J(ssize,1,0); sample=normal(sample)*std+mean; smean=sample[:,]; *new name for sample mean; p=probt(-abs(smean)/sqrt((sample[##,]-ssize*smean**2)/(ssize-1)),(ssize-1))*2; *calculate pvalue; pvec=pvec//p; *accumulate the p-values in a vector; xbar=xbar//smean; end; *sims; numrej=sum(pvec<alpha); *summarize number of rejections; pctrej=numrej/sims*100; *and do percents; print numrej pctrej; *and print; We can try some different sample sizes and see how they affect power. The larger the sample size, the more likely you are to detect a false null hypothesis. Now it is nice to have a graph to display power under different conditions. We can add another outer loop to our simulation so that we can repeat it with different means. We will keep track of the percent rejects for different means in another matrix and not print them individually, but make a graph. proc iml; sims=10; ssize=10; std=3; *remove mean from this list; alpha=.05; free xy; *this matrix stores the result of each simulation; do mean=-5 to 5; *set up a loop that changes the mean; free xbar; free pvec; do i=1 to sims; sample=J(ssize,1,0); sample=normal(sample)*std+mean; smean=sample[:,]; p=probt(-abs(smean)/sqrt((sample[##,]-ssize*smean**2)/(ssize-1)),(ssize-1))*2; pvec=pvec//p; xbar=xbar//smean; end; *sims; numrej=sum(pvec<alpha); pctrej=numrej/sims*100; *stop printing this; xy=xy//(mean||pctrej); *create a matrix with columns for the means and pctrejs; end; *end of the loop for means. call pgraf(xy); *graphs a two-column matrix; Next we will try to simulate regression. So a data point in regression is assumed to be the result of a linear equation plus an error term which has a mean of zero and a constant variance. So to simulate regression data, we need a set of x's together with the parameters (coefficients) in the equation, and also a standard deviation for the errors. We can then generate y values. The solution for the parameter estimates (referred to as the b vector) are found by matrix calculations based on a matrix called X which, in the simple linear case, consists of a column of 1's and a column of the x values. (In multiple regression there are more columns of independent variables.) The solution is given by b=(X`X)-1X`Y. So, the procedure here is to first simulate a regression sample, then calculate the b vector, and analyze the distribution of the results. proc iml; *simulating regression; *note that at the beginning of regression we assume the x values are fixed, there is a linear relationship between x and the mean

Page No.197

of y, and errors have a homogeneous normal distribution with mean 0.; sims=10000; ssize=10; b0=20; b1=5; std=3; xvals=(1:ssize)` ; *this is the fixed set of x; x=J(ssize,1)||xvals; *the X matrix in simple linear regression is ; *a column of 1s for the intercept and the x values; xpxi=inv(x`*x); *"x prime x" matrix; h=xpxi*x`; *useful in later calculations; free bsave; *this matrix will hold the parameter estimates;; do i=1 to sims; errors=J(ssize,1,0); errors=normal(errors)*std; y=b0+b1*xvals+errors; *note that xvals and errors are vectors; b=h*y; bsave=bsave//b`; end; bmean=bsave[:,]; *This calculates means of both columns; btruemn=b0||b1; *Want a vector to compare to bmean; btruecov=std**2*xpxi; *standard formula for covariance matrix of b; *calculate variances and covariances from simulation to compare; bvar=(bsave[##,]-sims*bmean##2)/(sims-1); bcov=((bsave[,1]#bsave[,2])[+,]-sims*bmean[,1]*bmean[,2])/(sims-1); *just cov(bo,b1); bcovar=(bvar[1,1]||bcov)//(bcov||bvar[1,2]); *variance-covariance matrix; bstd=sqrt(bvar); print bmean btruemn bstd; print bcovar btruecov; quit; Exercise: Modify the regression simulation to do a one-way ANOVA with three levels. There is actually not much to change because the matrix solutions will be the same. What changes is the X matrix and the way you simulate the data. You will have three means for three groups or levels. The distribution of errors is considered the same for all (simplest assumptions again). Your y vector will then consist of the group means plus the errors. The X matrix will consist of three columns which are the values of the indicator variables for the groups. It looks something like

Page No.198

Lesson 39: Binomial Confidence Intervals Consider a binomial distribution with parameters n and p. We want to study confidence intervals and coverage. Here is a way to simulate binomial data: %let n=10; %let p=.5; %let sims=10; data one; do i=1 to &sims; x=ranbin(0,&n,&p); *parameters are seed, n, p; output; end; proc print; run; The usual method of calculating a confidence interval is using the normal approximation. This can perform poorly especially if p is not close to .5 and n is small. The formula is phat plus or minus z alpha/2 times square root of phat times qhat over n. Now, coverage means the proportion or percent of confidence intervals that actually contain or capture the true parameter (p). So, ideally, if we have a 95% confidence interval, 95% of confidence intervals should contain the true parameter. However, this may not be true, if the assumptions are not met or the approximation is poor. In our simulation we will then calculate a 95% CI for each sample being simulated. %let n=10; %let p=.5; %let sims=10; %let z=1.96; *z-value for a 95% CI; data one; do i=1 to &sims; x=ranbin(0,&n,&p); phat=x/&n; lb=phat-&z*sqrt(phat*(1-phat)/&n); *upper and lower bounds of CI; ub=phat+&z*sqrt(phat*(1-phat)/&n); output; end; proc print; run; To find coverage we need to keep track of how many times the CI includes the true mean. Now, we could store the results of the simulation in a data set and then analyze it, but this is a simple question. We can actually count the correct results as the data step runs. We don't need to save all the results.

Page No.199

data one(keep=cover coverage); do i=1 to &sims; x=ranbin(0,&n,&p); phat=x/&n; lb=phat-&z*sqrt(phat*(1-phat)/&n); ub=phat+&z*sqrt(phat*(1-phat)/&n); if &p>lb and &p<ub then cover+1; *count up the intervals that cover; coverage=cover/&sims; *remove the output statement, only need the final result; end; Run this with p=.5 and n=10 with 1000 or 10,000 simulations. I got a coverage of 89%. This means we do not have a 95% CI, but only an 89% CI in this situation. Now try increasing the sample size (n) to 20. I got 95.7%. So with larger sample sizes we are getting better coverage. Increasing the sample size beyond that doesn't seem to improve it. Now try setting p to .1. It seems in this case you need a sample of at least 100 to get close to 95% coverage. However, that is not the whole story. There are other problems with these confidence intervals. Look at this sample of 10 confidence intervals produced with a p of .1 and an n of 20. Notice that one interval is 0 to 0 so it automatically misses. But most of the examples have negative numbers. This does not make sense, because 0<p<1. If p=.9, you might get the same phenomenon at the other end, with values greater than 1. Some people just solve this problem by setting the lower bound to 0 or the upper bound to 1. It is still somewhat unsatisfactory. -0.031481 0.23148 0.000000 0.00000 -0.006493 0.30649 -0.006493 0.30649 0.024692 0.37531 -0.031481 0.23148 -0.045519 0.14552 -0.031481 0.23148 -0.006493 0.30649 -0.031481 0.23148 There is an alternate formula for the confidence interval (proposed by Cohen?). The technique consists of adding two successes and two failures to the results. In other words, the estimator is ptilde=(x+2)/(n+4). We will add this formula to the program and compare the results. data one(keep=cover coverage covert coveraget); do i=1 to &sims; x=ranbin(0,&n,&p); phat=x/&n; lb=phat-&z*sqrt(phat*(1-phat)/&n); ub=phat+&z*sqrt(phat*(1-phat)/&n); if &p>lb and &p<ub then cover+1; coverage=cover/&sims; ptilde=(x+2)/(&n+1); lbt=ptilde-&z*sqrt(ptilde*(1-ptilde)/(&n+4)); ubt=ptilde+&z*sqrt(ptilde*(1-ptilde)/(&n+4)); if &p>lbt and &p<ubt then covert+1; coveraget=covert/&sims; end; Here are the alternate confidence intervals for the same run as that above. We didn't get any 0 to 0 intervals, and we only got one negative number. 0.03337 0.34758 -0.02220 0.21268 0.06769 0.40850 0.06769 0.40850 0.10498 0.46645

Page No.200

0.03337 0.00286 0.03337 0.06769 0.03337

0.34758 0.28286 0.34758 0.40850 0.34758

For n=10 and p=.1, the old method has a coverage of about 65%, while the new method has a coverage of about 92%. The new method gives about 95% with only n=20, as compared to n=100 for the old method.. Lesson 40: Bootstrap for Model Selection Frequencies libname s "c:\stat510"; options nonotes; %let ds=s.manp; *name of dataset to be analyzed; %let n=25; *number of observations in source dataset; *Reserved variable should not be in original data: i, idr; *Reserved dataset names should not be in work: temp, rands, sub; %macro boot; *This data step reads the source data and adds an idr column; data temp; set &ds; idr+1; %do j=1 %to 100; *This data step creates random numbers for bootstrap sample; data rands(drop=i); do i=1 to &n; idr=int(uniform(0)*&n)+1; output; end; *This sql step creates the bootstrap sample; proc sql; create table sub as select temp.* from temp, rands where temp.idr=rands.idr; *Here is where the analysis goes. Depending on the task, you need to send output to an output data set, then pull it into a summary data set (probably via sql). Finally, the summary data set must be summarized to generate the desired bootstrap statistics; *Example using proc reg and adjusted r-square selection to evaluate model selection frequencies; proc reg data=sub outest=est noprint; model y=x1-x7 /selection=adjrsq best=1; %if &j=1 %then %do; *Creates the summary data set on the first iteration; data summ; set est; %end; %else %do; *Adds to the summary data set on subsequent iterations; proc sql; insert into summ select * from est; %end;

Page No.201

%end; *process summary data set; data summ2 (keep=modl); set summ; length modl $21.; if x1 ne . then modl=" x1-"; if x2 ne . then modl=trim(modl)||"x2-"; if x3 ne . then modl=trim(modl)||"x3-"; if x4 ne . then modl=trim(modl)||"x4-"; if x5 ne . then modl=trim(modl)||"x5-"; if x6 ne . then modl=trim(modl)||"x6-"; if x7 ne . then modl=trim(modl)||"x7-"; proc freq data=summ2; tables modl; run; %mend boot; %boot; Lesson 41: Bootstrap II Difference between regression and anova simulation Download stateinfo data set. Use proc univariate to examine the distribution of the area. It is not normal due in large part to outliers. This data will be used to demonstrate how we can use bootstrapping to estimate a population parameter, and then in addition, to use simulation to compare the bootstrap estimate to a "normal" estimate in terms of its statistical performance. We will consider the 50 state values to be the entire population. We will be sampling from this population, and studying the behavior of our estimates for the mean in terms of their ability to predict the population mean. We will use the area variable to do this. Because this data is highly skewed with serious outliers, the measure of central tendency that should be used is the median. However, for the purpose of this example, we will focus on estimating the mean. We can find the population mean (mu) from the proc univariate output, it is 75894.1. What is a confidence interval? The next part of the process is a bit confusing because we have different levels of sampling. We are going to begin by simulating taking a sample (without replacement) from the population. In this part we are simulating what happens when we really take a sample. The process is similar to what we studied the first time we did it in IML. We can use a uniform random variable assigned to each observation and then sort them and take the top 20 or whatever our sample size is. From this sample we can calculate a traditional x-bar and confidence interval based on normal theory. Then we go one step further, and take this sample and do a bootstrap on it, which means we will resample from it, with replacement, and from all these samples, we find the percentiles corresponding to the confidence level we want. (Should do 90% because it is easier to get P5 and P95). These will be the bootstrap lower and upper bounds. We then repeat the sampling process (sample from the population again) and find a new P5 and P95. Do this many times and see what the coverage is as well as the variability. Does it perform better than the normal theory sample? libname s "c:\stat510"; proc print data=s.stateinfo; run; options ls=80 nonotes; *treat data set as population data. Examine distributions, particularly area.; proc univariate data=s.stateinfo normal plot; var area pop hipt;

Page No.202

run; *note non-normality, mainly due to Alaska.; proc univariate data=s.stateinfo normal plot; var area; where numenter < 49; run; *We will study the efficiency of the bootstrap technique for building confidence intervals. Copy the mean from univariate output.; *Let us use normal techniques for building confidence intervals and see what the true coverage is.; %let trumean=75894.1; %let n=20; %let sims=10; %let ds=s.stateinfo; %cistates; %macro CIstates; %do j=1 %to &sims; *This is the simulation loop; data temp (keep=state area rand); set &ds; rand=uniform(0); *This data step creates random numbers for simulation sample; proc sort data=temp out=sub1(drop=rand); by rand; data sub; set sub1(obs=&n); *proc print;run; *This sql step creates the bootstrap sample; proc sql; create table est as select mean(area)-tinv(.975,(&n-1))*std(area)/sqrt(&n) as LB, mean(area)+tinv(.975,(&n-1))*std(area)/sqrt(&n) as UB from sub; run; %if &j=1 %then %do; *Creates the summary data set on the first iteration; data summ; set est; %end; %else %do; *Adds to the summary data set on subsequent iterations; proc sql; insert into summ select * from est; %end; %end; *end of simulation loop; *process summary data set; data summ2; set summ; cover=0; if lb<&trumean and ub>&trumean then cover=1; proc print data=summ2(obs=10); run; proc freq data=summ2;

Page No.203

tables cover/nocum; proc means data=summ2 mean std cv; var lb ub; run; %mend CIstates;

%macro CIboot; data temp (keep=state area rand); set &ds; rand=uniform(0); *This data step creates random numbers for simulation sample; proc sort data=temp out=sub1(drop=rand); by rand; data sub; set sub1(obs=&n); idr+1; run; %do j=1 %to &sims; *This is the simulation loop; *This data step creates random numbers for bootstrap sample; data rands(drop=i); do i=1 to &n; idr=int(uniform(0)*&n)+1; output; end; *This sql step creates the bootstrap sample; proc sql; * create table sub as ; select sub.* from sub, rands where sub.idr=rands.idr; quit; *left off here; %if &j=1 %then %do; *Creates the summary data set on the first iteration; data summ; set est; %end; %else %do; *Adds to the summary data set on subsequent iterations; proc sql; insert into summ select * from est; %end; %end; *end of simulation loop; *process summary data set; data summ2 (keep=modl); set summ; proc freq data=summ2; tables modl; run; %mend CIboot; Lesson 42: Bootstrap III

Page No.204

We should be using t distribution to calculate normal confidence intervals. Use tinv(alpha, df) but be careful about the tail probability, which is different in different software. SAS uses the left tail (excel uses two tails). Continue building program from last time. libname s "c:\stat510"; proc print data=s.stateinfo; run; options ls=80 nonotes; *treat data set as population data. Examine distributions, particularly area.; proc univariate data=s.stateinfo normal plot; var area pop hipt; run; *note non-normality, mainly due to Alaska.; proc univariate data=s.stateinfo normal plot; var area; where numenter < 49; run; *We will study the efficiency of the bootstrap technique for building confidence intervals. Copy the mean from univariate output.; *Let us use normal techniques for building confidence intervals and see what the true coverage is.; %let trumean=75894.1; %let n=20; %let sims=100; %let ds=s.stateinfo; %cistates; %macro CIstates; %do j=1 %to &sims; *This is the simulation loop; data temp (keep=state area rand); set &ds; rand=uniform(0); *This data step creates random numbers for simulation sample; proc sort data=temp out=sub1(drop=rand); by rand; data sub; set sub1(obs=&n); *proc print;run; *This sql step creates the bootstrap sample; proc sql; create table est as select mean(area)-tinv(.95,(&n-1))*std(area)/sqrt(&n) as LB, mean(area)+tinv(.95,(&n-1))*std(area)/sqrt(&n) as UB from sub; run; %if &j=1 %then %do; *Creates the summary data set on the first iteration; data summ; set est; %end; %else %do; *Adds to the summary data set on subsequent iterations; proc sql; insert into summ select * from est; %end;

Page No.205

%end; *end of simulation loop; *process summary data set; data summ2; set summ; cover=0; if lb<&trumean and ub>&trumean then cover=1; proc print data=summ2(obs=10); run; proc freq data=summ2; tables cover/nocum; proc means data=summ2 mean std cv; var lb ub; run; %mend CIstates; %let trumean=75894.1; %let n=20; %let sims=100; %let boots=100; %let ds=s.stateinfo; %ciboot; %macro CIboot; %do k=1 %to &sims; *This is the simulation loop; data temp (keep=state area rand); set &ds; rand=uniform(0); *This data step creates random numbers for simulation sample; proc sort data=temp out=sub1(drop=rand); by rand; data sub; set sub1(obs=&n); idr+1; run; %do j=1 %to &boots; *This is the bootstrap loop; *This data step creates random numbers for bootstrap sample; data rands(drop=i); do i=1 to &n; idr=int(uniform(0)*&n)+1; output; end; *This sql step creates the bootstrap sample and calculates xbar; %if &j=1 %then %do; proc sql; create table summ as select mean(area)as xbar from sub, rands where sub.idr=rands.idr; quit; %end; %else %do; proc sql; insert into summ select mean(area)as xbar from sub, rands where sub.idr=rands.idr; quit; %end; %end; *end of bootstrap loop; proc means data=summ noprint;

Page No.206

var xbar; output out=summ2 p5=lb p95=ub; run; %if &k=1 %then %do; data summ3; set summ2; run; %end; %else %do; proc sql; insert into summ3 select * from summ2; quit; %end; %end; *simulation loop; data summ4; set summ3; cover=0; if lb<&trumean and ub>&trumean then cover=1; proc print data=summ4(obs=10); proc freq data=summ4; tables cover/nocum; proc means data=summ4 mean std cv; var lb ub; run; %mend CIboot;

Sign up to vote on this title
UsefulNot useful