You are on page 1of 5

Cheat sheet: Data Processing Optimization - for Pharma Analysts &

Statisticians
Karthik Chidambaram, Senior Program Director, Data Strategy, Genentech, CA

ABSTRACT
This paper will provide tips and techniques for the analysts & statisticians to optimize the data processing routines in
their day-to-day work. Quite a bit of productivity is lost on slow SAS® servers and slow response time from IT teams.
However, there are certain tools and techniques, that the analysts can do on their end, to bypass the inefficiencies.
This paper will provide a list of those techniques & share the experience on utilizing the SAS GRID architecture.
Key sections of the paper:
1. Tips and techniques to optimize the SAS programs, to bypass the bottlenecks
2. Hidden gems: quick tips to administer & optimize parameters to enhance processing huge volumes of
data
3. GRID: A quick primer on GRID (from an analyst/statistician perspective) and its advantages

TIPS AND TECHNIQUES TO OPTIMIZE SAS PROGRAMS TO BYPASS BOTTLENECKS


OPTIMIZING WINDOWS MACHINE FOR PROCESSING YOUR PROGRAMS:
In many cases, the servers or machines underperform and the blame is mostly placed on the SAS system. However,
there are instances where, the back end system could be optimized to better serve the analytics. For instance, under
Windows 7, follow these steps to optimize application performance:
• Open the Control Panel
• Click System and Security
• Select the System
• Click Advanced system settings task
• Select the Advanced tab
• In the Performance box, click Settings and then select the Advanced tab
• To optimize performance of an interactive SAS session, select Programs
• To optimize performance of a batch SAS session, select Background services
• Click OK

This optimization ensures that the memory and page files are appropriately optimized for the type of SAS processing
we use. This helps with the stability and memory processing of the server/PC to a greater extent. Irrespective of the
type of windows machine used, the optimization listed above could be accomplished (even though the navigation
path may be slightly different)

USING HIGHLY RECURSIVE PROCESS WITH MODERATE SIZED DATASETS? CONSIDER MEMLIB
OR MEMCACHE
With MEMLIB and MEMCACHE options, we will be able to create Memory-based libraries. Using memory based
libraries reduce the I/O to and from the disk. Especially, if our permanent library is on a SAN, we will see a substantial
processing improvement with MEMLIB option. Memory based libraries can be used in several ways:
1. As a storage for the work library
2. Processing SAS libraries with high I/O
3. Cache for very large SAS libraries

CHECK THE ASSIGNMENT OF THE SAS WORK LIBRARY


Especially in server based SAS processing, there is always an increasing need for additional space on the work
server. When the number of users or the processing database size increases, the size of the workspace is increased
correspondingly. In most cases, this impacts the performance of the system. SAS processes are I/O intensive and
utilize the work library for storing the temporary files. There are 2 common issues with SAS work library set up:
1. Size of the work folder
2. Network connectivity to work folder from the server

1
Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians, continued

Work around: Check the SAS work library assignment using the proc datasets. Check for I/O issues by switching on
the FULLSTIMER option. If you notice I/O issues, try to define a different location using –saswork option at runtime or
by modifying the SAS work assignment on autoexec.sas.

OPTIMIZE YOUR CODE


Many times, a simple change to the code could result in huge efficiency gain. A quick look at some of the efficient
SAS coding options:
• If we would be reading a flat file multiple times, it will be a better option to create a SAS dataset. Reading a
SAS dataset will be much faster than reading from a flat file.
• When using arrays in long programs, where the content generated in the DATA step are not intended for
output to the result dataset, ensure addition of “_TEMPORARY_”. This will release the memory after the
processing is complete.
• To reduce the I/O ensure that filters are done at the beginning of the code, especially while dealing with
huge volumes of data. Even while filtering, a combination of where statement and keep statements could
result in additional performance gains. SAS program data vector allocates buffer space based on the
number of variables that are being read in and the number of variables that are created during the data step
processing. Hence, if we are using 4 variables, out of 10 from a dataset, the keep statement at the set
statement is more efficient than the keep statement at the end of the program. This is because, the keep
option, when used with the set statement, avoids reading in the unwanted columns on to the buffer.

Less Efficient Code:


DATA sample;
SET source;
Other SAS Statements…
keep var1 var2 var3;
RUN;
Efficient Code:
DATA sample;
SET source (keep = var1 var2 var3);
Other SAS Statements…
RUN;

• Both “if” and “where” statements can be used to subset a dataset based on the specified criteria. Though
both if and where statements produce the exact same results in most cases, they have a big difference in
the way they operate on the data. In case of the “if” statement, the data is read into the program data vector
before the condition is verified. Thus all the records are read into the program data vector irrespective of
their value and the criteria. On the contrary, the where statement checks for the criteria, even before the
data is read into the PDV. Hence, the unwanted data records are not read in to the buffer space at all. Thus
the “Where” statement will be a better option for data subset, especially in case of datasets with a large
number of variables.

Less Efficient Code:


DATA subst;
SET source;
If sales > 1000;
RUN;
Efficient Code:
DATA subst;
SET source;
Where sales > 1000;

2
Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians, continued

RUN;

HIDDEN GEMS: QUICK TIPS TO ADMINISTER & OPTIMIZE PARAMETERS TO ENHANCE


PROCESSING HUGE VOLUMES OF DATA
Many SAS users do not adjust the SAS System options and work with the default setting on the system. There are
several hundreds of such options and it is virtually impossible to master the right setting for each of these parameters.
This section will highlight a few interesting parameters, that may offer huge performance benefit to the users.

BUFNO=, BUFSIZE=, CATCACHE=, AND COMPRESS= SYSTEM OPTIONS


BUFNO: SAS uses the BUFNO= option to adjust the number of open page buffers when it processes a SAS data set.
Increasing this option's value can improve our application's performance by allowing SAS to read more data with
fewer passes; however, when memory usage increases. Experiment with different values for this option to determine
the optimal value for our needs.
Note: We can also use the CBUFNO= system option to control the number of extra page buffers to allocate for each
open SAS catalog
BUFSIZE: When the Base SAS engine creates a data set, it uses the BUFSIZE= option to set the permanent page
size for the data set. The page size is the amount of data that can be transferred for an I/O operation to one buffer.
The default value for BUFSIZE= is determined by operating system environment. Note that the default is set to
optimize the sequential access method. To improve performance for direct (random) access, we should change the
value for BUFSIZE. Whether we use our operating environment's default value or specify a value, the engine always
writes complete pages regardless of how full or empty those pages are.
If we know that the total amount of data is going to be small, we can set a small page size with the BUFSIZE= option,
so that the total data set size remains small and we minimize the amount of wasted space on a page. In contrast, if
we know that we are going to have many observations in a data set, we should optimize BUFSIZE= so that as little
overhead as possible is needed. Note that each page requires some additional overhead.
Large data sets that are accessed sequentially benefit from larger page sizes because sequential access reduces the
number of system calls that are required to read the data set. Note that because observations cannot span pages,
typically there is unused space on a page.
CATCACHE: SAS uses this option to determine the number of SAS catalogs to keep open at one time. Increasing its
value can use more memory, although this might be warranted if our application uses catalogs that will be needed
relatively soon by other applications. (The catalogs closed by the first application are cached and can be accessed
more efficiently by subsequent applications.)
COMPRESS: One further technique that can reduce I/O processing is to store our data as compressed data sets by
using the COMPRESS= data set option. However, storing our data this way means that more CPU time is needed to
decompress the observations, as they are made available to SAS. But if our concern is I/O, and not CPU usage,
compressing our data might improve the I/O performance of our application.

SASFILE STATEMENT
The SASFILE global statement opens a SAS data set and allocates enough buffers to hold the entire data set in
memory. Once it is read, data is held in memory, available to subsequent DATA and PROC steps, until either a
second SASFILE statement closes the file and frees the buffers or the program ends, which automatically closes the
file and frees the buffers.
Using the SASFILE statement can improve performance by
• Reducing multiple open/close operations (including allocation and freeing of memory for buffers) to process
a SAS data set to one open/close operation
• Reducing I/O processing by holding the data in memory.

If our SAS program consists of steps that read a SAS data set multiple times and we have an adequate amount of
memory so that the entire file can be held in real memory, the program should benefit from using the SASFILE
statement. Also, SASFILE is especially useful as part of a program that starts a SAS server such as a SAS/SHARE
server.

IBUFSIZE SYSTEM OPTION


An index is an optional SAS file that we can create for a SAS data file in order to provide direct access to specific
observations. The index file consists of entries that are organized into hierarchical levels, such as a tree structure,

3
Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians, continued

and connected by pointers. When an index is used to process a request, such as for WHERE processing, SAS does
a search on the index file in order to rapidly locate the requested records.
Typically, we do not need to specify an index page size. However, the following situations could require a different
page size:
• The page size affects the number of levels in the index. The more pages there are, the more levels in the
index. The more levels, the longer the index search takes. Increasing the page size allows more index
values to be stored on each page, thus reducing the number of pages (and the number of levels). The
number of pages required for the index varies with the page size, the length of the index value, and the
values themselves. The main resource that is saved when reducing levels in the index is I/O. If our
application is experiencing a lot of I/O in the index file, increasing the page size might help. However, we
must re-create the index file after increasing the page size.
• The index file structure requires a minimum of three index values to be stored on a page. If the length of an
index value is very large, we might get an error message that the index could not be created because the
page size is too small to hold three index values. Increasing the page size should eliminate the error.

REUSE SYSTEM OPTION


If space is reused, observations that are added to the SAS data set are inserted wherever enough free space exists,
instead of at the end of the SAS data set. Specifying REUSE=NO results in less efficient usage of space if we delete
or update many observations in a SAS data set. However, the APPEND procedure, the FSEDIT procedure, and other
procedures that add observations to the SAS data set continue to add observations to the end of the data set, as they
do for uncompressed SAS data sets. We cannot change the REUSE= attribute of a compressed SAS data set after it
is created. Space is tracked and reused in the compressed SAS data set according to the REUSE= value that was
specified when the SAS data set was created, not when we add and delete observations. Even with REUSE=YES,
the APPEND procedure will add observations at the end. It may be worthwhile to check the default setting for this
variable and set it to “YES”, especially in environments dealing with a lot of data updates.

SAS GRID: A QUICK PRIMER ON GRID (FROM AN ANALYST/STATISTICIAN


PERSPECTIVE) AND ITS ADVANTAGES
SAS Grid Manager delivers grid computing capabilities, enabling organizations to create a managed, shared
environment for processing large volumes of data and analytic programs. The grid effectively combines several
servers, with dynamic load balancing abilities.
From the shoes of an analyst, without the IT terms, the GRID manager avoids having a single server for shared pool
of users, by combining a pool of CPUs and balancing the load across several machines, providing better performance
and enhancing reliability.
Some key benefits include:
• Automatically tailors SAS Data Integration Studio and SAS Enterprise Miner for parallel processing and job
submission in a grid environment.
• Balances the load of many SAS Enterprise Guide users through easy submission to the grid.
• Provides load balancing for all SAS servers to improve throughput and response time of all SAS clients.
• Uses SAS Code Analyzer to analyze job dependencies in SAS programs and generates grid-ready code:
Used by SAS Data Integration Studio and SAS Enterprise Guide to import SAS programs.
• Provides automated session spawning and distributed processing of SAS programs across a set of diverse
computing resources.
• Speeds up processing of applicable SAS programs and applications, and provides more efficient computing
resource utilization.
• Enables scheduling of production SAS workflows to be executed across grid resources:
Ø Provides a process flow diagram to create SAS flows of one or more SAS jobs that can be simple or
complex to meet our needs.
Ø Uses all of the policies and resources of the grid.
• Enables many SAS solutions and user-written programs to be easily configured for submission to a grid of
shared resources.
• Integrates with all SAS Business Intelligence clients and analytic applications by storing grid-enabled code
as SAS Stored Processes.
• Provides greater resilience for mission-critical applications and high availability for the SAS environment.
• Includes command-line batch submission utility called SASGSUB:
Ø Allows us to submit and forget, and reconnect later to retrieve results.
Ø Enables integration with other standard enterprise schedulers.

4
Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians, continued

• Enables batch submission to leverage checkpoint and automatically restart jobs.


Ø Applies grid policies to SAS workspace servers when they are launched through the grid.

CONCLUSION
This paper has highlighted the basic & easy rules for optimizing the SAS processing. With some minimal changes to
our code, we can make sure that we process our programs in an effective and efficient manner, leveraging all the
nice features in the SAS system.

REFERENCES
SAS Online Help, www.sas.com

ACKNOWLEDGMENTS
The Author would like to thank his family, friends, peers and supervisors for their encouragement, support and
suggestions.

CONTACT INFORMATION
Karthikeyan Chidambaram - SAS certified professional, has over 15 years of experience in SAS in a variety of roles
including SAS Administration, Statistical Analysis and ETL programming. Your comments and questions are valued
and encouraged. Contact the author at:
Karthikeyan Chidambaram
Genentech Inc.
1 DNA Way
South San Francisco, CA 94080
Phone: 805-300-0505
Email: karthihere@hotmail.com , Chidambaram.karthikeyan@gene.com

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® Indicates USA registration.
Other brand and product names are trademarks of their respective companies.

You might also like