You are on page 1of 16

Technical report

PASW Modeler Server Performance, Optimization, and Sizing

Table of contents
Introduction. .......................................................................................................................... 2 High performance out-of-the-box........................................................................................ 3 Scaling the data mining process with SPSS Predictive Enterprise Services........................... 5 Performance optimization...................................................................................................... 7 Advanced performance optimization................................................................................... 10 Scoping and sizing PASW Modeler Server............................................................................ 12 Conclusion.......................................................................................................................... 16 About SPSS Inc.................................................................................................................... 16

SPSS is a registered trademark and the other SPSS Inc. products named are trademarks of SPSS Inc. All other names are trademarks of their respective owners. 2009 SPSS Inc. All rights reserved. CSWP-0209

Introduction
Data mining offers organizations many benefits, including a more detailed view of their customers, along with a clearer view of current conditions and deeper insight into future events. By choosing a high-performance data mining tool, organizations can mine their data more efficiently and gain a significant return on investment (ROI). PASW Modeler*, the leading data mining workbench from SPSS Inc., enables organizations to easily and quickly mine many types of data, including large datasets. The result: more business value than other solutions can offer. PASW Modeler uses a scalable, three-tiered architecture to improve modeling productivity and deployment when working with large datasets. The PASW Modeler Client tier passes data mining processes to the PASW Modeler Server. Then PASW Modeler Server** analyzes these tasks to determine which ones should be executed within the database. After the database processes those tasks, it passes only the relevant aggregate or summary data to PASW Modeler Server. Since data pre-processing typically 80-90 percent of the data mining effortoccurs in the database tier, users will accelerate modeling, maximize resources, and minimize network traffic. Data mining is an exploratory and interactive process requiring immediate feedback, so high-performance tools like PASW Modeler Server are essential. PASW Modeler Server provides increased productivity and faster access to results. When analytical results are deployed into operational systems, the impact of performance is even more significant because of high data volumes and real-time constraints. Data mining is a core process involved in predictive analytics, which combines advanced analytic techniques and decision optimization to inform and direct decision making. The value of predictive analytics is that it gives your organization the ability to act on the results, and PASW Modeler Servers high performance is crucial to timely action. This technical brief serves as a guide for understanding and maximizing PASW Modeler Servers already high performance. It focuses on PASW Modeler Servers out-of-the-box performance, scalability, and performance optimization, as well as its scoping and sizing requirements.
* PASW Modeler, formerly called Clementine, is part of SPSS Inc.s Predictive Analytics Software portfolio. ** PASW Modeler Server, formerly called Clementine Server, is part of SPSS Inc.s Predictive Analytics Software portfolio.

PASW Modeler Server Performance, Optimization, and Sizing

High performance out-of-the-box


PASW Modeler Server has been designed and developed to provide high performance and scalability for all data mining tasks. SQL generation and parallel processing, for example, are performed automatically. As a result, PASW Modeler users dont need to make any changes to the way they work to get consistently high performance. In our benchmark tests of PASW Modeler Server performance1, we measured the ability of PASW Modeler to carry out the common tasks of model building, model scoring, and data preparation. Model building: 16 million records in under five minutes PASW Modeler Server was able to build a logistic regression model from approximately 16 million records2 in less than five minutes (see Figure 1). This dataset is larger than those typically used for model building. Against a more modest-sized dataset of 500,000 records, all of the model types were built in less than two minutes (see Figure 2). PASW Modeler Server transforms a time-consuming process into an iterative one and vastly reduces the time required to build models and to find the best model. Figure 1

Figure 1: This stream was used in tests of model building performance.

Figure 2

Figure 2: The elapsed time taken to build a model using different algorithms3.

1 Test

environment: 2 x Intel Xeon 3.6GHz (hyperthreaded), 8GB RAM, 36GB RAID 1 System disk, 440GB RAID 0 Data disk, Microsoft Windows Server 2003 Enterprise x64 SP1, Microsoft SQL Server 2000 SP4, and Clementine 10.0. 2 21 fields used, mixture of data types. 3 Neural network build time is affected by randomization in the selection of records to prevent overtraining.

PASW Modeler Server Performance, Optimization, and Sizing

Figure 3

Figure 3: This stream was used in tests of model scoring performance.

Model scoring: 32 million records in close to eight minutes In a test scoring records against a classification model (see Figures 3 and 4), PASW Modeler Server accessed data from a table of 32 million records4, scored the data against a decision tree model, and wrote the scores to a new database table in less than eight minutes. This scoring was achieved at a sustained rate of close to 65,000 records per second, equivalent to 225 million records per hour. Data preparation: 16 million customer records processed against 42 million products in eight minutes Data mining is about more than model building and scoring. A large part of the data mining process involves preparing the data. As seen in Figure 5, our tests of data preparation involved the performance of multiple, common data preparation steps, including joining customer data to a product dataset of nearly three times its size. Figure 5

Figure 4

Figure 4: The elapsed time taken to score a C&RT decision tree model.

Figure 5: This stream was used in tests of data preparation performance.

4 21

fields used, mixture of data types.

PASW Modeler Server Performance, Optimization, and Sizing

PASW Modeler Server ran the stream against 16 million customer records in approximately eight minutes for an overall rate of over 33,000 customers per second (see Figure 6).

Figure 6

Scaling the data mining process with SPSS Predictive Enterprise Services
Raw data processing speed is not the only factor affecting performance. Frequently, the volume of modelsrather than the volume of datais the bottleneck hampering data mining productivity. In many organizations, the number of data miners, analysts, and others involved in the process can also have a very significant impact on performance. Generating real performance from data mining activities often depends more on an organizations ability to manage its analytical assets and complex, multi-part analytical processes than on raw data processing performance alone. For example, powerful servers are often underutilized when organizations are unable to put the right models in the right place and effectively schedule their execution. However, with SPSS Predictive Enterprise Services, By using PASW Modeler Server with SPSS Predictive Enterprise Services, one financial services organization optimized its operational analytics, reducing the time taken to execute a key analytical process by a factor of 80 times. This resulted in major, quantifiable savings. organizations receive a complete, enterprise solution to the problems of analytical asset and process management. SPSS Predictive Enterprise Services uses an advanced, service-oriented architecture to improve the management of predictive models and related analytical processes within your organizations business operations. It extends PASW Modelers rapid model development and deployment capabilities to create more manageable predictive analytics solutions. By providing an integrated way to centralize and organize predictive modelsand also automate predictive analytics processesSPSS Predictive Enterprise Services helps organizations improve analytical asset and process management. Analytical asset management The resources that are involved in a predictive analytics process may involve:
n n n n

Figure 6: The elapsed time taken to perform data preparation steps.

PASW Modeler streams, models, and outputs Documentation External scripts for data preparation or report generation Resources from other predictive analytics tools, such as PASW Statistics syntax and outputs, and SAS code

PASW Modeler Server Performance, Optimization, and Sizing

These are analytical assetsthe tangible results of the efforts of data mining teams. SPSS Predictive Enterprise Services provides a centralized repository that offers:
n n n n n

Security and access control Version control and labeling Audit and tracking capabilities Advanced data mining-aware organization and search facilities Direct integration with PASW Modeler and also with PASW Statistics tools

Managing analytical assets provides a foundation for data mining processes, enabling these processes to scale to the enterprise level. Analytical process management Developing robust processes for data mining activities such as model building, scoring, and validation is integral to delivering high performance on an enterprise scale. These processes often involve the combination of multiple tools and technologies. SPSS Predictive Enterprise Services provides a visual workflow user interface, Predictive Enterprise Manager, which allows a full, end-to-end process to be defined using assets stored in the repository and a mix of technologies (see Figure 7). Analytical processes are fully integrated with the repository, automatically extracting the required objects and versions, and storing the results. A scheduling service allows these processes to be executed at regular intervals, and a notification service provides e-mail tracking. Figure 7

Figure 7: Predictive Enterprise Manager allows users to create and schedule multi-part, multi-tool, analytical processes via a visual workflow interface.

PASW Modeler Server Performance, Optimization, and Sizing

Performance optimization
Most of PASW Modeler Servers high performance is achieved through performance optimizations that are switched on by default. Many PASW Modeler operations can be further improved by fine-tuning performance parameters. Maximize performance with in-database mining One of the key benefits of PASW Modeler Server is that it allows organizations to fully utilize their investments in highperformance database systems. Many organizations have invested heavily in a database infrastructure and business intelligence systems, but these systems are often under-utilized by the analytical tools that use them. PASW Modeler Server improves performance when mining large datasets by maximizing in-database mining. For example, you can delegate as many operations as possible to your IBM DB2 Data Warehouse database or Oracle Database 10g, taking advantage of database optimization and reducing data movement. With PASW Modeler Server, processing is executed in the database via SQL queries. Any operation that cannot be represented using SQL queries is performed by the server itself. Only relevant results are passed back to the client; perhaps more importantly, data transfer between the database and PASW Modeler Server is minimized. Another advantage of PASW Modeler Servers in-database mining is that it minimizesand can even eliminatedata transfer costs. In a test measuring the impact of in-database mining (see Figure 8), the same PASW Modeler stream was executed with full SQL generation, no SQL generation, and a scoring-only SQL generation (which executed the scoring in-database but performed transfer of data to and from the database). Figure 8 While SQL generation of the scoring was approximately 10 percent quicker than scoring in the application, the biggest factor in performance is data transfer, which accounts for more than 85 percent of the elapsed time for scoring. The only way to manage the data transfer bottleneck is to ensure that less data is transferred. PASW Modeler Servers SQL generation reduces data transfer to a minimum and leverages your investment in highperformance databases.
Figure 8: Scoring stream executed with full SQL generation, SQL generation of scoring only, and no SQL generation

Data transfer costs are the most significant factor affecting performance. For example, over 85 percent of the time allotted to score a model can be attributed to data transfer between the database and the scoring application.

PASW Modeler Server Performance, Optimization, and Sizing

SQL feedback, previewing, and viewing There will be times when analysts will want more control over the optimization of PASW Modeler streams. PASW Modeler Server supports this by providing immediate feedback: upon execution, every PASW Modeler node that can be fully translated to SQL is highlighted (see Figure 9). Figure 9

Figure 9: SQL generation and highlighting in a PASW Modeler stream

In Figure 9, the PASW Modeler stream is executed using SQL generation. Many nodes are purple, rather than the usual white, during execution. Purple nodes mean that the operations represented by those nodes have been translated into SQL and executed in-database. This feedback helps an analyst ensure that as much of the stream as possible is executed in the database. Additional options allow the user to examine the SQL that is generated. Stream optimization relies on intelligent SQL generation and stream execution SQL generation is a powerful capability, but it depends upon analysts to understand how PASW Modeler operations can be executed on a database. And analysts are focused on solving business problems, rather than optimizing their PASW Modeler streams for performance. For this reason, PASW Modeler Server features advanced optimization that intelligently re-orders operations in the PASW Modeler stream to maximize performance without altering results. Data miners can organize streams in a way that makes sense to them, and PASW Modeler Server will reorganize those same operations in a way that makes sense to the database.

PASW Modeler Server Performance, Optimization, and Sizing

In Figure 10, the derive node contains an operation that cannot be carried out in the database. PASW Modeler optimizes the process so that the select operation is performed before the derive operation, thereby reducing data transfer and improving performance. In-database caching One common user optimization is to set up a cache on a node. The next time data is passed through that node, the cache is filled with that data. From then on, the data is read from the cache rather than from the data source. This can be a useful way to ensure that expensive data processing is only executed once. Normally, the cache is stored as a temporary file on the file system, but PASW Modeler Server also supports the caching of this data into a temporary table in the database. When combined with SQL optimization, this may result in significant gains in performance. As illustrated in Figure 11, the output from a stream that merges multiple tables to create a data mining view may be cached and reused as needed. Figure 11

Figure 10

Figure 10: Stream optimization

Figure 11: Setting a cache on a node that is likely to be re-executed will store the data in a temporary table on the database, when possible. Executing streams from that cached node will allow further in-database operations.

Plus, by automatically generating SQL for all downstream nodes, performance can be improved further. In Figure 11, the select operation is highlighted, indicating that the operation is being executed in the database from the filled database cache. In-database model building PASW Modeler Server supports integration with data mining algorithms that are available from other database vendors. Organizations can use PASW Modeler to manage the entire data mining process while modeling with the database-native algorithms provided by these vendors. Using in-database modeling ensures that data transfer is minimized, even during the model building phase. It also helps organizations leverage their existing investments in IBM DB2 Intelligent Miner , Microsoft SQL Server 2005, and Oracle Data Mining.

PASW Modeler Server Performance, Optimization, and Sizing

Advanced performance optimization


In addition to in-database mining, PASW Modeler Server provides a number of capabilities that allow the user to optimize the performance of his streams. Database bulk-loading Data movement is often a bottleneck in performance, especially when writing data to a database. PASW Modeler Server provides a number of features to optimize this process for large data volumes. By default, writing data to a database is performed on a row-by-row basis. While this prevents errors and provides data security, it slows performance. Allowing the PASW Modeler Server to commit multiple rows at a time is a good way to ensure more reasonable performance, and this option is available by default. In addition to the batch committal of records, PASW Modeler Server supports two types of bulk loading, as shown in Figure 12. The first is provided through ODBC bulk loading facilities. The second type uses an external bulk loading tool to allow a database-native solution. External bulk loading scripts are provided for Microsoft SQL Server, Oracle Data Mining, IBM DB2 Intelligent Miner, Netezza Performance Server, Teradata Warehouse, and IBM Redbrick Warehouse databases. These scripts can be customized, and custom scripts may be written for other databases. Database indexing Indexing database tables maintains the performance of in-database options. Correct indexing significantly impacts many subsequent database operations. As shown in Figure 13, PASW Modeler Server enables users to create indexes on tables exported from PASW Modeler. Simple indexes can be created easily, and PASW Modeler also allows you to customize the SQL statement used to create the index (for instance, to create a BITMAP, UNIQUE, or FILLFACTOR index).
Figure 13: Create indexes on database tables to improve database performance. Figure 12: Database export advanced options allow bulk loading to database via ODBC or through an external loader.

Figure 12

Figure 13

10

PASW Modeler Server Performance, Optimization, and Sizing

Optimized joins and sorts By default, PASW Modeler has to make assumptions about the state of data in the system. For example, PASW Modeler cannot assume that any data has already been sorted, so many operations ensure that a sort is performed when required, even if such a sort is redundant. PASW Modeler allows the user to optimize a sort or join operation by specifying any existing sorts on the data. This eliminates redundancy and improves performance, as shown in Figure 14. Users can also optimize the performance of PASW Modeler Server through special case algorithms for joins. PASW Modelers default join algorithm is designed to perform optimally when joining datasets of similar size. In some very common operations, such as when using a join to connect an ID in one table to a label or description from another (e.g., joining a product code in a table of transactions to a product name in a look-up table), the default join is inefficient. PASW Modeler offers an alternate join algorithm for these situations that significantly boosts performance speed, as can be seen in Figure 15. High performance through parallel data processing Multithreading is a method by which an applications process can perform more than one task at the same time. Threads share the same memory space, and

Figure 14

Figure 14: Impact of pre-sorting optimization on sort performance

Figure 15

Figure 15: Impact of specialized join when joining a large table to a small table (250,000 records)

must synchronize at certain points within their execution to access shared resources safely. Operating systems provide low-level mechanisms to support this synchronization. If an application uses more than one thread to execute, it is said to be multithreaded. Symmetric multiprocessing (SMP) machines are widely used and available for all platforms supported by PASW Modeler Server. They comprise multiple CPUs sharing access to the same memory, disk, network, and other I/O resources. When a multithreaded application runs on an SMP box, threads may be distributed across the CPUs and execute truly in parallel. Application processes and individual threads can usually migrate dynamically between CPUs to balance processor load. This is generally handled transparently by the operating system. PASW Modeler Server employs parallel processing to improve performance in both data processing and modeling operations.

PASW Modeler Server Performance, Optimization, and Sizing

11

Parallel data processing PASW Modeler Server uses a parallel data-sorting algorithm to improve the performance of a number of data processing operations. Sorting is used by many PASW Modeler operations, including binning, model evaluation, merge and, of course, the sort operation itself. All of these operations benefit from the parallelization of the sort operation. The parallelized sort algorithm uses a technique called record parallelism. This technique distributes records across a number of separate sorting processes. Each process sorts its own subset of records and then the results are joined. Figure 16 shows the effect of running a parallelized sort on multiprocessor hardware. At high data volumes, sort times can be reduced by more than 30 percent. Parallel predictive model building Parallel processing techniques are also used by PASW Modelers C5.0 decision tree algorithm and can improve performance in building decision trees and rule sets. The benefits depend largely on dataset sizeboth the number of records and the number of fieldsbut they can provide a useful boost to what can be a time-consuming process.
Figure 16: Impact of multiple CPUs on data sorting performance

Figure 16

Scoping and sizing PASW Modeler Server


Many factors must be considered when scoping hardware requirements for a PASW Modeler Server installation. The breadth of PASW Modeler operations and differences in data volumes make it difficult to estimate performance for any specific hardware configuration. Impact of CPUs on performance Obviously, the core speed of any individual CPU will impact data mining performance. Almost all data mining operations, especially modeling, are heavily processor dependent, so an increase in CPU speed will produce a proportional increase in performance for many PASW Modeler processes. The main benefits of multiple CPUs (or multicore CPUs) occur when running multiple streams. This means that the number of users will often be the deciding factor in determining the optimum number of CPUs. Multiple CPUs will also benefit parallelized operations, but the main benefits will be from supporting multiple users.

12

PASW Modeler Server Performance, Optimization, and Sizing

Table 1: Recommended number of CPUs per number of users Number of users 1-2 3-4 5-10 11-20 21+ Number of CPUs 1 2 4 8 16

For a production server running scheduled data mining via SPSS Predictive Enterprise Services, the number of CPUs should be determined by the number of separate processes to be performed simultaneously. Maximum performance can be achieved, for instance, by splitting a model scoring process across multiple CPUs or building multiple models simultaneously. Impact of physical memory on performance Most PASW Modeler operations can be performed on large volumes of data with minimal memory usage. Only certain operations, such as sorting, joining, and modeling, require data to be temporarily stored in memory. If not enough memory is available, these operations will store part of the data as virtual memory on disk. This can affect performance, since disk access is significantly slower than memory access. As with CPU usage, the number of users impacts the required memory for normal operation. Memory requirements depend on data volume. Typical minimum requirements can be found in Table 2. Table 2: Minimum RAM for number of users in normal use Number of users 1-2 3-4 5-10 11-20 21+ Minimum RAM 1GB 2GB 4GB 8GB 16GB

Large volume model building Model building is one of the more memory-intensive operations in the data mining process. This is because the modelbuilding algorithms require access to the entire modeling dataset, often making multiple passes at the data. For this reason, model building is usually performed on subsets or samples of data. It is normally more productive to build different models on a small subset of the data and then choose the best model, rather than to build a single model on a larger dataset. This type of model building can usually be performed within minimal memory requirements.

PASW Modeler Server Performance, Optimization, and Sizing

13

Using more data rarely improves the predictive accuracy of a model. However, if model building on larger volumes is required, additional memory can help performance.

Table 3: Estimated RAM required (GB) to avoid disk-caching during model building5 Columns Rows (millions) 0.1 0.5 1 2 4 8 16 32 64 10 0.5 0.5 0.5 0.5 0.5 1 2 4 8 20 0.5 0.5 0.5 0.5 1 2 4 8 16 50 0.5 0.5 1 2 4 8 16 32 100 0.5 1 2 4 8 16 32 500 2 4 8 16 32 1000 4 8 16 32 -

Table 3 provides guidance on the memory required to avoid disk-caching on model building operations, based on the memory usage of the neural network, K-means, and Kohonen modeling algorithms. Memory configuration PASW Modeler Server will, by default, limit the amount of physical memory used by any single process to ensure that other simultaneous processes arent affected. A maximum of 25 percent of available memory will be allocated for model building, and approximately 10 percent will be available for sorting operations. This figure is lower, as there may be multiple sorts in a single stream. The PASW Modeler Server administrator can modify these settings. Impact of disk space on performance Before addressing disk space requirements, it is important to understand the volume of data that is likely to be used for the actual data mining. Most organizations store many terabytes of data, especially transactional data, but this amount will rarely be used. Normally the data is aggregated, selected, or sampled before it is used for analysis. While large data volumes are typically used in model scoring, the model scoring processes usually rely on operations that dont use a lot of system resources. When trying to maximize performance, disk usage for data processing steps can be relatively high. The user often caches data to minimize execution times, and some operations will spill to disk when physical memory is unavailable. In addition, some operations may produce a dataset larger than the raw input data, further increasing disk requirements.

5 Estimates

based on neural network, Kohonen, and K-means algorithm memory requirements. Maximum physical memory may also be limited by the operating system.

14

PASW Modeler Server Performance, Optimization, and Sizing

To understand disk usage, a series of tests was performed based upon the PASW Modeler Application Template for customer relationship management (CRM). This template consists of streams that demonstrate data mining techniques used for CRM. The source dataset was 72MB in size, representing a sample of 140,000 customers and 360,000 transactions, plus other associated data. The data was stored in text files and all operations were carried out by PASW Modeler Serverno SQL generation was required6. As shown in Figure 17, the tests measured the maximum amount of disk space needed to execute over 100 separate execution streams. The vast majority of streams required little disk usage, but others used over four times the disk space of the source data. Given that these data preparation steps are typically executed infrequently (its a best practice to store the results of such processing as intermediate files or tables), a conservative rule of thumb is to reserve between three to five times the disk space required to store the original data. Table 4: Estimated disk space required (GB) for data mining (15 users)7 Columns Rows (million) 1 2 4 8 16 32 64 10 0.5 1 2 4 8 16 32 20 1 2 4 8 16 32 64 50 2.5 5 10 20 40 80 160 100 5 10 20 40 80 160 320 500 25 50 100 200 400 800 1600 1000 50 100 200 400 800 1600 3200
Figure 17: Percentage of original disk space required for data mining stream operations.

Figure 17

This rule holds for small numbers of users because users will rarely perform high disk-usage operations simultaneously. In addition, organizations can minimize overall disk usage by scheduling expensive data preparation steps during times of low system usage.

6 SQL

7 Estimates

generation typically reduces the disk space requirements for PASW Modeler Server since many of the data preparation steps can be carried out on the database. based on 1 million rows/10 columns requiring 100MB disk (high estimate) and a working multiplier of 5 times (high estimate for single user).

PASW Modeler Server Performance, Optimization, and Sizing

15

Conclusion
The ever-growing amount of data created by organizations presents opportunities and challenges for data mining. The PASW Modeler data mining solution makes it easy to use business knowledge to quickly develop, update, and deploy predictive models. Furthermore, PASW Modeler Servers combination of high performance, scalability, performance optimization options, and flexible hardware requirements enables it to handle large and complex data mining projects. With PASW Modeler Server, your organization can: Utilize your investment in high-performance databases for all data mining tasks, ensuring high performance and minimizing data transfer costs
n

Maximize your use of multiple CPUs (or multicore CPUs) in your operating environment by using parallel processing during a number of data preparation and model-building operations Use in-database caching, database write-back with indexing, and optimized merging to join tables outside of the database

Scaling the entire data mining process with PASW Modeler Server makes it possible for your organization to analyze large volumes of data efficiently, shortening the time needed to turn data into better business decisions that boost your ROI.

About SPSS Inc.


SPSS Inc. (NASDAQ: SPSS) is a leading global provider of predictive analytics software and solutions. The companys predictive analytics technology improves business processes by giving organizations consistent control over decisions made every day. By incorporating predictive analytics into their daily operations, organizations become Predictive Enterprisesable to direct and automate decisions to meet business goals and achieve measurable competitive advantage. More than 250,000 public sector, academic, and commercial customers rely on SPSS Inc. technology to help increase revenue, reduce costs, and detect and prevent fraud. Founded in 1968, SPSS Inc. is headquartered in Chicago, Illinois. For additional information, please visit www.spss.com.

To learn more, please visit www.spss.com. For SPSS Inc. office locations and telephone numbers, go to www.spss.com/worldwide.
SPSS is a registered trademark and the other SPSS Inc. products named are trademarks of SPSS Inc. All other names are trademarks of their respective owners. 2009 SPSS Inc. All rights reserved. CSWP-0209