This action might not be possible to undo. Are you sure you want to continue?
Tony Petrossian Ann Matzou Kwai Wong
An Introduction to System Sizing for Data Warehousing Workloads
Sizing the hardware for a data warehouse
Today’s successful enterprises have a significant interest in Business Intelligence (BI) and data warehousing (DW) because they rely on these tools to gain a better understanding of their business and to establish a competitive position within their marketplaces. When reviewing analyst reports and market research papers related to data warehousing and Business Intelligence applications, we can easily see the following common theme: “Data volumes are growing at unprecedented rates.” To accommodate huge volumes of data, we must learn to build large data warehouses that can function successfully and provide an ever-increasing return on investment. The need to effectively process large volumes of data requires abilities beyond just storing the data in a warehouse. This paper introduces the reader to the various aspects of sizing a system for data warehousing. To bring awareness to the critical issues that are frequently ignored in sizing projects, a sizing example is provided that estimates the hardware requirements of a recently published data warehousing benchmark.
© Copyright IBM Corp. 2004. All rights reserved.
The common misconception about sizing
One of the key elements contributing to the success of a data warehouse project is the hardware that stores, processes, and facilitates the movement of the data. Obviously, a large warehouse requires a large storage capacity, but the challenges of building a successful data warehouse are not limited to amassing a huge storage complex. Unfortunately, many system sizing exercises put too much emphasis on the capacity and function of the storage without considering the overall IO subsystem and the balance of system resources needed to make efficient use of the storage investment. The ability to attach a large storage complex to a system does not suggest that the system is appropriately equipped to process the large volumes of data within a reasonable window of time.
Understanding the sizing problems for a data warehouse
Sizing a system for a new data warehouse project without any experimental data can be a daunting task. Unlike more traditional OLTP workloads, only a small portion of common performance information can be used to size different data warehouse systems and applications. The majority of OLTP workloads tend to have well understood units of work per transaction. As a result, resource requirements can be scaled using transaction rates and number of users. In contrast, a unit of work in a data warehouse application is variable and mostly unrelated to the data size. This variability makes it difficult to compare resource utilizations of different DW applications for estimating system requirements. Many existing DW installations have created a science of capacity planning and measuring the resource utilization of their workloads. Unfortunately, most of this information is unavailable or inapplicable to a new installation. The ad hoc nature of a DW workload makes it difficult to compare different systems. Estimating CPU requirements for data processing in a data warehouse is a complex task. CPU requirements to process 100 MB of data can vary depending on the complexity of the queries. In order to build a knowledge base from which to estimate the processing requirements for a specific warehouse workload, be prepared to experiment, use benchmarks, seek expert opinions, and even guess. Understanding sizing problems can help to build flexible configurations for future refinements.
An Introduction to System Sizing for Data Warehousing Workloads
Accuracy goal in sizing estimates
An overly complex sizing methodology that requires massive amounts of estimated input will most likely produce a false sense of accuracy without necessarily producing a better system sizing estimate. The goal should be to produce a sizing estimate for a flexible system configuration with room for minor adjustments in the future. The system should have a good balance of resources that scale proportionally. It is important to remember that the outcome of any sizing methodology is an estimate, and although the accuracy can be improved, it will never reach one hundred percent. It is critical to recognize the point of diminishing returns when going through a sizing process. Hopefully, there exists a point between knowing the size of the data and understanding the resource requirements of all possible queries in order to achieve a reasonable size estimate. Each sizing effort should include an accuracy goal and a margin of error based on the level of existing knowledge of the application. Like any other business decision, this task requires risk calculation and contingency planning. An alternative to a sizing estimate is to run the custom benchmark using specific application and data. When feasible, these efforts are usually very expensive. In most cases, an application is built after the hardware infrastructure is installed, so benchmarking the application before buying hardware is not possible.
Optimally balanced system
Regardless of the methodology used to establish sizing estimates for data warehouse workloads, the outcome should always be a system with balanced resources that can be used efficiently. A well balanced configuration should be capable of maximizing one or more of the most expensive system resources at any time. Quite often, poorly configured systems leave expensive processing power idle due to an inadequate I/O subsystem. Data warehousing workloads present additional challenges that are not seen in traditional OLTP systems. The volume of data moved between storage and CPU for any given OLTP transaction is very small. The aggregate data movement for a large OLTP system is minuscule when compared with data warehousing systems. The balance between system CPU power, storage capacity, and the I/O subsystem is critical when building data warehousing systems. The I/O subsystem connects the storage to the CPU and accommodates the movement of data between the two components.
An Introduction to System Sizing for Data Warehousing Workloads
The following diagram illustrates the methodology used. Understand the workload and resource usage characteristics. As the quality of data increases. The workload used is a benchmark from the Transaction Processing Performance Council (TPC) for data warehousing. With some analytical work a reasonable configuration that meets the business requirements can be estimated. 4 An Introduction to System Sizing for Data Warehousing Workloads . Sizing methodology This section introduces the sizing methodology using a sample sizing effort. sizing estimates become more accurate. “TPC-H Benchmark overview” on page 30contains more information about the TPC and the TPC-H benchmark. Understand system performance capabilities and limits. special consideration should be given to the I/O subsystem for this type of workload. architecting a solution requires the following steps: 1. The TPC Benchmark™ H (TPC-H) is a well recognized data warehousing benchmark and its detailed description is publicly available. Size the system for optimal use of its resources. 3. Establish business requirements and system expectations. 4.CPU & Memory Network Interconnect IO BUS IO Subsystem Storage Interconnect Networks System Figure 1 Data movement in systems Storage Due to the high volume of data moving through data warehousing systems. 2. Overview of the sizing process To effectively size a system for optimal performance.
the data was used to run and publish a TPC-H benchmark that validated our work. a product line. More details on this benchmark result can be found on the TPC Web site: http://www.org/tpch/results/tpch_result_detail. It should be noted that 1 TPC Transaction Processing Performance Council: http://www. Although not part of this example.tpc.tpc.asp?id=103120801/1 The following sections describe the steps required to collect data and size the system. Performance data sheets on the IBM Eserver® pSeriesTM 655 (p655) and IBM® TotalStorageTM FAStT900 Storage Server were used to establish a system sizing estimate to meet the goals. and a system for a new project is a complicated process beyond the scope of this paper.Data Collection Workload Characteristics Product Data Business Requirements Sizing Knowledge Base Assumptions Facts Requirements Sizing Process Sizing Estimate Figure 2 System sizing methodology This example characterizes the behavior of a data warehouse workload and sets specific performance goals for achieving the business objective.asp?id=103120801/ An Introduction to System Sizing for Data Warehousing Workloads 5 .org/tpch/results/tpch_result_detail. Choosing a system Selecting a vendor.
I/O drawer connection capability. the selection team is responsible for ensuring that a selected system is capable of meeting the technical requirements of the workload and providing the return on investment sought by the business. clustered technical and commercial computing. The p655 server includes the latest IBM POWER4+™ chip technology in a building-block approach to the requirements of high-performance. Understanding system capabilities In this section we discuss system capabilities. and robust input/output (I/O) subsystems. memory.7GHz POWER4+ processor and its associated system architecture. With the speed advantages provided by the powerful 1. The eServer pSeries 655 System The pSeries 655 Central Electronics Complex (CEC) (7039-651) is a 4U tall. a fast system bus. rack-mounted device. extremely high memory bandwidth. and associated components. Regardless of the reasoning. 6 An Introduction to System Sizing for Data Warehousing Workloads .most selection processes are influenced by organizational preferences. It houses the system processors. the pSeries 655 (p655) provides a versatile solution to the most demanding client requirements. and other non-technical issues. historical purchasing patterns. For this project the following products were selected: Clustered configuration of IBM eServer pSeries® 655 systems IBM FAStT900 Storage Server IBM DB2 UDB The choice of products was influenced by the project requirements. as well as the desire to highlight these products. 24-inch half drawer. The following diagram shows the p655 system configuration. system support processor.
Each processor core contains 32 KB of data cache and 64 KB of instruction cache.com/servers/eserver/pseries/hardware/midrange/p655_desc. A 32 MB L3 cache is located between each processor chip and main memory and operates at one-third of the chip frequency. A Multi-Chip Module (MCM) has either four or eight 1. For more detailed information on the p655 configuration. the two cores on each processor chip share that chip’s L2 cache.ibm.5 MB L2 cache on board that operates at chip frequency.pdf3 Memory Slot 2 pSeries 655 description: http://www-1. John D.com/servers/eserver/pseries/hardware/whitepapers/p655_hpc. Mathis.html 3 IBM eServer pSeries 655—Ultra-dense Cluser Server for High Performance Computing. On the 8-way MCM.html2 Processing power The p655 system is powered by a single multi-chip processor module.ibm.com/servers/eserver/pseries/hardware/midrange/p655_desc.ibm.7 GHz. Each processor chip has a 1. Business Intelligence and Data Warehousing Applications by Harry M. refer to the following IBM Web site: http://www-1. POWER4+ processor cores. while on the 4-way MCM each core has a dedicated L2 cache. For general description and configuration information about the p655. Jacob Thomas http://www-1. refer to the following white paper: http://www-1.ibm.GX MCM GX to RIO-2 Bridge RIO-2 RIO-2 to PCI-X Bridge PCI Buses Internal PCI Devices and PCI Slots RIO-2 External I/O Drawer Figure 3 p655 CEC The following sections describe the major components of the p655.com/servers/eserver/pseries/hardware/midrange/p655_desc.html Memory Slot Memory Slot Memory Slot Memory Slot An Introduction to System Sizing for Data Warehousing Workloads 7 . McCalpin.
Figure 4 on page 9 shows the detailed configuration of the I/O drawer connected to the RIO-2 bus. the second and third ones have three 64-bit (133MHz) PCI slots. Table 1 System memory configuration options Total memory 4 GB 8 GB 16 GB 16 GB 32 GB 32 GB 64 GB Slot 1 4 GB 4 GB 4 GB 8 GB 8 GB 16 GB 16 GB 16 GB 8 GB 4 GB 4 GB 4 GB 8 GB 8 GB 16 GB 16 GB 16 GB 8 GB 4 GB Slot 2 Slot 3 Slot 4 I/O subsystem The p655 has two RIO-2 (Remote I/O) buses. The first PHB has four 64-bit (133MHz) PCI slots. two Ethernet ports. The following table shows possible memory configurations. 8 An Introduction to System Sizing for Data Warehousing Workloads . an integrated SCSI adapter. The second RIO-2 bus can be connected to the 7040-61D I/O drawer for additional I/O adapter slots and performance. The p655 supports a maximum of one I/O drawer with two RIO-2 ports.Memory configuration The p655 System has four memory slots that allow from 4 GB to 64 GB of memory to be installed. The first RIO-2 bus supports the service processor. Memory cards are available in 4 GB. and three hot-plug/blind-swap PCI-X slots on the system board (see Figure 3 on page 7). Each planar has three PCI Host Buses (PHBs). The I/O drawer contains two PCI I/O planars. 8 GB and 16 GB sizes.
73.4.2. Dual controllers with mirrored cache in the FAStT900 provide for the RAID functions necessary to protect data from disk failures. and 146. It offers up to 32 TB of Fibre Channel disk capacity using 18.4. 36.Sustained 2100MB/s Duplex RIO-2 RIO-2 HUB RIO-2 Active Passive/Failover RIO-2 Active Sustained 1050MB/s Duplex 64 bit PHB 600MB/s Sustained RIO-2 to PCI-X Bridge RIO-2 to PCI-X Bridge PHB1 PCI-PCI Bridge PHB2 PCI-PCI Bridge PHB3 PCI-PCI Bridge PHB1 PCI-PCI Bridge PHB2 PCI-PCI Bridge PHB3 PCI-PCI Bridge 7040-61D IO Drawer Figure 4 I/O drawer configuration FAStT900 Storage Server The FAStT900 Storage Server is a member of the IBM FAStT family of disk storage products. a An Introduction to System Sizing for Data Warehousing Workloads 9 . The FAStT900 is an enterprise-class storage server designed to provide performance and flexibility for data-intensive computing environments. The FAStT900 Storage Server has four host-side FC interfaces that provide an extremely high I/O bandwidth and four drive-side interfaces that accommodate a very large storage capacity. When performing sequential I/O operations. A FAStT900 can be connected through SAN switches or attached directly to the host. The FAStT900 Storage Servers sustain an enormous I/O rate with a mixture of read and write operations.8 GB drives with EXP700 disk drive enclosures.
com/disk/fastt/fast900/index. so many sizing efforts put most of the emphasis on this task. high-performance storage for on demand computing environments 4 http://www. with a good understanding of the specific database products and the targeted business environment.000 GB of raw data. For example.FAStT900 can saturate the four 2Gb FC host interfaces and deliver more than 720 MB per second of I/O to the system.ibm.storage.com/disk/fastt/fast900/index. Most DW projects can easily calculate the raw data size based on information provided by the data sources. The various components requiring storage space are: Table data storage Index data storage Database log space Temporary space required by the database Staging space required to store raw data FAStT900 Storage Server—Scalable. To characterize a workload for sizing a system.ibm. it is useful to have experience in data warehousing. its size is either known or easy to estimate. Each will be considered separately. There are two major workload-related areas of concern in sizing a system for data warehousing projects: The storage of the data warehouse The processing of the data Both storage and processing requirements have an impact on all system components. Storage space requirements of the workload Estimating the disk space to store data is the simpler aspect of system sizing. the assumption was to have 1.storage. For more information about the FAStT900 features and performance refer to the following IBM Web site: http://www.html 10 An Introduction to System Sizing for Data Warehousing Workloads . For this example.html4 Understanding the workload It should be mentioned that the authors have experience with the TPC-H workload based on previous projects. when the data is extracted from an existing transactional system.
the information provided in the DB2 Administration Guide: Planning. For this example. index. the following information was used to calculate the space requirements for the indices in the schema: Index size per row of the table Page size Index page overhead Rows per page – Including the page overhead – Free slots per page Free space for future page additions 5 DB2 Administration Guide: Planning.350 GB. log. the space requirement for storing all the base tables was estimated to be 1. Most database vendors provide accurate estimates once the schema and data size are known. The database product documentation should be consulted for information on estimating table space requirements.Once the raw data size is established. the schema. it is necessary to estimate the database space requirement for storing the data in tables using the internal format of the database. Most database vendors provide ample documentation to help calculate the database table space requirement for any given schema. SC09-4822. A database administrator and a data architect must be involved in this process. the following information was used to calculate the space requirements for the base table that holds the warehouse data: Raw data size Data row size Number of rows Page size Rows per page – Including page overhead – Free slots per page Free space for future page additions Considering the above items. SC09-4822 An Introduction to System Sizing for Data Warehousing Workloads 11 . This manual has a specific section that can help estimate table. and row density per page are required. For this.5 document was used. Index data storage For each index. and temporary space requirements for the schema. Table data storage For each table in the database. page size.
Considering the above information. The data warehouse maintenance strategy will generally dictate the database log requirements. Consult DB2 Administration Guide: Planning. As the database reaches the limits of its allotted memory. Transactional characteristics of the workload The log space requirements were estimated to be insignificant in size relative to the data size of the warehouse. Database log space Most data warehouses have infrequent update activity in comparison with OLTP workloads. a query that attempts to sort 300 GB of data on a system with 16 GB of memory will require significant temporary storage. 6 DB2 Administration Guide: Planning. Once again. sort. Other strategies involve regular maintenance of data that requires inserts and deletes from the database. regular updates to the data had to be accommodated.1 percent of the total warehouse. the space requirement for storing all the indices was estimated to be 258 GB. or aggregation operations. these configurations have insignificant logging requirements.6 for more details on sizing log requirements. SC09-4822. Some data warehouses are loaded periodically from operational data and are never modified between loads. they require memory to hold intermediate or partially processed data. In this configuration. but the volume of data being changed was only 0. SC09-4822 12 An Introduction to System Sizing for Data Warehousing Workloads . it uses disk space to temporarily store the partially processed data and free up memory for further processing. which adds up to 1 GB (see “TPC-H overview” on page 30 for details) per update cycle. Most data warehouse systems process more data than can be held in memory and therefore they need temporary database storage space. Temporary database space When databases are executing queries that process large join. For example. The following was taken into consideration: Data warehouse update and maintenance strategy Frequency of updates Volume of changing data per cycle Data recovery requirements Update cycles between log backups. 36 GB of space was allocated to the database logs to satisfy logging needs for at least twenty update cycles between log backups. the database product documentation should be consulted for information on estimating index space requirements.
500 GB of space was sufficient for the project.Estimating the temporary storage space requirements for a DW workload is difficult because several external factors such as system memory size and concurrency levels impact the need for space. There are several factors that impact the overall storage configuration. Staging space Most data warehouse projects require some space for staging raw data for loading or maintaining the warehouse. Underestimating the temporary space requirements of a workload can prevent the proper execution of large queries or limit concurrency levels of query execution. for instance: RAID requirements to protect the data An Introduction to System Sizing for Data Warehousing Workloads 13 . Depending on the operational procedures. about 1.000 GB of temporary space was needed. the space requirement can vary drastically. 140 GB of temporary space was estimated for the worst-case query. Minimum storage space requirement The storage space estimate is the minimum of space requirements. It usually takes an experienced data warehousing architect with help from database vendors to estimate temporary space needs. The following information was considered when estimating the staging space requirement: Data warehouse loading procedures Location and storage needs for the raw data Data warehouse maintenance procedures Location and storage needs for the maintenance data Future growth of data Based on the operational needs to store update data and some load data it was estimated that 1. and since seven query streams could run concurrently. The following information was considered when estimating our temporary storage needs: Percentage of data used in large sort and join queries Number of concurrent queries that can be running at any one time – Number of concurrent query segments per query Previous experience with the workload memory usage – Expert guesses and rules of thumb available from product vendors Comparative estimates provided by the database vendor Future growth of data and increase of concurrency levels Based on the above information.
350 GB 258 GB 36 GB 1. For example. It is important to understand the resource needs of the different categories of queries and to size the system to accommodate as many categories as possible. most of these queries can be put into a few general categories. while other queries that require simple filtering operations on large volumes of data become IO bound. Adjustments to the number of disks for performance reasons may result in having more space than the minimum required. Table 2 Minimum storage space requirements 1. Although an infinite number of queries can be formulated in an ad hoc environment. suppose that the storage requirement can be satisfied with 11 disk drives.144 GB Data processing characteristics of the workload The TPC-H workload consists of a number of queries executed on the data. but if the system has two disk controllers it might be better to use six disks per controller to evenly distribute the I/O on both controllers. For example. but as always one must balance the performance needs and cost based on the project priorities. queries with a significant number of arithmetic operations per row of data are CPU bound. The following table shows the overall storage requirements for the configuration. The three major system resources that are stressed by a data warehousing workload are: CPU resources Memory resources I/O resources – Network – Disk • Sequential I/O scans • Random I/O scans 14 An Introduction to System Sizing for Data Warehousing Workloads .000 GB 1.Number of disks required to meet the performance requirements Number of disks needed to balance the I/O performance.500 GB 4.000 GB TPC-H warehouse space requirement Data Index Database log Database temporary storage Staging space Total storage space 1.
Different databases and schema definitions may behave differently for the same queries. Large: More than 50 percent of the data. The process of categorizing the various queries starts by careful examination of the database schema. 2. The goal is to determine the following: 1. a single unit of a DW work (query) can be parallelized to maximize the utilization of the system and return results in a minimum amount of time. it was assumed that the database can be optimized using indices and other schemes to minimize data access to what is necessary to calculate and produce the query results. the following query should access the entire LINEITEM table (see “An Introduction to System Sizing for Data Warehousing Workloads” on page 1 for details): select sum(l_extendedprice*l_discount) as revenue from lineitem For this workload. with help from the database vendor and a data architect. can analyze a schema and anticipate the resource needs of potential queries. the various queries were organized in three groups: Small: Less than 15 percent of data is needed to produce results.A well balanced data warehouse system maximizes one or more of the resources in the system. data characteristics. and queries. Unlike OLTP workloads. When additional concurrent queries are added. For example. the minimum amount of data each query would need to produce the query result can be anticipated. An Introduction to System Sizing for Data Warehousing Workloads 15 . Select some larger queries to represent these categories. 3. Without the ability to experiment and run test queries. the databases start to distribute system resources among the active queries. Estimate the size of each query Based on the predicates used in a query and knowledge of the data. Medium: Between 15 and 50 percent of data. a missing index may force the database to execute a full table scan and result in significantly more access to data. 4. In estimating the data set size for queries. even experts can have a hard time gauging the CPU needs of a query. Only an expert in data warehousing workloads. The time to complete a query in a DW workload will vary from one execution to another depending on the mix of queries running at the same time. For example. Categorize the queries based on the dominant resource needs. Use these queries as a guide for system sizing. Estimate the size of the data accessed by each group of queries.
avg(l_quantity) as avg_qty. avg(l_discount) as avg_disc from 16 An Introduction to System Sizing for Data Warehousing Workloads .Since it was intended to use the largest queries for this characterization work.000 500 0 21 9 17 19 18 1 Query Number Figure 5 Query size estimates The details of the size categorization for all the queries are gathered in Table 3 on page 17. The following chart shows the approximate minimum data size required to complete the six largest queries in the workload. 3. the intensity of the processing being applied to the data being read for the query is estimated. Categorizing queries TPC-H queries are categorized with respect to the most dominant resource they use. sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge.500 1. sum(l_quantity) as sum_qty. avg(l_extendedprice) as avg_price.000 2. l_linestatus.000 1. sum(l_extendedprice) as sum_base_price.500 Data Size (GB) 2. For example. the following segment of SQL code shows a query with a significant number of calculations for every row of data to be processed: select l_returnflag. sum(l_extendedprice*(1-l_discount)) as sum_disc_price. To do this. there was less concern about the ability of the database to optimize data access.
l_linestatus In contrast. select sum(l_extendedprice*l_discount) as revenue from lineitem A few of the diverse queries were compared and contrasted with each other and based on the experience of the team with DB2 query processing. the following query has modest processing requirements for each row being scanned. Table 3 Query categorization Query 21 9 17 19 1 18 7 5 13 11 6 2 22 16 14 Data set Size Large Large Large Large Large Large Medium Medium Medium Small Small Small Small Small Small Memory Requirement High High Low Low Low High High High Medium Medium Low Low Low High Low CPU Requirement High High Low Low Medium High High High High Low Low Medium High High High I/O requirement Sequential Low Low High High Medium Low Medium Low Low High High Medium Low Low Low Random Low Medium None None None Medium Medium None Low None None None Low Low Low Network Requirement Low High Low Low Low Low High High Low Low Low Low Medium Medium High An Introduction to System Sizing for Data Warehousing Workloads 17 .lineitem group by l_returnflag. They were categorized in the following table. This query will run as fast as the data can be read by a database and therefore it is I/O bound.
The chart showed the relationship between CPU and I/O requirements of each query. Selecting a representative set of queries For this characterization work. 18 An Introduction to System Sizing for Data Warehousing Workloads . To balance the system. its I/O requirements are lower than a query that requires little CPU power to process the same amount of data. the assumption was that if the system is configured to meet the resource needs of the six largest queries it can also provide for the smaller queries with similar characteristics. The location of the bubble on the chart shows I/O versus CPU requirements for the query. it can be seen that most of the queries have high CPU or sequential IO requirements. When a query is CPU intensive.15 Small Medium High Low Low High Once the characteristics of the various queries are established. All the major system resource categories can be maximized by one or more of the six largest queries. The following chart was built using data from the analysis of the six largest queries. As can be seen in the above table. Each bubble on the chart represents a query. The size of the bubble is proportional to the volume of data required to obtain the query result. sufficient IO performance is needed to keep CPU resources maximally utilized. the top six largest queries can easily represent the entire query set with respect to resource needs.
then all queries to the right of query 1 will be I/O bound.0 0. The I/O sizing query Based on the information in “Estimate the size of each query” on page 15 and Figure 5 on page 16.2 0. this type of categorization can be used to optimize for the largest class of queries within a given budget. For example. the time to complete this query is equal to the time it takes to read the data.6 0. if a system is configured with just enough I/O bandwidth to maximize the CPU utilization during query 1.2 0.8 0. Assuming this query is I/O bound. A system with 1 GB per second of I/O bandwidth would require 800 seconds to read the 800 GB of data to complete this query.8 1. It can also be concluded that configuring the system with more I/O than is necessary for query 17 will not provide any benefit. a reasonable response time for this class of queries can be set and the I/O system can be configured accordingly.4 0.Relative Resource Needs of Queries Relative CPU Resource Requirements 1.0 18 9 1 21 19 17 0. query 17 reads 80 percent of the data to produce a result. Based on the business requirement. Since it is not possible to configure the system optimally for every possible query.0 Relative IO Throughput Requirements Figure 6 Relative resource requirements Figure 6 can be used to make system sizing decisions based on the relative information.4 0. An Introduction to System Sizing for Data Warehousing Workloads 19 .6 0. The reference queries In this section we discuss the reference queries.0 0.
then the system has to be optimized for single stream processing. On the other hand. if the I/O requirements of query 17 can be met. The point representing query 17 on the far right edge of the chart in Figure 6 on page 19 indicates that this query has an insatiable requirement to read data when compared to the other queries in the workload. depending on time of day or class of users. The point representing query 18 on the top left edge of the chart in Figure 6 on page 19 indicates that this query has massive requirements for data processing power relative to all other queries in the workload. If users are scheduled to have dedicated system time with the expectation of the fastest possible response time to the queries they submit. These two modes of operation have very different system resource requirements. there will be no shortage of I/O bandwidth for query 18. The Power Test (single-user batch jobs) For this test the goal was to optimize the single-user query processing (Power Test) because of the benchmark requirements. the system needs enough I/O bandwidth to read the required data in that time and enough CPU power to keep up with the processing of the data. The decision to optimize for single stream query processing makes sense depending on the business needs for the system. Operational characteristics of the workload The TPC-H benchmark has two execution models.To complete a given query in a fixed period of time. where. If a single stream 20 An Introduction to System Sizing for Data Warehousing Workloads . so the needs of the two must be balanced. the Power Test and the Throughput Test. The Throughput Test (multi-user operation) Considering that the number of queries that access large amounts of data are CPU intensive and have low I/O rate requirements (large bubbles near the top left of the chart in Figure 6 on page 19). Query 18 will be limited by the CPU during its executions. The CPU sizing query Based on the assessments in “Categorizing queries ” on page 16 and the information provided in Figure 6 on page 19. system requirements can be very different. it can be assumed that running multiple queries at the same time would result in a CPU bound system. When sizing a system all modes of operation must be considered and prioritized between conflicting requirements. Using the information in Figure 6 on page 19. if the system is mostly used by multiple users who submit queries and are not sensitive to response time detail. This multi-mode operational requirement is also common in many DW installations. then single stream performance is less critical. it was determined that query 18 is the most CPU intensive query.
then more than one stream of queries is required to fully utilize the system resources. This requirement must be translated to the various read. write. To be safe a margin of error should be anticipated. so it is critical to collect the business requirements applicable to the sizing effort and translate them to system requirements. 3 years An Introduction to System Sizing for Data Warehousing Workloads 21 . the primary purpose of the data warehouse infrastructure is to solve business problems. but it does capture the most critical elements for this purpose. For example. sometimes simultaneously running query streams can benefit from sharing data caches. Slower response time for any operation can be tolerated if the percentage of degradation is less than or equal to that of the growth in data. the system should be configured to fully utilize CPU resources as much as possible. If a system is mostly I/O bound during the execution of single stream queries. while at other times the reduced availability of memory results in resource conflicts. a business unit may require that a warehouse be reloaded from transactional data once per week and the task must be completed in a six-hour window of time. then seven streams of similar queries could potentially take about seven hours of CPU time. However. The following table shows the list of business requirements and expectations addressed in this sizing example. Normal growth rate must be accommodated without major architectural changes in the system. To get the best return on investment.of queries takes one CPU hour to complete. Table 4 Business requirements Requirement Raw data size Annual growth rate of data Tolerance for performance decline due to growth rate Service life expectancy Expectation 1000 GB of raw text data Less than 2% Less than 2% Comment The raw data size is only the base number for calculating storage space requirements. The system is expected to operate for at least three years without major changes. This list is by no means conclusive. This rule of thumb provides reasonable estimates. and process rates for the data to ensure the system has the appropriate capacity. Business requirements Obviously.
Although the overall hardware cost had a budget ceiling. In addition. In addition. a DBA will have enough time to rebuild the warehouse from scratch and prepare it for query process in less than four hours. query response times must not degrade by more than a factor of 7 to 8. reaching the specific performance targets had the highest priority. Less than 900 seconds Query concurrency rate 7 query streams Performance versus cost optimization priority Performance Data protections RAID level 5 22 An Introduction to System Sizing for Data Warehousing Workloads . the price/performance was secondary to overall performance. This requirement is critical because a significant number of ad hoc queries in the workload perform scan and filter operations on all or part of the data. RAID level 5 is most appropriate for us because it allows protection with a reasonable overhead. Our workload has several queries that frequently run to generate reports. Query 17 is our guide for this criterion. In this exercise. We will use query 18 as our guide for sizing the system to meet this requirement. We also intend to run multiple simultaneous queries and we like to ensure the work can be completed in a reasonable time. Our goal was to achieve the performance target with optimal configuration and price within the budget guidelines. extract operations from the warehouse are bound by the scan rate of the system. The workload requires that at least 7 query streams can operate concurrently at any one time. The response time for several large and small queries with simple calculation will be impacted by the scan rate. These queries require several complicated computations and sort operation on all or parts of the data. Scan query response time based on total data Less than 200 seconds Reporting and computational query response time based on total data. Our workload requires RAID protection to prevent a single disk failure from disabling the system. this adds a 17% disk overhead for RAID-5. when compared to a single stream execution. At this rate.Raw data load rates 145 MB/Sec 145 MB per second load rate is derived from the need to load the 1000 GB data in less than two hours. A 5+P RAID level 5 configuration requires 5 disks for data and one disk for parity. The intent of this requirement is to ensure the worst case report will complete in what is considered reasonable time based on the business requirement.
the information can be compiled into a single set of guidelines for sizing the storage configuration. Sizing the storage After collecting all the relevant data from workload characterization. When sizing the system. or both resources. In addition.The above table maps the business requirements to a set of checklist items for the system.144 GB Minimum number of disks is 122 (4144 / 34 = 122) Number of FAStT900 needed to meet IO 5.000 MB/sec scan rate based on query 17 I/O profile and 200 second limit (see Table 4 on page 21) on scan query response time. this table should be used as a boundary guide to ensure the business needs are met. business requirements. The goal is to configure a system that meets the workload requirements and runs as close to the system utilization limit as possible. all databases occasionally have execution paths that fail to maximize system utilization. In many situations. it is difficult to always maximize the utilization of CPU. Saturating processing power and maximizing the I/O subsystem during most of the operational hours of the system will provide the best return on investment. Approximately 700 MB/sec I/O rate per FAStT900 (see “FAStT900 Storage Server” on page 9) 4 Fibre Channel interfaces per FAStT900 14 disk capacity per FAStT700 Expansion unit Disk size of our choice 36 GB Minimum storage space needs 4. RAID level 5 configuration using 5+P settings 145 MB/sec load rate 5. I/O. factors such as limited memory or network resources can result in idle CPU or I/O subsystems. Sizing the system for effective utilization In data warehouse workloads.000 MB/sec I/O rate is 8 (5000 / 700 = 8) Number of FAStT900 needed to meet load rate of 145 MB/sec is 1 Number of EXP700 needed to fully utilize all FAStT900 disk side interfaces is 16 (2 per T900) An Introduction to System Sizing for Data Warehousing Workloads 23 . and system data sheets.
Obviously 192 disks is significantly more than the initial requirement of 122 disks.6 per EXP700 to 8. the number of disks should be rounded up from 7. evenly distributed load with RAID level 5 protection that meets the performance requirement. Although a single FAStT900 can easily accommodate the space needs for the workload. 5+P RAID 5 Arrays FC HBA SFP ESM A SFP A 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 E E FC HBA Host PCI-X FC HBA CTRL A SFP ESM B SFP B EXP700 Enclosure 0 CTRL B SFP ESM A SFP 1 SFP ESM B SFP 2 3 4 5 6 7 A 8 9 1 0 1 1 1 2 E E FC HBA FAStT900 Figure 7 FAStT900 disk configuration B EXP700 Enclosure 1 Twelve disks per EXP700. and 8 FAStT900 brings the total number of disks to 192.Since 122 disks do not evenly distribute amongst 16 EXP700. the random IO requirements of the workload is easily met. Additionally to configure for 5+P RAID level 5 configuration. 24 An Introduction to System Sizing for Data Warehousing Workloads . 2 EXP700 per FAStT900. Random read and write requirements for this workload were relatively low when compared to the rest of the I/O needs. Space should never be the only determining factor for storage configuration. a parity disk should be added so the total number of disks per EXP700 is 12. For every 5 data disks. Considering the number of disks and FAStT900 servers. but this configuration provides a balanced. it cannot possibly meet the performance requirements. The following diagram shows the logical configuration of a FAStT900 with the enclosures and disks attached. the number of data disks in each EXP700 should be divisible by 5 so once again the number of disks in each EXP700 is rounded up to 10.
In addition to past experiences. Table 5 Storage configuration Storage configuration FAStT900 Storage Server EXP700 Enclosures 36GB disk drives 2 Gb Fibre Channel Host bus adapters 8 16 192 32 Sizing for CPU and memory Sizing for CPU and memory requires extensive experience and significant knowledge of the workload. the storage can be connected to 8 or 16 nodes to satisfy the I/O bandwidth requirements. as long as it is known that the test system processor is ten times slower than the targeted system processor. For example.Based on the storage sizing the system has to be able to accommodate 8 storage servers. a small one processor system can be set up with a fraction of the data to measure query performance. reasonable estimates can be made. the experience of past testing with specific queries and the knowledge of the relative performance of the older systems compared to the targeted systems was beneficial. The following table sums up the storage requirements. For this project. Considering the flexibility of the FAStT900. A total of 32 Fibre Channel host bus adapters are required to match the performance of the storage subsystem. Each p655 can easily provide for 1400 MB/sec of I/O (see “I/O subsystem” on page 8). The test system can be An Introduction to System Sizing for Data Warehousing Workloads 25 . From this it can be determined that at least four p655 nodes are required to satisfy the 5000 MB/sec I/O needs of the workload. any size system can be connected that can evenly distribute the access to 32 HBAs amongst all processors to provide for 175 MB/sec of bandwidth for each HBA. This system can be a single node with any number of CPUs or a multi-node cluster of systems with an aggregate I/O bandwidth of 5000MB/sec. The test system can have any processors so long as their performance can be related to the targeted system. the sizing team needs to run experiments and make educated guesses. To estimate processor requirements for CPU intensive queries. Depending on the CPU and memory needs for the workloads. each with 4 Fibre Channel interfaces.
some sizing experts use the ratio of memory per CPU as a guide. Alternatively. It would be unreasonable to allow the 1 GB test query to consume 1 GB of memory unless is was intended to configure the target system with 1000 GB of memory. but as memory prices have gone down and processor speeds have gone up this ratio has increased to 4GB. This type of estimate can only be applied to simple compute intensive queries that are known to scale and that have stable query plans regardless of their size. it was estimated that it would take 720 seconds to complete a query similar 26 An Introduction to System Sizing for Data Warehousing Workloads . Too little memory will result in increased use of database temporary storage. Workloads that concurrently run many large queries with multi-table join and sort operations require more memory than workloads that run mostly scan/filter queries and aggregate data. it was established that at least 100 MB of database memory was needed for each 1 GB of data. it was known that the worst case query running on 1000 GB of data needed to complete in 900 seconds. Too much memory will fail to provide performance improvements once processing power is saturated. For this project. Complex multi-table join queries are not good candidates for simple CPU testing because most databases try to apply query optimizations that may behave differently based on data size. For this testing. If the test system completes a 1 GB query in 50 seconds then it is known that a processor that is ten times faster can complete the work in 5 seconds. Based on the business requirements. Throughout the testing. a similar ratio was maintained. Assuming query 18 was the worst case and knowing that it used about 80 percent of the data. Query 1 was known to be a much simpler query to run tests with and that it scales for all data sizes. A common rule of thumb for data warehouse workloads used to be 1 GB of memory per CPU. This ratio can be drastically different from one workload to another.used to measure the time it takes to process a fixed amount of data by the worst case query. The memory usage of the test system needs to be limited to a relative size. To simplify the estimates it is preferable to establish a fixed relation between memory size and data size during testing. 10 of the new processors are needed. To process 1000 GB in 500 seconds. which will require more I/O operations and possibly more idle processing power. Adding memory to a production system is much simpler than adding I/O or processing capacity. Query 1 was used for small scale testing and measurement and the relative resource usage graph (Figure 6 on page 19) was used to estimate the processing needs for query 18. so this area of the configuration can further be improved so long as some flexibility is built into the plan. query 18 was selected as a guide for sizing the system processing needs.
it was estimated that query 1 used about 2. CPU. adding an additional network interface would have been insignificant. The overall system Based on the I/O. Relative to the total system cost. Using experience with older systems and a subset of the data. the risk of under configuring the network was low. Based on these estimates it was assumed a configuration with 16 processors and approximately 128 GB of memory would satisfy the requirements.7 GHz processor available for the p655 system could process query 1 at the approximate rate of 180 MB per second.730 MB per second was needed. and from this it was determined that enough CPU processing power was needed to complete query 1 in approximately 300 seconds. It was estimated that 16 1. This configuration provided a balanced performance for the workload that maximized the overall system utilization. These timing estimates were well within the business requirements. Considering that only a few queries in the workload required significant data movement between the nodes. Ability to provide more than 1400 MB per second of I/O bandwidth was a critical feature of the IBM eServer p655 system that made it an attractive option. The ability to provide a huge I/O bandwidth to feed the powerful An Introduction to System Sizing for Data Warehousing Workloads 27 .5 times less processing resources to process the same size data as query 18 in a fixed period of time. it was estimated that a single Gigabit Ethernet interface per node would be sufficient for the configuration.to 18. at least 16 processors were needed to meet the performance target. The network switch and I/O subsystem were configured to accommodate additional network interfaces if needed. and memory requirements. the data management schemes used by DB2 and previous experience. Sizing for the network Based on the workload characteristics. When sizing the network requirements for a clustered data warehouse installation the database vendor should be consulted and their recommendation should be followed with care.7 GHz p655 processors could complete seven concurrently executing queries similar to query 18 in less than 90 minutes. For query 1 to process 800 GB of data in 300 seconds a processing rate of 2. Using the relative resource requirement chart (Figure 6 on page 19). Having said that. it was calculated that the 1. a 4-node cluster of p655 systems was needed to meet the workload requirements. At a rate of 180 MB per processor. Different databases have different network bandwidth requirements even when running the same workload. the team was prepared to add a second interface if necessary.
The CPU bound queries performed within ±10 percent of expectation with about equal number on the plus and minus side.processor and maximize utilization helps achieve the return on the initial investment needed to be successful. 2003. Some of the I/O bound queries performed better than expected due to conservative performance estimates for various components. For further details. EXP700 EXP700 FAStT900 EXP700 EXP700 FAStT900 EXP700 EXP700 FAStT900 EXP700 EXP700 FAStT900 Figure 8 Overall system configuration Gigabit Switch EXP700 IBM ^ p655 IBM ^ p655 EXP700 FAStT900 EXP700 EXP700 FAStT900 EXP700 EXP700 FAStT900 EXP700 EXP700 FAStT900 IBM ^ p655 IBM ^ p655 The four node cluster was connected with a gigabit Ethernet switch for all inter-node communications. The sizing effort was a success since the minimum requirements for the workload were met. Each node was directly connected to one-fourth of the total storage configuration. The benchmark was submitted to the TPC and published in December.tpc. The project was completed by building the configuration based on the sizing estimate and executing the TPC-H benchmark.org/tpch/results/tpch_result_detail. see the following TPC Web site: http://www. The overall configuration provided about 5. This configuration was well balanced and flexible and could be easily extended for larger DB2 installations. but the results were within 5–10 percent of expectations.500 MB per second of read bandwidth from disk to system memory.asp?id=1031208017 28 An Introduction to System Sizing for Data Warehousing Workloads . The following is a general diagram of the system configuration.
org/tpch/results/tpch_result_detail. ensure that the basic building block system is well balanced and meets both I/O and CPU requirements for its subset of the workload. In the absence of the workload characterization data.Conclusion System sizing for a data warehouse is a complicated task that requires some expertise to accurately estimate the configuration that can meet the needs of a business. For clustered configurations. The reliability. build flexibility into plans by configuring systems that are well balanced in resources and are extensible. availability. 7 TPC-H Result Highlights—IBM eServer p655 with DB2 UDB http://www. the expectation of an accurate system sizing estimate should be set appropriately. With some analysis and workload characterization. The quality of the sizing estimate depends on the accuracy of the data put into the process.tpc. it is possible to drastically improve a sizing estimate. To safeguard a project. and serviceability of a cluster are only as good as that of the building blocks.asp?id-103120801 An Introduction to System Sizing for Data Warehousing Workloads 29 .
The set of rows to be inserted or deleted by each execution of the update 30 An Introduction to System Sizing for Data Warehousing Workloads . The TPC-H database size is determined by the scale factor (SF). in addition to adhering to the specifications. they are meant to reflect the need to periodically update the database. IBM® was the very first company to publish a TPC-H result at the 10. TPC-H represents information analysis of an industry that must manage. or distribute a product worldwide. The refresh functions are not meant to represent concurrent online transaction processing (OLTP). A scale factor of 1 represents a database with 10. customer) implementations. The 22 queries answer questions in areas such as pricing and promotions. 3000 and 10000. Published benchmarks as well as the benchmark specifications are accessible on the TPC Web site: http://www. supply and demand management. which creates the synthetic data set. 300. results must be reviewed and approved by designated auditors and a full disclosure report documenting compliance is submitted to the TPC. PC Council The Transaction Processing Performance Council™ (TPC) was founded in August 1988 by eight leading hardware and software companies as a non-profit organization with the objective to produce common benchmarks to measure database system performance. profit and revenue management.000 suppliers and corresponds approximately to 1 GB of raw data.000 GB scale on December 5. 1000. TPC-H. dbgen. 2000.org TPC-H overview The TPC-H benchmark models a decision support system by executing ad-hoc queries and concurrent updates against a standard database under controlled conditions. shipping management.tpc. objective performance data to industry users” according to the specifications and all implementations of the benchmark. Prior to publication. 30.TPC-H Benchmark overview This section is the TPC-H benchmark overview. 10. The purpose of the benchmark is to “provide relevant. More than 20 companies are currently members of the council. TPC-R and TPC-W) for which results can be published. Only a subset of scale factors are permitted for publication: 1. customer satisfaction. There are four active benchmarks (TPC-C. The database is populated with a TPC-supplied data generation program. sell. must be relevant to real-world (that is. 100. market share.
An Introduction to System Sizing for Data Warehousing Workloads 31 .000 GB (10 TB) database. Table 6 Eight tables Table name REGION NATION SUPPLIER CUSTOMER PART PARTSUPP ORDER LINEITEM Cardinality 5 25 SF*10 K SF*150 K SF*200 K SF*800 K SF*1500 K SF*6000 K (approximate) The chart below gives a more detailed view of the relationships between the tables and the number of rows per table when the scale factor is 10.000. The database consists of eight tables. that is for a 10.functions is also generated by using dbgen.
It also restricts how horizontal partitioning (by row) may be implemented. By imposing these restrictions.000M PARTKEY NAME MFGR BRAND TYPE SIZE CONTAINER RETAILPRICE COMMENT PARTSUPP (PS_) 8. and date columns.000M PARTKEY SUPPKEY AVAILQTY SUPPLYCOST COMMENT CUSTOMER (C_) 1. the TPC-H benchmark maintains the server platform as part of the performance equation and represents an ad-hoc environment.500M CUSTKEY NAME ADDRESS NATIONKEY PHONE ACCTBAL MKTSEGMENT COMMENT NATION (N_) 25 NATIONKEY NAME REGIONKEY COMMENT LINEITEM (L_) 60. 32 An Introduction to System Sizing for Data Warehousing Workloads .PART (P_) 2. The table below summarizes the differences between the TPC-H and TPC-R benchmarks. the ranges must be divided equally between the minimum and maximum value.000M ORDERKEY CUSTKEY ORDERSTATUS TOTALPRICE ORDERDATE ORDERPRIORITY CLERK SHIPPRIORITY COMMENT SUPPLIER (S_) 100M SUPPKEY NAME ADDRESS NATIONKEY PHONE ACCTBAL COMMENT Figure 9 TPC-H schema TPC-H enforces the ad-hoc model by severely restricting the implementation of auxiliary data structures such as indices and materialized query tables (sometimes known as automatic summary tables or materialized views). The partitioning column is constrained to primary keys. foreign keys.000M ORDERKEY PARTKEY SUPPKEY LINENUMBER QUANTITY EXTENDEDPRICE DISCOUNT TAX RETURNFLAG LINESTATUS SHIPDATE COMMITDATE RECEIPTDATE SHIPINSTRUCT SHIPMODE COMMENT REGION (R_) 5 REGIONKEY NAME COMMENT ORDER (O_) 15. If range partitioning is used. The TPC-R benchmark that does not restrict auxiliary structures models a reporting environment.
Consistency. An Introduction to System Sizing for Data Warehousing Workloads 33 . – Queries relying on index access as well as table access – Long running queries as well as short running queries exercising all aspects of query processing Database refresh functions to perform inserts and deletes on the database The benchmark specifications require that the implementation chosen for the benchmark satisfy Atomicity.Table 7 Differences between TPC-H and TPC-R benchmarks TPC-H (ad-hoc) Auxiliary data structures Simulated environment Side effects Restrictions on indices No aggregates Ad-hoc query Heavy stress on system Average response time several minutes Update function times similar to query times TPC-R (reporting) Extensive indices and aggregates OK Pre planned. Specific tests are designed to show: That the system either performs individual operations on the data or assure that no partially completed operations leave any effects on the data (A). That execution of transactions take the database from one consistent state to another (C). frequently asked queries Lots of tuning by DBA Sub-second response times for several queries Load time much longer Update function times much longer than query times The TPC-H benchmark exercises the following areas: Twenty-two queries with the following characteristics: – Left outer join – Very complex queries with nested sub queries – Aggregate with "HAVING" clause – Queries with multiple "OR" predicates – Query combining "EXISTS" and "NOT EXISTS" – Query with multiple "SUBSTRING" operators – Large scans with multi-table joins – Aggregate operations with large number of distinct values – Large number aggregations and sorts. Isolation and Durability (ACID) properties.
This metric is the primary performance metric. The geometric mean of the queries and updates is used here to give equal “weighting” to all the queries even though some may be much longer running than others. which consist of two performance metrics and one price/performance metric: Composite Metric (QphH@Size™) = . etc. LINEITEM and ORDER. The TPC-H specification states that each refresh function (RF1 or RF2) can be decomposed into any number of database transactions as long as the following conditions are met: All ACID properties are satisfied.. when adding or deleting a new order. A single update pair must be run for the power test and a set of update pairs for each query stream is run in the multi-user throughput test.. the LINEITEM and ORDER tables are both updates within the same transaction. (D). For example. memory.1 percent of the initial population of these two tables. Each atomic transaction includes a sufficient number of updates to maintain the logical database consistency. The exact implementation of the refresh functions is left to the vendor.. communications. The concurrent updates insert and delete from the two large tables.That concurrent database transactions are handled correctly (I). * Q 22 * RF1 * RF 2 Where Q1. TPC-H Metrics The benchmark specification provides details on how to report results. The power metric is derived from a power run 34 An Introduction to System Sizing for Data Warehousing Workloads . That committed transactions and database consistency are preserved after recovery from hardware failures such as loss of power. An output message is sent when the last transaction of the update function has completed successfully. Each of the refresh functions represents 0. Each pair of refresh functions alters 0. RF1.. which is composed of the two pieces: – Power (QppH@Size™) = QppH @ size * QthH @ size 3600 * SF 24 Q1 * Q 2 * . data and log disks. RF2 are timing intervals in seconds of queries and updates. There are certain rules that need to be followed for the implementation of these refresh functions..2 percent of these two tables. Q2.
$ QphH @ Size Price/Performance (Price-per-QphH@size™) = Where $ is the total hardware. create indices. (single stream) in which all queries and update functions are run in a specified sequence. define and validate constraints. gather database statistics and configure the system under test) is also reported. The throughput metric must be derived from a throughput run (multi-stream). The database load time (defined as the total elapsed time to create the tables. defined in the specifications. the number of streams is reported.* Date released November 2002 Feb. Benchmark evolution The following table shows the evolution of the TPC Decision Support benchmarks. load data. In addition to these TPC metrics. which gives an indication of the amount of concurrency during the throughput run. Table 8 Evolution of TPC Decision Support benchmarks Benchmark TPC-H TPC-H TPC-D Version 2.* 1. 1999 May 1995 Date obsolete Current November 2002 April 1999 An Introduction to System Sizing for Data Warehousing Workloads 35 . Two consecutive runs must be executed. The size of the database (or scale factor) is explicitly stated in the metric names. software and three-year maintenance costs for the system under test.* 1. Each scale factor has a required minimum number of streams that must be run for the throughput run. The TPC believes that comparisons of TPC-H results measured against different database sizes are misleading and discourages such comparisons.NumberOfStreams * 24 * 3600 * SF TotalElapsedTime – Throughput (QthH@Size™) = Where each stream is defined as a set of the 22 queries and 2 updates in the predefined order and total elapsed time includes the timing interval for the completion of all query streams and the parallel update stream. The metrics for the run with the lower QphH are reported.* & 2.
The TPC-H Version 2 became effective in November of 2002. The price/performance metric of TPC-H V2 is based on 3-year cost. 36 An Introduction to System Sizing for Data Warehousing Workloads .Although the TPC-H benchmark evolved from TPC-D. that is. CPU. database sizes have increased Processing power has increased Memory and disk requirements have increased The TPC-H benchmark measures the server’s I/O. Performance evolution Since the inception of the TPC Decision Support benchmark. the pricing methodology was changed. joins and aggregation. they are vastly different and cannot be compared in any way. and rewards those with efficient code paths and advanced query optimizers and parallel technology. there have been some general trends in the industry standard benchmark results. sorting. Although the incompatibilities between the four different versions of the TPC Decision Support benchmarks make it impossible to chart a continuous trend line from 1995 to 2003 the following points are undisputed: Price/performance has improved steadily Scale factor size had increased. and memory capabilities via various database operations such as full table scans. while V1 was based on 5-years. which reflect the growth pattern of the business intelligence market. and although the basic performance aspects of V1 and V2 are identical. It also measures how well a DBMS performs these basic database operations.
The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND. program. Users of this document should verify the applicable data for their specific environment.A. IBM Corporation. companies. compatibility or any other claims related to non-IBM products. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Changes are periodically made to the information herein. MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.S. However. INCLUDING. North Castle Drive Armonk.S. brands. To illustrate them as completely as possible. or service. services. the examples include the names of individuals. Consult your local IBM representative for information on the products and services currently available in your area. it is the user's responsibility to evaluate and verify the operation of any non-IBM product. EITHER EXPRESS OR IMPLIED. program. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. Any functionally equivalent product. Any reference to an IBM product. Actual results may vary. or service that does not infringe any IBM intellectual property right may be used instead. 37 . or features discussed in this document in other countries. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites.Notices This information was developed for products and services offered in the U. this statement may not apply to you. This information contains examples of data and reports used in daily business operations. and distribute these sample programs in © Copyright IBM Corp. All rights reserved. modify.A. Some states do not allow disclaimer of express or implied warranties in certain transactions. some measurement may have been estimated through extrapolation. THE IMPLIED WARRANTIES OF NON-INFRINGEMENT. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. program. This information could include technical inaccuracies or typographical errors. to: IBM Director of Licensing. their published announcements or other publicly available sources. IBM may have patents or pending patent applications covering subject matter described in this document. therefore. Therefore. COPYRIGHT LICENSE: This information contains sample application programs in source language. Any performance data contained herein was determined in a controlled environment. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. or service may be used. or service is not intended to state or imply that only that IBM product. You may copy. You can send license inquiries. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. in writing. The furnishing of this document does not give you any license to these patents. these changes will be incorporated in new editions of the publication. IBM has not tested those products and cannot confirm the accuracy of performance. BUT NOT LIMITED TO. which illustrates programming techniques on various operating platforms. IBM may not offer the products. NY 10504-1785 U. Information concerning non-IBM products was obtained from the suppliers of those products. Furthermore. 2004. program. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. and products. the results obtained in other operating environments may vary significantly.
International Technical Support Organization Dept. Power PC Architecture. All TPC-H results referenced are as of March 30. QthR and QphD. You may copy. QphH.ibm. AIX. Performance results described in this paper were obtained under controlled conditions and may not be achievable under different conditions. and service names may be trademarks or service marks of others. DB2. using.com® POWER4™ pSeries® Redbooks™ Redbooks (logo) ™ Other company. QppR. other countries. TPC-H. Send us your comments in one of the following ways: Use the online Contact us review redbook form found at: ibm. This document created or updated on July 23. JN9B. for the purposes of developing. or both: DB2® Eserver® IBM® ibm.com Mail your comments to: IBM Corporation. 2004. or function of these programs. TPC-R. product. 11501 Burnet Road Austin. cannot guarantee or imply reliability. Internal Mail Drop 9053D005. marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. POWER. QppD. IBM. 2004. TPC-D. QphR are trademarks of the Transaction Processing Performance Council. Building 905. Actual system performance may vary and is dependent upon many factors including system hardware configuration and software design and configuration. or distributing application programs conforming to IBM's application programming interfaces. DB2 Universal Database. 38 An Introduction to System Sizing for Data Warehousing Workloads . All information is provided “AS IS” and no warranties or guarantees are expressed or implied by IBM.any form without payment to IBM. using. IBM Eserver. QthD. The following terms are trademarks or registered trademarks of IBM in the United States and/or other countries: IBM.S. QppH. therefore. serviceability. marketing.com/redbooks Send your comments in an email to: redbook@us. Texas 78758-3493 U. pSeries. TPC Benchmark. QthH. and distribute these sample programs in any form without payment to IBM for the purposes of developing. These examples have not been thoroughly tested under all conditions. modify.A Trademarks The following terms are trademarks of the International Business Machines Corporation in the United States.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.