Oracle, the TPC-C and RamSan®

by Michael R. Ault, Oracle Guru

Texas Memory Systems
August 2010

Introduction
Often, you may know that you need to document how a particular system configuration will perform for a particular type of load. This process is called a benchmark. A benchmark provides a performance metric that then can be used to judge future performance or performance of other system configurations against the base benchmark. If your system will be performing many small transactions with numerous inserts, updates, and deletes, then a test that measures online transaction performance (OLTP) would be the proper benchmark. In this series of tests we will use the TPC-C benchmark. This paper will demonstrate using TPC-C to contrast and compare HDD and SSD based IO subsystems.

TPC-C Structure
For our tests the 1000 warehouse size was selected, corresponding to 42.63 gigabytes of data. The 42.63 gigabytes corresponds to over 150 gigabytes of actual database size once the needed undo, redo, and temporary tablespaces and indexes are added to the 42.63 gigabytes of data. Table 1 shows the beginning row counts for the TPC-C schema used. Table Name C_WAREHOUSE C_DISTRICT C_CUSTOMER C_HISTORY C_NEW_ORDER C_ORDER C_ORDER_LINE C_ITEM C_STOCK Row Count 1,000 10,000 3,000,000 3,000,000 900,000 3,000,000 30,000,000 500,000 10,000,000

Table 1: Row Counts for TPC-C Tables
In this test we will utilize a ramp from 5-500 users spread over several client machines with loads being generated with the Quest Software Benchmark Factory software tool. There will be two copies of the database, one on RamSan devices and one on two JBOD arrays of 45-15K, 144 gigabyte hard drives. This is because we want to just see the effects on the number of transactions caused by placing the data and indexes on RamSan versus placing them on disk arrays. We will also be doing a RAC scaling test with this setup.

The System Setup
The system we are testing consists of four dual socket quad core 2 GHz, 16 GB memory Dell servers using InfiniBand interconnects for an Oracle11g 11.1.0.7 RAC database. The I/O subsystem is connected through a Fibre Channel switch to two RamSan-400s, a single RamSan-500 and two JBOD arrays of 45-10K disks each. This setup is shown in Figure 1.

Figure 1: Test Rack
The database was built on the Oracle11g real application clusters (RAC) platform and consists of four major tablespaces: DATA, INDEXES, HDDATA, and HDINDEXES. The DATA and INDEXES tablespaces were placed on the RamSan-500 and the HDDATA and HDINDEXES tablesapces were placed on two sets of 45 disk drives configured via ASM as a single diskgroup with a failover group (RAID10). The RamSan 400s were used for undo tablespaces, redo logs, and temporary tablespaces. By using the dual configuration where one set of tables and indexes owned by schema/user TPCC is placed in the DATA and INDEXES tablespaces and an identical set using a different schema/owner, HDTPCC, is placed on the HDDATA and HDINDEXES tablespaces, we can test the effects of having the database reside on RamSan assets or having it reside on disk drive assets using the exact same database configuration as far as memory and other internal assets. By placing the redo, undo, and temporary tablespaces on the RamSan-400s we eliminate the concerns of undo, redo, and temporary activities on the results and can isolate the results due to the relative placement of data and indexes.

Testing Software
To facilitate database build and the actual TPC-C protocol we utilized the Benchmark Factory software from Quest Software to build the database and run the tests. By using a standard software system we assured that the test execution and data gathering would be identical for the two sets of data.

Building the Database
Utilizing a manual script to create the needed partitioned tables and the Benchmark Factory (BMF) application with a custom build script, we built and loaded the RamSan based. As each table finished loading, a simple “INSERT INTO HDTPCC.table_name SELECT * FROM TPCC.table_name” command was used to populate the HDD-based tables. Once all of the tables were built in both locations, the indexes were then built using a custom script. Following the database build, statistics were built on each schema using the DBMS_STATS.GATHER_SCHEMA_STATS() Oracle-provided PL/SQL package. The normal BMF TPC-C build wizard looks for existing tables, if they exist. It assumes that they are loaded and jumps to the execution of the test. Since we have prebuilt the tables as partitioned tables, this would result in the

test being run against an empty set of tables. Instead of using the provided wizard, Quest provided me with a custom BMF script that allowed loading of the TPC-C tables in 9 parallel streams against existing tables. The custom script could also be edited to allow loading of any subset of tables. An example of where this subsetting of the loading process is critical is in the case of a table running out of room (for example, the C_ORDER_LINE table). By allowing only a single table to be loaded, if needed, the custom script enabled greater flexibility to recover from errors. The custom BMF load script did not create indexes. Instead a custom script loading additional performance-related indexes was used to build the TPC-C related indexes.

Scalability Results – 1 to 4 Instances
The first set of TPC-C tests was designed to measure the scalability of the RamSan against the scalability of the disk-based environment. In this set of tests the size of the SGA was kept constant for each node while additional nodes were added and the number of users was cycled from 5 to 300. The transactions per second (TPS) and peak user count at the peak TPS were noted for both the disk and RamSan-based systems.

Highest TPS and Users at Maximum
Of concern to all IT managers, DBAs, and users are the values where a system will peak in its ability to process transactions versus the number of users it will support. The first results we will examine concern these two key statistics. The results for both the RamSan and HDD tests are shown in Figure 2.
4000 3500 3000 2500
TPS
HD and SSD TPCC

2000 1500 1000 500 0 0 50 100 150 200 250 300 350
SSD 2 server SSD 4 Server HD 2 Server HD 4 Server

Clients

SSD 1 Server SSD 3 Server HD 1 Server HD 3 Server

Figure 2: RamSan and HDD Scalability Results
As can be seen from the graph, the HDD results peak at 1051 TPS and 55 users. The RamSan results peak at 3775 TPS and 245 users. The HDD results fall off from 1051 TPS with 55 users to 549 TPS and 15 users, going from 4 down to 1 server. The RamSan results fall from 3775 TPS and 245 users down to 1778 TPS and 15 users. However, the 1778 TPS seems to be a transitory spike in the RamSan data with an actual peak occurring at 1718 TPS and 40 users. Notice that even at 1 node, the RamSan-based system outperforms the HDD based system with 4 full nodes, with the exception of 3 data points, across the entire range from 5 to 295 users. In the full data set out to 500 users for the 4 node run on RamSan, the final TPS is 3027 while for the HDD 4 node run it is a paltry 297 TPS.

Key Wait Events During Tests
Of course part of the test is to determine why the HDD and RamSan numbers vary so much. To do this we look at the key wait events for the 4-node runs for each set of tests. Listing 1 shows the top five wait events for the RamSan 4 node run.

Top 5 Timed Foreground Events ----------------------------Event Waits Time(s) ------------------------------ ------------ ----------gc buffer busy acquire 82,716,122 689,412 gc current block busy 3,211,801 148,974 DB CPU 120,745 log file sync 15,310,851 70,492 resmgr:cpu quantum 6,986,282 58,787 Avg wait % DB (ms) time Wait Class ------ ------ --------8 47.5 Cluster 46 10.3 Cluster 8.3 5 4.9 Commit 8 4.0 Scheduler

Listing 1: Top 5 Wait Events During 4 Node RAMSAN Run
The only physical I/O related event in the RamSan 4 node top 5 events is the log file sync event with a timing of 5 ms per event. Listing 2 shows the top 5 events for the HDD 4 node run.
Top 5 Timed Foreground Events ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Event Waits Time(s) ----------------------------- ------------ ----------db file sequential read 15,034,486 579,571 free buffer waits 28,041,928 433,183 gc buffer busy acquire 4,064,135 150,435 gc buffer busy release 116,550 103,679 db file parallel read 191,810 102,082 Avg wait % DB (ms) time Wait Class ------ ------ --------39 37.4 User I/O 15 28.0 Configurat 37 9.7 Cluster 890 6.7 Cluster 532 6.6 User I/O

Listing 2: Top 5 Wait Events During 4 Node HDD Run
Note in Listing 2 how the predominant event is the db_file_sequential_read with 15,034,486 waits and an average wait of 39 ms per wait. This was with 579.3 read requests per second per instance. In the RamSan test if we examine the full wait events list, we can see that we had substantially more db_file_sequential_reads, as shown in Listing 3.
Event Waits ------------------------- -----------gc buffer busy acquire 82,716,122 gc current block busy 3,211,801 log file sync 15,310,851 resmgr:cpu quantum 6,986,282 db file sequential read 49,050,918 Avg %Time Total Wait wait Waits % DB -outs Time (s) (ms) /txn time ----- ---------- ------- -------- -----0 689,412 8 5.1 47.5 0 148,974 46 0.2 10.3 0 70,492 5 1.0 4.9 0 58,787 8 0.4 4.0 0 53,826 1 3.0 3.7

Listing 3: Extended Wait List for RAMSAN 4-Node Run
Listing 3 shows that while there were over 3 times the number of db_file_sequential_reads performed (since there were 3 times the TPS at peak) the actual total wait time for those events was less than 10 percent of the total wait time for the HDD test (only 53,826 seconds for the RamSan vs 579,571 seconds for the HDD). The RamSan based system performed 2,510.5 read requests per second with 1.1 ms per wait event (from each instance.) In addition, the HDD test shows that it was experiencing heavy write stress because the number two wait event was for free buffer waits which indicates users waited on buffers to be written to disk before they could be reused. In comparison, the AWR report from the SSD 4-node test shows no listing at all for free buffer waits, indicating there was little to no write stress on the RamSan during the test. The statistics above show that the main problem for the HDD runs was latency related. Since the RamSan system has a latency that is 10 percent of the HDD system, the physical I/O waits are lower by at least a factor of 10 overall. This reduction of latency allows more processing to be accomplished and more transactions as well as users to be supported.

Transaction and Response Times
Of course to get simple throughput (IOPS) we can add disks to our disk array and eventually get latency down to between 2-5 ms. However, in this situation we would probably have to increase our number of disks from 90 to 900 in a RAID10 setup to achieve the minimal latency possible. To purchase that many disks, HBAs, cabinets, and controllers and provide heating and cooling to support less than a 100 gigabyte database would be ludicrous,

yet many of the TPC-C reports shown on the www.tpc.org website do show just such a configuration. A measure of throughput is also shown by looking at transaction time and response time in a given configuration. You must th look at three statistics, minimum time, maximum time, and 90 percentile time for both transactions and response. Transaction time is the time to complete the full transaction; response time is the amount of time needed to get the first response back from the system for a transaction. Figure 3 shows the plot of transaction and response times for the 4-node HDD TPC-C run.

HD Transaction and Response Time
1000 100 10
Seconds

Avg Time Min Time Max Time 90th Time Avg Response Time Min Response Time Max Response Time 90th Response Time

1 0.1 0.01 0.001 0.0001 0 100 200 300 Users 400 500 600

Figure 3: HDD Transaction and Response Times
As you can see, the transaction times and the response times for the HDD system track closely to each other. th Once you pass around 150 users the 90 percentile times begin to exceed 1 second and soon reach several seconds as the user load continues to increase. The maximum response and transaction times reach over 200 seconds at around 200 users and hover there for the remainder of the test. The results for the 4-node RamSan run are shown in Figure 4.

SDD Transaction and Response Times
10 Avg Time Min Time Max Time 90th Time Avg Response Time 0.01 Min Response Time Max Response Time 0.001 90th Response Time

1

Seconds

0.1

0.0001 0 100 200 300 Users 400 500 600

Figure 4: RAMSAN Transaction and Response Times

For the RamSan 4-node run, at no time, even at 500 users, do the average, 90 percentile, or maximum th transaction or response times ever exceed 6 seconds. In fact, the average and 90 percentile transaction and response times for the RamSan run never exceed 0.4 seconds. The RamSan serves more users, at a higher transaction rate, and delivers results anywhere from 5 to 10 times faster, even at maximum load.

th

Summary of Scalability Results
In every phase of scalability testing the RamSan out-performed the HDD-based database. In total transactions, maximum number of users served at peak transactions, and lowest transaction and response times the RamSan based system showed a factor of 5-10 times better performance and scalability. In fact, a single node RamSan test outperformed the 4 node HDD test. These tests show that a single server with 8 CPUs and 16GB of memory hooked to a RamSan-500 can outperform a 4-node, 8 CPU, and 16GB per node RAC system when it is running against a 90 disk RAID10 array setup with identical configurations for the Oracle database. It should also be noted that the I/O load from the SYSTEM, UNDO, TEMP, and SYSAUX tablespaces, as well as from the control file and REDO logs was offloaded to RamSan400s. Had this additional load been placed on the disk array, performance for the HDD tests would have been much worse.

Memory Restriction Results: 1 to 9 Gigabyte Cache Sizes
In the next set of TPC-C runs the number of 4 gigabit Fibre Channel links was maximized for the RamSan arrays. Since the disk arrays (2 with 1-2Gbps FC links each) were not bandwidth constrained, no additional FC links were added for them. The Fibre Channel and InfiniBand layout is shown in Figure 5.

Figure 5: Fibre and InfiniBand Connections
In order to properly utilize the InfiniBand interconnect the kernel of the Oracle executable must be relinked with the RDP protocol. With 1-2 Gbps interfaces per disk array this translates to about 512 MB/s of bandwidth, or 64K IOPS at 8k size. Because the database is configured at an 8k blocksize and this is OLTP with mostly single block reads, this should give sufficient bandwidth. With 90 disk drives the maximum expected IOPS would be 18,000 if every disk could be accessed simultaneously at the maximum random I/O rate.

RedHat multipathing software was used to unify the multiple ports presented into a single virtual port in the servers for each device. The RamSans were presented with the names shown in the diagram in Figure 5. The HDD arrays were formed into a single ASM diskgroup with one array as the primary and the other as its failure group. The HDDATA and HDINDEXES tablespaces were then placed into that ASM diskgroup. The memory sizes for the various caches (default, keep, and recycle) were then reduced proportionately to see the effects of memory stress on the four node cluster. The results of the memory stress test are shown in Figure 6.
SSD and HD TPS vs Memory 4 Instances
4000 3500 3000 2500
TPS

2000 1500 1000 500 0 0 100 200 300
Users

400
TPS 9 gb TPS 2.25G TPS1.05g HD 6.85gb HD 3 gb

500
TPS 4.5g TPS 6.75g HD 9gb HD 4.5 gb

600

Figure 6: RAMSAN and HDD Results from Memory Stress Tests
Figure 6 clearly shows that the RamSan handles the memory stress better by a factor ranging from 3 at the top end at 9 GB total cache size to a huge 7.5 at the low end comparing a 1.05 GB cache on the RamSan to a 4.5 GB cache on the HDD run. The HDD run was limited to 3 GB at the lower end by time constraints. However, as performance would only get worse as the cache was further reduced, further testing was felt to be redundant. Of course the reason for the wide range between the upper and lower memory results, from a factor of 3 to 7.5 times better performance by the RamSan, can be traced to the increase in physical I/O that resulted from not being able to cache results and the subsequent increase in physical IOPS. This is shown in Figure 7.

IOPS High and Low Memory 100,000 90,000 80,000 70,000 60,000

IOPS

50,000 40,000 30,000 20,000 10,000 0 0 10 20 30
SSD SSD HDD HDD HM LM LM HM

15 Minute Interval

Figure 7: IOPS for RAMSAN and HDD with Memory Stress
Figure 7 shows that the physical IOPS for the RamSan ranged from 86,354 at 1.050 GB down to 33,439 at 9 GB. The HDD was only able to achieve a maximum of 14,158 at 1.050 GB down to 13,500 at 9 GB of memory utilizing the 90 disk RAID10 ASM controlled disk array. The relatively flat response curve of the HDD tests indicate that the HDD array was becoming saturated with I/O requests and had reached its maximum IOPS. Figure 7 also shows that the IOPS was allowed to reach the natural peak for the data requirements in the RamSan-based tests, while for the HDD tests it was being artificially capped by the limits of the hardware. The timing for the db file sequential read waits for the RamSan and HDD runs are also indicative of the latency differences between RamSan and HDD. For the RamSan runs the time spent doing db file sequential reads varied from a high of 169,290 seconds for the entire measurement period at 1.05 GB down to 53,826 seconds for the 9 GB run. In contrast the HDD required 848,389 seconds at 3 GB down to 579,571 seconds at 9 GB.

Memory Stress Results Summary
The tests show that the RamSan array handles a reduction in available memory much better than the HDD array. Even at a little over 1 GB of total cache area per node for a 4-node RAC environment, the RamSan outperformed the HDD array at a 9 GB total cache size per node for a 4-node RAC using identical servers and database parameters. Unfortunately due to bugs in the production release of Oracle11g, release 11.1.0.7, we were unable to test the Automatic Memory Management feature of Oracle; the bug limits total SGA size to less than 3-4 gigabytes per server.

Summary
These tests prove that the RamSan array out-performs the HDD array in every aspect of the OLTP environment. In scalability, performance, and under memory stress conditions, the RAMSAN array achieved a minimum of 3 times the performance of the HDD array, many times giving a 10 fold or better performance boost for the same user load and memory configuration. In a situation where the choice is between buying additional servers and memory and an HDD SAN versus getting fewer servers, less memory, and a RamSan array, these results show that you will get better performance from the RamSan system in every configuration tested. The only way to get near the performance of the RamSan array would be to over-purchase the number of disks required by almost two orders of magnitude to obtain the required IOPS through increased spindle counts.

When the costs of purchasing the additional servers and memory and the costs of floor space, disk cabinets, controllers, and HBAs are combined with the ongoing energy and cooling costs for a large disk array system, it should be clear that RamSans are a better choice.

Sign up to vote on this title
UsefulNot useful