You are on page 1of 5

In-Memory database deep dive last updated: 25.

August 11
Based on excerpts from in-memory data management by Hasso Plattner and Alexander Zeier (ISBN 978-3-642-19362-0) and own observations Edited by Michael Missbach

Intro ..................................................................................................................................... 1 Business with the Speed of Thought ............................................................................ 1 In-Memory - a No-Brainer? ......................................................................................... 2 Not all memory is the same ................................................................................................ 2 Non-Uniform Memory Access ..................................................................................... 3 Cache hierarchies .......................................................................................................... 3 Other readers .the impact of prefetching ............................................................... 4 Between Scylla and Charybdis ........................................................................................... 4 Row-oriented versus column-oriented ........................................................................ 4 Two engines under one hood: Hybrid ......................................................................... 5 What about real-real time? .................................................................................................. 5

Intro
Business with the Speed of Thought
According to scientific studies the time between a stimulus and a simple reaction of a human is 220 Milliseconds on average (just put your finger on a hot stove and measure the time till the pain make you pull it back). However the recognition time to really understand a situation and take comprehension reactions has been measured to be 384 Milliseconds. For more complex contexts, the interval of 550 to 750 Milliseconds is what is assumed as speed of thought. For repeatedly performed actions, the reaction time becomes shorter. If the response time of a system is significantly longer than this interval the users mind starts to move to other topics. This effect cannot be consciously controlled. If the system response pops up, the users brain have to perform a context switch to find his way back to the topic and needs to remember what the next step was. Such context switching is extremely tiring. Less than a second response time is therefore the ultimate goal to increase productivity by enabling the user to focus on a topic without distraction by other tasks during waiting periods. Thanks to the dramatically improved performance of computer systems this goal can be usually achieved for well designed transactional systems, which use only a relative small working set of data which could be easily kept in memory and small amounts of data only have to be read and written to persistent storage devices. For business analytics, where usually much larger datasets have to be processed the situation is different however. The same is true for certain processes in transactional systems, for example dunning and replenishment. Even worse, such processes result in significant performance degradation when used in parallel on classical transactional systems. The classical approach to resolve this was to run such processes at night and use specialized data warehouses like SAP BW. Such On-Line Analytical Systems (OLAP) systems store the data in special data structures known as InfoCubes, which provide also key figures derived from this data as aggregates. As a result, the key figures do not have to be calculated individually again for each query. Instead, they are computed directly after the data is loaded and are therefore immediately available when a query is made. This pre-aggregated data model enables reports to be executed considerably quicker in SAP BI than in an OLTP system.

However, such fast response applies only for the pre-calculated reports. Any change of the reports demands a cumbersome recreation of the InfoCubes. Even worse, the operational data has to be transferred into these warehouses via batch jobs mostly after business hours to avoid longer response time on the transactional systems. As a consequence business decisions are made on the truth of yesterday, which made flexible ad-hoc reporting on up-to-date data almost impossible. A technology enabling fast as-hoc analysis would make such predefined reports cumbersome to build and maintain cubes obsolete. Even better such technology will remove the need for separate analytical systems, subsequently allowing analysis of the real-time data in a transactional business system again. And best of all users would be able to interact with their business application just like they interact with a web search engine by refining search results on the fly when the initial results are not satisfying. Experience with common search engines on the web seem to demonstrate, that they are able to analyze massive amounts of data real time. Whatever queries you enter, the answer appears almost instantly. However there is a major difference in regards to the completeness of result between web search and enterprise applications. A web search engine didnt have to give complete answers; Google can be so astonishing fast because the result of their searches just has to be good enough for the common user. For this purpose they just have to scan through an indexed set of data. The generation of this index however is also a time consuming process which needs massive computing power anything else than real-time. In contrast, enterprise applications have to consider all data relevant for a business report - no taxiation authority on the planet will accept tax payment based on anything else than a complete scan through each and every account number.

In-Memory - a No-Brainer?
Given the fact that access to data in a computers memory is orders of magnitude faster than the ones stored on disk, the concept of in-memory computing seems to be a no-brainer. SAP has followed this approach already more than a decade ago with the APO LiveCache, literally a maxDB (aka SAPDB aka ADABAS-D) running completely in main memory. Simply enlarging the main memory till it can keep the complete dataset of any application seems to be a straightforward strategy since large amounts of main memory become affordable. With the majority of databases used for SAP business applications being in the range of 1 to 3 TB and considering the advanced compression features of state of the art database systems it should be easy to hold the complete working set of an On-line transactional processing application in the SGA like in the standard SD benchmark. Such approach however will still not be sufficient to achieve the necessary performance also for ad-hock analysis, where the complete content of such datasets have to be scanned. To enable business users to distil useful informations from raw data within the blink of an eye, a deep understanding is necessary of the way data is organized not only in the main memory but also in the CPU and the different intermediate caches.

Not all memory is the same


According to the empirical Moores Law, the number of transistors on a single chip is doubled approximately every two years. The exponential increase of clock speed of processors however has come to an end after almost 30 years in 2002, when heat dissipation became the limiting factor for increasing CPU power. So software programmers can no longer implicitly benefit from the advantages in processor technology - the free lunch is over. Multi-Thread and multi core technologies have been introduced to circumvent these limitations, however not all business processes and applications can be parallelized to take advantage of this developments. In addition, the speed of the interface between the CPU and the main memory and all other input/output (I/O) components has also stagnated, widening the gap between the capabilities of processors to digest data and the ability of the Front Side Bus (FSB) to deliver them.

Consequently large main memory only is of not much help if the access is throttled down by bottlenecks in the path down to the processor.

Non-Uniform Memory Access


To circumvent this bottleneck the single shared media bus topology was replaced by point-topoint connections where each processor has a exclusive direct channel to a certain segment of the main memory. This straight forward approach has removed the bottleneck of the shared Front Side Bus. But as a consequence, access to main memory has now to be distinguished between access to local memory and remote memory. The first one is the segment directly connected to the processor requesting the data, the second one indentifies all the segments connected to other processors which need to be transmitted over a special bus among processors. Obviously, maximum performance in such a Non-Uniform Memory Access (NUMA) architecture will only be reached if the executed tasks solely access local memory. Performance degradation of up to 25% have to be accepted if many tasks need to access remote memory and the interprocessor connections become saturated by extensive data transfer. To grant data consistency, the caches of the processors have to be synchronized. In cachecoherent Non-Uniform Memory Access (ccNUMA) systems, all CPU caches share the same view of the available memory and coherency is ensured by hardware. In non cache-coherent NUMA software layers have to take care of conflicting memory accesses.

Cache hierarchies
Even if main memory is several times faster than disk, its still not as fast as the processor itself. Typical DDR3 memory run with clock speeds between 800 and 1333 MHz, actual Intel Westmere CPUs are rated for up to 3.6 GHz. Therefore all current CPU designs deploy caching to decrease the latency for repetitive access to the same piece of data. Actually most of the billions of transistors on a CPU chip are used to intermediate cache data before they reach the computing core registers. Physically it needs much more transistors to build very fast caches with high bandwidth and low latency than for one with for slower caches with more latency. Because the available numbers of transistor is still restricted, CPU makers have established a hierarchy of Caches with different latency and bandwidth. Data is transmitted subsequently through the following layers: From main memory via the memory channel to a level 3 cache shared among all cores of the CPU From level 3 cache via the CPU-internal bus to the level 2 cache that is unique per CPU core From level 2 cache via the core-bus to the level 1 cache of the CPU core From level 1 cache via the core-bus to the registers of the CPU. Even if transistors do not need the seek time of a hard disk caused by the mechanical inertia of the disk arm, it still need some time to decode a memory address and connect the transistors that contain the requested piece of data to the bus. For caching memories, latency also caused by the process to determine if and where requested data is stored in a given block. Lower levels are faster but smaller. In current Intel Xeon CPUs the 64 KB level 1 low latency cache and the 256 KB level 2 cache run both with the full core clock speed. But even if they run with the same clock as the core, making them into one would have negative effects on bandwidth and latency and would also increase the transistor footprint. The 30 MB Level 3 cache shared by all cores run with the Uncore-clock speed somewhat more than half of the core ones as the memory controller and the QPI links. To be efficient, the design of an in-memory system has to take the different sizes and latencies of these caches into account. Caches are organized on a fixed-size logical cache line basis rather than on a per-byte basis. In the commonly used inclusive cache hierarchy the data cached on the lower layer is included in the higher level caches. To load a value from main memory data have to be transmitted over all

intermediate caches subsequently. Accessing main memory can consume up to 80 times the number of CPU cycles compared an access to the Layer 1 cache. In an ideal world, the data requested by the processor would be always available in the cache. In real world however so called cache missed happen, that is the necessary piece of data is not currently available in a certain cache level. Worst case is a full miss when a requested piece of data is only present in main memory. Avoiding cache misses have a major impact on the processing performance, because a complete cache line has to be invalidated to free up the necessary space for the newly fetched data, wasting all the effort for previously filling the line. Given the relatively small size of the level 1 cache it is obvious that data should be organized in such a way, that the ones are grouped together which have a high likelihood to be used together in a given calculation. If data are organized the wrong way, the cache line based pattern leads to an effect of reading too much data which is not used for calculation at all. So even the fastest data transfer is futile if its transfer the wrong data. Therefore data structures have to be optimized to maximize the likelihood that all the data necessary for the next computing step are gathered in the same cache line. The target is to align access patterns in a way that enables the data to be read sequentially and random lookups are avoided.

Other readers .the impact of prefetching


Obviously even the fastest data transfer causes some additional latencies and a subsequent a stall of the programs waiting for data. Modern CPUs use prefetching algorithms which try a best guess what block of data would be requested next by the CPU and load it already in the cache lines. To achieve a high hit/miss ratio the principles of spatial and temporal locality has to be considered regarding memory access. Spatial locality refers to the observation that the CPU often accesses adjacent memory cells; temporal locality describes the experience that if the CPU addresses data in memory, it is likely that this data will be accessed again soon. Caching utilizes spatial locality by loading not only the memory blocks including the requested data, but also some of its neighbors into the cache. The temporal locality of memory references can be utilized by the cache replacement policy by replacing the cache line that is not used for the longest time to free up space when loading new data from memory to the cache.

Between Scylla and Charybdis


Row-oriented versus column-oriented
Whenever database structures are described its implicitly assumed that data is logically stored in 2-dimensional tables like a spreadsheet. In the physical world however, all the bits and bytes representing the data are stored and transmitted in one single string. Consequently there are two ways to transform a table into a single string: you can either write one row of the table behind the other or you can write each column after the other. The first option is called row oriented, the second column oriented. For very good reasons, most databases used for business applications store the data values in a row-oriented fashion. This way much of the data belonging to the same business transaction like a order number, the number of the customer who bought the item, the part number of the item ordered, the number of parts ordered, the price per piece and the total sum are stored in adjacent blocks in memory. Such row-oriented organization of data increase the likelihood that all data belonging to a single business transaction are found in the same cache line, reducing the number of cache misses. The fact that row-oriented databases enabled sub-second response times even with disc based storage since decades demonstrate that this concept fits well to the On-Line Transaction Processing (OLTP) systems. Unfortunately Row-oriented storage is not well suited for reporting, where not the complete data set of a single business transaction is of interest, but for example only the part numbers, how

much of them are bought on average or the total sum per order. In contrast to a typical business process, only a small number of attributes in a table is of interest for a particular query in a typical analysis. Loading every row into cache when only a fraction of data is really used is clearly not an optimal way for On-Line Analytical (OLAP) Systems, even if they run completely in-memory. Organizing the tables in a way that columns are stored in adjacent blocks make it possible that only the required columns have to be moved into cache lines while the rest of the table can be ignored. This way, the cache has to keep only the data needed to process the request, reducing the data traffic from main memory to CPUs, in-between CPUs and down through the whole cache hierarchy significantly. Maximizing the likelihood that data necessary can be found in the level 1 cache will obviously speed op the processing and minimizing the response time. Analysis of database accesses in enterprise warehouse applications as well as practical experience demonstrate that column-oriented solutions like SAP BWA and Sybase IQ are a excellent choice for on-line analytical systems. The obvious disadvantage of these systems is that their performance with row-based transactions is poor. So what is good for business transactions is bad for reports and vice versa. For many years the only answer to this dilemma was to deploy two sets of applications optimized either for OLTP or for OLAP, doubling not only the amount of data to be stored and subsequently also the hardware and operation costs, but stipulate also the demand to synchronize the data between the different systems.

Two engines under one hood: Hybrid


To combine the best of both worlds and support analytical as well as transactional workloads in one system, HANA combines the two different types of database systems under one umbrella and enables this way raw- as well as column-based data representation. Technically HANA consists of the following components: TREX provide the column based data store (same as BWA) Ptime provide the raw based data store (acquired by SAP in 2008) A MaxDB shadow server provide the persistence layer for virtual data and log files for TREX, virtual data file for Ptime as well as a Legacy DB approach with transaction log. To grant optimal performance of a hybrid database, the access pattern has to be known in advance. Therefore typical queries are analyzed with regards to their cache miss behavior. Together with the weight of the query this is used to determine the optimal layout either raw or column.

What about real-real time?


One of the obstacles of traditional data marts and warehouses is that the extraction of data from operational transactional business system put such a high load on the source systems that this activity has to be done at night shift. This obstacles result in the fact that any analysis on this data marts and warehouses done show only the truth of yesterday. With the acquisition of Sybase, SAP gets possession of Sybase Logminer which enabling intrusive free near-online data extraction by reverse-engineering of database redo-logs to extract data without additional load on the source system. Currently only DB2 is supported as source system for real-real time data extraction into HANA using Logminer. For other databases BusinessObjects tools have to be used for ETL.

You might also like