Professional Documents
Culture Documents
August 11
Based on excerpts from in-memory data management by Hasso Plattner and Alexander Zeier (ISBN 978-3-642-19362-0) and own observations Edited by Michael Missbach
Intro ..................................................................................................................................... 1 Business with the Speed of Thought ............................................................................ 1 In-Memory - a No-Brainer? ......................................................................................... 2 Not all memory is the same ................................................................................................ 2 Non-Uniform Memory Access ..................................................................................... 3 Cache hierarchies .......................................................................................................... 3 Other readers .the impact of prefetching ............................................................... 4 Between Scylla and Charybdis ........................................................................................... 4 Row-oriented versus column-oriented ........................................................................ 4 Two engines under one hood: Hybrid ......................................................................... 5 What about real-real time? .................................................................................................. 5
Intro
Business with the Speed of Thought
According to scientific studies the time between a stimulus and a simple reaction of a human is 220 Milliseconds on average (just put your finger on a hot stove and measure the time till the pain make you pull it back). However the recognition time to really understand a situation and take comprehension reactions has been measured to be 384 Milliseconds. For more complex contexts, the interval of 550 to 750 Milliseconds is what is assumed as speed of thought. For repeatedly performed actions, the reaction time becomes shorter. If the response time of a system is significantly longer than this interval the users mind starts to move to other topics. This effect cannot be consciously controlled. If the system response pops up, the users brain have to perform a context switch to find his way back to the topic and needs to remember what the next step was. Such context switching is extremely tiring. Less than a second response time is therefore the ultimate goal to increase productivity by enabling the user to focus on a topic without distraction by other tasks during waiting periods. Thanks to the dramatically improved performance of computer systems this goal can be usually achieved for well designed transactional systems, which use only a relative small working set of data which could be easily kept in memory and small amounts of data only have to be read and written to persistent storage devices. For business analytics, where usually much larger datasets have to be processed the situation is different however. The same is true for certain processes in transactional systems, for example dunning and replenishment. Even worse, such processes result in significant performance degradation when used in parallel on classical transactional systems. The classical approach to resolve this was to run such processes at night and use specialized data warehouses like SAP BW. Such On-Line Analytical Systems (OLAP) systems store the data in special data structures known as InfoCubes, which provide also key figures derived from this data as aggregates. As a result, the key figures do not have to be calculated individually again for each query. Instead, they are computed directly after the data is loaded and are therefore immediately available when a query is made. This pre-aggregated data model enables reports to be executed considerably quicker in SAP BI than in an OLTP system.
However, such fast response applies only for the pre-calculated reports. Any change of the reports demands a cumbersome recreation of the InfoCubes. Even worse, the operational data has to be transferred into these warehouses via batch jobs mostly after business hours to avoid longer response time on the transactional systems. As a consequence business decisions are made on the truth of yesterday, which made flexible ad-hoc reporting on up-to-date data almost impossible. A technology enabling fast as-hoc analysis would make such predefined reports cumbersome to build and maintain cubes obsolete. Even better such technology will remove the need for separate analytical systems, subsequently allowing analysis of the real-time data in a transactional business system again. And best of all users would be able to interact with their business application just like they interact with a web search engine by refining search results on the fly when the initial results are not satisfying. Experience with common search engines on the web seem to demonstrate, that they are able to analyze massive amounts of data real time. Whatever queries you enter, the answer appears almost instantly. However there is a major difference in regards to the completeness of result between web search and enterprise applications. A web search engine didnt have to give complete answers; Google can be so astonishing fast because the result of their searches just has to be good enough for the common user. For this purpose they just have to scan through an indexed set of data. The generation of this index however is also a time consuming process which needs massive computing power anything else than real-time. In contrast, enterprise applications have to consider all data relevant for a business report - no taxiation authority on the planet will accept tax payment based on anything else than a complete scan through each and every account number.
In-Memory - a No-Brainer?
Given the fact that access to data in a computers memory is orders of magnitude faster than the ones stored on disk, the concept of in-memory computing seems to be a no-brainer. SAP has followed this approach already more than a decade ago with the APO LiveCache, literally a maxDB (aka SAPDB aka ADABAS-D) running completely in main memory. Simply enlarging the main memory till it can keep the complete dataset of any application seems to be a straightforward strategy since large amounts of main memory become affordable. With the majority of databases used for SAP business applications being in the range of 1 to 3 TB and considering the advanced compression features of state of the art database systems it should be easy to hold the complete working set of an On-line transactional processing application in the SGA like in the standard SD benchmark. Such approach however will still not be sufficient to achieve the necessary performance also for ad-hock analysis, where the complete content of such datasets have to be scanned. To enable business users to distil useful informations from raw data within the blink of an eye, a deep understanding is necessary of the way data is organized not only in the main memory but also in the CPU and the different intermediate caches.
Consequently large main memory only is of not much help if the access is throttled down by bottlenecks in the path down to the processor.
Cache hierarchies
Even if main memory is several times faster than disk, its still not as fast as the processor itself. Typical DDR3 memory run with clock speeds between 800 and 1333 MHz, actual Intel Westmere CPUs are rated for up to 3.6 GHz. Therefore all current CPU designs deploy caching to decrease the latency for repetitive access to the same piece of data. Actually most of the billions of transistors on a CPU chip are used to intermediate cache data before they reach the computing core registers. Physically it needs much more transistors to build very fast caches with high bandwidth and low latency than for one with for slower caches with more latency. Because the available numbers of transistor is still restricted, CPU makers have established a hierarchy of Caches with different latency and bandwidth. Data is transmitted subsequently through the following layers: From main memory via the memory channel to a level 3 cache shared among all cores of the CPU From level 3 cache via the CPU-internal bus to the level 2 cache that is unique per CPU core From level 2 cache via the core-bus to the level 1 cache of the CPU core From level 1 cache via the core-bus to the registers of the CPU. Even if transistors do not need the seek time of a hard disk caused by the mechanical inertia of the disk arm, it still need some time to decode a memory address and connect the transistors that contain the requested piece of data to the bus. For caching memories, latency also caused by the process to determine if and where requested data is stored in a given block. Lower levels are faster but smaller. In current Intel Xeon CPUs the 64 KB level 1 low latency cache and the 256 KB level 2 cache run both with the full core clock speed. But even if they run with the same clock as the core, making them into one would have negative effects on bandwidth and latency and would also increase the transistor footprint. The 30 MB Level 3 cache shared by all cores run with the Uncore-clock speed somewhat more than half of the core ones as the memory controller and the QPI links. To be efficient, the design of an in-memory system has to take the different sizes and latencies of these caches into account. Caches are organized on a fixed-size logical cache line basis rather than on a per-byte basis. In the commonly used inclusive cache hierarchy the data cached on the lower layer is included in the higher level caches. To load a value from main memory data have to be transmitted over all
intermediate caches subsequently. Accessing main memory can consume up to 80 times the number of CPU cycles compared an access to the Layer 1 cache. In an ideal world, the data requested by the processor would be always available in the cache. In real world however so called cache missed happen, that is the necessary piece of data is not currently available in a certain cache level. Worst case is a full miss when a requested piece of data is only present in main memory. Avoiding cache misses have a major impact on the processing performance, because a complete cache line has to be invalidated to free up the necessary space for the newly fetched data, wasting all the effort for previously filling the line. Given the relatively small size of the level 1 cache it is obvious that data should be organized in such a way, that the ones are grouped together which have a high likelihood to be used together in a given calculation. If data are organized the wrong way, the cache line based pattern leads to an effect of reading too much data which is not used for calculation at all. So even the fastest data transfer is futile if its transfer the wrong data. Therefore data structures have to be optimized to maximize the likelihood that all the data necessary for the next computing step are gathered in the same cache line. The target is to align access patterns in a way that enables the data to be read sequentially and random lookups are avoided.
much of them are bought on average or the total sum per order. In contrast to a typical business process, only a small number of attributes in a table is of interest for a particular query in a typical analysis. Loading every row into cache when only a fraction of data is really used is clearly not an optimal way for On-Line Analytical (OLAP) Systems, even if they run completely in-memory. Organizing the tables in a way that columns are stored in adjacent blocks make it possible that only the required columns have to be moved into cache lines while the rest of the table can be ignored. This way, the cache has to keep only the data needed to process the request, reducing the data traffic from main memory to CPUs, in-between CPUs and down through the whole cache hierarchy significantly. Maximizing the likelihood that data necessary can be found in the level 1 cache will obviously speed op the processing and minimizing the response time. Analysis of database accesses in enterprise warehouse applications as well as practical experience demonstrate that column-oriented solutions like SAP BWA and Sybase IQ are a excellent choice for on-line analytical systems. The obvious disadvantage of these systems is that their performance with row-based transactions is poor. So what is good for business transactions is bad for reports and vice versa. For many years the only answer to this dilemma was to deploy two sets of applications optimized either for OLTP or for OLAP, doubling not only the amount of data to be stored and subsequently also the hardware and operation costs, but stipulate also the demand to synchronize the data between the different systems.