Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Standard view
Full view
of .
Look up keyword or section
Like this

Table Of Contents

0 of .
Results for:
No results containing your search query
P. 1
What Every Programmer Should Know About Memory

What Every Programmer Should Know About Memory

|Views: 27|Likes:
Published by TraxNet

More info:

Categories:Types, Research
Published by: TraxNet on Jul 04, 2011
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





What Every Programmer Should Know About Memory
Ulrich DrepperRed Hat, Inc.
November 21, 2007
As CPU cores become both faster and more numerous, the limiting factor for most programs isnow, and will be for some time, memory access. Hardware designers have come up with evermore sophisticated memory handling and acceleration techniques–such as CPU caches–butthese cannot work optimally without some help from the programmer. Unfortunately, neitherthe structure nor the cost of using the memory subsystem of a computer or the caches on CPUsis well understood by most programmers. This paper explains the structure of memory subsys-temsinuseonmoderncommodityhardware, illustratingwhyCPUcachesweredeveloped, howthey work, and what programs should do to achieve optimal performance by utilizing them.
1 Introduction
In the early days computers were much simpler. The var-ious components of a system, such as the CPU, memory,mass storage, and network interfaces, were developed to-gether and, as a result, were quite balanced in their per-formance. For example, the memory and network inter-faces were not (much) faster than the CPU at providingdata.This situation changed once the basic structure of com-puters stabilized and hardware developers concentratedon optimizing individual subsystems. Suddenly the per-formance of some components of the computer fell sig-nificantly behind and bottlenecks developed. This wasespecially true for mass storage and memory subsystemswhich, for cost reasons, improved more slowly relativeto other components.The slowness of mass storage has mostly been dealt withusing software techniques: operating systems keep mostoftenused(andmostlikelytobeused)datainmainmem-ory, which can be accessed at a rate orders of magnitudefaster than the hard disk. Cache storage was added to thestoragedevices themselves, which requires no changes inthe operating system to increase performance.
For thepurposes of this paper, we will not go into more detailsof software optimizations for the mass storage access.Unlike storage subsystems, removing the main memoryas a bottleneck has proven much more difficult and al-most all solutions require changes to the hardware. To-
Changes are needed, however, to guarantee data integrity whenusing storage device caches.Copyright © 2007 Ulrich DrepperAll rights reserved. No redistribution allowed.
day these changes mainly come in the following forms:RAM hardware design (speed and parallelism).Memory controller designs.CPU caches.Direct memory access (DMA) for devices.For the most part, this document will deal with CPUcaches and some effects of memory controller design.In the process of exploring these topics, we will exploreDMA and bring it into the larger picture. However, wewill start with an overview of the design for today’s com-modity hardware. This is a prerequisite to understand-ing the problems and the limitations of efficiently us-ing memory subsystems. We will also learn about, insome detail, the different types of RAM and illustratewhy these differences still exist.This document is in no way all inclusive and final. It islimited to commodity hardware and further limited to asubset of that hardware. Also, many topics will be dis-cussed in just enough detail for the goals of this paper.For such topics, readers are recommended to find moredetailed documentation.When it comes to operating-system-specific details andsolutions, the text exclusively describes Linux. At notime will it contain any information about other OSes.The author has no interest in discussing the implicationsfor other OSes. If the reader thinks s/he has to use adifferent OS they have to go to their vendors and demandthey write documents similar to this one.One last comment before the start. The text contains anumber of occurrences of the term “usually” and other,similar qualifiers. The technology discussed here exists
in many, many variations in the real world and this paperonly addresses the most common, mainstream versions.It is rare that absolute statements can be made about thistechnology, thus the qualifiers.
Document Structure
This document is mostly for software developers. It doesnot go into enough technical details of the hardware to beuseful for hardware-oriented readers. But before we cango into the practical information for developers a lot of groundwork must be laid.To that end, the second section describes random-accessmemory (RAM) in technical detail. This section’s con-tent is nice to know but not absolutely critical to be ableto understand the later sections. Appropriate back refer-ences to the section are added in places where the contentis required so that the anxious reader could skip most of this section at first.The third section goes into a lot of details of CPU cachebehavior. Graphs have been used to keep the text frombeing as dry as it would otherwise be. This content is es-sential for an understanding of the rest of the document.Section 4 describes briefly how virtual memory is imple-mented. This is also required groundwork for the rest.Section 5 goes into a lot of detail about Non UniformMemory Access (NUMA) systems.Section 6 is the central section of this paper. It brings to-gether all the previous sections’ information and givesprogrammers advice on how to write code which per-forms well in the various situations. The very impatientreader could start with this section and, if necessary, goback to the earlier sections to freshen up the knowledgeof the underlying technology.Section 7 introduces tools which can help the program-mer do a better job. Even with a complete understandingof the technology it is far from obvious where in a non-trivial software project the problems are. Some tools arenecessary.In section 8 we finally give an outlook of technologywhich can be expected in the near future or which might just simply be good to have.
Reporting Problems
The author intends to update this document for sometime. This includes updates made necessary by advancesin technology but also to correct mistakes. Readers will-ing to report problems are encouraged to send email tothe author. They are asked to include exact version in-formation in the report. The version information can befound on the last page of the document.
I would like to thank Johnray Fuller and the crew at LWN(especially Jonathan Corbet for taking on the dauntingtask of transforming the author’s form of English intosomethingmoretraditional. MarkusArmbrusterprovideda lot of valuable input on problems and omissions in thetext.
About this Document
The title of this paper is an homage to David Goldberg’sclassic paper “What Every Computer Scientist ShouldKnow About Floating-Point Arithmetic” [12]. This pa-per is still not widely known, although it should be aprerequisite for anybody daring to touch a keyboard forserious programming.One word on the PDF: xpdf draws some of the diagramsratherpoorly. Itisrecommendeditbeviewedwithevinceor, if really necessary, Adobe’s programs. If you useevince be advised that hyperlinks are used extensivelythroughout the document even though the viewer doesnot indicate them like others do.2 Version 1.0
What Every Programmer Should Know About Memory 
2 Commodity Hardware Today
It is important to understand commodity hardware be-cause specialized hardware is in retreat. Scaling thesedays is most often achieved horizontally instead of verti-cally, meaning today it is more cost-effective to use manysmaller, connected commodity computers instead of afew really large and exceptionally fast (and expensive)systems. This is the case because fast and inexpensivenetwork hardware is widely available. There are still sit-uations where the large specialized systems have theirplace and these systems still provide a business opportu-nity, but the overall market is dwarfed by the commodityhardware market. Red Hat, as of 2007, expects that forfuture products, the “standard building blocks” for mostdata centers will be a computer with up to four sockets,each filled with a quad core CPU that, in the case of IntelCPUs, will be hyper-threaded.
This means the standardsystem in the data center will have up to 64 virtual pro-cessors. Bigger machines will be supported, but the quadsocket, quad CPU core case is currently thought to be thesweet spot and most optimizations are targeted for suchmachines.Large differences exist in the structure of computers builtof commodity parts. That said, we will cover more than90% of such hardware by concentrating on the most im-portant differences. Note that these technical details tendto change rapidly, so the reader is advised to take the dateof this writing into account.Over the years personal computers and smaller serversstandardized on achipsetwith two parts: theNorthbridgeand Southbridge.Figure 2.1shows this structure.
Figure 2.1: Structure with Northbridge and SouthbridgeAll CPUs (two in the previous example, but there can bemore) are connected via a common bus (the Front SideBus, FSB) to the Northbridge. The Northbridge contains,among other things, the memory controller, and its im-plementation determines the type of RAM chips used forthe computer. Different types of RAM, such as DRAM,Rambus, and SDRAM, require different memory con-trollers.To reach all other system devices, the Northbridge mustcommunicate with the Southbridge. The Southbridge,often referred to as the I/O bridge, handles communica-
Hyper-threading enables a single processor core to be used for twoor more concurrent executions with just a little extra hardware.
tion with devices through a variety of different buses. To-day the PCI, PCI Express, SATA, and USB buses are of most importance, but PATA, IEEE 1394, serial, and par-allel ports are also supported by the Southbridge. Oldersystems had AGP slots which were attached to the North-bridge. This was done for performance reasons related toinsufficiently fast connections between the Northbridgeand Southbridge. However, today the PCI-E slots are allconnected to the Southbridge.Such a system structure has a number of noteworthy con-sequences:All data communication from one CPU to anothermusttraveloverthesamebususedtocommunicatewith the Northbridge.All communication with RAM must pass throughthe Northbridge.The RAM has only a single port.
Communication between a CPU and a device at-tached to the Southbridge is routed through theNorthbridge.A couple of bottlenecks are immediately apparent in thisdesign. One such bottleneck involves access to RAM fordevices. In the earliest days of the PC, all communica-tion with devices on either bridge had to pass through theCPU, negatively impacting overall system performance.To work around this problem some devices became ca-pable of direct memory access (DMA). DMA allows de-vices, with the help of the Northbridge, to store and re-ceive data in RAM directly without the intervention of the CPU (and its inherent performance cost). Today allhigh-performance devices attached to any of the busescan utilize DMA. While this greatly reduces the work-load on the CPU, it also creates contention for the band-width of the Northbridge as DMA requests compete withRAM access from the CPUs. This problem, therefore,must be taken into account.A second bottleneck involves the bus from the North-bridge to the RAM. The exact details of the bus dependon the memory types deployed. On older systems thereis only one bus to all the RAM chips, so parallel ac-cess is not possible. Recent RAM types require two sep-arate buses (or channels as they are called for DDR2,see page8) which doubles the available bandwidth. TheNorthbridge interleaves memory access across the chan-nels. More recent memory technologies (FB-DRAM, forinstance) add more channels.With limited bandwidth available, it is important for per-formance to schedule memory access in ways that mini-mize delays. As we will see, processors are much faster
We will not discuss multi-port RAM in this document as this typeof RAM is not found in commodity hardware, at least not in placeswhere the programmer has access to it. It can be found in specializedhardware such as network routers which depend on utmost speed.
Ulrich Drepper 
Version 1.0 3

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->