Linux and symmetric multiprocessing
MA R C H 1 4 , 2 0 0 7

You can increase the perform ance of a Linux sy stem in v arious w ay s, and one of the m ost popular m ethods is increasing the perform ance of the processor. An obv ious solution is to use a processor w ith a faster clock rate, but for any giv en technology there exists a phy sical lim it w her e the clock sim ply can't go any faster. When y ou reach that lim it, y ou can use the m ore-is-better approach and apply m ultiple processors. Unfortunately , perform ance doesn't scale linearly w ith the aggregate perform ance of the indiv idual processors. Before discussing the application of m ultiprocessing in Linux, let's take a quick look back at the history of m ultiprocessing. History of m ultiprocessing Multiprocessing originated in the m id-1 950s at a num ber of com panies, som e y ou know and som e y ou m ight not rem em ber (IBM, Digital Equipm ent Corporation, Control Data Corporation). In the early 1 9 60s, Burroughs Corporation introduced a sy m m etrical MIMD m ultiprocessor with four CPUs and up to sixteen m em ory m odules connected v ia a crossbar sw itch (the first SMP architecture). The popular and successfu l CDC 66 00 w as introduced in 1 9 64 and prov ided a CPU w ith ten subprocessors (peripheral processing units). In the late 1 9 60s, Honey w ell deliv ered the first Multics sy stem , another sy m m etrical m ultiprocessing sy stem of eight CPUs. While m ultiprocessing sy stem s w ere being dev eloped, technologies also adv anced the ability to shrink the processors and operate at m uch higher clock rates. In the 1 980s, com panies like Cray Research introduced m ultiprocessor sy stem s and UNIX®-like operating sy stem s that could take adv antage of them (CX-OS). The late 1 9 80s, w ith the popularity of uniprocessor personal com puter sy stem s such as the IBM PC, saw a decline in m ultiprocessing sy stem s. But now , tw enty y ears later, m ultiprocessing has returned to these sam e personal com puter sy stem s through sy m m etric m ultiprocessing. Am dahl's law Gene Am dahl, a com puter architect and IBM fellow , dev eloped com puter architectures at IBM, his nam esake v enture, Am dahl Corporation, and others. But he is m ost fam ous for his law that predicts the m axim um expected sy stem im prov em ent w hen a portion of the sy stem is im prov ed. This is used predom inantly to calculate the m axim um theoretical perform ance im prov em ent w hen using m ultiple processors (see Figure 1 ).

Figure 1. Amdahl's law for processor parallelizat ion

Using the equation show n in Figure 1 , y ou can calculate the m axim um perform ance im prov em ent of a sy stem using N processors and a factor F that specifies the portion of the sy stem that cannot be parallelized (the portion of the sy stem that is sequential in nature). The result is show n in Figure 2 .

Figure 2. Amdahl's law for up t o t en CPUs

Ideally . This ty pe of architecture is also called a cluster (see Figure 3 ). the brow n line shows a problem that's 1 0% sequential and. or Infiniband). Each processor has equal access to the shared m em ory (the sam e access latency to the m em ory space). These are constructed from m ultiple standalone sy stem s connected by a high-speed interconnect (such as 1 0G Ethernet. the top line show s the num ber of processors. For exam ple. 90% parallelizable. for w hich the Linux Beow ulf project rem ains a popular solution. Contrast this w ith the Non-Uniform Mem ory Access (NUMA) architecture. Unfortunately . this is w hat y ou'd like to see w hen y ou add additional processors to solv e a problem . ten processors perform only slightly better than fiv e. In the best case for this graph. Loosely -coupled mult iprocessing archit ecture . Figure 3. Fibre Channel.In Figure 2 . each processor has its own m em ory but also access to shared m em ory w ith a different access latency . Back to top Multiprocessing and the PC An SMP architecture is sim ply one where tw o or m ore identical processors connect to one another through a shared m em ory . the speedup is quite a bit less. At the bottom (purple line) is the case of a problem that is 90% sequential. because not all of the problem can be parallelized and there's ov erhead in m anaging the processors. Ev en in this case. Loosely -coupled m ultiprocessing The earliest Linux SMP sy stem s w ere loosely -coupled m ultiprocessor sy stem s. therefore. Linux Beow ulf clusters can be built from com m odity hardw are and a ty pical netw orking interconnect such as Ethernet.

Tightly -coupled m ultiprocessing Tightly -coupled m ultiprocessing refers to chip-lev el m ultiprocessing (CMP). they include hardware that isn't relev ant but consum es pow er and space. therefore. Table 1 lists som e of the popular v ariants w ith Linux support. The tightly -coupled nature of the CMP allows v ery short phy sical distances betw een processors and m em ory and. m ultiple CPUs are connected v ia a shared bus to a shared m em ory (lev el 2 cache). Building a large m ultiprocessor netw ork can take considerable space and pow er. but they hav e their lim itations. Ev en w ith high-speed networks such as 1 0G Ethernet. This is know n as thread-lev el parallelism (TLP). m inim al m em ory access latency and higher perform ance. This ty pe of architecture w orks w ell in m ultithreaded applications where thr eads can be distributed across the processors to operate in parallel. As they 're com m only built from com m odity hardw are. Giv en the popularity of this m ultiprocessor architecture. Each processor also has its own fast m em ory (a lev el 1 cache). Figure 4. . m any v endors produce CMP dev ices. That's the idea behind tightly -coupled m ultiprocessing (also called m ulti-core com puting). m ultiple chips. On a single integrated circuit. there are lim its to the scalability of the sy stem .Building loosely -coupled m ultiprocessor architectures is easy (thanks to projects like Beow ulf). and an inter connect form a tightly integrated core for m ultiprocessing (see Figure 4). Think about the loosely -coupled architecture being scaled dow n to the chip lev el. Tight ly -coupled mult iprocessing archit ecture In a CMP. The bigger drawback is the com m unications fabric. shared m em ory .

The exam ple show n is from a tw o-chip Xeon m otherboard. four sim ultaneous threads AMD AMD X2 SMP. processor : 7 : 7389. With an SMP- aware kernel running on a m ulti-CPU host. As show n in Listing 1 . List ing 1. three Pow er PC CPUs IBM Cell Processor Asy m m etric m ultiprocessing (ASMP). The CONFIG_SMP option m ust be enabled during kernel configuration to m ake the kernel SMP aware. y ou retriev e the num ber of processors from the cpuinfo file in /proc using grep. dual CPU Int el® Xeon SMP.Table 1..18 . dual CPU. Using t he proc filesy st em t o ret rieve CPU informat ion mtj@camus:~$ grep -c ^processor /proc/cpuinfo 8 mtj@camus:~$ cat /proc/cpuinfo processor vendor_id cpu family model model name stepping cpu MHz cache size physical id siblings core id cpu cores fdiv_bug hlt_bug f00f_bug coma_bug fpu fpu_exception cpuid level wp flags : 0 : GenuineIntel : 15 : 6 : Intel(R) Xeon(TM) CPU 3. y ou can identify the num ber of processors and their ty pe using the proc filesy stem .. the kernel m ust be properly configured. The content of the cpuinfo file is then presented. Sampling of CMP devices Vendor Dev ice Descript ion IBM POWER4 SMP. dual CPU ARM MPCore SMP. dual CPU IBM POWER5 SMP. nine CPUs Back to top Kernel configuration To m ake use of SMP w ith Linux on SMP-capable hardware.219 : 2048 KB : 0 : 4 : 0 : 2 : no : no : no : no : yes : yes : 6 : yes : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm pni monitor ds_cpl est cid xtpr bogomips . First. y ou use the count option ( -c) for lines that begin with the word processor. dual or quad CPU Int el Core2 Duo SMP.73GHz : 4 : 3724. up to four CPUs IBM Xenon SMP.

73GHz : 4 : 3724. and when they use their allocation of tim e slice. With a task queue per CPU. Each runqueue suppor ts 1 40 priorities. Back to top .0. see the Resources section. The key w as the ability to load balance w ork across the av ailable CPUs w hile m aintaining som e affinity for cache efficiency .6 kernel m aintains a pair of runqueues for each processor (the expired and activ e runqueues). This prov ides fair access for all tasks to the CPU (and locking only on a per CPU basis). Ev ery 2 00 m illiseconds. w ith the top 1 00 used for real-tim e tasks. m ov ing it to another CPU requires the cache to be flushed for the task.vendor_id cpu family model model name stepping cpu MHz cache size physical id siblings core id cpu cores fdiv_bug hlt_bug f00f_bug coma_bug fpu fpu_exception cpuid level wp flags : GenuineIntel : 15 : 6 : Intel(R) Xeon(TM) CPU 3. For cache efficiency . Tasks are giv en tim e slices for execution. recall fr om Figure 4 that w hen a task is associated with a single CPU. The 2 .33 Back to top SMP and the Linux kernel In the early day s of Linux 2 . w ork can be balanced giv en the m easured load of all CPUs in the sy stem . Adv ances for support of SMP slowly m igrated in. and the bottom 4 0 for user tasks. but it w asn't until the 2 . This increases the latency of the task's m em ory access until its data is in the cache of the new CPU.6 kernel that the power of SMP w as finally rev ealed.6 scheduler. The 2 . the scheduler perform s load balancing to redistribute the task loading to m aintain a balance across the processor com plex.219 : 2048 KB : 1 : 4 : 3 : 2 : no : no : no : no : yes : yes : 6 : yes : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm pni monitor ds_cpl est cid xtpr bogomips mtj@camus:~$ : 7438.6 kernel introduced the new O(1 ) scheduler that included better support for SMP sy stem s. they 're m ov ed from the activ e runqueue to the expired runqueue. For m ore inform ation on the Linux 2 . SMP support consisted of a "big lock" that ser ialized access across the sy stem .

*/ DEFINE_PER_CPU(int. y ou can also initialize it there. . If m ultiple threads attem pt to lock a sem aphore after the initial call abov e. The follow ing exam ple (from .User space threads: Exploiting the pow er of SMP A lot of great wor k has gone into the Linux kernel to exploit SMP. .6 kernel introduced the concept of per-CPU v ariables that are associated w ith a single CPU. POSIX threads prov ide the threading m echanism as w ell as shared m em ory . it's desirable to av oid sharing data that's specific to a giv en core. Not doing so can lead to corrupted m em ory due to unsy nchronized m anipulation by m ultiple threads. POSIX prov ides the m utex function to create critical sections that enforce exclusiv e access to an object (a piece of m em ory ) by a single thread.. List ing 2. cpu_state) = { 0 }. To support m ulti-threaded access to shared m em ory . Memory access is safe here * for the memory protected by the crit_section_mutex. When a program is inv oked that creates som e num ber of threads. Listing 2 illustrates creating a critical section w ith a POSIX m utex.c) defines a v ariable to represent the state for each CPU in the sy stem . Portable Operating Sy stem Interface (POSIX) threads are a great way to build threaded applications that are able to take adv antage of SMP. Back to top Kernel v ariable protection for SMP When m ultiple cores in a processor w ork concurrently for the kernel. the 2 . the threads wor k w ith one another to hide each other's latency . they block and their requests are queued until the pthread_mutex_unlock call is perform ed. /* State of each CPU. each thread is prov ided its ow n stack (local v ariables and state) but shares the data space of the parent. All threads created share this sam e data space. but this is w here the problem lies. Recall that the pow er of SMP lies in TLP. coordination m echanism s are necessary ./arch/i3 86 /ker nel/sm pboot. Single m onolithic (non-threaded) program s can't exploit SMP. /* Inside the critical section. Since the m acro behav es like an l-v alue. In this w ay . to w hich y ou prov ide a ty pe and v ariable nam e. Defining a per-CPU v ariable is done w ith the DEFINE_PER_CPU m acro. but SMP can be exploited in program s that are com posed of m any threads that can be distributed across the cores.. pthread_mutex_lock( &crit_section_mutex ). */ pthread_mutex_unlock( &crit_section_mutex ). w hich m inim izes the locking requirem ents and im prov es perform ance. Using pt hread_mut ex_lock and unlock t o creat e crit ical sect ions pthread_mutex_t crit_section_mutex = PTHREAD_MUTEX_INITIALIZER. This perm its the declaration of v ariables for a CPU that are m ost com m only accessed by that CPU. While one thread is delay ed aw aiting com pletion of an I/O. For this reason. another thread is able to do useful w ork. but the operating sy stem by itself is not enough.

h. C-2 1 . The kernel prov ides other functions for per-CPU locking and dy nam ic allocation of v ariables.The m acro creates an array of v ariables. the per_cpu m acro is used along w ith smp_processor_id. Resources Learn "Inside the Linux scheduler" (dev eloperWorks. 1 97 2 . You can find these functions in . HACMP also prov ides higher reliability through com plete online sy stem m onitoring. "Access the Linux kernel using the /proc filesy stem " (dev eloperWorks. a popular way to increase perform ance is sim ply to add m ore processors. smp_processor_id() ) = CPU_ONLINE. To access the per-CPU v ariable. The ARM1 1 MPCore is a sy nthesizable processor that im plem ents up to four ARM1 1 CPUs for an aggregate 2 600 Dhry stone m illion instructions per second (MIPS) perform ance. The kernel does its part to optim ize the load across the av ailable CPUs (from threads to v irtualized operating sy stem s). which utilizes the Cell. Fly nn's original taxonom y defined w hat was possible for m ultiprocessing architectures.6 kernel. You'll find SMP sy stem s not only in serv ers. June 2 006 ) details the new Linux scheduler introduced in the 2 . His paper w as entitled "Som e Com puter Organizations and Their Effectiv eness. clearly show s how pow erful this processor can be. The Beowulf cluster is a great way to consolidate com m odity Linux serv ers to build a high-perform ance . Wikipedia prov ides a great sum m ary of the four classifications. Januar y 2 004) introduces Pthread program m ing with Linux. Vol. March 2 006 ) introduces the /proc filesy stem . In addition to m ultiprocessing through clustering. All that's left is to ensure that the application can be sufficiently m ulti-threaded to exploit the pow er in SMP./include/linux/percpu. Linux prov ides support for SMP. this m eant adding m ore processors to the m otherboard or clustering m ultiple independent com puters together. "Basic use of Pthreads" (dev eloperWorks. particularly with the introduction of v irtualization. w hich is a function that returns the current CPU identifier for w hich the code is currently executing. In the early day s. The Sony Play station 3 . The IBM POWER4 and POWER5 architectures prov ide sy m m etric m ultiprocessing. perm itting ev en greater per form ance due to reduced m em ory latency . The POWER5 also prov ides sy m m etric m ultithreading (SMT) for ev en greater perform ance. The Cell processor is an interesting architecture for asy m m etric m ultiprocessing. Mark Pacifico and Mike Merrill prov ide a short but interesting history of fiv e decades of m ultiprocessing. Like m ost cutting-edge technologies. including how to build y our ow n kernel m odule to prov ide a /proc filesy stem file. Back to top Sum m ary As processor frequencies reach their lim its." This paper was published in the IEEE Transactions on Com puting. The Pow er Architecture technology zone offers m ore technical resources dev oted to IBM's sem iconductor technology . one per CPU instance. In "The History of Parallel Processing " (1 998). Today . per_cpu( cpu_state. chip-lev el m ultiprocessing prov ides m ore CPUs on a single chip. IBM prov ides clustering technologies in High-Av ailability Cluster Multiprocessing (HACMP). but also desktops.

but y ou may edit t he informat ion at any t ime. Your display nam e m ust be unique in the dev eloperWorks com m unity and should not be y our em ail address for priv acy reasons. and BSD Sockets Programming from a Multilanguage Perspective. and display nam e will accom pany the content that y ou post. y ou agr ee to the dev eloperWor ks term s of use. His engineering backgr ound ranges from the dev elopm ent of kernels for geosy nchronous spacecraft to em bedded sy stem s architecture and networking pr otocols dev elopm ent. The first tim e y ou sign in to dev eloperWorks. in Longm ont. Colorado. Get product s and t echnologies With IBM trial softw are. Stay current w ith dev eloperWorks technical ev ents and Webcasts. so y ou need to choose a display nam e. Rate this article Com m ents . and the upcom ing Com m on Sy stem Interconnect prov ide efficient chip-to-chip interconnects for next-generation sy stem s. All inform ation subm itted is secure. By clicking Submit . The first tim e y ou sign into dev eloperWorks. In the dev eloperWorks Linux zone. All inform ation subm itted is secure. AI Application Programming. build y our next dev elopm ent project on Linux. Please choose a display name bet ween 3-31 charact stem . Standards such as Hy perTransport . Your first nam e. a profile is created for y ou. a profile is created for y ou. Tim is a Consultant Engineer for Em ulex Corp. y ou agr ee to the dev eloperWor ks term s of use. Discuss About the author M. find m ore resources for Linux dev elopers. last nam e (unless y ou choose to hide them ). Your display nam e accom panies the content y ou post on dev eloperWorks. Display nam e: By clicking Submit . RapidIO . Tim Jones is an em bedded softw ar e architect and the author of GNU/Linux Application Programming. av ailable for dow nload directly from dev eloperWorks. Select informat ion in y our developerWorks profile is display ed t o t he public. IBM ID: Need an IBM ID? Forgot y our IBM ID? Passw ord: Forgot y our passw ord? Change y our password Keep m e signed in.

View y our My dev eloperWorks profile Return from help Or igin a l URL: h t t p://w w w .Back to top Help: Update or add to My dW interests What's this? This little tim esav er lets y ou update y our My dev eloperWorks profile with just one click! The general subject of this content (AIX and UNIX. Lotus. Open source. You only need to be logged in to My dev eloperWorks. SOA and Web serv ices. Inform ation Managem ent. Your interests also help us recom m end relev ant dev eloperWorks content to y ou. Web dev elopm . Linux. Rational. /dev eloper w or k s/libr a r y /l-lin u x -sm p/ . And w hat's the point of adding y our interests to y our profile? That's how y ou find other users w ith the sam e inter ests as y our s. or XML) w ill be added to the interests section of y our profile. Tiv oli. and see w hat they 're reading and contributing to the com m unity . if it's not there already . Jav a.