4 ARM Parallelism PDF

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/2956345
Parallelism and the ARM instruction set architecture
Article in Computer · August 2005

DOI: 10.1109/MC.2005.239 · Source: IEEE Xplore
CITATIONS READS
83 565
2 authors:
John Goodacre Andrew Sloss

The University of Manchester ARM
38 PUBLICATIONS 326 CITATIONS 4 PUBLICATIONS 83 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Create new project "ARM" View project
Exascale View project
All content following this page was uploaded by John Goodacre on 12 December 2018.
The user has requested enhancement of the downloaded file.

COVER FEATURE
Parallelism and the ARM

Instruction Set
Architecture
Leveraging parallelism on several levels, ARM’s new chip designs could
change how people access technology. With sales growing rapidly and
more than 1.5 billion ARM processors already sold each year, software
writers now have a huge range of markets in which their ARM code can
be used.
John
O
ver the past 15 years, the ARM reduced- a system-on-chip device.1 Although this kept the
Goodacre instruction-set computing (RISC) proces- main design goal focused on performance, devel-
sor has evolved to offer a family of chips opers still gave priority to high code density, low
Andrew N.
that range up to a full-blown multi- power, and small die size.
Sloss processor. Embedded applications’ To achieve this design, the ARM team changed
ARM
demand for increasing levels of performance and the RISC rules to include variable-cycle execution
the added efficiency of key new technologies have for certain instructions, an inline barrel shifter to
driven the ARM architecture’s evolution. preprocess one of the input registers, conditional
Throughout this evolutionary path, the ARM execution, a compressed 16-bit Thumb instruction
team has used a full range of techniques known to set, and some enhanced DSP instructions.
computer architecture for exploiting parallelism.
The performance and efficiency methods that ARM • Variable cycle execution. Because it is a load-
uses include variable execution time, subword par- store architecture, the ARM processor must
allelism, digital signal processor-like operations, first load data into one of the general-purpose
thread-level parallelism and exception handling, registers before processing it. Given the single-
and multiprocessing. cycle constraint the original RISC design
The developmental history of the ARM architec- imposed, loading and storing each register indi-
ture shows how processors have used different types vidually would be inefficient. Thus, the ARM
of parallelism over time. This development has cul- ISA instructions specifically load and store mul-
minated in the new ARM11 MPCore multiprocessor. tiple registers. These instructions take variable
cycles to execute, depending on the number of
RISC FOR EMBEDDED APPLICATIONS registers the processor is transferring. This is
Early RISC designs such as MIPS focused purely particularly useful for saving and restoring con-
on high performance. Architects achieved this with text for a procedure’s prologue and epilogue.
a relatively large register set, a reduced number of This directly improves code density, reduces
instruction classes, a load-store architecture, and instruction fetches, and reduces overall power
a simple pipeline. All these now fairly common consumption.
concepts can be found in many of today’s modern • Inline barrel shifter. To make each data pro-
processors. cessing instruction more flexible, either a shift
The ARM version of RISC differed in many ways, or rotation can preprocess one of the source
partly because the ARM processor became an registers. This gives each data processing
embedded processor designed to be located within instruction more flexibility.
42 Computer Published by the IEEE Computer Society 0018-9162/05/$20.00 © 2005 IEEE

Nonsaturated (ISA v4T) Saturated (ISA v5TE)
PRECONDITION PRECONDITION
r0=0x00000000 r0=0x00000000
r1=0x70000000 r1=0x70000000
r2=0x7fffffff r2=0x7fffffff
• Conditional execution. An ARM instruction
executes only when it satisfies a particular con- ADDS r0,r1,r2 QADD r0,r1,r2
dition. The condition is placed at the end of the
instruction mnemonic and, by default, is set to POSTCONDITION POSTCONDITION
always execute. This, for example, generates a result is negative result is positive
savings of 12 bytes—42 percent—for the great- r0=0xefffffff r0=0x7fffffff
est common divisor algorithm implemented
with and without conditional execution.
Figure 1.
• 16-bit Thumb instruction set. The condensed • Enhanced DSP instructions. Adding these
Nonsaturated and
16-bit version of the ARM instruction set instructions to the standard ISA supports flex-
saturated addition.
allows higher code density at a slight perfor- ible and fast 16 × 16 multiply and arithmetic
Saturation is
mance cost. Because the Thumb 16-bit ISA is saturation, which lets DSP-specific routines
particularly useful
designed as a compiler target, it does not migrate to ARM. A single ARM processor
for digital signal
include the orthogonal register access of the could execute applications such as voice-over-
processing because
ARM 32-bit ISA. Using the Thumb ISA can IP without the requirement of having a sepa-
nonsaturations
achieve a significant reduction in program size. rate DSP. The processor can use one example
would wrap around
In 2003, ARM announced its Thumb-2 tech- of these instructions, SMLAxy, to multiply the
when the integer
nology, which offers a further extension to top or bottom 16 bits of a 32-bit register. The
value overflowed,
code density. This technology increases the processor could multiply the top 16 bits of
giving a negative
code density by mixing both 32- and 16-bit register r1 by the bottom 16 bits of register r2
result.
instructions in the same instruction stream. To and add the result to register r3.
achieve this, the developers incorporated
unaligned address accesses into the processor Figure 1 shows how saturation can affect the
design. result of an ADD instruction.2 Saturation is par-
A Short History of ARM
Formed in 1990 as a joint venture with Acorn Computers, Apple processor, which it incorporated into a product called the ARM
Computers, and VLSI Technology (which later became Philips Second Processor, which attached to the BBC Microcomputer
Semiconductor), ARM started with only 12 employees and through a parallel communication port called “the tube.” These
adopted a unique licensing business model for its processor designs. devices were sold mainly as a research tool. ARM1 lacked the
By licensing rather than manufacturing and selling its chip tech- common multiply and divide instructions, which users had to
nology, ARM established a new business model that has redefined synthesize using a combination of data processing instructions.
the way industry designs, produces, and sells microprocessors. The condition flags and the program counter were combined,
Figure A shows how the ARM product family has evolved. limiting the effective addressing range to 26 bits. ARM ISA ver-
The first ARM-powered products were the Acorn Archimedes sion 4 removed this limitation.
desktop computer and the Apple Newton PDA. The
ARM processor was designed originally as a 32-bit Physical IP
replacement for the MOS Technologies 6502 proces- Cortex

Thumb-2 technology NEON
sor that Acorn Computers used in a range of desktops MPCore technology OptimoDE
designed for the British Broadcasting Corporation. ARM11 family
data engines
When Acorn set out to develop this new replacement TrustZone

SecurCore family technology
processor, the academic community had already Jazelle technology
begun considering the RISC architecture. Acorn ARM10 family
decided to adopt RISC for the ARM processor. ARM’s
ARM9 family
developers originally tailored the ARM instruction set
architecture to efficiently execute Acorn’s BASIC inter- Thumb technology ARM7 family
preter, which was at the time very popular in the ARM6 core
StrongARM processor
European education market.

The ARM1, ARM2, and ARM3 processors were 1990 1996 1998 1999 2000 2003 2004 2005
developed by Acorn Computers. In 1985, Acorn
received production quantities of the first ARM1 Figure A. Technology evolution of the ARM product family.
July 2005 43
Figure 2. SIMD
versus non-SIMD MPEG4 ACC
power consumption. MPEG4 ACC LTP
The lightweight ARM ACC Dec
implementation of MP3 Dec
SIMD reduces gate MJPEG Dec
count, hence
MJPEG Enc ARM11 versus ARM9E
significantly
JPEG Dec (1024x768 10:1) ARM11 versus ARM7
reducing die
size, power, and JPEG Enc (1024x768 10:1)
complexity. H.264 Baseline Dec
H.264 Baseline Enc
H.263 Baseline Dec
H.263 Baseline Enc
MPEG4 SP Decode
MPEG4 SP Encode
1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
Improvement per MHz
ticularly useful for digital signal processing because size, power, and complexity. Further, all the SIMD
nonsaturations would wrap around when the inte- instructions execute conditionally.
ger value overflowed, giving a negative result. A To improve handling of video compression sys-
saturated QADD instruction returns a maximum tems such as MPEG and H.263, ARM also intro-
value without wrapping around. duced the sum of absolute differences (SAD)
concept, another form of DLP instructions. Motion
DATA-LEVEL PARALLELISM estimation compares two blocks of pixels, R(I,j)
Following the success of the enhanced DSP and C(I,j), by computing
instructions introduced in the v5TE ISA, ARM
introduced the ARMv6 ISA in 2001. In addition to SAD = ∑ | R(I,j) – C(I,j) |
improving both data- and thread-level parallelism,
other goals for this design included enhanced math- Smaller SAD values imply more similar blocks.
ematical operations, exception handling, and Because motion estimation performs many SAD
endian-ness handling. tests with different relative positions of the R and
An important factor influencing the ARMv6 ISA C blocks, video compression systems require very
design involved increasing DSP-like functionality fast and energy-efficient implementations of the
for overall video handling and 2D and 3D graph- sum-of-absolute-differences operation.
ics. The design had to achieve this improved func- The instructions USAD8 and USADA8 can com-
tionality while still maintaining very low power pute the absolute difference between 8-bit values.
consumption. ARM identified the single-instruc- This is particularly useful for motion-video-com-
tion, multiple-data architecture as the means for pression and motion-estimation algorithms.
accomplishing this.
SIMD is a popular technique for providing data- THREAD-LEVEL PARALLELISM
level parallelism without compromising code den- We can view threads as processes, each with its
sity and power. A SIMD implementation requires own program counter and register set, while hav-
relatively few instructions to perform complex cal- ing the advantage of sharing a common memory
culations with minimum memory accesses. space. For thread-level parallelism, ARM needed
Due to a careful balancing of computational effi- to improve exception handling to prepare for the
ciency and low power, ARM’s SIMD implementa- increased complexity in handling multithreading
tion involved splitting the standard 32-bit data path on multiple processors. These requirements added
into four 8-bit or two 16-bit slices. This differs from inherent complexity in the interrupt handler, sched-
many other implementations, which require addi- uler, and context switch.
tional specialized data paths for SIMD operations. One optimization extended the exception han-
Figure 2 shows the improvements in MHz that dling instructions to save precious cycles during the
various codecs require when using the ARMv6 time-critical context switch. ARM achieved this by
SIMD instructions introduced in the ARM11 adding three new instructions to the instruction set,
processor. as Table 1 shows.
The lightweight ARM implementation of SIMD Programmers can use the change processor state
reduces gate count, hence significantly reducing die (CPS) instruction to alter processor state by setting
44 Computer
Table 1. Exception handling instructions in the ARMv6 architecture.
Instruction Description Action
CPS Change processor state CPS<effect> <iflags>,{,#mode}
CPS #<mode>
CPSID <flags>
the current program status register to supervisor CPSIE <flags>
mode and disabling fast interrupt requests, as the RFE Return from exception RFE<addressing_mode> Rn!
code in Figure 3 shows. Whereas the ARMv4T ISA SRS Save return state SRS<addressing_mode>,#<mode>{!}
required four instructions to accomplish this task,
the ARMv6 ISA requires only two.
Programmers can use the save return state (SRS)
instruction to modify the saved program status reg-
ister in a specific mode. The updating of SPSR in a ARMv4T ISA ARMv6 ISA
particular ARMv4T ISA mode involved many ; Copy CPSR ; Change processor state and modify
more instructions than in the ARMv6 ISA. This MRS r3, CPSR ; select bits
new instruction is useful for handling context ; Mask mode and FIQ interrupt CPSIE f, #SVC
switches or preparing to return from an exception IC r3, r3, #MASK|FIQ
handler. ; Set Abort mode and enable FIQ
ORR r3, r3, #SVC|nFIQ
MULTIPROCESSOR ATOMIC INSTRUCTIONS ; Update the CPSR
Earlier ARM architectures implemented sema- MSR CPSR_c, r3
phores with the swap instruction, which held the
external bus until completion. Obviously, this was
Figure 3. ARMv6 ISA
unacceptable for thread-level parallelism because ARM11, the memory management unit logic
change processor
one processor could hold the entire bus until com- resides between the level 1 cache and the proces-
state instruction
pletion, disallowing all other processors. ARMv6 sor core. The reduction in cache flushing has the
compared with the
introduced two new instructions—load-exclusive additional benefit of decreasing overall power con-
ARMv4T
LDREX and store-exclusive STREX—which take sumption by reducing the external memory
architecture.
advantage of an exclusive monitor in memory: accesses that occur in a virtually tagged cache. The
physically tagged cache increases overall perfor-
• LDREX loads a value from memory and sets mance by about 20 percent.
the exclusive monitor to watch that location,
and INSTRUCTION-LEVEL PARALLELISM
• STREX checks the exclusive monitor and, if In ILP, the processor can execute multiple instruc-
no other write has taken place to that location, tions from a single sequence of instructions con-
performs the store to memory and returns a currently. This form of parallelism has significant
value to indicate if the data was written. value in that it provides additional overall perfor-
mance without affecting the software programming
Thus, the architecture can implement semaphores model.
that do not lock the system bus that grants other Obviously, ILP puts more emphasis on the com-
processors or threads access to the memory piler that extracts it from the source code and
system. schedules the instructions across the superscalar
The ARM11 microarchitecture was the first core. Although potentially simplifying otherwise
hardware implementation of the ARMv6 ISA. It overly complex hardware, an excessive drive to
has an eight-stage pipeline with separate parallel extract ILP and achieve high performance through
pipelines for the load/store and multiply/accumu- increased MHz has increased hardware complex-
late operations. With the parallel load/store unit ity and cost.
the ARM1136J-S processor can continue execut- ARM has remained the processor at the edge of
ing without waiting for slower memory—a main the network for several years. This area has always
gating factor for processor performance. seen the most rapid advancements in technology,
In addition, the ARM1136J-S processor has with continuous migration away from larger com-
physically tagged caches to help with thread-level puter systems and toward smaller ones. For exam-
parallelism—as opposed to the virtually tagged ple, technology developed for mainframes a decade
caches of previous ARM processors—which con- or two ago is found in desktop computers today.
siderably benefits context switches, especially when Likewise, technologies developed for the desktop
running large operating systems. five years ago have begun appearing in consumer
A virtually tagged cache must be flushed every and network products. For example, symmetric
time a context switch takes place because the cache multiprocessing (SMP) is appearing today in both
contains old virtual-to-physical translations. In the desktop and embedded computers.
July 2005 45
PERFORMANCE VERSUS POWER operating systems, ARM applied these enhance-
REQUIREMENTS ments, known as ARMv6K or AOS (for Advanced
Continued demand
For several years, the embedded processor OS Support), across all ARMv6-architecture-based
for performance at market inherited technology matured in desk- application processors to provide a firm founda-
low power has led top computing as consumers demanded sim- tion for embedded software.
to minimizing the ilar functionality in their embedded devices. The ARM11 multiprocessor also addressed the
overall power The continued demand for performance at SMP system design’s two main bottlenecks:
low power has, however, driven slightly dif-
budget by adding ferent requirements and led to the overall • interprocessor communication with the inte-
multiple processors power budget being minimized by adding gration of the new ARM Generic Interrupt
and accelerators. multiple processors and accelerators within Controller (GIC), and
an embedded design. Today, the demand for • cache coherence with the integration of the
high levels of general-purpose computing dri- Snoop Control Unit (SCU), an intelligent mem-
ves using SMP as the application processor ory-communication system.
in both embedded and desktop systems.
In 2004, both the embedded and desktop mar- These logic blocks deliver an efficient, hardware-
kets hit the cost-performance-through-MHz wall. coherent single-core SMP processor that manufac-
In response, developers began embracing poten- turers can build cost-effectively.
tial solutions that require SMP processing to avoid
the following pitfalls: PREPARATIONS FOR ARM
MULTIPROCESSING
• High MHz costs energy. Increasing a proces- To fully realize the advantages of a multiproces-
sor’s clock rate has a quadratic effect on power sor hardware platform in general-purpose com-
consumption. Not only does doubling the puting, ARM needed to provide a cache-coherent,
MHz double the dynamic power required to symmetric software platform with a rich instruc-
switch the logic, it also requires a higher oper- tion set. ARM found that a few key enhancements
ating voltage, which increases at the square of to the current ARMv6 architecture could offer the
the frequency. Higher frequencies also add to significant performance boost it sought.
design complexity, greatly increasing the
amount of logic the processor requires. Enhanced atomic instructions
• Extracting ILP is complex and costly. Using Researchers can use the ARMv6 load-and-store
hardware to extract ILP significantly raises the exclusives to implement both swap-based and com-
cost in silicon area and design complexity, fur- pare-and-exchange-based semaphores to control
ther increasing power consumption. access to critical data. In the traditional server com-
• Programming multiple independent proces- puting world of SMP there has, however, been sig-
sors is nonportable and inefficient. As devel- nificant software investment in optimizing SMP
opers use more processors, often with dif- code using lock-free synchronization. This work
ferent architectures, the software complexity has been dominated by the x86 architecture and its
escalates, eliminating any portability between atomic instructions that developers can use to com-
designs. pare and exchange data.
Many favored using the Intel cmpxchg8b instruc-
In mid-2004, PC manufacturers and chip makers tion in these lock-free routines because it can
made several announcements heralding the end of exchange and compare 8 bytes of data atomically.
the MHz race in desktop processors and champi- Typically, this involved 4 bytes for payload and 4
oning the use of multicore SMP processors in the bytes to distinguish between payload versions that
server realm, primarily through the introduction of could otherwise have the same value—the so-called
hyperthreading in the Intel Pentium processor. At A-B-A problem.
that time, ARM announced its ARM11 MPCore The ARM exclusives provide atomicity using the
multiprocessor core as a key solution to help data address rather than the data value, so that the
address the demand for performance scalability. routines can atomically exchange data without
Introduced alongside the ARM11 MPCore, a set experiencing the A-B-A problem. Exploiting this
of enhancements to the ARMv6 architecture pro- would, however, require rewriting much of the
vides further support for advanced SMP operating existing two-word exclusive code. Consequently,
systems. In its move to support richer SMP-capable ARM added instructions for performing load-and-
46 Computer
store exclusives using various payload sizes— part of the Linux 2.6 kernel. This Posix
including 8 bytes—thus ensuring the direct porta- thread library provides significant perfor-
bility of existing multithreaded code. mance improvements over the old Linux Spin-lock causes
pthread library. the processor
Improved access to localized data to spin around a
When an OS encounters the increasing number of Power-conscious spin-locks tight loop while
threaded applications typical in SMP platforms, it Another SMP system cost involves the syn-
must consider the performance overheads of asso- chronization overhead required when
continually
ciating thread-specific state with the currently exe- processors must access shared data. At the attempting to
cuting thread. This can involve, for example, lowest abstraction level in most SMP syn- acquire a lock.
knowing which CPU a thread is executing on, chronization mechanisms, a spin-lock soft-
accessing kernel structures specific to a thread, and ware technique uses a value in memory as a
enabling thread access to local storage. The AOS lock. If the memory location contains some
enhancements add registers that help with these predefined value, the OS considers the shared
SMP performance aspects. resource locked, otherwise it considers the resource
CPU number. Using the standard ARM system unlocked. Before any software can access the
coprocessor interface, software on a processor can shared resource, it must acquire the lock—and an
execute a simple, nonmemory-accessing instruc- atomic operation must acquire it. When the soft-
tion to identify the processor on which it executes. ware finishes accessing the resource, it must release
Developers use this as an index into kernel struc- the lock.
tures. In an SMP OS, processors often must wait while
Context registers. SMP operating systems handle another processor holds a lock. The spin-lock
two key demands from the kernel when providing received its name because it accomplishes this wait-
access to thread-specific data. The ARMv6K archi- ing by causing the processor to spin around a tight
tecture extensions define three additional system loop while continually attempting to acquire the
coprocessor registers that the OS can manage for lock. A later refinement to reduce bus contention
whatever purpose it sees fit. Each register has a dif- added a back-off loop during which the processor
ferent access level: does not attempt to access the lock. In either case,
in a power-conscious embedded system, these
• user and privileged read/write accessible; unproductive cycles obviously waste energy.
• read-only in user, read/write privileged acces- The AOS extensions include a new instruction
sible; and pair that lets a processor sleep while waiting for a
• privileged only read/write accessible. lock to be freed and that, as a result, consumes less
energy. The ARM11 multiprocessor implements
The exact use of these registers is OS-specific. In the these instructions in a way that provides next-cycle
Linux kernel and GNU toolchain, the ARM appli- notification to the waiting processor when the lock
cation binary interface has assigned these registers is freed, without requiring a back-off loop. This
to enable thread local storage. A thread can use TLS results in both energy savings and a more efficient
to rapidly access thread-specific memory without spin-lock implementation mechanism.
losing any of the general-purpose registers. Figure 4 shows a sample implementation of the
To support TLS in C and C++, the new keyword spin lock and unlock code used in the ARM Linux
thread has been defined for use in defining and 2.6 kernel.
declaring a variable. Although not an official exten-
sion of the language, using the keyword has gained Weakly ordered memory consistency
support from many compiler writers. Variables The ARMv6 architecture defined various mem-
defined and declared this way would automatically ory consistency models for the different definable
be allocated locally to each thread: memory regions. In the ARM11 multiprocessor,
spin-lock code uses coherently cached memory to
__thread int i; store the lock value. As a multiprocessor, the
__thread struct state s; ARM11 MPCore is the first ARM processor to
extern __thread char *p; fully expose weakly ordered memory to the pro-
grammer. The multiprocessor uses three instruc-
Supporting TLS is a key requirement for the new tions to control weakly ordered memory’s side
Native Posix Thread Library (NPTL) released as effects:
July 2005 47
static inline void _raw_spin_lock(spinlock_t *lock)
{
unsigned long tmp;
_asm__ __volatile__(
1: ldrex %0, [%1] ; exclusive read lock
teq %0, #0 ; check if free
wfene ; if not, wait (saves power)
strexeq %0, %2, [%1] ; attempt to store to the lock
teqeq %0, #0 ; Were we successful ?
bne 1b ; no, try again
: “=&r” (tmp)
: “r” (&lock->lock), “r” (1), “r” (0)
: “cc”, “memory”
);
rmb(); // Read memory barrier stops speculative reading of payload

} // This is NOP on MPCore since dependent reads are sync’ed
static inline void _raw_spin_unlock(spinlock_t *lock)

{
wmb(); // data write memory barrier, ensure payload write visible
// Ensures data ordering, but does not necessarily wait
_asm__ __volatile__(
str %1, [%0] ; Release spinlock
mcr p15, 0, %1, c7, c10, 4 ; DrainStoreBuffer (DSB)
sev ; Signal to any CPU waiting
: “r” (&lock->lock), “r” (0)
: “cc”, “memory”);
}
Figure 4. Power-
• wmb(). This Linux macro creates a write- tiprocessor, this macro can be defined as empty.
conscious spin-lock.
memory barrier that the multiprocessor can • DSB (Drain Store Buffer). The ARM architec-
This sample imple-
use to place a marker in the sequencing of any ture includes a buffer visible only to the proces-
mentation shows
writes around this barrier instruction. The sor. When running uniprocessor software, the
the spin-lock and
spin-lock, for example, executes this instruc- processor allows subsequent reads to scan the
unlock code used
tion prior to unlocking to ensure that any data from this buffer. However, in a multi-
in the ARM Linux
writes to the payload data complete before the processor, this buffer becomes invisible to reads
2.6 kernel.
write to release the spin-lock, and hence before from other processors. The DSB drains this
any other processor can acquire the lock. To buffer into the L1 cache. In the ARM11 multi-
ensure higher performance, the barrier does processor, which has a coherent L1 cache
not necessarily stall the processor by flushing between the processors, the flush only needs to
data. Rather it informs the load-store unit and proceed as far as the L1 memory system before
lets execution continue in most situations. another processor can read the data. In the
• rmb(). Again from the Linux kernel, this macro spin-lock unlock code, the processors issue the
places a read-memory barrier that prevents DSB immediately prior to the SEV (Set Event)
speculative reads of the payload from occur- instruction so that any processor can read the
ring before the read has acquired the lock. correct value for the lock upon awakening.
Although legal in the ARMv6 architecture, this
level of weakly ordered memory can make it
difficult to ensure software correctness. Thus, ARM11 MPCORE MULTIPROCESSOR
the ARM11 multiprocessor implements only The ISA’s suitability is not the only factor affect-
nonspeculative read-ahead. When the possi- ing the multiprocessor’s ability to actually deliver
bility exists that a read will not be required, as the scalability promises of SMP. If they are poorly
in the spin-lock case—where there is a branch implemented, two aspects of an SMP design can
instruction between the teqeq instruction and significantly limit peak performance and increase
any payload read—the read-ahead does not the energy costs associated with providing SMP
take place. So, for the ARM11 MPCore mul- services:
48 Computer
Figure 5. ARM11
Configurable number of hardware interrupt lines Private FIQ lines
MPCore. This
multiprocessor
integrates the new
… ARM GIC inside the
core to make the
interrupt system’s
Interrupt distributor
access and effects
much closer and
Per-CPU
aliased more efficient.
peripherals Timer Timer Timer Timer
CPU CPU CPU CPU
Wdog interface Wdog interface Wdog interface Wdog interface
Configurable
between IRQ IRQ IRQ IRQ
1 and 4
symmetric
CPUs
CPU/VFP CPU/VFP CPU/VFP CPU/VFP
LI memory LI memory LI memory LI memory

ARM MPCORE
Snoop control unit (SCU)

Private
peripheral
bus
Primary AXI R/W Optional 2nd AXI R/W

64-bit bus 64-bit bus
• Cache coherence. Developers typically provide tected resource. Other SMP OS communica-
the single-image SMP OS with coherent caches tion between CPUs is best accomplished with-
so it can maintain performance by placing its out accessing memory. Systems frequently
data in cached memory. In the ARM11 multi- must also synchronize asynchronously. One
processor, each CPU has its own instruction such mechanism uses the device’s interrupt sys-
and L1 data cache. Existing coherency schemes tem to cause activity on a remote processor.
often extend the system bus with additional sig- These software-initiated interprocessor inter-
nals to control and inspect other CPUs’ caches. rupts (IPI) typically use an interrupt system
In an embedded system, the system bus often designed to interface interrupts from I/O
clocks slower than the CPU. Thus, besides plac- peripherals rather than another CPU.
ing a bottleneck between the processor and its
cache, this scheme significantly increases the Figure 5 shows how the ARM11 MPCore inte-
traffic and hence the energy the bus consumes. grates the new ARM GIC inside the core to make
The ARM11 MPCore addresses these prob- the interrupt system’s access and effects closer and
lems by implementing an intelligent SCU more efficient. ARM designed the GIC to optimize
between each processor. Operating at CPU fre- the cost for the key forms of IPI used in an SMP OS.
quency, this configuration also provides a very
rapid path for data to move directly between Interrupt subsystem
each CPU’s cache. A key example of IPI’s use in SMP involves a
• Interprocessor communication. An SMP OS multithreaded application that affects some state
requires communication between CPUs, which within the processor that is not hardware-coher-
sometimes is best accomplished without ent with the other processors on which the appli-
accessing memory. Also, the system must often cation process has threads running. This can occur
regulate interprocessor communication using when, for example, the application allocates some
a spin-lock that synchronizes access to a pro- virtual memory. To maintain consistency, the OS
July 2005 49
also must apply these memory translations to all state. This optimization also causes the processor to
other processors. In this example, the OS would transfer the cache line directly to the other processor
typically apply the translation to its processor and without intervening external memory operations.
then use the low-contention private peripheral bus This ability to move shared data directly between
to write to an interrupt control register in the GIC processors provides a key feature that programmers
that causes an interrupt to all other processors. The can use to optimize their software. When defining
other processors could then use this interrupt’s ID data structures that processors will share, pro-
to determine that they need to update their mem- grammers should ensure appropriate alignment and
ory translation tables. packing of the structure so that line migration can
The GIC also uses various software-defined pat- occur. Also, if the programmers use a queue to dis-
terns to route interrupts to specific processors tribute work items across processors, they should
through the interrupt distributor. In addition to ensure that the queue is an appropriate length and
their dynamic load balancing of applications, SMP width so that when the worker processor picks up
OSs often also dynamically balance the interrupt the work item, it will transfer it again through this
handler load. The OS can use the per-processor cache-to-cache transfer mechanism. To aid with this
aliased control registers in the local private periph- level of optimization, the MPCore includes hard-
eral bus to rapidly change the destination CPU for ware instrumentation for many operations within
any particular interrupt. both the traditional L1 cache and the SCU.
Another popular approach to interrupt distribu-
tion sends an interrupt to a defined group of proces-
sors. The MPCore views the first processor to he ARMv6K ISA can be considered a key mul-
accept the interrupt, typically the least loaded, as
being best positioned to handle the interrupt. This
flexible approach makes the GIC technology suit-
T tiprocessor-aware instruction set. With its
foundation in low-power design, the archi-
tecture and its implementation in the ARM11
able across the range of ARM processors. This stan- MPCore can bring low power to high-performance
dardization, in turn, further simplifies how designs. These new designs show the potential to
software interacts with an interrupt controller. truly change how people access technology. With
more than 1.5 billion ARM processors being sold
Snoop control unit each year, there is a huge range of markets in which
The MPCore’s SCU is an intelligent control ARM developers can use their software code. ■
block used primarily to control data cache coher-
ence between each attached processor. To limit the
power consumption and performance impact from References
snooping into and manipulating each processor’s 1. D. Seal, ARM Architecture Reference Manual, 2nd
cache on each memory update, the SCU keeps a ed., Addison-Wesley Professional, 2000.
duplicate copy of the physical address tag (pTag) 2. A. Sloss et al., ARM System Developer’s Guide, Mor-
for each cache line. Having this data available gan Kauffman, 2004.
locally lets the SCU limit cache manipulations to
processors that have cache lines in common. John Goodacre is a program manager at ARM with
The processor maintains cache coherence with responsibility for multiprocessing. His interests
an optimized version of the MESI (modified, exclu- include all aspects of both hardware and software
sive, shared, invalid) protocol. With MESI, some in embedded multiprocessor designs. Goodacre
common operations, such as A = A + 1, cause many received a BSc in computer science from the Uni-
state transitions when performed on shared data. versity of York. Contact him at john.goodacre@
To help improve performance and further reduce arm.com.
the power overhead associated with maintaining
coherence, the intelligence in the SCU monitors the Andrew N. Sloss is a principle engineer at ARM.
system for a migratory line. If one processor has a His research interests include exception handling
modified line, and another processor reads then methods, embedded systems, and operating system
writes to it, the SCU assumes such a location will architecture. Sloss received a BSc in computer sci-
experience this same operation in the future. As this ence from the University of Hertfordshire. He is
operation starts again, the SCU will automatically also is a Chartered Engineer and a Fellow of the
move the cache line directly to an invalid state rather British Computer Society. Contact him at andrew.
than expending energy moving it first into the shared sloss@arm.com.
50 Computer
View publication stats

4 ARM Parallelism PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4 ARM Parallelism PDF

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Parallelism and the ARM instruction set architecture

Article in Computer · August 2005

John Goodacre Andrew Sloss

SEE PROFILE SEE PROFILE

Create new project "ARM" View project

Exascale View project

The user has requested enhancement of the downloaded file.

Parallelism and the ARM

42 Computer Published by the IEEE Computer Society 0018-9162/05/$20.00 © 2005 IEEE

A Short History of ARM

replacement for the MOS Technologies 6502 proces- Cortex

When Acorn set out to develop this new replacement TrustZone

European education market.

rmb(); // Read memory barrier stops speculative reading of payload

static inline void _raw_spin_unlock(spinlock_t *lock)

CPU/VFP CPU/VFP CPU/VFP CPU/VFP

LI memory LI memory LI memory LI memory

Snoop control unit (SCU)

Primary AXI R/W Optional 2nd AXI R/W

View publication stats

You might also like