You are on page 1of 9

PowerPC assembly

Introduction to assembly on the PowerPC

Level: Advanced

Hollis Blanchard (hollis@austin.ibm.com), Software developer, IBM

01 Jul 2002

Assembly language is not widely known among the programming community these days, and
PowerPC assembly is even more exotic. Hollis Blanchard presents an overview of assembly
language from a PowerPC perspective and contrasts examples for three architectures: ia32, ppc, and
ppc64.

High-level languages offer great advantages in general by hiding many mundane and repetitive details from
programmers, allowing them to concentrate on their goals. However, sometimes programmers must use a lower-
level language, such as when writing code that deals directly with hardware or that is extremely performance
sensitive. Assembly language is the programming language closest to the hardware, which makes it a natural last
resort in such situations.

This article assumes a basic understanding of computer design (for example, you should know that a processor has
registers and can access memory) and of operating systems (system calls, exceptions, process stacks). This article
should be useful to PowerPC programmers unfamiliar with assembly as well as programmers who already know
ia32 assembly and want to broaden their horizons.

Introduction to PowerPC
The PowerPC Architecture Specification, released in 1993, is a 64-bit specification with a 32-bit subset. Almost
all PowerPCs generally available (with the exception of late-model IBM RS/6000 and all IBM pSeries high-end
servers) are 32-bit.

PowerPC processors have a wide range of implementations, from high-end server CPUs such as the Power4 to the
embedded CPU market (the Nintendo Gamecube uses a PowerPC). PowerPC processors have a strong embedded
presence because of good performance, low power consumption, and low heat dissipation. The embedded
processors, in addition to integrated I/O like serial and ethernet controllers, can be significantly different from the
"desktop" CPUs. For example, the 4xx series PowerPC processors lack floating point, and also use a software-
controlled TLB for memory management rather than the inverted pagetable found in desktop chips.

PowerPC processors have 32 (32- or 64-bit) GPRs (General Purpose Registers) and various others such as the PC
(Program Counter, also called the IAR/Instruction Address Register or NIP/Next Instruction Pointer), LR (link
register), CR (condition register), etc. Some PowerPC CPUs also have 32 64-bit FPRs (floating point registers).

RISC
PowerPC architecture is an example of a RISC (Reduced Instruction Set Computing) architecture. As a result:
All PowerPCs (including 64-bit implementations) use fixed-length 32-bit instructions.
The PowerPC processing model is to retrieve data from memory, manipulate it in registers, then store it
back to memory. There are very few instructions (other than loads and stores) that manipulate memory
directly.
Application binary interfaces (ABIs)
Technically, a developer can use any GPR for anything. For example, there is no "stack pointer register"; a
programmer could use any register for that purpose. In practice, it is useful to define a set of conventions so that
binary objects can interoperate with different compilers and pre-written assembly code.
binary objects can interoperate with different compilers and pre-written assembly code.

Calling conventions are determined by the ABI (Application Binary Interface) used. ppc32 Linux and NetBSD
implementations use the SVR4 (System V R4) ABI, but ppc64 Linux follows AIX and uses the PowerOpen ABI.
The ABI specifies which registers are considered volatile (caller-save) and non-volatile (callee-save) when calling
subroutines, and a lot more.

Some concrete examples of behavior specified by the SVR4 ABI:


Since the PowerPC has so many GPRs (32 compared to ia32's 8), arguments are passed in registers starting
with gpr3.
Registers gpr3 through gpr12 are volatile (caller-save) registers that (if necessary) must be saved
before calling a subroutine and restored after returning.
Register gpr1 is used as the stack frame pointer.
Many of the SVR4 features are identical to the PowerOpen ABI, which greatly aids interoperability.

When to use assembly


All the pros and cons listed in the "Assembly HOWTO" (see Resources for a link) apply to PowerPC.

Machine-specific registers
Sometimes you must touch CPU registers that higher-level languages are completely unaware of. This is
especially true in the course of writing an operating system. One simple example is assigning your code its own
stack -- on a PowerPC, you must set r1. A C compiler will only increment or decrement r1, so if your
application is running directly on the hardware, you must set r1 before calling C code. Another example is an
operating system's exception handlers, which must carefully save and restore state one register at a time until it's
safe to call higher-level code.

Nonetheless, when faced with a situation in which you must use low-level hardware features, you should
implement as little as possible in assembly:
C code is portable and understood by a large number of developers; assembly code (especially PowerPC
assembly) is not.
Higher-level code is frequently much easier to debug than assembly.
Higher-level code is by definition more expressive than assembly; in other words you can do more with less
code (and in less time).
If you find yourself writing high-level constructs such as loops or C structures in assembly, take a step back and
consider if this could be done more easily in another language. A general rule is to use just enough assembly to
allow you to use a higher-level language.

Optimization
One of the most common reasons people want to use assembly language is to make a slow program run faster.
But in these cases, assembly should be the absolute last place you turn.

General advice on optimization is beyond the scope of this document, but here are some places to start:
Profile
You must profile your code before starting any optimization work. Not only will this tell you where the
hotspots are (they're frequently not where you expect!), it will also give you proof that you've sped
anything up once you're done. Once you find hotspots, you can begin optimizing the high-level code (rather
than attempting to rewrite it in assembly).
than attempting to rewrite it in assembly).

Algorithmic optimization
No matter how tight your assembly is, if you're using an n 4 algorithm, you're still going to be incredibly
slow. Some other techniques you should try first include using a more appropriate data structure. If you
iterate repeatedly over a linked list, think about using a hash table, binary tree, or whatever is appropriate
for your application.
Your compiler can almost always do a much better job than you can at writing assembly! Rather than attempting
to rewrite high-level code in assembly, make judicious use of optimization options such as -O3 and C directives
like __inline__. The compiler is aware of tricks like instruction scheduling, which considers the internals of
the processor and tries to keep all pipelines full at all times. That may involve moving loads earlier in the
instruction stream than required to keep the pipeline from stalling as the CPU waits for memory accesses to catch
up. Unless you've been coding assembly for many years, these are tasks that most people cannot correctly perform
by hand.

How to learn assembly


gcc is the best place to start learning assembly (for any architecture). gcc -O3 -S file.c will produce
file.s in gas-compilable format (gas is the GNU Assembler). Open file.s in your favorite editor and you
can see the assembly output from your C code.

You'll probably see instructions you don't understand. You can look them up in The PowerPC Architecture: A
Specification for a New Family of RISC Processors, 2nd. Ed and PowerPC Microprocessor Family: The
Programming Environments for 32-bit Microprocessors (see Resources for links to these documents). However,
like learning any (spoken) language, there are certain words that are important and that you should know, and
others that can be safely ignored until you've figured out more important features of the code. A good example of
an important instruction is the branch family of instructions, such as blr.

Assembly examples

Hello World -- ia32 assembly


Listing 1 is copied directly from the gas example in the Assembly HOWTO, which unfortunately is completely
ia32-specific. It makes two direct system calls: the first writes to stdout; the second exits the application (with a
return code of 0). It is very unusual to make system calls directly; normally applications link with a libc library,
which wraps all the system calls.

Listing 1. ia32 assembly

.data # section declaration

msg:
.string "Hello, world!\n"
len = . - msg # length of our dear string

.text # section declaration

# we must export the entry point to the ELF linker or


# we must export the entry point to the ELF linker or
.global _start # loader. They conventionally recognize _start as their
# entry point. Use ld -e foo to override the default.

_start:

# write our string to stdout

movl $len,%edx # third argument: message length


movl $msg,%ecx # second argument: pointer to message to write
movl $1,%ebx # first argument: file handle (stdout)
movl $4,%eax # system call number (sys_write)
int $0x80 # call kernel

# and exit

movl $0,%ebx # first argument: exit code


movl $1,%eax # system call number (sys_exit)
int $0x80 # call kernel

Hello World -- PPC32 assembly


Listing 2 is a straightforward translation of the same code into PowerPC assembly.

Listing 2. PPC32 assembly

.data # section declaration - variables only

msg:
.string "Hello, world!\n"
len = . - msg # length of our dear string

.text # section declaration - begin code

.global _start
_start:

# write our string to stdout

li 0,4 # syscall number (sys_write)


li 3,1 # first argument: file descriptor (stdout)
# second argument: pointer to message to write
lis 4,msg@ha # load top 16 bits of &msg
addi 4,4,msg@l # load bottom 16 bits
li 5,len # third argument: message length
sc # call kernel

# and exit

li 0,1 # syscall number (sys_exit)


li 3,1 # first argument: exit code
sc # call kernel

General notes about Listing 2


PowerPC assembly requires a destination register for all register-to-register operations (because it is a RISC
architecture). This register is always the first in the argument list.
Under PPC Linux, system calls are made with the syscall number in gpr0 and arguments beginning with gpr3.
The syscall number, order of arguments, and number of arguments may differ under other PowerPC operating
systems (NetBSD, Mac OS, etc.), which is one reason programmers typically make system calls through a libc
library (which handles the OS-specific details).

Register notation
PowerPC registers have numbers, not names. For the learner, this can sometimes be confusing since literals aren't
easily distinguishable from registers. "3" could mean the value 3 or the register gpr3, or floating point fpr3,
or special purpose register spr3. Get used to it. :)

Immediate instructions
li means "load immediate", which is a way of saying "take this constant value known at compile time and store
it in this register". Another example of an immediate instruction is addi, for example addi 3,3,1 would
increment the contents of gpr3 by 1, then store the result back into gpr3. Contrast this with add 3,3,1,
which increments the contents of gpr3 by the contents of gpr1 , storing the result back into gpr3.

Instructions ending in "i" are usually immediate instructions.

Mnemonics
li isn't really an instruction; it's actually a mnemonic. A mnemonic is a bit like a preprocessor macro: it's an
instruction that the assembler will accept but secretly translate into other instructions. In this case, li 3,1 is
really defined as addi 3,0,1.

The sharp-eyed will notice that those instructions aren't necessarily the same thing: addi is really adding 1 to
the contents of gpr0, storing the result into gpr3, right? That would be true, except the PowerPC spec says
gpr0 sometimes has a value, and sometimes is read as 0, depending on the context. In this case (and the addi
description states this explicitly), the 0 means value 0 rather than register gpr0.

Mnemonics shouldn't matter at all to anyone other than assembler developers, but mnemonics can be confusing
when you're looking at disassembly output. However, GNU objdump -d is quite good at displaying the
original mnemonic rather than the instruction actually present in the file. For example, objdump will display the
mnemonic nop rather than ori 0,0,0 (the actual instruction used).

Loading pointers
The most interesting part of our Hello World example is how we load the address of msg. As mentioned earlier,
PowerPC uses fixed-length 32-bit instructions (in contrast to ia32, which uses variable-length instructions). That
32-bit instruction is just a 32-bit integer. This integer is divided into fields of different sizes:

Listing 3. addi machine code format

--------------------------------------------------------------------------
| opcode | src register | dest register | immediate value |
| 6 bits | 5 bits | 5 bits | 16 bits |
--------------------------------------------------------------------------

The number of fields and their sizes will vary by instruction, but the important point here is that these fields take
up space in the instruction. In the case of addi, after just those three fields are placed into the instruction, there
are only 16 bits left for the immediate value you're adding!

That means that li can only load 16-bit immediates. You cannot load a 32-bit pointer into a GPR with just one
instruction. You must use two instructions, loading first the top 16 bits and then the bottom. That is exactly the
purpose of the @ha ("high") and @l ("low") suffixes. (The "a" part of @ha takes care of sign extension.)
purpose of the @ha ("high") and @l ("low") suffixes. (The "a" part of @ha takes care of sign extension.)
Conveniently, lis (meaning "load immediate shifted") will load directly into the high 16 bits of the GPR. Then
all that's left to do is add in the lower bits.

This trick must be used whenever you load an absolute address (or any 32-bit immediate value). The most
common use is in referencing globals.

Listing 4. Hello World -- PPC64 assembly


Listing 4 is almost identical to the 32-bit PowerPC example (Listing 2) above. PowerPC was designed as a 64-bit
specification with 32-bit implementations, and not only that, PowerPC user-level programs are more or less
binary-compatible across those implementations. Under Linux, ppc32 binaries run perfectly well on 64-bit
hardware (with a little munging here and there for variable types visible to both 32-bit userland and the 64-bit
kernel).

Listing 4. PPC64 assembly

.data # section declaration - variables only

msg:
.string "Hello, world!\n"
len = . - msg # length of our dear string

.text # section declaration - begin code

.global _start
.section ".opd","aw"
.align 3
_start:
.quad ._start,.TOC.@tocbase,0
.previous

.global ._start
._start:

# write our string to stdout

li 0,4 # syscall number (sys_write)


li 3,1 # first argument: file descriptor (stdout)
# second argument: pointer to message to write

# load the address of 'msg':

# load high word into the low word of r4:


lis 4,msg@highest # load msg bits 48-63 into r4 bits 16-31
ori 4,4,msg@higher # load msg bits 32-47 into r4 bits 0-15

rldicr 4,4,32,31 # rotate r4's low word into r4's high word

# load low word into the low word of r4:


oris 4,4,msg@h # load msg bits 16-31 into r4 bits 16-31
ori 4,4,msg@l # load msg bits 0-15 into r4 bits 0-15

# done loading the address of 'msg'

li 5,len # third argument: message length


sc # call kernel

# and exit

li 0,1 # syscall number (sys_exit)


li 0,1 # syscall number (sys_exit)
li 3,1 # first argument: exit code
sc # call kernel

There are only two differences between the ppc32 code (Listing 2) and the ppc64 code (Listing 4). The first is the
way we load pointers, and the second is those assembler directives about an .opd section. It's worth pointing out
that the ppc32 code works perfectly under ppc64 Linux when compiled as a ppc32 binary.

Loading pointers
On ppc32 it took two instructions to load a 32-bit immediate value into a register. On ppc64 it takes 5! Why?

We still have 32-bit fixed-length instructions, which can only load 16 bits worth of immediate value at a time.
Right there you need a minimum of four instructions (64 bits / 16 bits per instruction = 4 instructions). But there
are no instructions that can load directly into the high word of a 64-bit GPR. So we have to load up the low word,
shift it to the high word, then load the low word again.

The rotate instructions (like the rlicr seen here) are notoriously complicated, and having jokingly been called
Turing-complete. If all you need to do is load 64-bit immediate values, don't worry about it -- just convert these
five instructions into a macro and never think about it again.

One last note: we used @h here instead of @ha in the ppc32 example because we then use ori rather than addi
to supply the low 16 bits. On RISC machines it's frequently possible to accomplish something in many different
ways (for example, there are many possibilities for nop).

Function descriptors -- the .opd section


Under ppc64 Linux, when you define and call a C function foo, that is not actually the address of the function's
code. In assembly if you try to bl foo, you will quickly find your program crashing. The label foo is actually
the address of foo's function descriptor. Function descriptors are described in detail in the ppc64 ELF ABI (see
Resources), but very briefly you must have a function descriptor (which is simply a structure containing 3
pointers) if your assembly will be called from C code, because the compiler expects it.

We don't have any C code here, but the ELF ABI also says that the ELF file's entry point (_start by default)
points to a function descriptor. So we must have one, and that is what goes into the .opd section.

Those assembler directives were copied almost directly from the output of gcc -S. This is another excellent
candidate for a preprocessor macro in your assembly code.

Where to learn more


For those of you interested in learning more about PowerPC, you can start by compiling tiny programs with gcc
-S -- provided that you have a PowerPC box handy. If you do not, check out the PPC cross-compiling mini-
howto, as well as the other sites and documents listed in the Resources section. Also try experimenting with gdb's
psim (PowerPC simulator) target. It's easier than you may think!

Resources
Download the Hello World code samples listed in this article:
For ia32 assembly
For PPC32 assembly
For PPC64 assembly

Get details on assembly instructions in The PowerPC Architecture: A Specification for a New Family of
RISC Processors, 2nd. Ed (Morgan Kaufmann, May 1994, ISBN 1-55860-316-6), and also PowerPC
Microprocessor Family: The Programming Environments for 32-bit Microprocessors (IBM, February
2000).

Find links to UNIX assembly projects and programming information at linuxassembly.org's projects page.

For assembly concepts, the Linux Assembly HOWTO is a good place to start, but it contains little actual
assembly, unfortunately.

Learn embedded assembly in the Linux for PowerPC Embedded Systems HOWTO.

Also see the Cross Development mini-howto for PPC Linux. (Don't panic -- it's easier than you think!)

Learn more about function descriptors in the 64-bit PowerPC ELF ABI.

For IBM PowerPC applications, feature summaries, technical documentation, news, and more, visit the
IBM PowerPC Web site.

Browse the current list of IBM white papers and technical reports on PowerPC architecture.

"A programmer's view of performance monitoring in the PowerPC microprocessor" (IBM Systems Journal,
1997) shows how you can analyze processor, software, and system attributes for a variety of workloads
with the Power PC's on-chip Performance monitor (PM).

"A decompression core for PowerPC" (IBM Systems Journal, 1998) shows you how to improve size
efficiency for PowerPC code.

Learn the basics and usage of inline assembly code in Linux in "Inline assembly for x86 in Linux"
(developerWorks, March 2001).
For an overview of embedded development on Linux, see "Linux system development on an embedded
device" (developerWorks, March 2002).

Find more Linux articles in the developerWorks Linux zone.

About the author

Hollis Blanchard has been programming PowerPC assembly for about 6 months. He graduated from Carnegie-
Mellon University in 2001 and works on Linux and other PowerPC projects as part of IBM's Linux Technology
Center. You can contact him at hollis@austin.ibm.com.

You might also like