Professional Documents
Culture Documents
The assembling of a program into machine code by the assembler involves several steps:
1. Read the assembly language program, handling directives appropriately.
2. Replace pseudo-instructions with a sequence of native MIPS instructions, using $at (and
nothing else!) if needed. For example, the instruction move $t0, $t1 is replaced with add $t0,
$zero, $t1. This substitution is done with every pseudo-instruction in the assembly program
with its corresponding equivalent MIPS instructions.
3. Generate machine code for each instruction in the “expanded” assembly program.
4. Generate relocation information for code that depends on absolute addresses. Ex: j Label
5. Generate symbol table for all references (e.g., labels, variables). To produce the binary version
of each instruction in the assembly language program, the assembler must determine the
addresses corresponding to all labels. Assemblers keep track of labels used in branches and
data transfer instructions in a symbol table (hash tables, anyone?), a table that matches names
of labels to the addresses of the memory words that instructions occupy.
The object file for UNIX systems typically contains six distinct pieces:
● The object file header describes the size and position of the other pieces of the object
file.
● The text segment contains the machine language code.
● The static data segment contains data allocated for the life of the program. (UNIX
allows programs to use both static data, which is allocated throughout the program, and
dynamic data, which can grow or shrink as needed by the program.)
● The relocation information identifies instructions and data words that depend on
absolute addresses when the program is loaded into memory.
● The symbol table contains the remaining labels that are not defined, such as external
references.
6. Generate information for use by debugger.
Linker
Complete retranslation is a terrible waste of computing resources. This repetition is particularly wasteful
for standard library routines, because programmers would be compiling and assembling routines that
by definition almost never change. Instead of compiling and assembling the whole program whenever a
line of code changes, an alternative is to compile and assemble each procedure independently, so that
a change to one line would require compiling and assembling only one procedure. This alternative
requires a new systems program, called a linker, which takes all the independently assembled
machine language programs and “stitches” them together. The linker thus avoids the recompilation
and/or reassembling of every module if only one module changes.
The linker produces an executable file that can be run on a computer. Typically, this file has the same
format as an object file, except that it contains no unresolved references. It is possible to have
partially linked files, such as library routines, that still have unresolved addresses and hence result in
object files.
Loader
The loader is a systems program that places an object program in main memory so that it is ready to
execute. In UNIX, it follows these steps:
1. Reads the executable file header to determine size of the text and data segments.
2. Creates an address space large enough for the text and data.
3. Copies the instructions and data from the executable file into memory.
4. Copies the parameters (if any) to the main program onto the stack.
5. Initializes the machine registers and sets the stack pointer to the first free location.
6. Jumps to a start-up routine that copies the parameters into the argument registers and calls the
main routine of the program. When the main routine returns, the start-up routine terminates the
program with an exit system call.
Dynamic Linking
Link/load library procedures only when they are called (not before program execution):
1. Procedure code needs to be relocatable
2. Avoids code bloat of static linking of all [transitively] referenced libraries
3. Automatically gets new library versions
Compilation Steps
1. Lexical Analysis (Tokenization). Scan input program; group characters into tokens. For
example, each space delimited string is a token.
2. Syntax Analysis (Parsing). Group tokens into a graphical representation (AST) to relate them
appropriately.
3. Semantic Analysis. Make sure that program syntax makes sense for further processing Ex:
type checking, variable scoping
4. Intermediate Code Generation. Translate source into a high-level assembly-like code. Also
generate control flow graphs (CFGs), and other graphs needed for further analysis.
5. Optimization. Manipulate intermediate code for faster execution (and less memory usage). For
example, avoid recomputation of common expressions.
6. Code Generation. Translate intermediate code to assembly.
7. Target-Specific Optimizations. Optimize assembly code for target-specific optimizations.
Types of Optimizations
A basic block is a sequence of instructions with no branches from or to internal instructions.
● Local to basic blocks
● Across basic blocks but intra-procedural
● Inter-procedural
● Target specific
● Constant folding
for (...) {
x = 2;
y = x + 3 * 4;
}
In the above code, the operation 3 * 4 will be done for whatever many iterations the loop does.
Instead, the compiler folds it into a constant and does away with the multiplication.
for (...) {
x = 2;
y = x + 12;
}
● Constant propagation
for (...) {
x = 2;
y = x + 12;
}
The expression k = m + n doesn’t need to be recomputed with each loop iteration, thus it can
be moved out of it.
k = m + n;
for (...) {
y = 5 * x + y;
}
For each loop iteration, x is incremented by 1 and thus, w is incremented by 2. The compiler
might be able to deduce this and replace w = 2 * x with w = w + 2.
● Strength reduction
This compiler optimization weakens expensive operations into cheaper ones. For example,
reducing multiplications to additions.
Parallel execution is easier when tasks are independent, but often they need to cooperate. Cooperation
usually means some tasks are writing new values that others must read. To know when a task is
finished writing so that it is safe for another to read, the tasks need to synchronize. If they don’t
synchronize, there is danger of a data race, where the results of the program can change depending on
how events happen to occur.
In computing, synchronization mechanisms are typically built with user-level software routines that rely
on hardware-supplied synchronization instructions. In this section, we focus on the implementation of
lock and unlock synchronization operations. Lock and unlock can be used straightforwardly to create
regions where only a single processor can operate, called a mutual exclusion, as well as to implement
more complex synchronization mechanisms.
There are a number of alternative formulations of the basic hardware primitives, all of which provide the
ability to atomically read and modify a location, together with some way to tell if the read and write were
performed atomically. In general, architects do not expect users to employ the basic hardware
primitives, but instead expect that the primitives will be used by system programmers to build a
synchronization library, a process that is often complex and tricky. One typical operation for building
synchronization operations is the atomic exchange or atomic swap, which interchanges a value in a
register for a value in memory.
Implementing a single atomic memory operation introduces some challenges in the design of the
processor, since it requires both a memory read and a write in a single, uninterruptible instruction. An
alternative is to have a pair of instructions in which the second instruction returns a value showing
whether the pair of instructions was executed as if the pair were atomic. The pair of instructions is
effectively atomic if it appears as if all other operations executed by any processor occurred before or
after the pair, and not between the pair. Thus, when an instruction pair is effectively atomic, no other
processor can change the value between the instruction pair. In MIPS this pair of instructions includes a
special load called a load linked and a special store called a store conditional:
These instructions are used in sequence: If the contents of the memory location specified by the load
linked are changed before the store conditional to the same address occurs, then the store conditional
fails. The store conditional is defined to both store the value of a (presumably different) register in
memory and to change the value of that register to a 1 if it succeeds and to a 0 if it fails.
Since the load linked returns the initial value, and the store conditional returns 1 only if it succeeds, the
following sequence implements an atomic exchange on the memory location specified by the contents
of $s1 and $s4:
In this example, we try to swap the contents of MEM[$s1] and register $s4. We load the contents of
MEM[$s1] into register $t0 and immediately store the contents of $s4 into the same memory address.
Did we do though? Well, if there was no change from the same address we loaded before store
conditional was executed, then the exchange was successful and $t0 is set 1. Otherwise, $t0 is set to
0 and the branch instruction takes us back to try again. This looping is done until the atomic exchange
is successful, at which point we store the contents of MEM[$s1] into register $t1, and thus completing
the swap.
Similar restrictions exist for subtraction: When the signs of the operands are the same, overflow cannot
occur. To see this, remember that c - a = c + (-a) because we subtract by negating the second operand
and then add. Therefore, when we subtract operands of the same sign we end up by adding operands
of different signs.
● Overflow occurs when adding two positive numbers and the sum is negative, or vice versa.
Since adding two 32-bit numbers can yield a result that needs 33 bits to be fully expressed, the
lack of a 33rd bit means that when overflow occurs, the sign bit is set with the value of the result
instead of the proper sign of the result. Since we need just one extra bit, only the sign bit can be
wrong. This spurious sum means a carry out occurred into the sign bit.
● Overflow occurs in subtraction when we subtract a negative number from a positive number and
get a negative result, or when we subtract a positive number from a negative number and get a
positive result. Such a ridiculous result means a borrow occurred from the sign bit.
Because C ignores overflows, the MIPS C compilers will always generate the unsigned versions of the
arithmetic instructions addu, addiu, and subu, no matter what the type of the variables. The MIPS
Fortran compilers, however, pick the appropriate arithmetic instructions, depending on the type of the
operands.
MIPS includes a register called the exception program counter (EPC) to contain the address of the
instruction that caused the exception. The instruction move from system control (mfc0) is used to copy
EPC into a general-purpose register so that MIPS software has the option of returning to the offending
instruction via a jump register instruction.
Multiplication/Division
DISCLAIMER: Patterson and Hennessy goes a lot deeper than what’s discussed here, however the
professor glossed over all the minutiae the book discusses. Thus, read it there if you want a deeper
understanding. For example, the book walks the reader through the evolution of the multiply hardware
and algorithm through multiple generations.
Multiplication
If we ignore the sign bits, the length of the multiplication of an 𝑛-bit multiplicand and an 𝑚-bit multiplier
is a product that is 𝑛 × 𝑚 bits long. That is, 𝑛 × 𝑚 bits are required to represent all possible products.
Hence, like add, multiply must cope with overflow because we frequently want a 32-bit product as the
result of multiplying two 32-bit numbers.
MIPS provides a separate pair of 32-bit registers to contain the 64-bit product, called Hi (for the upper
32-bits) and Lo (for the lower 32-bits). To produce a properly signed or unsigned product, MIPS has two
instructions: multiply (mult) and multiply unsigned (multu). To fetch the integer 32-bit product, the
programmer uses move from lo (mflo). The MIPS assembler generates a pseudoinstruction for multiply
that specifies three general-purpose registers, generating mflo and mfhi instructions to place the
product into registers.