You are on page 1of 10

● They’re determined by the assembler, not the chip’s ISA (in theory)

Example of a few assembler directives:

.word 50 # Reserve 1 word of memory, initialized to 50


arr: .space 64 # Reserve 64 [uninitialized] bytes of memory, called arr
x: .byte 7 # Reserve 1 byte of memory, called x, initialized to 7
.asciiz "ab" # Reserve 3 bytes for null-terminated string"ab"
.text # Start of text (i.e., code) segment
.data # Start of data segment
.data 0xABCD1234 # Start of data segment (force to addr. 0xABCD1234)

The assembling of a program into machine code by the assembler involves several steps:
1. Read the assembly language program, handling directives appropriately.
2. Replace pseudo-instructions with a sequence of native MIPS instructions, using $at (and
nothing else!) if needed. For example, the instruction move $t0, $t1 is replaced with add $t0,
$zero, $t1. This substitution is done with every pseudo-instruction in the assembly program
with its corresponding equivalent MIPS instructions.
3. Generate machine code for each instruction in the “expanded” assembly program.
4. Generate relocation information for code that depends on absolute addresses. Ex: j Label
5. Generate symbol table for all references (e.g., labels, variables). To produce the binary version
of each instruction in the assembly language program, the assembler must determine the
addresses corresponding to all labels. Assemblers keep track of labels used in branches and
data transfer instructions in a symbol table (hash tables, anyone?), a table that matches names
of labels to the addresses of the memory words that instructions occupy.

The object file for UNIX systems typically contains six distinct pieces:
● The object file header describes the size and position of the other pieces of the object
file.
● The text segment contains the machine language code.
● The static data segment contains data allocated for the life of the program. (UNIX
allows programs to use both static data, which is allocated throughout the program, and
dynamic data, which can grow or shrink as needed by the program.)
● The relocation information identifies instructions and data words that depend on
absolute addresses when the program is loaded into memory.
● The symbol table contains the remaining labels that are not defined, such as external
references.
6. Generate information for use by debugger.

Linker
Complete retranslation is a terrible waste of computing resources. This repetition is particularly wasteful
for standard library routines, because programmers would be compiling and assembling routines that
by definition almost never change. Instead of compiling and assembling the whole program whenever a
line of code changes, an alternative is to compile and assemble each procedure independently, so that
a change to one line would require compiling and assembling only one procedure. This alternative
requires a new systems program, called a linker, which takes all the independently assembled
machine language programs and “stitches” them together. The linker thus avoids the recompilation
and/or reassembling of every module if only one module changes.

The major tasks of a linker are:


● Find any library routines needed by the code.
● Determine memory locations that each module will be loaded in.
● Patch code for absolute references, using relocation information from assembler.
● Patch code for references to labels in other modules (i.e., external references), using a symbol
table from the assembler. Error if unresolved references are found.

The linker produces an executable file that can be run on a computer. Typically, this file has the same
format as an object file, except that it contains no unresolved references. It is possible to have
partially linked files, such as library routines, that still have unresolved addresses and hence result in
object files.

Loader
The loader is a systems program that places an object program in main memory so that it is ready to
execute. In UNIX, it follows these steps:
1. Reads the executable file header to determine size of the text and data segments.
2. Creates an address space large enough for the text and data.
3. Copies the instructions and data from the executable file into memory.
4. Copies the parameters (if any) to the main program onto the stack.
5. Initializes the machine registers and sets the stack pointer to the first free location.
6. Jumps to a start-up routine that copies the parameters into the argument registers and calls the
main routine of the program. When the main routine returns, the start-up routine terminates the
program with an exit system call.

Dynamic Linking
Link/load library procedures only when they are called (not before program execution):
1. Procedure code needs to be relocatable
2. Avoids code bloat of static linking of all [transitively] referenced libraries
3. Automatically gets new library versions

Translating from High-Level Languages (HLLs)


● Compiler. This program generates an assembly language program that is executed by a
processor (i.e., statically). Languages such as C and C++ are compiled languages.
● Interpreter. This program translates HLL statements to assembly at runtime (i.e., dynamically
while the HLL program is executing). Languages such as Python and Ruby are interpreted
languages.
● P-Code. In this case, a HLL is translated into an intermediate language that’s interpreted
dynamically. An example of this is Java, translated into Java bytecodes by the Java Virtual
Machine (JVM).

Compilation Steps
1. Lexical Analysis (Tokenization). Scan input program; group characters into tokens. For
example, each space delimited string is a token.
2. Syntax Analysis (Parsing). Group tokens into a graphical representation (AST) to relate them
appropriately.
3. Semantic Analysis. Make sure that program syntax makes sense for further processing Ex:
type checking, variable scoping
4. Intermediate Code Generation. Translate source into a high-level assembly-like code. Also
generate control flow graphs (CFGs), and other graphs needed for further analysis.
5. Optimization. Manipulate intermediate code for faster execution (and less memory usage). For
example, avoid recomputation of common expressions.
6. Code Generation. Translate intermediate code to assembly.
7. Target-Specific Optimizations. Optimize assembly code for target-specific optimizations.

Major Compiler Data Structures


1. Symbol Table: table of each name in a program, along with its types and attributes
2. Parse Tree (Concrete Syntax Tree): syntactic structure of a program
3. Abstract Syntax Tree (AST): syntactic structure of a program, stripping out parser details not
needed for further processing
4. Control Flow Graph (CFG): flow of control between basic blocks in intermediate code
5. Dependence Graph (and other graphs): better suited for program analysis and optimization

Types of Optimizations
A basic block is a sequence of instructions with no branches from or to internal instructions.
● Local to basic blocks
● Across basic blocks but intra-procedural
● Inter-procedural
● Target specific

Typical Compiler Optimizations


● Register allocation
Assign a large number of program variables onto a small number of CPU registers.

● Constant folding
for (...) {
x = 2;
y = x + 3 * 4;
}

In the above code, the operation 3 * 4 will be done for whatever many iterations the loop does.
Instead, the compiler folds it into a constant and does away with the multiplication.

for (...) {
x = 2;
y = x + 12;
}

● Constant propagation
for (...) {
x = 2;
y = x + 12;
}

Would be transformed into the following by the compiler


for (...) {
x = 2;
y = 14;
}

● Common subexpression elimination


for (...) {
y = 5 * x + y;
z = x + 5 * x;
}
The expression 5 * x can be eliminated by replacing with a temporary value that equals its
evaluation (temp = 5 * x).
for (...) {
y = temp + y;
z = x + temp;
}
● Code motion
for (...) {
y = 5 * x + y;
k = m + n;
}

The expression k = m + n doesn’t need to be recomputed with each loop iteration, thus it can
be moved out of it.

k = m + n;
for (...) {
y = 5 * x + y;
}

● Induction variable analysis


for (x = 1; x < n; x++) {
y = 5 * x + y;
w = 2 * x;
}

For each loop iteration, x is incremented by 1 and thus, w is incremented by 2. The compiler
might be able to deduce this and replace w = 2 * x with w = w + 2.

● Strength reduction
This compiler optimization weakens expensive operations into cheaper ones. For example,
reducing multiplications to additions.

Support for Parallelism


DISCLAIMER: Mostly lifted from Patterson and Hennessy.

Parallel execution is easier when tasks are independent, but often they need to cooperate. Cooperation
usually means some tasks are writing new values that others must read. To know when a task is
finished writing so that it is safe for another to read, the tasks need to synchronize. If they don’t
synchronize, there is danger of a data race, where the results of the program can change depending on
how events happen to occur.

In computing, synchronization mechanisms are typically built with user-level software routines that rely
on hardware-supplied synchronization instructions. In this section, we focus on the implementation of
lock and unlock synchronization operations. Lock and unlock can be used straightforwardly to create
regions where only a single processor can operate, called a mutual exclusion, as well as to implement
more complex synchronization mechanisms.

The critical ability we require to implement synchronization in a multiprocessor is a set of hardware


primitives with the ability to atomically read and modify a memory location. In this sense, atomic
means that nothing else can interpose itself between the read and the write of the memory location. In
other words, reading/writing to memory is executed as a single, "indivisible" unit. Without such a
capability, the cost of building basic synchronization primitives will be high and will increase
unreasonably as the processor count increases.

There are a number of alternative formulations of the basic hardware primitives, all of which provide the
ability to atomically read and modify a location, together with some way to tell if the read and write were
performed atomically. In general, architects do not expect users to employ the basic hardware
primitives, but instead expect that the primitives will be used by system programmers to build a
synchronization library, a process that is often complex and tricky. One typical operation for building
synchronization operations is the atomic exchange or atomic swap, which interchanges a value in a
register for a value in memory.

Implementing a single atomic memory operation introduces some challenges in the design of the
processor, since it requires both a memory read and a write in a single, uninterruptible instruction. An
alternative is to have a pair of instructions in which the second instruction returns a value showing
whether the pair of instructions was executed as if the pair were atomic. The pair of instructions is
effectively atomic if it appears as if all other operations executed by any processor occurred before or
after the pair, and not between the pair. Thus, when an instruction pair is effectively atomic, no other
processor can change the value between the instruction pair. In MIPS this pair of instructions includes a
special load called a load linked and a special store called a store conditional:

ll rt, offset(rs) # similar to lw but linked to the sc instruction

sc rt, offset(rs) # similar to sw but storage is done conditionally.


# If MEM[rs + offset] is modified after the last
# ll instruction, then rt is set to 0. Otherwise,
# rs is set to 1.

These instructions are used in sequence: If the contents of the memory location specified by the load
linked are changed before the store conditional to the same address occurs, then the store conditional
fails. The store conditional is defined to both store the value of a (presumably different) register in
memory and to change the value of that register to a 1 if it succeeds and to a 0 if it fails.

Since the load linked returns the initial value, and the store conditional returns 1 only if it succeeds, the
following sequence implements an atomic exchange on the memory location specified by the contents
of $s1 and $s4:

add $t0, $s4, $zero # t0 = s4


try: ll $t1, 0($s1) # t1 = MEM[s1]
sc $t0, 0($s1) # MEM[s1] = s4?
beq $t0, $zero, try # Try again; we failed to store s4 into MEM[s1]
add $s4, $zero, $t1 # s4 = MEM[s1]

In this example, we try to swap the contents of MEM[$s1] and register $s4. We load the contents of
MEM[$s1] into register $t0 and immediately store the contents of $s4 into the same memory address.
Did we do though? Well, if there was no change from the same address we loaded before store
conditional was executed, then the exchange was successful and $t0 is set 1. Otherwise, $t0 is set to
0 and the branch instruction takes us back to try again. This looping is done until the atomic exchange
is successful, at which point we store the contents of MEM[$s1] into register $t1, and thus completing
the swap.

Addition and Subtraction


Addition is just what you would expect in computers. Digits are added bit by bit from right to left, with
carries passed to the next digit to the left, just as you would do by hand. Subtraction uses addition: the
appropriate operand is simply negated before being added.

Example: Add 6 (decimal) to 7 (decimal) in binary.

6𝑡𝑒𝑛 = 0000 0000 0000 0000 0000 0000 0000 0110𝑡𝑤𝑜


7𝑡𝑒𝑛 = 0000 0000 0000 0000 0000 0000 0000 0111𝑡𝑤𝑜

0000 0000 0000 0000 0000 0000 0000 0110


+ 0000 0000 0000 0000 0000 0000 0000 0111
--------------------------------------------
0000 0000 0000 0000 0000 0000 0000 1101

0000 0000 0000 0000 0000 0000 0000 1101𝑡𝑤𝑜 = 13𝑡𝑒𝑛

Example: Subtract 6 (decimal) from 7 (decimal) in binary.

7𝑡𝑒𝑛 = 0000 0000 0000 0000 0000 0000 0000 0111𝑡𝑤𝑜


6𝑡𝑒𝑛 = 0000 0000 0000 0000 0000 0000 0000 0110𝑡𝑤𝑜
− 6𝑡𝑒𝑛 = 1111 1111 1111 1111 1111 1111 1111 1010𝑡𝑤𝑜

0000 0000 0000 0000 0000 0000 0000 0111


+ 1111 1111 1111 1111 1111 1111 1111 1010
--------------------------------------------
0000 0000 0000 0000 0000 0000 0000 0001

0000 0000 0000 0000 0000 0000 0000 0001𝑡𝑤𝑜 = 1𝑡𝑒𝑛

Overflow with Signed Integers


Overflow occurs when the result from an operation cannot be represented with the available hardware,
in this case a 32-bit word.
When can overflow occur in addition? When adding operands with different signs, overflow cannot
occur. The reason is the sum must be no larger than one of the operands. For example, 10 + (-4) = 6.
Since the operands fit in 32 bits and the sum is no larger than an operand, the sum must fit in 32 bits as
well. Therefore, no overflow can occur when adding positive and negative operands.

Similar restrictions exist for subtraction: When the signs of the operands are the same, overflow cannot
occur. To see this, remember that c - a = c + (-a) because we subtract by negating the second operand
and then add. Therefore, when we subtract operands of the same sign we end up by adding operands
of different signs.

How to detect overflow when it occurs?

● Overflow occurs when adding two positive numbers and the sum is negative, or vice versa.
Since adding two 32-bit numbers can yield a result that needs 33 bits to be fully expressed, the
lack of a 33rd bit means that when overflow occurs, the sign bit is set with the value of the result
instead of the proper sign of the result. Since we need just one extra bit, only the sign bit can be
wrong. This spurious sum means a carry out occurred into the sign bit.
● Overflow occurs in subtraction when we subtract a negative number from a positive number and
get a negative result, or when we subtract a positive number from a negative number and get a
positive result. Such a ridiculous result means a borrow occurred from the sign bit.

Overflow with Unsigned Integers


Unsigned integers are commonly used for memory addresses where overflows are ignored. The
computer designer must therefore provide a way to ignore overflow in some cases and to recognize it in
others. The MIPS solution is to have two kinds of arithmetic instructions to recognize the two choices:

Exceptions on Overflow No Exceptions on Overflow

add (add) addu (add unsigned)


addi (add immediate) addiu (add immediate unsigned)

sub (subtract) subu (subtract unsigned)

Because C ignores overflows, the MIPS C compilers will always generate the unsigned versions of the
arithmetic instructions addu, addiu, and subu, no matter what the type of the variables. The MIPS
Fortran compilers, however, pick the appropriate arithmetic instructions, depending on the type of the
operands.

How Does MIPS Detect Overflow?


MIPS detects overflow with an exception, an unscheduled procedure call that disrupts program
execution. The address of the instruction that overflowed is saved in a register, and the computer jumps
to a predefined address to invoke the appropriate routine for that exception. The interrupted address is
saved so that in some situations the program can continue after corrective code is executed.

MIPS includes a register called the exception program counter (EPC) to contain the address of the
instruction that caused the exception. The instruction move from system control (mfc0) is used to copy
EPC into a general-purpose register so that MIPS software has the option of returning to the offending
instruction via a jump register instruction.

Multiplication/Division
DISCLAIMER: Patterson and Hennessy goes a lot deeper than what’s discussed here, however the
professor glossed over all the minutiae the book discusses. Thus, read it there if you want a deeper
understanding. For example, the book walks the reader through the evolution of the multiply hardware
and algorithm through multiple generations.

Multiplication
If we ignore the sign bits, the length of the multiplication of an 𝑛-bit multiplicand and an 𝑚-bit multiplier
is a product that is 𝑛 × 𝑚 bits long. That is, 𝑛 × 𝑚 bits are required to represent all possible products.
Hence, like add, multiply must cope with overflow because we frequently want a 32-bit product as the
result of multiplying two 32-bit numbers.

MIPS provides a separate pair of 32-bit registers to contain the 64-bit product, called Hi (for the upper
32-bits) and Lo (for the lower 32-bits). To produce a properly signed or unsigned product, MIPS has two
instructions: multiply (mult) and multiply unsigned (multu). To fetch the integer 32-bit product, the
programmer uses move from lo (mflo). The MIPS assembler generates a pseudoinstruction for multiply
that specifies three general-purpose registers, generating mflo and mfhi instructions to place the
product into registers.

You might also like