You are on page 1of 10

Characters and Strings

Most computers today offer 8-bit bytes to represent characters, with the American Standard Code for
Information Interchange (ASCII) being the representation that nearly everyone follows.

Storing and Loading a Single Byte

So far we’ve used instructions that transfer words (1 word = 32 bits) from memory to registers and vice
versa, however MIPS provide instructions to extract a byte from a word (1 word = 4 bytes). MIPS
provides the following instructions to move bytes:
● Load byte (lb) loads a byte from memory, placing it in the 8 LSBs of a register. In other words, it
loads the byte from memory into the low order eight bits of the register(0-7 bits of the register)
and then it copies bit 7 to bits 8-31 of the register (all bits to the left of bit 7). The address of the
byte is calculated at run time by adding an offset to a base register (just as with the load
word and store word instructions). Use this instruction when the byte is regarded as an 8-bit
signed integer in the range -128...+127 and you want a 32-bit version of the same integer.
lb t, off(b) # $t <-- Sign-extended byte
# from memory address b+off
# b is a base register.
# off is 16-bit two's complement.

lb $t0, 20($a0) # $t0 = Memory[$a0 + 20]


If the byte is regarded as a ascii character or 8-bit unsigned integer, then use lbu (load byte unsigned).
This instruction fills bits 8-31 of the register with 0s.

lbu t, off(b) # $t <-- Zero-extended byte


# from memory address b+off
# b is a base register.
# off is 16-bit two's complement.

● Store byte (sb) takes a byte from the 8 LSBs (i.e., lowest byte of data) of a register and writes it
to memory. There is no need for two "store byte" instructions, thus there’s no sbu
instruction. Whatever is in the low-order byte of the register is copied to memory. The
rest of the register is ignored. Of course, the register does not change.

sb t, off(b) # The byte at off+b <-- low-order


# byte from register $t.
# b is a base register.
# off is 16-bit two's complement.

sb $t0, 20($a0) # Memory[$a0 + 20] = $t0

NOTE: Loading and storing bytes is used for processing text and for low-level system programs (such
as assemblers and operating systems). Graphics programs also make frequent use of these
operations.
Example: Assume that $t0 contains the value 0x12121212 and $t1 contains the address 0x1000000.
Assume that the memory data, starting from address 0x1000000 is: 88 77 66 55. What will be the
value of $t0 after the following code is executed: lb $t0, 0($t1)?

A. 0x00000088
B. 0x88121212
C. 0xffffff88
D. 0x12121288

The instruction lb $t0, 0($t1) loads a byte from a location in memory into the register $t0. The
memory address is given by 0($t1), which means the address $t1 + 0, i.e., 0x1000000 + 0 =
𝑡ℎ
0x1000000, meaning the 0 byte in that memory address. Since MIPS is a big-endian architecture, we
start the numbering from the "big end”:

byte: 0 1 2 3
88 77 66 55

𝑡ℎ
The 0 byte is 88 so we load it into $t0’s lower order portion. 88 is 0b1000 1000, thus we use its 7th bit
to fill out the remaining 8-32 bits since lb is sign-extended (we’d use simply 0s if we were using lbu).
88 sign-extended is 0b1111 1111 1111 1111 1111 1111 1000 1000 or 0xFFFFFF88. Therefore, the
answer is C.

Example: Memory at 0x10000005 contains the byte 0xA4 and register $t0 contains the address
0x10000000. What’s placed in register $t2 when the instruction lb $t2, 0x5($t0) is executed?

The offset 5 tells us how much we move away from memory location at $t0 to get to memory at
0x10000005. Once there, the instruction lb loads the lowest significant byte at that address (i.e., 0xA4)
and places it into $t2’s lowest byte, copying bit 7th into the remaining high order three bytes. Thus, $t2
= 0xFFFFFFA4.

Example:
The add $s3, $zero, $zero simply makes $s3 = 0.

The lb $t0, 1($s3) loads some byte into $t0. The memory address is given by 1($s3), which means
𝑠𝑡
the address $s3 + 1. This is 0 + 1 = 1, meaning 1 byte from base memory location 0. Since we have a
big-endian architecture, we start enumerating the bytes from the left:

Data Address
(decimal)

12
12 13 14 15

10 00 00 10
8
8 9 10 11

01 00 04 02

4
4 5 6 7

FF FF FF FF

0
0 1 2 3

00 90 12 A0

Byte: 0 1 2 3
00 90 12 A0

𝑠𝑡
The 1 byte is 90, so that’s what we load into $t0. The load byte instruction sign-extends it so $t0 =
0xFFFFFF90. That’s the value left in $t0.

The sb $t0, 6($s3) instruction stores the lower-order byte from register $t0 into the memory address
given by 6($s3), i.e., $s3 + 6 (Remember that $s3 is the base address, and $s + 6 means 6 bytes
away from that in this context). Thus

Byte: 4 5 6 7
FF FF FF FF

becomes

Byte: 4 5 6 7
FF FF 90 FF

Example: Memory at 0x10000519 contains the byte 0x44. Register $t0 contains 0x10000400 and
register $t2 contains 0xFA034183. Write the instruction that replaces the 0x44 in memory with 0x83.
Here we must store byte 0x83 from register $t2 to the memory address at 0x10000519. We must use
the base address at $t0 and add some offset to get to the memory address. So offset = 0x10000519
- 0x10000400 = 0x00000119. Since we’re storing from a register to a memory address we use the
store byte instruction:

sb $t2, 0x0119($t0)

The instruction above writes the single byte 0x83 into the memory location 0x10000519. If you were do
to a store word at 0x10000519, you would be writing 0x10000519...0x1000051C, but with store byte
you are writing only that one byte

String Representations
Characters are normally combined into strings, which have a variable number of characters. There are
three choices for representing a string:

1. the first position of the string is reserved to give the length of a string (Java uses this choice)
2. an accompanying variable has the length of the string (as in a structure), or
3. the last position of a string is indicated by a character used to mark the end of a string (C uses
this choice).

Example: The procedure strcpy copies string y to string x using the null byte termination convention
of C:

void strcpy(char x[], char y[]) {


int i = 0;
while ((x[i] = y[i]) != '\0') /* copy & test byte */
i += 1;
}

What’s the MIPS assembly code?

Notice that strcpy is a leaf procedure, thus we can allocate i to a temporary register instead of
allocating it to a saved register, which we must save into the stack and later on restore.

# Allocate:
# $a0 = (base of) x, $a0 = (base of) $a1
# $t0 = 1

strcpy:
addi $t0, $zero, $zero # i = 0 + 0 = 0
loop:
# y is an array of bytes, thus not multiplication by 4 as in
# array of words. Multiplication by 1?!
add $t1, $a1, $t0 # add. of y[i].

# load character ith from array y.


lbu $t2, 0($t1) # t2 = y[i]

# similar process for x[i]


add $t3, $a0, $s0 # t3 = add. of x[i]

# storing byte from t2 into t3.


sb $t2, 0($t3) # x[i] = y[i]

# exit the loop. This is C, thus \0 marks


# the end of the string.
beq $t2, $zero, loop_end # exit loop if y[i] == 0

# we didn't exit the loop so...


addi $t0, $t0, 1 # i = i + 1
j loop # loop again

loop_end:
jr $ra # return to caller

NOTE: String copies usually use pointers instead of arrays in C to avoid the operations on i in the code
above.

Storing and Loading a Halfword


Unicode is a universal encoding of the alphabets of most human languages. Figure 2.16 gives a list of
Unicode alphabets; there are almost as many alphabets in Unicode as there are useful symbols in
ASCII. To be more inclusive of other alphabets, some programming languages (e.g., Java, Raku, etc.)
use Unicode for characters. For instance, by default Javas uses 16 bits to represent a character.

The MIPS instruction set has explicit instructions to load and store such 16- bit quantities, called
halfwords (2 bytes):

● Load half (lh) loads a halfword from memory, placing it in the rightmost 16 bits of a register. Like
load byte, load half treats the halfword as a signed number and thus sign-extends to fill the 16
leftmost bits of the register.
lh t,off(b) # t <-- Sign-extended halfword
# starting at memory address b+off.
# b is a base register.
# off is 16-bit two's complement.

Unlike load half, load halfword unsigned (lhu) works with unsigned integers. Thus, lhu is the more
popular of the two.

lhu t,off(b) # t <-- zero-extended halfword


# starting at memory address b+off.
# b is a base register.
# off is 16-bit two's complement.

● Store half (sh) takes a halfword from the rightmost 16 bits of a register and writes it to memory.
Only one store halfword instruction is needed. The low-order two bytes of the designated
register are copied to memory, no matter what the upper two bytes are. Of course, the register is
not changed when its data is copied to memory.

Strings in languages such as Java and Raku are a class with special built-in support and predefined
methods for concatenation, comparison, and conversion. Unlike C, these languages include a
function/method that gives the length of the string:

# Java
class Main {
public static void main(String[] args) {
System.out.println("こんにちは".length());
}
}

# Raku
say chars "こんにちは"; # «5␤»
"こんにちは".chars.say; # «5␤»

From Lowercase to Uppercase


Example: Write the MIPS assembly code to convert a null-terminated string from lowercase to
uppercase. Assume the string only contains alphabetical characters. Similarly assume the base
address of the string is stored in $t0. Also assume the moon is made of cheese.
# Alloc: $t0 = base add. of string

Loop: lb $t2, 0($t0) # t2 = next character of string


beq $t2, $zero, Exit # goto to Exit if null-character reached
addi $t2, $t2, -32 # convert char from lc to uc. lc - uc = 32
sb $t2, 0($t0) # update character
addi $t0, $t0, 1 # t0 = next character

Exit:

STL is just a library and it doesn’t come bundled with every implementation of C++. Thus, it’s wrong to
say C++ strings when referring to the strings made available by #include <string> since not all C++
compilers are bundled with it.

Systems Software

Translating and Starting a Program


Transforming any C program in a file on disk into a program running on a computer takes four main
steps:
1. Compilation — A high-level language program is compiled into an assembly language
program.
2. Assembly — The assembler assembles the assembly language program into an object module
in machine code. This is the step where the different assembly instructions are converted into
their machine code equivalents (i.e., strings of 0s and 1s).
3. Linking — The linker combines multiple modules with library routines to resolve all references.
This produces the executable file.
4. Loading — The loader places the resulting machine code into the proper memory locations for
execution by the processor.
Compiler
The compiler transforms the C program into an assembly language program, a symbolic form of what
the machine understands.

Assembler
Assembly language is an interface to higher-level software, and thus the assembler can also treat
common variations of machine language instructions as if they were instructions in their own right. The
hardware need not implement these instructions; however, their appearance in assembly language
simplifies translation and programming. Such instructions are called pseudoinstructions, i.e.,
assembly language instructions based on native assembly instructions.

An assembler directive (also, pseudo-op) is an instruction to the assembler that tells it how to
translate the program to machine code. Some remarks about assembler directives:
● No code is generated for assembler directives
● In MIPS, they’re preceded with a dot (.)

You might also like