You are on page 1of 6

Chapter 1: Introduction

1.1 Why study assembly language?


1.2 What Is A Computer?
1.2.1 Bytes
1.2.2 Program Execution
1.3 Machine Language
1.4 Assembly Language
1.5 Assembling And Linking

In detail

an introduction to assembly language programming for the Intel/AMD 64 bit CPUs.

Assembly language is not widely used in general purpose programming. Still used in
core functions in scientific computing and other domains where maximum effciency is
needed. also used to perform functions which cannot be handled in a high level
language(?).

This book targets people with some experience in a high level language - ideally C
or C++.

Assembly language is inherently non portable, and this book targets users of Linux.
The primary goal of this text is to have readers learn to write assembly functions
callable from C/C++, which gives an understanding of how a high level language
compiler is implemented. a secondary goal is to have the reader use SSX and AVX
instructions.

1.1 Why study assembly language?

case against:
latest approaches to programming involve oo hlls using byte code interpreters
==> to write portable highly reliable programis in a short time.Worrying about
memory usage and CPU cycles is a relic from a bygone age.
Assembly language has some of the worst features of computing.
First, assembly language is non portable. Every CPU has its own assembly language
and some have more than one. e.g: Intel/AMD CPUs can work in 16/32/64 bit modes,
with differences in assembly language for each mode. In addition operating systems
bring more differences. Portability is difficult, if not impossible.

Second, assembly language is not reliable. In many HLLs, programmers are protected
from possible problems like pointer errors. In assembly language every variable is
a pointer access. HLL syntax resembles mathematical syntax. Assembly syntax is
essentially a sequence of machine instructions with no reference to the problem
being solved.

Third, assembly language is slow to write. This makes assembly language code hard
to create and maintain.

case for;
more efficient than HLLs. Skilled author can create code which uses less CPU
and memory than code generated by HLL compilers. In general C/C++ compilers do
excellent optimizations and can do better than beginning assembly programmers.
However, advanced assembly language programmers can do better. E.g: in the ATLAS
linear algebra suite, the assembly function for matrix multiplication is 4 times
more efficient than well written C code.

can do things not possible in HLLs. E.g: handling hardware interrupts, mapping
memory of CPU. These are essential in building operating systems, though not for
user programs.

so assembly language has some benefits for highly skilled programmers what
about 'normal programmers'?

First studying assembly language helps you understand how a CPU works. Also how an
HLL compiler actually implements longauge features. Understanding the translation
from HLLS to assembly helps understand why bugs behave the way to do. without
understanding assembly, an HLL is eessentially a mathematical concept following
mathematical laws, but machine instructions have limits and quirks.

1.2 What Is A Computer?

A computer is a machine for processing bits. A bit is an unit of computer storage


which can take one of two values 0 or 1. Computers process information, but all
information is represented as bits.

1.2.1 Bytes

computers access memory in 8 bit chunks called bytes. The main memory of computers
is essentially an array of bytes, with each byte having a separate memory address.
The first address is 0.

A byte can be interpreted as a binary number. The binary number 01010101 represents
the decimal number 85.

(KEY) If this number is interpreted as a machine instruction, then the computer


will push the value of the rbp register to the runtime stack.
This number can also be interpreted as the letter "U".

1.2.2 Program Execution

A program in execution (_ "an executing program") occupies a range of addresses for


the instructions (_ as distinct from data) of the program. The following 12 bytes
constitute a very simple program which simply exits.

Memory Address Content

4000b0 184
4000b1 1
4000b2 0
4000b3 0
4000b4 0
4000b5 187
4000b6 5
4000b7 0
4000b8 0
4000b9 0
4000ba 205
4000bb 128
The addresses are in hexadecimal, but could be in decimal (_ or any other
convenient number system). There are numerous 0 s in the hexadecimal number system
which gives us a clue as to how the operating system maps a program into memory
(?).
Pages of memory begin with addresses that have the three rightmost hexadecimal
digits set to 0. So this program is loaded close to the start of a page of memory.

1.3 Machine Language

Each type of computer has a collection of instructions it can execute.


These instructions are
a. stored in memory
b. fetched _(from memory), interpreted, and executed
during the execution of a program.
Sequences of bytes (like the 12 byte program above) constitute a "machine language
program". Writing machine language programs directly would be quite painful.
1. you would have to enter the correct bytes for each instruction of your
program.
2. you would have to know the addresses of all the data used in your program.
3. most programs would have branching instructions ==> the exact address to
branch to depends on where your program is loaded into memory.
4. the address to branch to can change when you add, delete, or modify
instructions in your program.

The first computers were programmed in machine language and soon people figured
out ways to make these tasks easier.
1. words (_ instead of raw bytes) for specific instructions e.g mov
2. use of symbolic names to indicate addresses of instructions and data in a
program (_ e.g EAX, IP). Using symbolic names avoids the need to calculate absolute
addresses, insulating the programmer from changes in the source code (_ coming from
loading into different addresses etc)

1.4 Assembly Language

Programmers developed symbolic assembly languages to eliminate direct coding in


machine languages, thus eliminating tedious work.
Machine languages are "first generation" programming languages. Assembly languages
are "second generation".

Fortran and Cobol were "third generation" languages. Many programs continued to be
written in assembler, specifically operating systems, until the creation of C for
the Unix operating system.

Assembly for the 12 byte "exit" program is

; Program: Exit
;
; Input: none
;
; Output: only the exit status. '$?' in the shell _ wtf is $?
;
;
segment .text
global _start

_start:
mov eax, 1 ; 1 is the exit syscall number
mov ebx, 5 ; the status value to return
int 0x80 ; execute a system call . _ basically set up sys call parameters in
specific registers. then 'exec syscall' instruction

; indicates comment follows

Lines of assembly code consists of labels and instructions.


A label usually starts on column 1 but is not required to.
A label establishes a symbolic name to the current point in the assembler. A label
by itself on a line must have a colon after it, if there is something else on the
same line, the colon is optional.

Instructions can be
1. machine instructions
2. macros
3. instructions to the assembler (_ directives?)
Instructions (as opposed to labels) are usually placed further right in column 1.
Most devs practice starting instructions on the same column.

'segment .text' is an instruction to the assembler rather than a machine


instruction. This particular statement (to the assembler) that data and
instructions following it are to be placed in the .text segment/section. In Linux
this is where *instructions* of a program are located.

The statement 'global _start' is another instruction to the assembler, called an


assembler directive or a pseudo-opcode ('pseudo-op'). This informs the assembler
that the label _start is to made known to the linker. the _start function is the
'entry point' for a linux system. When the system runs the program it transfers
control to the _start function. A C program has the main function which is
indirectly called via a _start (assembly? C?) function in the C library.

The remaining 3 lines are symbolic opcodes representing the three executable
instructions in the program.
- the first instruction moves the constant 1 to the eax register.
- the second line moves the constant 5 to the register ebx.
- the final instruction generates a software interrupt number (0x80) which is
the way Linux handles 32 bit system calls. (_but) this code works on both 32 and 64
bit systems.

1.5 Assembling And Linking

We use the yasm assembler to produce an object file from an assembly source code
file.
The yasm assembler is modeled after the nasm assembler and produces object code
that works properly with gdb and ddd debuggers (while nasm did not, during the
author's testing).

> yasm -f elf64 -g dwarf2 -l exit.lst exit.asm


-f elf64 selects a 64 bit output format which is compatible with Linux and gcc
-g dwarf2 selects the dwarf2 debugging format, which is essential for use with
a debugger
-l exit.lst asks for a listing file which shows the generated code in
hexadecimal

the yasm command produces an object file named exit.o which contains the generated
instructions and code in a form ready to link with other code from other object
files or libraries.

(KEY) In the case of an assembly program with the _start function, the linking
needs to be done with ld.

> ld -0 exit exit.o

in this case (? use of ld?) gcc will incorporate its own version of _start and will
call main from _start (?) or indirectly from _start.

Then we run the program with

> ./exit

Exercises

1. Enter the assembly language program from this chapter and assemble and link
it. Then execute the program and enter echo $? .

a non-zero status indicates an error. Change the program to yield a zero status.

ravi@yantra:~/.../code$ ./exit
ravi@yantra:~/.../code$ echo $?
5

To change the program to yield a zero status,

just change

mov ebx 5 to move ebx 0

ravi@yantra:~/.../code$ ld -o exit exit.o


ravi@yantra:~/.../code$ ./exit
ravi@yantra:~/.../code$ echo $?
0

2. (a) modify the assembly program to define _main rather than start. assembly
and compile with gcc. What is the difference in the size of executables?

when assembled and linked as above


ravi@yantra:~/.../code$ ls -l
total 20
-rwxrwxr-x 1 ravi ravi 5328 Aug 5 12:44 exit
-rw-rw-r-- 1 ravi ravi 278 Aug 5 12:44 exit.asm
-rw-rw-r-- 1 ravi ravi 726 Aug 5 12:44 exit.lst
-rw-rw-r-- 1 ravi ravi 1632 Aug 5 12:44 exit.o
conceptually we should be able to compile and link with gcc. But apparently gcc
requires files to be in .S files and (I think) gas format.

see https://stackoverflow.com/questions/43960208/how-to-use-gcc-to-compile-a-nasm-
syntax-asm-file

3. Why is 0 true(success indicator) and non-zero false in shell?

https://stackoverflow.com/questions/2933843/why-0-is-true-but-false-is-1-in-the-
shell

You might also like