You are on page 1of 41

Computer Arithmetics

SC414: introduction to computer architecture
Dr. Dhammika Elkaduwe
Department of Computer Engineering
Faculty of Engineering
University of Peradeniya
Recap

ISA

Instruction encoding

Simple programs in assembly

Function calling methods

Compilation, linking and loading process

Memory hierarchy

How cache memories work

Virtual memory
Todays plan

Look at computer arithmetic

Design trade-offs

Float point numbers
Chapter 3: Computer Organization and
design: the hardware software
interface
How to represent numbers?

Computers use binary

Convert base 10 into binary (or from
any other base)

Conversion can be done by division
by 2
 Example: 10
10
= 1010
2
What about negative
numbers?
Simple solution: add a sign bit

Use one bit (MSB) to indicate the sign

MSB 0 positive, 1 negative → →

Called the “sign and magnitude” method

Issues with the approach

Waste a bit

Arithmetic operations are difficult: need to reset
the sign bit

Have negative and positive zero (may confuse
some applications)
What about negative
numbers?
Two's compliment

Positive numbers are represented as
before (2
n
-1, where n is the number
of bits used)

Negative numbers are represented as
a complement

Get the corresponding positive value

Flip all the bits

Add one
Two's complement

With n bits we can have -2
n-1
up to
2
n
-1

Note that it is not symmetric around
zero

Example: suppose we use 4bits, then

Most negative number is: 1000 (-8)

Next most negative number is: 1001
(-1)

Most positive number: 0111 (7)
Exercises

Find the decimal number of the
following 8bit two's complement
numbers written in hexadecimal
format

AB

6A

FF

EF
Exercises

Find 8bit two's complement of the
following negative numbers, do the
arithmetic operation and find the
corresponding decimal value

1 - 9

8 - 16

-1 + 4
Advantages of two's
complement

Main advantage: simplifies the
hardware

Asymmetric around 0

Better range than sign magnitude
method

Used in all most all the computers
Subtraction

SUB $r1, $r1, $r2 $r1 = $r1 - $r2 →

Find the two's complement of the
value in $r2

Use the addition circuit

The end result would be subtraction

The result is the two's complement
representation
• Works due to overflow
Watch-out for overflows

Should you come out of the following
loops?

Value changes sign when it reach the
maximum
int main()
{
int i = 0;
while(i-- < 1);

printf("Should not be here!\n");
}
int main()
{
int i = 1;
while(i++ > 0 );
printf("Should not be here\n");
}
Loading signed values

Some ISAs provide an signed and
unsigned load

In MIPS all loads are signed

You load the sign bit as well

Load byte (LB) and Load half (LH)
extends the signed bit to cover the
remaining 24 and 16 MSBs
respectively
Summary of signed
numbers

Sign and magnitude

MSB gives the sign

Two's complement

Negative numbers are represented as
2
n
– ABS(x) where n is the number of
bits used for the representation

One's complement

Negative numbers are represented as
(2
n
-1 )– ABS(x)
Note on the overflow
Overflow can occurs when

Adding large positive numbers

Subtracting a larger number from a positive

This is how we get a negative result

Borrow from the signed bit
Overflow condition

At times we want to ignore the overflow

Example: dealing with memory pointers

MIPS provides two types of instructions for this

add, addi, sub cause exception on overflow

addu, addiu, subu does not cause overflow
exceptions

Compiler selects the correct instruction
based on the data type (example, unsigned
int or int)

Note that C ignores the overflow exception
Floating point

Integers are limited

We need way to support factions

So called “real” numbers

Things to ask ourselves:

How to convert a fraction into binary?

How numbers can be represented

How can we encode them?
Recall: scientific notation

Example: 1.2 x 10
3

Idea:

One digit leading the decimal point

Convert the following into scientific
notation

10.1223

0.00012

-12.34
Binary numbers

11011011 can we represent using →
scientific notation
Binary numbers

11011011 can we represent using →
scientific notation

1.1011011 x 2
7

The point is called the binary point
(not the decimal point we know)

How to encode such numbers?
We need to track:

Faction number

Exponent
Observation

1.1001011 x 2
4

We cannot have anything other than one as the
first digit?

We can absorb that into the exponent

So, we can drop that from our representation
and add it when we want to work with the
number

We call this the significant

1.significant = fraction

Also called the 'hidden one' method
Designing time

We need some number of bits for the
significant and some number of bits
for the exponent

It is a trade-off between accuracy and
range

More bits for exponent higher range →

More bits for significant more →
accuracy
MIPS float point

MSB sign of the number →

Next 11bits exponent using sign →
and magnitude representation

Remaining 20bits fraction →
Before going further...

How do we represent 0.45 in binary?

How do we represent 0.125 in binary?
.125 * 2
0.25 * 2
0.5*2
10 0.001bin
IEEE 754 float point numbers

Similar to MIPS float point in the use of bits

However, representation is different

32bit

Value = (-1)
s
x (1 + significand) x 2
(exponent – 127)
Example

Represent 0.5 using the IEEE 754
standard
Example

Represent 0.5 using the IEEE 754
standard

0.5
10
= .1
2
= 1.0 x 2
(-1)

S=0, exponent = 126, significand=0
Example:

What is the number given below in IEEE 754
format?
Example:

What is the number given below in IEEE 754
format?

X = 1.01
2
x 2
(129-127)
= 1.25 * 4 = 5
10
IEEE 754 format

Has representation to denote
exceptions (such as0/0)

They are typically shown as NaN (Not
a Number)

The representation is selected so that
comparisons can be fast

First the sign

Then the exponent
Thinking time....

What is the maximum number that can be
represented using IEEE-754 format?

What is the minimum number that can be
represented using IEEE-754 format?

Can you represent 0

Represent 0.6 using IEEE-754
Issue:

We cannot represent some numbers accurately

Loss of precession

We some strange results when doing some
computations!!!

On the other hand we might not need this much
precession
Double precision

Try to address the underflow condition

64bit floating point numbers

Additional 32bits used for the
significant
Exceptions

Overflows:

You have numbers too big to be represented
using the exponent

(unlike integers we have another issue)

Underflows:

Fraction bits are not enough to represent the
number
• You end up with zero
• A common mistake when doing arithmetics

Both overflow and underflow raises exceptions
Note

Some programming languages
support arbitrary precision numbers

Either via libraries (example C)

Or via the language (example Haskell)

So that,

You will not lose the accuracy

Or get wrong results
Floating point addition

We cannot add them as they are

First need to align the binary point

Do the addition of the fraction

Then adjust the exponent accordingly
Flow chart:
float point
addition
MIPS instructions for float
point numbers

Provides instructions for IEEE 754 single
and double precision numbers

Instructions include: addition,
substantiation, multiplication, division,
comparison

Example:

add.s single precision addition →

mul.d double precision →
multiplication
IA32 floating point unit

Co-processor to handle float point arithmetic

Separate registers

About 100 instructions for dealing with float points

Data movement, comparisons, addition,
subtraction, square-root, sine, ....

Stack architecture to deal with float point numbers

Loads push arguments into the stack

Operations pop them from there

Operands are converted into a different format
before pushing to the stack
Issues with float point

Associative law: (x + y)+z = x+ (y +z )

Test it out: x = -1.5x10
38
, y = 1.5x10
38
,
z = 1

Assume single precision

And note that precision is limited