You are on page 1of 18

Computer Arithmetic

Computer Arithmetic
• Computer arithmetic is a field of computer science that investigates how
computers should represent numbers and perform operations on them.
• It deals with methods of representing integers and real values (e.g., fixed- and
floating-point numbers) in digital systems.
• Computer memory is organized to give only a certain amount of space to
represent each number, in multiples of bytes, each containing 8 bits. Most
commonly used are 32-bit and 64-bit representations.
• Calculations in a computer are sometimes described as finite precision arithmetic
which describes the number of digits that are used to express a value. Since many
results are not representable, any computation that results in such a number will
have to be dealt with by issuing an error or by approximating the result.
Integer Representation
• Sign Magnitude: Sign magnitude is a very simple representation of ineger
numbers. In sign magnitude the first bit is dedicated to represent the sign
and hence it is called sign bit.
• Sign bit ‘1’ represents negative sign.
• Sign bit ‘0’ represents positive sign.
Integer Representation
• In sign magnitude representation of a n bit number, the first bit will represent
sign and rest n-1 bits represent magnitude of number.
• +25 = 00011001
Where 11001 = 25
And 0 for ‘+’
• -25 = 10011001
Where 11001 = 25
And 1 for ‘-‘
• For an n bit word, the range would be from -2n-1 + 2n-1-1.
• The numbers above or below the range can’t be represented.
Integer Representation
•2’s complement method: To represent a negative number in this
form, first we need to take the 1’s complement of the number
represented in simple positive binary form and then add 1 to it.

•(-8)10 = (1000)2
• 1’s complement of 1000 = 0111
• Adding 1 to it, 0111 + 1 = 1000

•So, (-8)10 = (1000)2


Floating Point Representation
• Floating point representation is based on exponential (or scientific
notation). In exponential notation, a nonzero real number x is expressed in
decimal as
x = m be
m = Mantissa/Significand
b = Base
e = Exponent

• Decimal numbers use radix of 10 (m*10^e); while binary numbers use


radix of 2 (m*2^e).
Floating Point Representation
• Representation of floating point number is not unique. For example, the number
55.66 can be represented as 5.566×10^1. The fractional part can be normalized.
• In the normalized form, there is only a single non-zero digit before the radix point.
For example, decimal number 123.4567 can be normalized as 1.234567×10^2;
binary number 1010.1011B can be normalized as 1.0101011B×2^3.

• Consider the value


1.23 x 10^4
• The number has a sign (+ in this case)
The significand (1.23) is written with one non-zero digit to the left of the decimal point.
The base (radix) is 10.
The exponent (an integer value) is 4. It too must have a sign.
Floating Point Representation
• In computers, floating-point numbers are represented in scientific notation of
fraction (m) and exponent (e) with a radix of 2, in the form of m*2^e. Both e and
m can be positive as well as negative.
• Modern computers adopt IEEE 754 standard for representing floating-point
numbers. There are two representation schemes: 32-bit single-precision and 64-
bit double-precision.
• Both the representation has three fields:
IEEE-754 32-bit Single-Precision Floating-Point
Numbers
• In 32-bit single-precision floating-point representation:
• The most significant bit is the sign bit (S), with 0 for positive numbers and 1
for negative numbers.
• The following 8 bits represent exponent (e).
• The remaining 23 bits represents fraction (m).
IEEE-754 32-bit Single-Precision Floating-Point
Numbers
Representing 3.625 in 32 bit format:

• Changing 3 in binary=11
• Changing .625 in binary = 101
.625 X 2 1
.25 X 2 0
.5 X 2 1

• Writing in binary exponent form


3.625=11.101 X 20
• On normalizing
11.101 X 20=1.1101 X 21
IEEE-754 32-bit Single-Precision Floating-Point Numbers

• We have 3 elements in a 32-bit floating point representation.


• Sign (MSB)
• Exponent (8 bits after MSB)
• Mantissa (Remaining 23 bits)

• Sign bit is the first bit of the binary representation. '1' implies negative number
and '0' implies positive number.
To convert 3.625 into 32-bit floating point representation Sign bit = 0
IEEE-754 32-bit Single-Precision Floating-Point
Numbers
• Exponent is decided by the nearest smaller or equal to 2n number. For
3.625, the normalized form is 1.1101 X 21. Thus, the exponent of 2 will
be 1.
• 127 is the unique number for 32 bit floating point representation. It is
known as bias. It is determined by 2k-1 -1 where 'k' is the number of
bits in exponent field.
Thus bias = 127 for 32 bit. (28-1 -1 = 128-1 = 127)
Now, 127 + 1 = 128 i.e. 10000000 in binary representation.
IEEE-754 32-bit Single-Precision Floating-Point
Numbers
• Mantissa: 3.625 in binary = 11.101 X 20. Move the binary point so that there is
only one bit from the left. Adjust the exponent of 2 so that the value does not
change. This is normalizing the number 1.1101 x 21. Since the leading bit of
mantissa is always 1 there is no need to store it.

• Now, consider the Digits after decimal = 1101


• Expanding to 23 bit = 11010000000000000000000
• Thus the floating point representation of 3.625 is
0 10000000 11010000000000000000000
IEEE-754 64-bit Double-Precision Floating-Point Numbers

• The representation scheme for 64-bit double-precision is similar to the 32-bit


single-precision:
• The most significant bit is the sign bit (S), with 0 for positive numbers and 1
for negative numbers.
• The following 11 bits represent exponent (e).
• The remaining 52 bits represents fraction (m).
IEEE-754 64-bit Double-Precision Floating-Point Numbers

Representing -1/8 = -0.125 in 64 bit format:

• Changing 0 in binary=0
• Changing .125 in binary = 101
.125 X 2 0
.25 X 2 0
.5 X 2 1

• Writing in binary exponent form


0.125 =0.001 X 20
• On normalizing
0.001 X 20 =1.00000 X 2-3
IEEE-754 64-bit Double-Precision Floating-Point Numbers

• We have 3 elements in a 64-bit floating point representation.


• Sign (MSB)
• Exponent (11 bits after MSB)
• Mantissa (Remaining 52 bits)

• Sign bit is the first bit of the binary representation. '1' implies negative number
and '0' implies positive number.
To convert -0.125 into 64-bit floating point representation Sign bit = 1
IEEE-754 64-bit Double-Precision Floating-Point Numbers

• Exponent is decided by the nearest smaller or equal to 2n number. For


0.125 , the normalized form is 1.00000 X 2-3. Thus, the exponent of 2
will be -3.
• 1023 is the unique number for 64 bit floating point representation. It
is known as bias. It is determined by 2k-1 -1 where 'k' is the number of
bits in exponent field.
Thus bias = 1023 for 64 bit. (211-1 -1 = 1024-1 = 1023)
Now, 1023 - 3 = 1020 i.e. 01111111100 in binary representation.
IEEE-754 64-bit Double-Precision Floating-Point Numbers

• Mantissa: 0.125 in binary = 0.001 X 20 . Move the binary point so that there is


only one bit from the left. Adjust the exponent of 2 so that the value does not
change. This is normalizing the number 1.00000 X 2-3. Since the leading bit of
mantissa is always 1 there is no need to store it.

• Now, consider the Digits after decimal = 00000


• Expanding to 52 bit = 000000000000000000 …………. making total 52 bits
• Thus the floating point representation of -0.125 is
1 01111111100 00000000000000000000 making total 52 bits by adding further 0’s.

You might also like