Lecture 4 - Computer Arithmetic

Computer Arithmetic
Computer Arithmetic
• Computer arithmetic is a field of computer science that investigates how
computers should represent numbers and perform operations on them.
• It deals with methods of representing integers and real values (e.g., fixed- and
floating-point numbers) in digital systems.
• Computer memory is organized to give only a certain amount of space to
represent each number, in multiples of bytes, each containing 8 bits. Most
commonly used are 32-bit and 64-bit representations.
• Calculations in a computer are sometimes described as finite precision arithmetic
which describes the number of digits that are used to express a value. Since many
results are not representable, any computation that results in such a number will
have to be dealt with by issuing an error or by approximating the result.
Integer Representation
• Sign Magnitude: Sign magnitude is a very simple representation of ineger
numbers. In sign magnitude the first bit is dedicated to represent the sign
and hence it is called sign bit.
• Sign bit ‘1’ represents negative sign.
• Sign bit ‘0’ represents positive sign.
• In sign magnitude representation of a n bit number, the first bit will represent
sign and rest n-1 bits represent magnitude of number.
• +25 = 00011001
Where 11001 = 25
And 0 for ‘+’
• -25 = 10011001
Where 11001 = 25
And 1 for ‘-‘
• For an n bit word, the range would be from -2n-1 + 2n-1-1.
• The numbers above or below the range can’t be represented.
•2’s complement method: To represent a negative number in this
form, first we need to take the 1’s complement of the number
represented in simple positive binary form and then add 1 to it.
•(-8)10 = (1000)2
• 1’s complement of 1000 = 0111
• Adding 1 to it, 0111 + 1 = 1000
•So, (-8)10 = (1000)2

Floating Point Representation
• Floating point representation is based on exponential (or scientific
notation). In exponential notation, a nonzero real number x is expressed in
decimal as
x = m be
m = Mantissa/Significand
b = Base
e = Exponent
• Decimal numbers use radix of 10 (m*10ê); while binary numbers use

radix of 2 (m*2ê).
• Representation of floating point number is not unique. For example, the number
55.66 can be represented as 5.566×10^1. The fractional part can be normalized.
• In the normalized form, there is only a single non-zero digit before the radix point.
For example, decimal number 123.4567 can be normalized as 1.234567×10^2;
binary number 1010.1011B can be normalized as 1.0101011B×2^3.
• Consider the value

1.23 x 10^4
• The number has a sign (+ in this case)
The significand (1.23) is written with one non-zero digit to the left of the decimal point.
The base (radix) is 10.
The exponent (an integer value) is 4. It too must have a sign.
• In computers, floating-point numbers are represented in scientific notation of
fraction (m) and exponent (e) with a radix of 2, in the form of m*2ê. Both e and
m can be positive as well as negative.
• Modern computers adopt IEEE 754 standard for representing floating-point
numbers. There are two representation schemes: 32-bit single-precision and 64-
bit double-precision.
• Both the representation has three fields:
IEEE-754 32-bit Single-Precision Floating-Point
Numbers
• In 32-bit single-precision floating-point representation:
• The most significant bit is the sign bit (S), with 0 for positive numbers and 1
for negative numbers.
• The following 8 bits represent exponent (e).
• The remaining 23 bits represents fraction (m).
Numbers
Representing 3.625 in 32 bit format:
• Changing 3 in binary=11
• Changing .625 in binary = 101
.625 X 2 1
.25 X 2 0
.5 X 2 1
• Writing in binary exponent form

3.625=11.101 X 20
• On normalizing
11.101 X 20=1.1101 X 21
IEEE-754 32-bit Single-Precision Floating-Point Numbers
• We have 3 elements in a 32-bit floating point representation.

• Sign (MSB)
• Exponent (8 bits after MSB)
• Mantissa (Remaining 23 bits)
• Sign bit is the first bit of the binary representation. '1' implies negative number
and '0' implies positive number.
To convert 3.625 into 32-bit floating point representation Sign bit = 0
Numbers
• Exponent is decided by the nearest smaller or equal to 2n number. For
3.625, the normalized form is 1.1101 X 21. Thus, the exponent of 2 will
be 1.
• 127 is the unique number for 32 bit floating point representation. It is
known as bias. It is determined by 2k-1 -1 where 'k' is the number of
bits in exponent field.
Thus bias = 127 for 32 bit. (28-1 -1 = 128-1 = 127)
Now, 127 + 1 = 128 i.e. 10000000 in binary representation.
Numbers
• Mantissa: 3.625 in binary = 11.101 X 20. Move the binary point so that there is
only one bit from the left. Adjust the exponent of 2 so that the value does not
change. This is normalizing the number 1.1101 x 21. Since the leading bit of
mantissa is always 1 there is no need to store it.
• Now, consider the Digits after decimal = 1101

• Expanding to 23 bit = 11010000000000000000000
• Thus the floating point representation of 3.625 is
0 10000000 11010000000000000000000
IEEE-754 64-bit Double-Precision Floating-Point Numbers
• The representation scheme for 64-bit double-precision is similar to the 32-bit

single-precision:
• The most significant bit is the sign bit (S), with 0 for positive numbers and 1
for negative numbers.
• The following 11 bits represent exponent (e).
• The remaining 52 bits represents fraction (m).
Representing -1/8 = -0.125 in 64 bit format:
• Changing 0 in binary=0
• Changing .125 in binary = 101
.125 X 2 0
.25 X 2 0
.5 X 2 1
• Writing in binary exponent form

0.125 =0.001 X 20
• On normalizing
0.001 X 20 =1.00000 X 2-3
• We have 3 elements in a 64-bit floating point representation.

• Sign (MSB)
• Exponent (11 bits after MSB)
• Mantissa (Remaining 52 bits)
• Sign bit is the first bit of the binary representation. '1' implies negative number
and '0' implies positive number.
To convert -0.125 into 64-bit floating point representation Sign bit = 1
• Exponent is decided by the nearest smaller or equal to 2n number. For

0.125 , the normalized form is 1.00000 X 2-3. Thus, the exponent of 2
will be -3.
• 1023 is the unique number for 64 bit floating point representation. It
is known as bias. It is determined by 2k-1 -1 where 'k' is the number of
bits in exponent field.
Thus bias = 1023 for 64 bit. (211-1 -1 = 1024-1 = 1023)
Now, 1023 - 3 = 1020 i.e. 01111111100 in binary representation.
• Mantissa: 0.125 in binary = 0.001 X 20 . Move the binary point so that there is

only one bit from the left. Adjust the exponent of 2 so that the value does not
change. This is normalizing the number 1.00000 X 2-3. Since the leading bit of
mantissa is always 1 there is no need to store it.
• Now, consider the Digits after decimal = 00000

• Expanding to 52 bit = 000000000000000000 …………. making total 52 bits
• Thus the floating point representation of -0.125 is
1 01111111100 00000000000000000000 making total 52 bits by adding further 0’s.

Lecture 4 - Computer Arithmetic

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4 - Computer Arithmetic

Uploaded by

Copyright:

Available Formats

Computer Arithmetic

•So, (-8)10 = (1000)2

• Decimal numbers use radix of 10 (m*10^e); while binary numbers use

• Consider the value

• Writing in binary exponent form

• We have 3 elements in a 32-bit floating point representation.

• Now, consider the Digits after decimal = 1101

• The representation scheme for 64-bit double-precision is similar to the 32-bit

Representing -1/8 = -0.125 in 64 bit format:

• Writing in binary exponent form

• We have 3 elements in a 64-bit floating point representation.

• Exponent is decided by the nearest smaller or equal to 2n number. For

• Mantissa: 0.125 in binary = 0.001 X 20 . Move the binary point so that there is

• Now, consider the Digits after decimal = 00000

You might also like