You are on page 1of 21

Computer Representation of Numbers and Computer Arithmetic In a Computer numbers are represented by binary digits 0 and 1.

Computers employ binary arithmetic for performing operations on numbers. Since it gets cumbersome to display large numbers in binary form computers usually display them in hexadecimal or octal or decimal system. All of these number systems are positional systems. In a positional system a number is represented by a set of symbols. Each of these symbols denote a particular value depending on its position. The number of symbols used in a positional system depends on its 'base'. Let us now discuss about various positional number systems: Decimal System: The decimal system uses 10 as its base value and employs ten symbols 0 to 9 in representing numbers. Let us consider a decimal number 7402 consisting of four symbols 7,4,0,2. In terms of base 10 it can be expressed as follows.

So each of the symbols from a set of symbols denoting a number is multiplied with power of the base (10) depending on its position counted from the right. The count always begins with 0. In general a decimal number expressed as: consisting of symbols can be


Similarly, a fractional part of a decimal number can be expressed as

Binary system: Binary system is the positional system consisting of two symbols i.e. 0,1 and '2' as its base. Any binary number actually represents a decimal value given by

where Consider the binary number 10101. The decimal equivalent of 10101 is given by

Hexadecimal System: The Hexadecimal system is the positional system consisting of sixteen symbols, 0,1,2...9,A,B,C,D,E,F, and '16' as its base. Here the symbols A denotes 10, B denotes 11 and so on. The decimal equivalent of the given hexadecimal number is given by . For example consider

We can convert a binary number directly to a hexadecimal number by grouping the binary digits, starting from the right, into sets of four and converting each group to its equivalent hexadecimal digit. If in such a grouping the last set falls short of four binary digits then do the obvious thing of prefixing it with adequate number of binary digit '0'. 2

For example let us find the hexadecimal equivalent of

The vice-versa is also true. Octal System: The octal system is the positional system that uses 8 as its base and as its symbol set of size 8. The decimal equivalent of an octal number is given by . For

example consider

We can get the octal equivalent of a binary number by grouping the binary digits, starting from the right, into sets of three binary digits and converting each of these sets to its octal equivalent. If such a grouping results in a last set having less number of digits it may be prefixed with adequate number of binary digit 0. As an example the octal equivalent of

Conversion of decimal system to non-decimal system: To convert a decimal number to a number of any other system we should consider the integer and fractional parts separately and follow the following procedure: Conversion of integer part: (a) Consider the integer part of a given decimal number and divide it by the base b of the new number system. The remainder will constitute the rightmost digit of the integer 3

part of the new number. (b) Next divide the quotient again by the base b. The remainder will constitute second digit from the right in the new system. Continue this process until we end up with a zero-quotient. The last remainder is the leftmost digit of the new number. Conversion of fractional part: (a) Consider the fractional part of the given decimal number and multiply it with the base b of the new system. The integral part of the product constitutes the leftmost digit of the fractional part in the new system. (b) Now again multiply the fractional part resulting in step (a) by the base b of the new system. The integral part of the resultant product is the second digit from the left in the new system. Repeat the above step until we encounter a zero-fractional part or a duplicate fractional part. The integer part of this last product will be the rightmost digit of the fractional part of the new number. Eg: Convert 54.45 into its binary equivalent. (a) Consider the integer part i.e. 54 and apply the steps listed under conversion of integer part i.e.

(b) Conversion of fractional part: Product integral part Binary number 4

0.45 0.9 0.8 0.6 0.2 0.4 0.8 0.6 0.2 0.4 0.8

2 = 2 = 2 = 2 = 2 = 2 = 2 = 2 = 2 = 2 = 2 =

0.90 1.80 1.6 1.2 0.4 0.8 1.6 1.2 0.4 0.8 1.6

0 1 1 1 0 0 1 1 0 0 1

Here the overbar denotes the repetition of the binary digits. Note: Using binary system as an intermediate stage we can easily convert octal numbers to hexadecimal numbers and vice-versa.

In the above two examples we have grouped the binary digits suitably either to quadruplets or triplets to convert octal to hexadecimal and hexadecimal to octal numbers respectively.

Computer Representation of Numbers Computers are designed to use binary digits to represent numbers and other information. The computer memory is organized into strings of bits called words of same length. Decimal numbers are first converted into their binary equivalents and then are represented in either integer or floating point form. Integer Representation The largest decimal number that can be represented , in binary form , in a computer depends on its word length. An n-bit word computer can handle a number as large as . For instance a 16-bit word machine can represent numbers as large as . How do we represent negative numbers ? Negative numbers are stored using complement. This is obtained by taking the complement of the binary representation of the positive number and then adding to it. For example let us represent in the binary form.

Here in an extra zero to the left of the binary number is appended to indicate that it is positive. If this extra leftmost binary digit is set to then it indicates that the binary number is negative. So the general convention for storing signed numbers is to append a binary 7

digit 0 or to the left of the binary number depending on the positive or negative sign of the number. So in a n-bit word computer, as one bit is reserved for sign , one can use maximum up to bits to store a signed number. So the largest signed number a 16-bit word can represent is . On this machine since zero is defined as

it is redundant to use the number to define a "minus zero". It is usually employed to represent an additional negative number i.e and hence the range of signed numbers that can be represented on a 16-bit word machine is from to . Floating Point Representation Fractional numbers such as and large numbers like which fall outside the range of a d-bit word machine , say for instance 16-bit word machine are stored and processed in Exponential form. In exponential form these numbers have an embedded decimal point and are called floating point numbers or real numbers. The floating point representation of a real number is where is called mantissa and is the exponent. So the floating - point representation of the fractional number is and that of the large number is . Typically computers use a 32-bit representation for a floating point. The left most bit is reserved for the sign. The next seven bits are reserved for exponent and the last twenty four bits are used for mantissa. The shifting of the decimal point to the left of the most significant digit is called normalization and the numbers represented in the normalized form are known as normalized floating point numbers. For example , the normalized floating point form of the numbers are: 0.00695 = 56.2547 = -684.6 = Inherent Errors Inherent errors arise due to the data errors or due to the conversion errors. Data Errors 8


= .562547E2 = -.6846E3

If the data supplied for a problem is obtained from some experiment or from some measurement then it is prone to errors due to the limitations in instrumentation or reading. Such errors are also referred to as empirical errors. So when the data supplied is correct , say to two decimals there is no use performing arithmetic accurate to four decimals! Conversion Errors Conversion errors arise due to the limitation on the number of the bits used for representing numbers both under integer and floating point representation. So it is also called as representation error. The digits that are not retained constitute the round-off error. For example consider the case of representing a decimal number equivalent of has a non-terminating form like in a computer. The binary ...... but the computer

has limited number of bits. If we add ten such numbers in a computer the result will not be exactly due to the round -off error during the conversion of to binary form.

Computer Arithmetic The most common computer arithmetic are integer arithmetic and floating point arithmetic. Now these arithmetic systems will be briefly discussed. Integer Arithmetic : The result of any integer arithmetic operation is always an integer. The range of integers that can be represented on a given computer is finite. The result of an integer division is usually given as a quotient. The remainder is truncated as fractional quantities which cannot be represented under the integer representation. Eg:

Remark: (1) Simple rules like , where are integers may not hold

under computer integer arithmetic due to the truncation of the remainder.

(2) An integer operation may result in a very small or a very large number which is beyond the range of that the computer can handle. When the result is larger than the maximum limit , it is referred to as an overflow and when it is less than the lower limit , it is referred to as underflow. 10

Floating Point Arithmetic: In the floating point arithmetic all the numbers are stored and processed in normalized exponential form . Firstly the process of addition under floating point arithmetic will be discussed. Addition under Floating Point Arithmetic: Let and be the two numbers to be added and floating point representation of and are be the result. The normalized , ,

respectively. The rules for carrying out the addition are as follows : (a) Set Say b) Right shift and call it = maximum then by . places, so that the exponent of are the same .

c) Set

d) Normalize e) Set E.g : Add the numbers

and let

be its normalized representation.


a) b) on right shifting by 3 we get




which is already in normalized form

i.e e) Remark:

Substraction is nothing but addition of numbers with different signs.

Multiplication Under Floating Point Arithmetic: If product , are two real numbers in normalized form then their

E.g : Say



is already in normalized form ,


. Remark: (1)

(after normalization) During the floating point arithmetic mantissa 'M' may be truncated due to the limitation on the number of bits available for its representation on a computer. (2) Floating point arithmetic is prone to the following errors: a) Errors due to inexact representation of a decimal number in binary form. For example . Since binary equivalent of has a repeating fraction, it has to be terminated at some point. b) Error due to round-off-effect c) Subtractive cancellation : It is possible that some mantissa positions are unspecified. These unspecified positions may be arbitrarily filled by the computer.This may lead to serious loss of significance when two nearly equal numbers are subtracted.

For example if


then has only one significant digit. However the

mantissa will have provision to store more number of significant digits, which may get arbitrarily filled as they may be specified. Further if the operands themselves are approximate representation due to this non-specification problem the overall loss of significance will get serious. d) Basic laws of arithmetic such as associative, distributive may not be satisfied i.e

(3) Numerical computation involves a series of computations consisting of basic arithmetic operation. There may be round-off or truncation error at every step of the computation. These errors accumulate with the increasing number of computations in a process. There can be situations where even a single operation may magnify the 13

roundoff errors to a level that completely ruins the result. A computation process in which the cumulative effect of all input errors is grossly magnified is said to be numerically unstable. It is important to understand the conditions under which the process is likely to be 'sensitive' to input errors and become unstable. Investigations to see how small changes in input parameters influence the output are termed as sensitivity analysis. (4) Roundoff and truncation errors effect on the final numerical result may be reduced by a) Increasing the significant figures of the computer either through hardware or through software manipulations.For instance one may use double precision for floating point arithmetic operations. b) Minimizing the number of arithmetic operations. Here one may try to rearrange a formula to reduce the number of arithmetic operations. For example in the evaluation of a polynomial , it may be rearranged as

which requires less arithmetic operations. c)A formula like may be replaced by to avoid substractive cancellation

d) While finding the sum of set of numbers, arrange the set so that they are in ascending order of absolute value. i.e when then is better than .

5) It may not be possible to simultaneously reduce both the truncation and round-off error effects on the final result of a numerical computation. For instance in an iterative procedure when one tries to reduce the round-off error by increasing the step size , it may lead to higher truncation error and vice-versa. Hence proper care has to be taken to reduce both the errors simultaneously.


Numerical Errors: Numerical errors arise during computations due to round-off errors and truncation errors. Round-off Errors: Round-off error occurs because computers use fixed number of bits and hence fixed number of binary digits to represent numbers. In a numerical computation round-off errors are introduced at every stage of computation. Hence though an individual round-off error due to a given number at a given numerical step may be small but the cumulative effect can be significant. When the number of bits required for representing a number are less then the number is usually rounded to fit the available number of bits. This is done either by chopping or by symmetric rounding. Chopping: Rounding a number by chopping amounts to dropping the extra digits. Here the given number is truncated. Suppose that we are using a computer with a fixed word length of four digits. Then the truncated representation of the number will be . The digits will be dropped. Now to evaluate the error due to chopping let us consider the normalized representation of the given number i.e.

chopping error in representing So in general if a number

. is the is the

is the true value of a given number and and

normalized form of the rounded (chopped) number

normalized form of the chopping error then


, the chopping error

Symmetric Round-off Error : In the symmetric round-off method the last retained significant digit is rounded up by 1 if the first discarded digit is greater or equal to 5.In other words, if in is such that example let then the last digit in is raised by 1 before chopping . For

be two given numbers to be rounded to five and and .

digit numbers. The normalized form x and y are On rounding these numbers to five digits we get respectively. Now w.r.t here

In either case error Truncation Errors:

Often an approximation is used in place of an exact mathematical procedure. For instance consider the Taylor series expansion of say i.e.

Practically we cannot use all of the infinite number of terms in the series for computing the sine of angle x. We usually terminate the process after a certain number of terms. The error that results due to such a termination or truncation is called as 'truncation error'. Usually in evaluating logarithms, exponentials, trigonometric functions, hyperbolic


functions etc. an infinite series of the form . Thus a truncation error of

is replaced by a finite series is introduced in the computation.

For example let us consider evaluation of exponential function using first three terms at

Truncation Error

Some Fundamental definitions of Error Analysis: Absolute and Relative Errors: Absolute Error: Suppose that and denote the true and approximate values of a by is given by

datum then the error incurred on approximating

and the absolute error

i.e. magnitude of the error is given by


Relative Error: Relative Error or normalized error an approximate value is defined by

in representing a true datum


and Sometimes is defined by




Machine Epsilon: Let us assume that we have a decimal computer system. We know that we would encounter round-off error when a number is represented in floating-point form. The relative round-off error due to chopping is defined by

Here we know that


. We know that i.e. maximum relative round-off error due to chopping is given by the value of 'd' i.e the length of mantissa is machine dependent. Hence the maximum relative round-off error due to chopping is also known as machine epsilon . Similarly , maximum relative round-off error due to symmetric rounding is given by


for symmetric rounding is given by,

It is important to note that the machine epsilon represents upper bound for the round-off error due to floating point representation. For a computer system with binary representation the machine epsilon due to chopping and symmetric rounding are given by

respectively. Eg: Assume that our binary machine has 24-bit mantissa. Then . Say that our system can represent a q decimal digit mantissa. Then, i.e

that our machine can store numbers with seven significant decimal digits.


Approximations and Round-off Errors Approximations and errors are integral part of numerical methods. Prior to using the numerical methods it is essential to know how errors arise, how they grow during the numerical computations and how they affect the accuracy of a solution.Errors can come in a variety of forms and sizes. To get a quick feel let us look at the following taxonomy of errors:

Further discussion will be focussed on errors due to computing machine and those due to numerical method. Firstly the notion of significant digits will be introduced. Significant Digits Usually , the numerical solution to a given problem is sought to a desired level of accuracy and 20

precision wherein the error is below a set tolerance level.The idea of significant numbers is essential to understand the concept of accuracy and precision in the solution and also to designate the reliability of a numerical value. The Significant Digits of a number are those that can be used with confidence. Suppose we seek a numerical solution to an accuracy of and obtain as solution . Here the or the solution has etc. have infinite number solution is reliable only up to the first three decimal places i.e five significant digits . Some numbers like , , ,

of significant digits. For example consider =

Such numbers can never be represented exactly on a computer which operates with fixed number of significant digits due to hardware limitations.The omission of certain digits from such numbers results in what is called round-off-error. Some thumb rules on the significant digits , within the desired level of accuracy are : (a) All non-zero digits are significant , (b)All zeros occurring between non-zero digits are significant, (c)Trailing zeros following a decimal point are significant. (e.g , , have three significant digits),

(d) Zeros between the decimal point and preceding a non-zero digit are not significant. For example , , , have four significant digits. (e) Trailing zeros in large numbers without the decimal point are not significant. For instance may be written in scientific notation as and contains only two significant digits. The concept of accuracy and precision are closely related to significant digits as follows: Accuracy refers to the number of significant digits in a value. For example the number is accurate to five significant digits: Precision refers to the number of decimal positions i.e the order of magnitude of the last digit in a value. The number has a precision of or .