You are on page 1of 11

Adama University Department Of

Electrical Engineering

Number System and Numerical Error Analysis

1.1 Introduction

This section is devoted to introducing review of general information concerned with the
quantification of errors such as the concept of significant digits is reviewed.

Significant digits:
Are important in showing the truth one has in a reported number. For example, if someone asked
me what the population of my county is, I would simply respond, “The population of Adama area
is 1 million!” Is it exactly one million? “I don’t know! But I am quite sure that it will not be
The problem comes when someone else was going to give me a $100 for every citizen of the
county, I would have to get an exact count in that case. That count would have been 1,079,587 in
this year.
So you can see that in my statement that the population is 1 million, that there is only one
significant digit. i.e, 1, and in the statement that the population is 1,079,587, there are seven
significant digits. That means I am quite sure about the accuracy of the number up to the seventh
digit. So, how do we differentiate the number of digits correct in 1,000,000 and 1,079,587? Well
for that, one may use scientific notation. For our data we can have
1,000,000=1 ×10
1,079,587=1.079587 × 10
to signify the correct number of significant digits.

Precision and accuracy:

Accuracy: refers to how a closely computed value or measured value agrees with the true value.

Precision: refers to how closely individual computed or measured values agree with each other.

Numerical methods should be sufficiently accurate to meet the requirements of a particular
engineering problem. They also should be precise enough for adequate engineering design. The
collective term error represents both the inaccuracy and imprecision of our predictions in the
following lessons.

Computational Methods 1

But. This itself can create huge errors in determining the performance of the car. if one uses three terms in his calculation. the Maclaurin series for e x is given as x 2 x3 e x =1+x + + +… 2! 3 ! This series has an infinite number of terms but when using this series to calculate e x only a finite number of terms can be used on iteration. Then. the error may be in the modeling technique.333333=0. errors may arise from mistakes in programs themselves or in the measurement of physical quantities. then the resulting truncation error becomes x [ Truncation Error=e − 1+ x + x2 ] x3 x 4 = + +… 2! 3 ! 4 ! Clearly we cannot avoid truncation errors. For example. in application of numerical methods itself. the round off error generated in this case is 1/ 3−0.1.333333 on a PC.2. . The question should rather be how we can control truncation errors. For example. A mathematical model may be based on using assumptions that are not acceptable. no matter how accurate the numerical methods you may use. For example.2.2 Sources of Error Error in solving an engineering or science problem can arise due to several factors. For example.00000033 There are other numbers that cannot be represented exactly.2. 1.Adama University Department Of Electrical Engineering 1. For the time being let us assume that this could be achieved by setting a particular acceptable error Computational Methods 2 . Second. π and √ 2 are numbers that need to be approximated in computer calculations. We can use the concept of relative approximation in order to control this. one may assume that the drag force on a car is proportional to the velocity of the car. Truncation error 1. the two errors we need to focus on are: . Truncation Error Truncation error is defined as the error caused by truncating a mathematical procedure. Round off error and. a number like 1/ 3 may be represented as 0. but actually it is proportional to the square of the velocity of the car. For example. First. Round of Error A computer can only represent a number approximately because there is no mechanism to deal with infiniteness in computers.

Each digit in 257. A bit may contain a 0 or a l. A common instance where this occurs involves finding the roots of quadratic equation for case 2 b ≫ 4 ac . It can be written as 2 1 0 −1 −2 257. We can minimize round off errors by increasing the number of significant digits. The binary number system requires only two values to be distinguished. such as voltage or current.56 has a value of 0 through 9 and has a place value.56=2×10 +5 × 10 + 7× 10 +5 ×10 +6 ×10 In a binary system. simply because we have ten fingers! For example. the lesser is the separation between adjacent values. called a bit. People often say that computers use binary arithmetic because it is "efficient.3 Number Representation and Conversion Algorithm The basic unit of memory is the binary digit.56. Consequently. With this assumption in mind one can finally conclude that increasing the number of terms would decrease truncation error.Adama University Department Of Electrical Engineering magnitude and decide for the minimum number of terms involved in our calculation. In contrast to this. it is the most reliable method for encoding digital information. we use a number system with a base of 10. It is the simplest possible unit. we have a similar system where the base is made of only two digits 0 and 1. 1." What they mean (although they rarely realize it) is that digital information can be stored by distinguishing between different values of some continuous physical quantity. The more values that must be distinguished. In everyday life. 1 Subtractive cancellation is the term that refers to round off induced when subtracting two nearly equal floating point numbers. All of the above statements are simply would mean that truncation error increases as round off errors are increased which that means we are forced to trade off between the two errors in finding the suitable step size for a particular computation. and the less reliable the memory we would have.e number of terms involved) will intensify round off errors. Computational Methods 3 . A device capable of storing only zeros could hardly form the basis of a memory system. Subtractive cancellation1 and increase in the number of computations (i. look at the number 257. The total numerical error is the summation of truncation error and round off error. Truncation error on the other hand could be reduced with decreasing the step size which in turn could increase subtractive cancelation or increase in computations.

Hence numbers on a computer are ‘represented with binary or base-2 systems’. The general format is m. Integer representation: The most straight forward approach is the signed magnitude method. Numbers are typically stored in one or more words. b e Where: m=mantissa (or significant) b=the base of the number system being used Computational Methods 4 . 1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 Sign number Fig: The representation of the decimal integer -173 on a `16 bit computer Floating point representation: Fractional numbers are represented in computers using floating point form.3.Adama University Department Of Electrical Engineering So it is a base 2 system. 1. The primary logic units of a digital computer are on/off electronic components.0011) in base-2 represents the decimal number as 1011. In general in both systems we have an integer part to the left of the decimal point and we have a fraction part to the right of the decimal point. A number like (1011. It employs the first bit of a word to indicate the sign (0 is positive and 1 is negative) of the number and the remaining bits are used to store the number.1875 in the decimal system. This is an entity that consists of a string of binary digits or bits. The following figure shows a signed magnitude representation of the integer -173 on a 16-bit computer.0011=1× 23+ 0 ×22 +1× 21+ 1× 20+ 0× 2−1+ 0 ×10−2 +1× 2−3+ 1× 2−4 ¿ 11.1 Floating Point Representation The fundamental unit whereby ‘meaningful’ information is represented is called a word.

Generally.1 and 1.3. Conversion algorithm between base two and base ten systems A point of interest regarding the two number systems is the conversion mechanism between them. The above figure shows one way of storing a floating point number in a word. Let us take the number 11. may be stored on a floating point base ten system that would allow four decimal places to be stored i. We have a quotient of 1 and remainder of 0.0294 ×10 Could be normalized as 0. We have already converted the binary system to the decimal in the beginning of this section and now we will proceed with the reverse. 1.2.Let a1=1 (III) Divide 2 by 2.029411765…. First look at the integer part 11 (I)Divide 11 by 2.1875. for any base b 1/ b ≤ m<1 In addition to taking more time and memory space the floating point representation has got its own disadvantage because of round off errors introduced in approximating the mantissa.15678 X 10 3 as a floating point number in base 10 system. We have a quotient of 2 and remainder of 1. Let a2=0. For example the number 1/34 = 0.2941× 10−1 That means we are available with more space to increase our significant figure in approximating the number to the four decimal places available. This gives a quotient of 5 and remainder of 1. The mantissa is usually normalized if it has a leading zero digit.78 could be represented as 0. We have a quotient of 0 and remainder of 1. Let a0=1 (II) Divide the quotient 5 by 2 again. Let a3=1 Since we finally reached at a quotient of 0 we will stop and write our decimal equivalent as Computational Methods 5 .Adama University Department Of Electrical Engineering e=the exponent (or characteristic) Sign signed exponent Mantissa Fig: The manner in which floating point is represented in a word. One can easily deduce that the absolute value of the mantissa is between 1/10 = 0.e 0 1/34=0. So we can say that 156. (IV) Divide again 1 by 2.

a-2=0. That is 0. Again. This will give 0. So we have 1.375. conversion from base 10 to base two involves two independent procedures for both the integer and the fraction part. 11=(a 3 a 2 a 1 a 0)two=(1011) two Now let us look at the fraction part i.75 by 2.e 0. So a-1=0.5 multiplied by 2. Generally.75. (II) Multiply 0. (III) Multiply 0. a-3=1.0 and a-4=1. This will give 0. The following figures show the conversion flowchart Computational Methods 6 . This will result in 1. Now. that will mark the end of the conversion process.1875 (I) Multiply 0.1875 by 2.5.Adama University Department Of Electrical Engineering follows. If we got a 0 after the decimal. The number before the decimal is 0. (IV) Unexpectedly we should multiply the number after decimal now.375 by 2.

Adama University Department Of Electrical Engineering Computational Methods 7 .

Adama University Department Of Electrical Engineering F Computational Methods 8 .

1 True Error True error. That means it actually depends on the mathematical equation that describes a given quantity.4.4. 1. Relative true error=true error /true value=∈t This value could be expressed as percentage value as follows ∈t∗100 Sometimes it would give more sense if we use the absolute relative true error |∈t| 1.Adama University Department Of Electrical Engineering ig . denoted by Et is the difference between the true value (also called the exact value) and the approximate value. We need to know how to quantify error for such cases. Computational Methods 9 . To be able to deal with the issue of errors. we will only have access to approximate values. Approximate error is denoted by Ea and is defined as the difference between the present approximation and previous approximation. In most cases we will not have the luxury of knowing true values. Relative true error is defined as the ratio of the true error and the true value.4 Error Estimation In any numerical analysis. So when we are solving a problem numerically. followed by (B) Quantifying the error (“what is this?”). Such errors are calculated only if true values are known. errors will arise during the calculations. 1. This section is devoted to addressing what we have stated under (B).2 Approximate Error So far. True Error=True value – Approximate value The magnitude of true error does not show how bad the error is. we discussed how to calculate true errors. and lastly (C) Minimize the error as per our needs (“how about increasing the step size?”). Approximate Error=Present Approximation – Previous Approximation Here again the magnitude of approximate error does not show how bad the error is. we need to (A) Identify where the error is coming from (“is it round off or truncation?”). Flow charts for the integer part and fraction part conversion of base two to base ten.

(I) Functions of single variable Suppose we have a function f (x) with the independent variable x that is approximated as x́ So what will be the effect of having an approximate value of the variable x on f (x) .5 Error Propagation The scope of this section to study how errors in numbers can influence the results of mathematical calculations those numbers are involved in. We cannot have the exact value of f (x) since we cannot have the exact value of x. 1. approximate error Relative Approximate Error= present approximation Similar to relative true error we will have approximate error in percentage as well as an absolute value of relative approximate error. So we will apply the multivariable version of the tailor series. ∆ f ( x́ )=|f ( x )−f ( x́)| We have again a problem. But if we are guaranteed that f is continuous and differentiable and taking x́ as closer as possible to x we may apply Taylor series. ' '' f ( x )=f ( x́ ) + f ( x́ ) ( x− x́ )+ f ( x́ ) (x− x́ ) Let us drop the second and higher terms and rearrange to approximately get f ( x )−f ( x́ ) =f ' ( x́ )( x− x́ ) ∆ f ( x́ )=|f ' ( x́ )|( x− x́ ) Now the quantity ( x− x́) represents the estimate of the error in e.Adama University Department Of Electrical Engineering Relative approximate error is denoted by Ea and is defined as the ratio between the approximate error and the present approximation. Suppose we have two variables u and v.u i+1 )=f ( ui . f ( v i+1 . (II) Functions of more than one variable Here we are going to generalize the foregoing approach. v i ) + ∂f ∂u ∂f ( ui +1−ui ) + ( v i+1−v i ) + ∂v [1 ∂2 f (u −u 2! ∂u 2 i+1 i )2 + 2 ∂2 f ( u −u ) ( v −v ) ∂ u ∂ v i +1 i i+1 i ∂ v 2 + ∂2 f Where all partial points are evaluated at the base point i We can drop the second and higher terms to obtain Computational Methods 10 .

x́ 2 . Let us take a first order Taylor series f ( x )=f ( x́ ) + f ' ( x− x́ ) The relative error of f given by f ( x )−f ( x́) f ' ( x− x́ ) = f ( x) f (x ) x− x́ The relative true error of x is given as x́ Taking the ratio of the two we will define a condition number as follows x́ f ' ( x́) condition number= f ( x́ ) This number is used to measure the extent of uncertainty in x is magnified by f(x) with three independent cases: Condition number >1 relative error in x is amplified in f(x) Condition number =1 relative error is identical in both x and f(x) Condition number <1 relative error is attenuated in f(x) Computational Methods 11 . the function we are using.. x́ 3 … … … .Adama University Department Of Electrical Engineering ∆ f (ú .e.. x́ n )= | | | | ∂f ∂ x1 ∆ x́ 1 + ∂f ∂ x́2 ∆ x́ 2 … … . We say that a computation is numerically unstable if the uncertainty of the input values is grossly magnified by the numerical method i. | | ∂f ∂ x́ n ∆ x́ n (III) Stability and condition Here we measure the sensitivity of a given mathematical equation to variations in its input variables. … … . . v́ )= |∂∂ fu|∆ ú+|∂∂ fv|∆ v́ For n independent values x́ 1 . x́1 . x́ n we have ∆ f ( x́1 .