You are on page 1of 28

# By

## Oumar Gueye, Ph.D.

Department of Mathematics, University of Manitoba

## 1. 3 Errors in numerical solutions

Math-2120-Winter-2015

Definition 1:
Numerical analysis is the part of applied mathematics that studies the methods using
Numerical approximation for solving mathematical problems (differential calculus, integral
calculus, differential equations, partial differential equations, optimization problems and
so on)

Definition 2:
Numerical methods are mathematical techniques used:

## in numerical analysis, to evaluate a definite integral for example,

in numerical Algebra, to solve an equation or system of equations,

## and in optimization, to find a local minimum of a given function for example.

Math-2120-Winter-2015

Remark:
1)

2)

3)

4)

5)

Because of their power, computers and calculators play a capital role in numerical
analysis.

## MATLAB a high level computer language will be used in this course.

Solving a problem analytically (if possible) provides an exact solution, while numerical
methods (very often) provides an approximate solution.
Solving a problem analytically is not always possible
Approximate solutions involve errors due the numerical method used and how the data
are stored on the computer.

Math-2120-Winter-2015

Decimal system:
Decimal system uses the following ten digits, 0,1, , 9, to represent any number. A
number in decimal system (base 10) is given by a sequence of these 10 digits

Remark:

## A number in base 10 can be written as a sum of multiple of power of ten.

Math-2120-Winter-2015

Example 1:

## 1) 345610 3000 400 50 6

3 103 4 102 5 101 6 100

## 2) 546.74810 5 102 4 101 6 100 7 101 4 102 8 103

Math-2120-Winter-2015

Binary system:
Binary system uses the following 2 digits, 0,1 to represent any number. A number in
binary system (base 2) is given by a sequence of these 2 digits.

Remark:
The popularity of the binary system is due to the rapid development of computer science
and Telecommunication.

Example 2:

1) 11101012
2) 1001.0112

Math-2120-Winter-2015

## From base 10 to base 2: (For an integer)

1.

2.

3.

4.

Take the decimal number (integer) and divide it by 2 keeping track of the remainder

Take the result and divide it by two in the same way, always keeping track of the
remainder.
Repeat Step 2 until you reach a result of 0.
Read the remainders (all 0 or 1) off in reverse order starting at the bottom with the one
you just finished. This is the answer.

Remark:
Your last step should always look like 1 / 2 = 0 r 1.

Math-2120-Winter-2015

Example 3:
Convert

5310in base 2.

53/2
26/2
13/2
6/2
3/2
1/2

26
13
6
3
1
0

=
=
=
=
=
=

R
R
R
R
R
R

1
0
1
0
1
1

## The binary representation of 53 is

1101012

Math-2120-Winter-2015

1.

2.

## Record the integer part of the result obtained at step 1 (that is 0 or 1)

3.

Stop if the fractional part of the result obtained at step 2 is 0, otherwise go to step 4.

4.

Repeat previous step with the fractional part of the result obtained at step 2.

Remark:
The binary representation is given by the sequence of recorded integer part from the first to
the last.

Math-2120-Winter-2015

10

in base 2

8710 10101112
Result

Fractional part

Integer part

0.1874 x 2

0.375

0.375

0.375 x 2

0.75

0.75

0.75 x 2

1.5

0.5

0.5 x 2

1.00

0.00

from
top to
bottom

0.187510 0.00112

## therefore 87.187510 1010111.00112

Math-2120-Winter-2015

11

## From base 2 to base 10 :

1.

2.

3.

4.

5.

Multiply by 2 the first digit to the left just before the decimal point
0

The next digit is multiply by 21 and repeat the process by increasing the exponent of 2
until the last digit
Multiply by 2 1 the first digit to the right just after the decimal point
The next digit is multiply by 2 2 , and repeat the process by decreasing the exponent of
2 until the last digit.
Add all results to get the number in base 10

Math-2120-Winter-2015

12

Example 5:
Convert in decimal the following numbers

1) 1011.01012

2) 1101012
Exercise:
1)

1
Find the binary representation of
5 10

2)

## Find the decimal representation of

0.012

Math-2120-Winter-2015

13

Definition 3:
A number in decimal floating representation (or scientific representation) is a number
D
written under the form d 0 .d1d 2 d s 10 , where

## 0 d i 9 for all i 0, , s; d 0 0, D is an integer called the order of magnitude

and 0. d1d 2 d s is called the mantissa

Example 6:

## 0.3572 is the mantissa

Math-2120-Winter-2015

14

Definition 3:
In binary floating representation a number written as

## bi and e {0,1}; 0.b1b2b3 bk is the mantissa and e is the exponent

Example 7:

1) 1.1001 2101
2) 1.01101 210
Exercise:
Find the binary floating representation of

10012

Math-2120-Winter-2015

15

Remark:
1)

2)

3)

## A byte is equal to 8 bits.

4)

Since the memory of computers is limited, then all numbers cannot be represented on
computers.

Math-2120-Winter-2015

16

## Converting a number in binary floating representation:

Let D be a number in base 10. To represent it in binary floating point, one carry out the
following steps:
1.

2.

## Normalize D by multiplying and dividing D by

3.

Find respectively the binary representation of the mantissa and the exponent.

2d ;

2d

D
2 d 1.d1d 2 d s 2d
d
2

Math-2120-Winter-2015

17

Example 8:
Find the binary floating point of the following numbers
1)

40

2)

0.15625

Remark:
When a number is in binary floating point representation, computers store the sign of the
mantissa, the mantissa and the exponent according to IEEE-754 standard.

Math-2120-Winter-2015

18

## IEEE-754 standard (Institute of Electrical and Electronics Engineers)

A) Single precision format: 32 bits

## Sign (1 bit); 1 for and 0 for +

Exponent (8 bits)
Mantissa (23 bits)

Exponent (8 bits)

Sign
(1 bit)

## Mantissa (23 bits)

Math-2120-Winter-2015

19

Remark:
1.

2.

3.

4.

Since in binary floating point all numbers start by 1, then this leading 1 is not
represented on the computer.

The largest number that can be stored with 8 bits (for the exponent) is 28 1 255

Numbers from 0 to 255 will be used to represent exponents between -127 and 128.

In single precision format, a bias of 127 is added to any exponent and the result is
stored as the exponent (for example -127 is stored as 0).

Math-2120-Winter-2015

20

Example 8:
Find how a computer store the number 16.5 by using single precision format

## Write the final result

01000001100001000000000000000000

Exponent

Mantissa
Math-2120-Winter-2015

21

## IEEE-754 standard (Institute of Electrical and Electronics Engineers)

B) Double precision format: 64 bits

Sign (1 bit)
Exponent (11 bits)
Mantissa (52 bits)

Sign (1 bit)

## Mantissa (52 bits)

Math-2120-Winter-2015

22

Remark:
1.

2.

3.

The largest number that can be stored with 11 bits (for the exponent) is 211 1 2047

Numbers from 0 to 2047 will be used to represent exponents between -1023 and
1024.

In single precision format, a bias of 1023 is added to any exponent and the result is
stored as the exponent (for example -1023 is stored as 0).

Math-2120-Winter-2015

23

Example 9:
Find how a computer store the number -50 by using double precision format

## Write the final result

1 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0
Exponent (11 bits)

0 0

## Mantissa (52 bits)

Math-2120-Winter-2015

24

Two types of error occur frequently when one deals with numerical methods: Truncation
error and round-off error.
A.

Truncation Error:

It refers to the errors caused by the method itself. This kind of error occurs when the
numerical method uses an approximate mathematical procedure.

Example 10:
Lets use Taylor series to approximate

cos x 1

x2

2!

x4

4!

cos 3

x6

6!

x8

8!

Math-2120-Winter-2015

25

cos 0.5

cos 1

cos 1

2
18

2
18

0.451688

4
1944

0.501796

## E Tr 0.5 0.501796 0.001796

Math-2120-Winter-2015

26

B.

Round-off Error

Since a finite number of bits is used to store, the mantissa longer than the number of bits
available have to be chopped or rounded. In such case, the true value of the number is not
stored and the error made is called round-off error.

Remark:

## Total error = truncation error + rounded-off error

Since the true value is usually unknown then the value of the truncation error is also
unknown.
For some numerical methods the truncation error can be approximated (see next
chapters).

Math-2120-Winter-2015

27

END OF CHAP 1

Math-2120-Winter-2015

28