5 views

Uploaded by Laura Gorostidi

lab

lab

© All Rights Reserved

- Eig (MATLAB Functions)
- 8532-Business Math & Stat
- Common Probability Distribution PDF
- Matrix Method
- Assignment Bca 193
- R Tutorial VBT
- hw3.pdf
- Ch 1 Multivariate Random Variables
- HW3
- Syllabus 2015 - 16 Pcm Cbse
- udl lesson on boxplots
- Dis Trib
- LectureNotes13.pdf
- sensors-17-02241-v2
- Continuous Ran Var
- ENCS 6161 - Ch3 and 4
- 1203.3437
- matrixintr
- Mid-term Test Question and Answer 2017
- VPadronUniquenessApril17-06

You are on page 1of 37

Statistics

Bachelor of Software Engineering

2013-2014

Contents

1 Lab Practice 1.- Introduction to Octave

13

25

39

5 Lab Practice 5.- Tests on the parameters of two independent normal distributions

47

6 Lab Practice 6.- Tests on population proportions

55

7 Lab Practice 7.- Goodness of t tests, a test for randomness, and the Kolmogorov-Smirnov test for two samples.

61

Bibliography

69

iii

g

n

i

r

e

e

n

Octave is free software, which can be used with Linux, Unix and Microsoft Windows.

Such software can be downloaded in http://www.octave.org, where is possible to obtain

additional information.

i

g

http://www.gnu.org/software/octave

Chapter 1

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

(Ocial page)

http://www.gnu.org/software/octave/about.html

http://www.gnu.org/software/octave/download.html

http://www.gnu.org/software/octave/NEWS-3.2.html

http://octave.sourceforge.net/packages.html

http://octave.sourceforge.net/FAQ.html#install

http://octave.sourceforge.net/symbolic/index.html

Content

http://www.network-theory.co.uk/octave/manual

1.1

1.2

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3

1.4

. . . . . . . . . . . . . . . . . . . . . . .

10

1.3

Throughout the lab classes we will use QtOctave, that is, Octave with a graphic interface which

allows users to interact with Octave in a simple way.

You can start QtOctave by a double clicking on QtOctave icon that should be on the

desktop of your computer.

1.1

In this practice we are showing basic instructions of Octave software. These orders will be

essential for the purposes of the lab classes.

r

o

1.2

l

e

Introduction

h

c

on chemical reactor design, being written by James B. Rawlings of the University of WisconsinMadison and John G. Ekerdt of the University of Texas in 1988. John W. Eaton started

developing such a program in 1992. The rst alpha version was presented in 1993, appearing

the version 1.0 in 1994.

a

B

This brings up the window called QTOctave[Empty] which contains the Octave Terminal,

the Editor, the Commands List, the Variables List and the Navigator windows. Those windows

allow users to enter simple commands (see next gure).

The prompt >> (in the Command line) indicates that Octave is awaiting a command.

To perform a simple computation, type a command and next press the Enter or Return

key. All commands in Octave should be typed in lower case.

To get out of Octave we can use the order Quit in the File window.

Some relevant orders of Octave are the following:

1) Denition of variables:

Variables are dened assigning their values directly. Names of variables are chosen

by the user.

Clearly, Octave is now much more than just another courseware package with utility

beyond the classroom.

Once the variable is dened, we can use it in posterior calculations. Thus >> a = 2

denes the variable a with value 2. The order >> a + 23 will return the value 25

Today Octave is an interactive software for numeric computation, which contains powerful tools for areas like linear algebra, analysis, calculus, optimization, interpolation, linear

programming, statistics, probability, geometry, dierential equations, signal processing, etc.

One can obtain the value of a variable just by typing its name.

Blank spaces are not allowed.

Statistics

g

n

i

r

e

e

n

i

g

If you begin to write in a line and then you click on , the last order with the

same beginning appears.

You can recuperate a line in the Command History by a double clicking on it.

To get out of Octave we can use the order Quit in the File window.

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

2) Usual operations: usual operations are performing with the same symbols of usual calculators, thus

a + b addition,

a b subtraction,

a b multiplication,

a /b division,

a b

Capital letters are distinguished, thus A and a are not the same variable.

the variable ans (shortening of answer), which can be used as any other variable in

subsequent computations. Thus >> a + 23 will return the value 25 in the variable

ans, however >> b = a + 23 will return the value 25 in the variable b.

There are built-in variables in Octave. A built-in variable that is often useful is pi

which is the number . Moreover inf is . The complex unit is represented by

either of the built-in variables i or j.

To see the variables which have been dened:

l

e

r

o

- the order whos displays the names of the variables and their characteristics.

h

c

a

B

3) Denition of a matrix: for instance, the command >> A = [1, 1, 2; 3, 2, 4; 13, 12, 1] will

dene a matrix with name A, size 33, which each row will have those numbers separated

by ; the elements of each row determined by ,

Octave will respond by printing the matrix in neatly aligned columns without parenthesis.

The above order will dene the matrix

1 1 2

A = 3 2 4

13 12 1

It is also possible to dene the above matrix in the following way

>> A = [1 1 2; 3 2 4; 13 12 1] that is, the elements of each row are separated by blanks

instead of commas, or also

>> A = [1 1 2

>> 3 2 4

- clear deletes all the variables

>> 13 12 1]

power,

You can suppress an output by adding a semicolon (;) after the statement.

To write comments, use % or # Octave stops to read after those symbols.

To enter dierent statements in one line, use commas (,).

Keys with arrows allow to recuperate previous instructions:

reproduces the line before

Statistics

that is, the elements of each row are separated by blanks, and the dierent rows are

written in dierent lines. Note that the last procedure does not allow to recuperate all

the matrix using the Command History.

4) Generation of a random matrix: the order >> rand(n, m) return a n m-matrix with

random numbers of the interval (0,1).

For instance >> C = rand(2, 3); generates a 2 3-matrix with random numbers of the

interval (0,1) and name C, but without showing in the screen since at the end of the order

we have the symbol ; (semicolon).

Statistics

6) Adding matrices: the sum of the matrices A and B is obtained with >> A + B

7) Subtracting matrices: the order >> A B obtains A minus B

8) Multiplying matrices: to multiple A by B we use the order >> A B

g

n

23) Dening matrices by combination of matrices: in some cases it is necessary to join two

matrices A and B in a matrix C. When the sizes of the matrices A and B allow this,

that can be performed in two dierent ways: on the one hand, placing the matrix B to

the right of the matrix A (A and B should have the same number of rows), on the other

hand, placing B below the matrix A (A and B should have the same number of columns).

i

g

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

9) Power of matrices : the order >> A p obtains A raised to the power p (A.A. . . . .A

multiplied p times).

10) Transpose of a matrix: the transpose of the matrix A (At ) can be obtained with the order

>> A. (note the point after A) where the symbol is in the same key of the question

mark (note that . is not the exclamation mark, but two symbols, . and ). It is also

possible to obtain the transpose with >> transpose(A)

11) Conjugate transpose of a matrix: the conjugate transpose of A is obtained with >> A

12) Determinant of a matrix: the determinant of A is computed with the order >> det(A)

i

r

e

e

n

In the rst case we use the order >> C = [A, B], which denes the matrix C = (A B),

for the second case the order is >> C = [A; B] generating the matrix

A

.

C=

B

It is necessary that the sizes of matrices A and B permit such operations in each case.

24) Obtaining an element of a matrix: >> A(i, j) displays the element of A in row i and

column j.

25) Obtaining columns of a matrix: with >> A(:, j) we obtain the submatrix of the matrix

A with all rows of A and the column j of A, that is, we obtain column j of A.

13) Rank of a matrix: the rank of matrix A is calculated by means of >> rank(A)

14) Trace of a matrix: the trace of matrix A can be calculated with >> trace(A)

26) Obtaining rows of a matrix: the order >> A(i, :) obtains the submatrix of A with the

row i and all the columns, that is, row i de A.

15) Eigenvalues of a matrix: the eigenvalues of a matrix A are computed with the order

>> eig(A) In some version of Octave such an order also obtains the eigenvectors.

16) Eigenvalues and eigenvectors of a matrix: The eigenvalues and eigenvectors of a matrix

A can be calculated with >> [P, D] = eig(A) which provides a squared matrix named P

of the same size as A, whose rows contains the the eigenvectors, and a diagonal matrix

named D with the eigenvalues of A in the diagonal (variables names P and D can be

changed).

>> A(u, v) denes a submatrix of A with the rows indicated in the vector u and the

columns which appear in vector v. For instance, >> A([i, r], [j, k, l]) displays the

submatrix of A given by rows i and r, and columns j, k and l,

17) Inverse of a matrix: the order >> inv(A) computes the inverse of the square matrix A.

with >> A(i : r, j) we obtain the submatrix of matrix A given by the part of the

column j between rows i and r, both included,

18) Size of a matrix: >> size(A) obtains the size of the matrix A.

>> A(:, v) provides the submatrix of matrix A given for all the rows and those

columns which appear in vector v. For instance A(:, [1, 3, 5]) gives the submatrix of

A with all its rows and columns 1, 3 and 5,

l

e

r

o

19) Length of a matrix: >> length(A) calculates the length of a matrix. The length is

the number of rows or columns, whichever is greater. Thus, if the size of A is m n,

>> length(A) obtains the maximum value between m and n. This order will be very

useful to obtain the number of elements in a vector.

h

c

a

B

20) Number of rows of a matrix >> rows(A) returns the number of rows of matrix A.

21) Number of columns of a matrix: >> columns(A) provides the number of columns of

matrix A.

22) Solving systems of linear equations: suppose we want to solve the system of linear equations Ax = b. The order >> A\b obtains the solution of the system by calculating A1 b

(when such a value exists, otherwise Octave will notify that the solution does not exist).

Statistics

with >> A(i, j : k) we obtain the submatrix of matrix A given by the part of the

row i between columns j and k, both included,

by means of >> A(u, :) we calculate the submatrix of A given by all the columns

and those rows which appear in vector u. Thus A([1, 4], :) provides the submatrix of

A with all its columns and rows 1 and 4,

the order >> A(i : j, :) denes the submatrix of A given by all its columns and those

rows between the ith and the j th , both included,

in a similar way >> A(:, i : j) creates the submatrix of A given by all its rows and

columns between the ith and the j th , both included,

with >> A(i : j, k : l) we calculate the submatrix of A given by rows between the

ith and the j th , both included, and columns between k th and lth , both included,

Statistics

the orders >> vec(A) or >> A(:) returns all elements of A in a column vector, after

stacking the columns of A,

we should note that obtaining a column or a row of a matrix is also generating a

submatrix.

28) Replacing elements of a matrix: >> A(i, j) = replaces the element in row i and column

j of A by the value .

For instance, if A is a matrix of size 5 6 and V is a vector of size 5 1, the order

A(:, 2) = V replaces column 2 of A with vector V .

29) Other transformations of matrices: suppose that we have a matrix A, sometimes it is

necessary to dene a new matrix B of the same size of A, such that the element Bi,j must

be equal to 1 if the element Ai,j satises certain inequality, and Bi,j must be equal to 0

in any other case.

For instance , if A is a matrix, the order >> M = (A <= 3) denes a matrix named M ,

of the same size of A, such that the element (i, j) of M will be 1 if the element (i, j) of

A is less than or equal to 3, and 0 in any other cases,

30) Element-by-element calculations with matrices: if A and B are matrices of the same size,

it is possible to operate element-by-element with such matrices.

Let us clarify these operations with an example. Suppose that we have two row matrices

of size 1 n, A = (a1 , a2 , , an ) and B = (b1 , b2 , , bn ), and let c be a real number,

(a1 + c, a2 + c, , an + c)

>> A c or >> c A displays

r

o

(a1 c, a2 c, , an c)

l

e

>> A + B calculates

(a1 + b1 , a2 + b2 , , an + bn )

h

c

i

g

(ab11 , ab22 , , abnn )

there are built-in functions in Octave which can be used with matrices in an elementby-element way. Some examples are: sine (sin), cosine (cos), logarithm (log), etc.

Thus >> log(A) obtains

(log a1 , loga2 , , logan )

>> ones(m, n) denes a m n matrix whose element are ones,

>> ones(n) creates a n n matrix of ones,

>> zeros(m, n) generates a m n matrix of zeros,

>> zeros(n) returns a n n matrix of zeros,

>> eye(m, n) returns a m n matrix with ones on the principal diagonal and zeros

elsewhere

>> diag(v), where v is a vector, returns a diagonal matrix with v on the diagonal

32) Requesting help: One of the nice features of QtOctave is its help system. Thus >>

help order shows information on the order order For instance, >> help inv provides

information on the order inv which computes the inverse of a matrix.

33) General help: QtOctave facilitates a manual on Octave by only clicking in the question

mark under the command View. The resulting window presents a Content Table in which

it is possible to obtain information.

34) Handling of les: We will use two dierent kinds of les, namely, les of variables (data)

and les of orders (programs). For such a purpose we will use the windows Navigator and

Editor respectively (see rst gure).

(a1 b1 , a2 b2 , , an bn )

a

B

if A ia a matrix >> diag(A) returns a vector with the elements of the diagonal of

A.

the order >> A./B (note the point after matrix A) computes

(

i

r

e

e

n

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

>> A + c obtains

g

n

In order to select the folder in which to save (or to load) a le of variables, we specify

the path to the le with Navigator (similar to a File Open window). Once we are in the

corresponding folder (you can be sure that you are in that folder with Go), we use the

orders:

a1 a2

an

, , ,

)

b1 b2

bn

>> load mane to load the variables which are in the le name (we can check that

such variables are loaded with the order >> who which shows the current variables)

Statistics

Statistics

g

n

i

r

e

e

n

1.4

10

Solve the following problems using the Octave commands previously described (the solutions

are shown after the exercises).

i

g

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

1 3 0 2 3

5

5 7 2 4

1 3

2

4

6

2

4

4

A=

2 3 3 1 7 3

4 2 7 6 7 2

>> A = [1, 3, 0, 2, 3, 5; 5, 7, 2, 4, 1, 3; 2, 4, 6, 2, 4, 4; 2, 3, 3, 1, 7, 3; 4, 2, 7, 6, 7, 2];

>> save mane to save the current variables in a le with name name

Programs are written in the Editor and they are saved and loaded in the usual way with

the corresponding icons in the top part of the Editor.

>> A

Exercise 1.3. Obtain a matrix of size 3 4 with random numbers of the interval (0, 1). Give to this

matrix the name B.

>> B = rand(3, 4)

Exercise 1.4. By means of B obtain a matrix of the same size of B with random numbers of the

interval (0, 4).

h

c

l

e

a

B

r

o

>> C = 4 B

>> C.

or

>> transpose(C)

>> C B.

35) List of les: To obtain the list of les of the folder in use (make sure with Navigator and

Go that your are in such a folder) we use the order >> dir which shows the list of les.

36) Re-establishing the terminal: to re-establish the terminal use the order >> quit or >>

exit.

37) Clearing the terminal: in the window of View we can nd the option Clear T erminal,

which deletes the terminal. Note that no information is lost.

Statistics

1

b = 2

1

>> b = [1; 2; 1]

Exercise 1.8. Calculate the determinant, rank and trace of the matrix CB t .

>> det(C B. )

Statistics

11

>> rank(C B. )

g

n

>> [P, D] = eig(C B. )

i

g

Exercise 1.20. Obtain all the elements of the matrix A in a column vector.

>> vec(A) or >> A(:)

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

>> inv(C B. )

Exercise 1.11. Obtain a matrix of size 4 4 with name D, taking from A its rst four rows and

columns.

>> D = A(1 : 4, 1 : 4) or >> D = A([1, 2, 3, 4], [1, 2, 3, 4])

Exercise 1.12. Obtain the size of A

1 3 2 2 4 6

F = 5 3 6 2 4 4

1 3 5 8 4 0

A

F

>> F = [1, 3, 2, 2, 4, 6; 5, 3, 6, 2, 4, 4; 1, 3, 5, 8, 4, 0]

>> size(A)

>> [A; F ]

Exercise 1.13. Dene a matrix E by means of A which should contain the columns and rows of A

with even position.

>> E = A([2, 4], [2, 4, 6])

Exercise 1.14. Replace the element in position (2, 2) of the E by the value 11.

>> E(2, 2) = 11

1 3

2

2 4 6

6

G=

5 3

2 4 4

1 3 5

Exercise 1.15. By means of the matrix E, dene a matrix equal to the second column of E, and a

matrix equal to the rst row.

r

o

>> E(:, 2)

l

e

>> E(1, :)

h

c

>> A(2, 3 : 6)

a

B

>> G = [1, 3, 2; 2, 4, 6; 5, 3, 6; 2, 4, 4; 1, 3, 5]

>> [A, G]

Exercise 1.23. Dene a matrix of the same dimension of G, such that in a position takes the value 1

if in such a position G has a value lower than 3, and 0 in any other case. Posteriorly give it the name

M.

Exercise 1.16. Obtain the submatrix of A with the part of the second row between columns 3 and

6, both included.

>> (G < 3)

>> M = ans

Exercise 1.17. Obtain the submatrix of A with the part of the column 2 between rows 1 and 3, both

included.

>> E. E or >> E. 2

Exercise 1.25. Construct a matrix H of size 5 3 with random numbers in the interval (0, 1). Divide

element-by-element G by H.

Exercise 1.18. Obtain the submatrix of A with all the rows, and columns 1, 3 and 5.

>> A(:, [1, 3, 5])

12

Exercise 1.19. Obtain the submatrix of A with all its columns, and rows 2 and 4.

>> trace(C B. )

>> A(1 : 3, 2)

i

r

e

e

n

>> H = rand(5, 3)

>> G./H

Statistics

Statistics

g

n

i

r

e

e

n

14

For instance, matrix >> A = [1, 20; 2, 23; 1, 19; 2, 21] would contain four observations (four experimental units) of two characteristics. Such observations are (1, 20), (2, 23), (1, 19) and (2, 21).

Thus, if we study only one characteristic, our sample must be stored in a column vector.

i

g

Chapter 2

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

with Octave

Content

2.1

2.1

2.2

13

2.3

19

. . . . . . . . . . . . . . . . . . . . . . .

13

r

o

In this practice we show basic commands of Octave in order to apply the descriptive statistics explained

at expositive classes to real-life problems.

l

e

i) frequency tables,

h

c

iv) dispersion measures,

v) graphical representations, etc.

2.2

>> table(x)

obtains the absolute frequencies of the dierent values of vector x, when these are considered

in increasing sense.

3) Calculating the relative frequencies: if the absolute frequencies are obtained with the order

>> table(x) and n is the sample size,

>> table(x)/n

will compute the relative frequencies of the dierent values of the vector x when these are in

increasing sense.

In a general way, if x is the vector with the sample, the order

>> table(x)/length(x)

obtains the relative frequencies of the dierent values of vector x when these are in increasing

order.

4) Calculating the accumulated absolute frequencies: these can be obtained with the order

>> cumsum(table(x))

That displays the accumulated absolute frequencies of the dierent values of vector x when these

are in increasing order.

>> values(x)

obtains the dierent values which are in vector x. They are displayed from the lowest dierent

value to the greatest dierent value.

a

B

We must clarify that Octave commands assume that our data are in a matrix where each column

is a characteristic, and each row contains all the information of exactly one experimental unit.

13

5) Calculating the accumulated relative frequencies: by means of the previous order, accumulated

relative frequencies can be computed with

>> cumsum(table(x))/n

where n is the sample size.

In a general way, if vector x contains the data, the order

>> cumsum(table(x))/length(x)

computes the accumulated relative frequencies of the dierent values of vector x when these are

in increasing order.

>> mean(x)

obtains the mean of the data in vector x.

If x is a matrix, the order >> mean(x) returns a row vector with the mean of each column

(mean of each characteristic).

7) Calculating the median: The order

>> median(x)

calculates the median of the data in vector x.

Statistics

15

If x is a matrix, the order >> median(x) returns a row vector with the median of each column

(median of each characteristic).

8) Calculating the mode: the order

>> mode(x)

obtains the mode of the data in vector x.

>> meansq(x)

obtains the mean of the square of the values which are in vector x.

10) Calculating the quasivariances and variances: the orders we will use are

>> var(x) or >> var(x, opt)

i=1

r

o

l

e

If opt is 0,

>> var(x, 0)

computes the quasivariance of the elements the vector x, that is, >> var(x, 0) and >> var(x)

lead to the same result.

h

c

If x is a matrix, the order >> var(x, 0) returns a row vector containing the quasivariance of

each column (of each characteristic).

2

=

SX

If opt is 1,

>> std(x, 1)

computes the standard deviation of the elements of vector x, that is,

n

1

SX =

(xi X)2 ni .

n

i=1

If x is a matrix, the order >> std(x, 1) returns a row vector containing the standard deviation

of each column (of each characteristic).

>> range(x)

returns the range of the values in x, that is, the dierence between the maximum and the

minimum of the input data in vector x.

If x is a matrix, >> range(x) computes the range in each column of x.

If opt is 1,

>> var(x, 1)

computes the variance of the elements of the vector x, that is,

a

B

>> std(x, opt)

the optional argument opt determines the type of normalization to use.

If x is a matrix the order >> std(x, 0) returns a row vector containing the quasistandard

deviations of each column (of each characteristic).

If x is the matrix the sample, the above order returns a row vector containing the quasivariance

of each column (of each characteristic).

In relation to the order

>> var(x, opt)

the optional argument opt determines the type of normalization.

i=1

If opt is 0,

>> std(x, 0)

computes the quasistandard deviation of the elements vector x, that is, >> std(x, 0) and >>

std(x) lead to the same result, this being

>> var(x)

computes the quasivariance of the elements of such a vector, that is,

n

i

g

If x is a vector,

>> std(x)

computes the quasistandard deviation of the elements of such a vector, that is,

n

1

SX =

(xi X)2 ni .

n1

If x is a matrix, the order returns a row vector containing the quasistandard deviation of each

column (of each characteristic).

If x is a matrix, the order >> meansq(x) returns a row vector containing the mean square of

each column (of each characteristic).

1

2

=

(xi X)2 ni .

S

X

n1

i

r

e

e

n

16

11) Calculating the quasistandard deviations and standard deviations: the orders we will use are the

following

>> std(x) or >> std(x, opt)

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

If x is a matrix, the mode is calculated in each column (in each characteristic) with >> mode(x)

.

If there are two modes or more, the smallest one is returned.

g

n

>> iqr(x)

returns the interquartile range, i.e., the dierence between the third and the rst quartile of the

input data (Q3 Q1 ).

1

(xi X)2 ni .

n

i=1

If x is a matrix, >> var(x, 1) returns a row vector containing the variance of each column (of

each characteristic).

Statistics

>> statistics(x)

Statistics

17

returns a matrix with the minimum, rst quartile, median, third quartile, maximum, mean,

quasistandard deviation, skewness and kurtosis (nine dierent values) of the values in x.

If x is a matrix, >> statistics(x) computes the above values for each column (for each characteristic).

The result will be a matrix with 9 rows (one row for each of the above computed values) and

so many columns as columns in the data matrix (if x is a vector we obtain a 9 1 matrix).

15) Sum of the data of a sample: by means of the order

>> sum(x)

we obtain the sum of the values in vector x.

17) Denition of categories: in many real-life problems it is necessary to divide quantitative data

in dierent intervals, dening in this way dierent categories.

If x is a vector containing a sample, the order

>> cut(x, n)

where n {1, 2, . . . }, returns a vector of the same size as x saying which group each point in x

belongs to. Groups are labelled from 1 to the number of groups (n).

Group 1 corresponds to the lowest values of the sample, group 2 contains the following lowest

values, and so on. The last group will contain the greatest values. For instance, if >> cut(x, n)

returns the value 2 in position 8, that means that the value of x which is in position 8 belongs

to the second group.

In Lab practice 1, we have seen dierent matrices manipulation instructions. Other interesting

orders are the following:

r

o

18) Ordering elements of a matrix: it is possible to order the elements of a matrix in accordance

with dierent criteria. Thus

a

B

i

g

16) Empirical distribution function: the empirical distribution function of a sample can be evaluated

with the order

>> empirical cdf (y, X)

where X is the data vector and y is the point (or set of points in a vector) in which we want to

evaluate the empirical distribution function.

l

e

h

c

i

r

e

e

n

18

>> sortrows(A, c)

returns a matrix of the same size of A, where the values of the column c have been sorted

in increasing order, the rest of rows being ordered in the same way the rows of the column

c were ordered.

For instance, if

1 3 2

A = 3 2 1

2 2 0

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

If x is a matrix, >> sum(x) obtains the sum of the values of each column of x.

in increasing order.

For instance, if

1

A = 3

2

g

n

3 2

2 1

2 0

2 2 0

3 2 1

1 3 2

Note that the order >> sortrows does not break the rule columns=characteristics.

If we want the elements of the column c ordered in decreasing sense, we should write in

the above order c instead of c.

For instance, if

1 3 2

A = 3 2 1

2 2 0

then >> sortrows(A, 1) returns the matrix

3 2 1

2 2 0

1 3 2

19) Graphical representations: dierent graphical representations for quantitative and qualitative

data can be depicted with Octave:

the empirical distribution function can be drawn in the following way. The order >>

empirical cdf (y, X) gives the value of the empirical distribution function of the sample in

vector X at the point y. Thus we can draw such a graphic as follows. For instance, if the

data are included in the interval [5, 9], we can draw the picture from the point -6 to the

point 10 with a jump between points of 0.1. Such a graphic is generated with the orders

>> y = 6:0.1:10

>> plot(y, empirical cdf (y, X)),

the order

>> hist(x)

plots an histogram for the quantitative data of vector x. Octave will draw 10 dierent

boxes. The number of boxes can be modied with the order

>> hist(x, )

where is the number of boxes to include in the picture,

1 2 0

2 2 1

3 3 2

Note that the order >> sort breaks the key structure rows=experimental units.

From a statistical point of view this instruction is, let us say, dangerous.

Statistics

if vector x contains a sample, a pie chart can be drawn with the order

>> pie(table(x)),

Statistics

19

i

r

e

e

n

20

the order

>> stem(values(x), table(x))

draws a line chart for the quantitative data in vector x, that is, for each dierent value in

x, a height equal to its absolute frequency is drawn,

1800; 1, 61, 10020, 2, 1400; 1, 52, 12574, 1, 1800; 1, 38, 12490, 3, 1400; 0, 57, 15265, 4, 1600; 1, 43, 12432, 1,

1900; 0, 49, 20780, 3, 2100; 0, 19, 18052, 4, 1800]

>> plot

erases all previous pictures. If we want to overlap pictures, we have to use the order

>> hold on Thus all the pictures will be drawn in the same gure. To deactivate such a

order it is sucient to consider >> hold off The order >> ishold generates the value 1 if

hold on is activated, and 0 in any other cases,

The matrix can be found in the le DatPrac2Ejer1 which can be loaded from Campus Virtual.

Matrix A contains the above samples.

>> clf ()

rubs all the gures.

2.3

g

n

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

For such a purpose we cut the part of matrix B with the women data. We will use the orders

explained in Lab Practice 1. Thus >> C = B(1 : 9, :) cuts the rst 9 rows of B, exactly where the

women data are.

age

34

26

39

61

52

38

57

43

49

36

61

52

38

57

43

49

19

i

g

Exercise 2.2. Using the above matrix, construct another matrix containing in the rst rows the data

of women, and in the last rows the men data.

We have consider the codication male=0, female=1. Tus we can order the matrix in accordance

with the values of the gender column, when such values are sorted in decreasing order, that is, >>

B = sortrows(A, 1)

gender

of drivers

Male

Female

Male

Female

Female

Female

Male

Female

Male

Male

Female

Female

Female

Male

Female

Male

Male

Such a matrix can be dened in any of the ways explained in Lab Practice 1.

km. covered

in the last year

15231

12231

14150

9020

11574

12390

19265

12132

20780

18052

10020

12574

12490

15265

12432

20780

18052

make

r

o

l

e

h

c

a

B

seat

seat

renault

peugeot

seat

renault

peugeot

opel

renault

seat

peugeot

opel

renault

seat

opel

renault

seat

cubic capacity

of the car

1900

1600

1800

1600

1600

1400

1600

1600

2100

1800

1400

1800

1400

1600

1900

2100

1800

Exercise 2.4. Analyze cubic capacities of the cars driven by women. Describe the cubic capacities

of cars driven by men.

Fort such a purpose we can consider only the column of the cubic capacities of the matrix dened

in the above question. Thus

>> D = C(:, 5)

generates matrix D which contains the cubic capacities of the cars driven by women

Now we can obtain the dierent values of the cubic capacities of cars driven by the women with

>> values(D)

and the absolute frequencies with

>> table(D)

We should note that

>> table(D)/length(D)

provides the relative frequencies. The accumulated absolute frequencies are computed with the order

>> cumsum(table(D))

By means of the instruction

>> cumsum(table(D))/length(D)

the accumulated relative frequencies are displayed.

Exercise 2.1. Introduce the above data in a matrix for a posterior descriptive analysis.

For the qualitative data (gender and make of car) we consider codications, for instance male=0,

female=1, and in the cases of car makes, for instance opel=1, peugeot=2, renault=3 and seat=4.

The data matrix, let us call it A, can be entered as follows

>> A = [0, 34, 15231, 4, 1900; 1, 26, 12231, 4, 1600; 0, 39, 14150, 3, 1800; 1, 61, 9020, 2, 1600; 1, 52, 11574, 4,

1600; 1, 38, 12390, 3, 1400; 0, 57, 19265, 2, 1600; 1, 43, 12132, 1, 1600; 0, 49, 20780, 3, 2100; 0, 36, 18052, 4,

Statistics

>> mean(D)

the median with

>> median(D)

and in general with

>> statistics(D)

basic descriptive statistics are calculated.

The second part of the problem can be solved in the same way by considering the men data.

Statistics

21

Exercise 2.5. By means of the matrix dened in the rst problem, construct a matrix in which the

age appears in moths.

The matrix with the original data was named A. We can cut the second column, multiply it by

12, and paste it again. That can be performed with

>> E = 12 A(:, 2)

>> H = A

>> H(:, 2) = E

>> sum(A)

taking the second value of the output (the ages are in the second column).

For the second and the third question it is sucient with

>> mean(A)

and

>> meansq(A)

taking the second value of the output.

r

o

l

e

h

c

>> median(A)

taking the second value of the generated row vector.

a

B

We should take the submatrix of A given by the columns 2, 3 and 5 (and all the rows). The order

>> J = A(:, [2, 3, 5])

generates the result.

The matrix with the women data is C. We take the vector of cubic capacities with

>> K = C(:, 5)

The matrix with the women data was called C. We can consider the orders

>> std(C, 1)

and

>> var(C, 1)

taking the second value of the output. The case of male drivers can be solved in a similar way.

i

g

Exercise 2.12. Draw the empirical distribution function of the cubic capacity of cars driven by

women, and do the same with cars driven by men, both representations in the same gure. In

accordance with such a graphic, is it possible to obtain a conclusion?

Exercise 2.7. Obtain the standard deviation and the variance of the women ages. Repeat the same

question with the men.

It is sucient to consider

>> iqr(A)

taking the second value of the generated row vector.

i

r

e

e

n

22

>> cut(I, 3)

Values which are assigned number 1 correspond to the category low number of km., if number is 2,

the category is a medium number of km., and the number 3 is for the category large number of

km..

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

Exercise 2.6. Obtain the sum of the ages of the drivers. Obtain the mean age and the mean square

of the ages.

g

n

We dene the matrix with the men data, for instance with

>> H = sortrows(A, 1)

>> I = H(1 : 8, :)

>> J = I(:, 5)

Vector J contains the cubic capacities of the cars driven by men.

Cubic capacities are between 1400 and 2100. We can consider jumps of 1 unit for the representation. We draw graphics in the same gure with

>> y = 1300 : 1 : 2200

>> plot(y, empirical cdf (y, K))

>> hold on

>> plot(y, empirical cdf (y, J), r )

Note that in the last order we have added r to draw such a graphic in red color. Other initials

can be used as y (yellow), g (green), etc.

Exercise 2.10. Classify the drivers in accordance with number of the cover kilometers in the last

year. Distinguish among a low number of km., a medium number of km. and a large number of

km..

We should generate three dierent categories with the number of km. in the last year. First we

consider the vector number of kilometers in the last year with the order

>> I = A(:, 3)

Statistics

The empirical distribution function of a sample in a point a R gives the proportion of data

which are lower than or equal to (not greater than) the point a.

Statistics

23

g

n

In accordance with the gure, whatever the value of a is, the proportion of cars driven by women

with cubic capacity lower than or equal to a is greater than the proportion of cars driven by men with

cubic capacity lower than or equal to a.

As a consequence we can conclude that in the sample data, the cubic capacities of cars driven by men

are greater than those driven by women.

It is interesting to point out that this armation is only valid for that sample data. In order to

obtain a general conclusion by means of the sample information, we should apply statistical inference

techniques.

i

g

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

Exercise 2.13. How would you obtain the accumulated relative frequencies of the dierent cubic

capacities of the sample?, and the accumulated absolute frequencies?

i

r

e

e

n

24

Accumulated relative frequencies of the dierent values can be obtained for instance with the value

of the empirical distribution function in such values.

>> K = A(:, 5)

The dierent values of such a vector with

>> L = values(K)

Now we can evaluate the empirical distribution function of the sample stored in K in the values of the

vector L by means of

>> M = empirical cdf (L, K)

The values generated by the last order are the accumulated relative frequencies. To obtain the accumulated absolute frequencies, we can multiply the accumulated relative frequencies by the number of

data, that is,

>> N = length(K) M

Another possibility is the following. First we obtain the accumulated absolute frequencies with

>> K = A(:, 5)

which is vector of cubic capacities, and

>> M = cumsum(table(K))

which obtains the accumulated absolute frequencies. The accumulated relative frequencies are obtained

by dividing the accumulated absolute frequencies by the number of data,

>> N = M/length(K)

r

o

l

e

Exercise 2.14. Draw a pie chart for the makes of the cars.

h

c

Exercise 2.16. If x is a column vector with quantitative data, how would you construct a frequency

table (rst column with the dierent values, second column with absolute frequencies, third column

with relative frequencies, fourth column with accumulated absolute frequencies and last column with

accumulated relative frequencies).

The above matrix could be obtained with

>> [values(x), transpose(table(x)), transpose(table(x)/length(x)),

transpose(cumsum(table(x))), transpose(cumsum(table(x))/length(x))]

which gives the table of frequencies of the sample in vector x.

The makes are at the fourth column of matrix A. Thus with the order

>> pie(table(A(:, 4)))

we obtain the requested pie chart.

In this case the representation is given in the following gure

a

B

Exercise 2.15. Draw a histogram with ve boxes for the ages of the drivers.

The ages appear in the second column of the matrix A. Thus with the order

>> hist(table(A(:, 2)), 5)

we generate the histogram, obtaining the following representation

Statistics

Statistics

g

n

i

r

e

e

n

26

The following table shows the orders to simulate data from the distributions which appear on

the left hand side. Note that all orders nish with the ending rnd:

Binomial

Chi2

discrete rnd

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

Exponential

F (Snedecor)

Geometric

Octave

Content

i

g

chi2rnd

Discrete

Chapter 3

binornd

3.2

25

3.3

29

. . . . . . . . . . . . . . . . . . . . . . .

25

f rnd

geornd

Normal

normrnd

Poisson

poissrnd

t (Student)

3.1

exprnd

trnd

Uniform

unif rnd

Weibull

wblrnd

>> binornd(n, p, r, c), or >> binornd(n, p, s), returns a r c matrix, or a s s matrix, of

random numbers drawn from the binomial distribution with parameters n and p.

>> chi2rnd(n, r, c), or >> chi2rnd(n, s), returns a r c matrix, or a s s matrix, of

random numbers drawn from the chi2 distribution with n degrees of freedom (2n ).

3.1

r

o

- evaluate densities functions (continuos distributions) and probability mass functions (discrete

distributions),

l

e

h

c

3.2

a

B

The main orders for the above purposes will be described in this section.

1) Simulating data from probability distributions: let X be a random variable which follows any

of the probability distributions studied in the expositive classes (binomial, geometric, Poisson,

etc.). We want to simulate values drawn from such a random variable.

25

>> discrete rnd(n, v, p) (or >> discrete rnd(v, p, r, c), or >> discrete rnd(v, p, s) ), returns a row vector of size n with random numbers from the discrete distribution which

can take the values which appear in vector v whose probabilities are proportional to the

values (non-negative) which appear in vector p (if the sum of the values of the components

of p is not equal to 1, such values are divided by the total sum to obtain the probabilities

of the elements of vector v). The role of r, c and s in the other orders is the same as in

the previous cases.

>> exprnd(, r, c), or >> exprnd(, s), returns a r c matrix, or a s s matrix, of random

numbers from the exponential distribution whose mean is , that is, for the exponential

distribution with parameter 1/ (be careful with this).

>> f rnd(m, n, r, c), or >> f rnd(m, n, s), returns a r c matrix, or a s s matrix, of

random numbers from a F distribution with m and n degrees of freedom.

>> geornd(p, r, c), or >> geornd(p, s), returns a r c matrix, or a s s matrix, of random

numbers from a geometric distribution with parameter p.

>> normrnd(m, d, r, c), or >> normrnd(m, d, s), returns a r c matrix, or a s s matrix,

of random numbers from a normal distribution with mean m and standard deviation d.

>> poissrnd(lambda, r, c) or >> poissrnd(lambda, s) returns a r c matrix, or a s s

matrix, of random numbers from a Poisson distribution with mean lambda, that is, from

a Poisson distribution with parameter lambda.

>> trnd(n, r, c), or >> trnd(n, s), returns a r c matrix, or a s s matrix, of random

numbers from a t (Student) distribution with n degrees of freedom.

Statistics

27

>> unif rnd(a, b, r, c), or >> unif rnd(a, b, s), returns a r c matrix, or a s s matrix, of

random numbers from a uniform distribution on the interval (a, b).

Poisson

random numbers from a Weibull distribution with parameters a and b.

Uniform

unif pdf

Weibull

wblpdf

2) Evaluating,

ii) cumulative distribution functions,

iii) and the inverse of cumulative distribution functions (in case of existence):

Octave has dierent orders to evaluate:

- usual probability mass functions (discrete distributions) and usual density functions (continuous case),

- common cumulative distribution functions, and

We should remark that the inverse of a distribution function F exists if and only if F is continuous and strictly increasing (where it takes values dierent from 0 and 1).

F distribution, have cumulative distribution functions with inverse. This allows us to obtain

key values of such distributions without using their associated tables (included at the end of

Chapter 5 of Statistical Notes).

We clarify this with the following example.

Let Z N (0, 1). Suppose we want to obtain the value x for the above distribution, such that

the tail on its right hand side is equal to 0.05. Therefore the tail on its left hand side should be

equal to 0.95. Thus we look for the value x such that FZ (x) = 0.95, equivalently, FZ1 (0.95) = x,

that is, we need to know the value of the inverse function of the cumulative distribution function

at the point 0.95.

r

o

l

e

The following table summarizes the orders for dierent probability distributions.

In the rst place we have the name of the distribution, then the order which computes values

of the probability mass function (discrete case) or values of the density function (continuous

case) (ending with pdf ), in the third place we nd the order which returns values of cumulative

distribution functions (ending with cdf ), and nally the order which calculates values of the

inverse of the cumulative distribution function (ending with inv):

Binomial

Chi2

binopdf

chi2pdf

Discrete

discrete pdf

exppdf

F (Snedecor)

f pdf

Normal

geopdf

normpdf

h

c

a

B

chi2cdf

Exponential

Geometric

binocdf

chi2inv

poisspdf

t (Student)

tpdf

i

r

e

e

n

poisscdf

tcdf

i

g

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

i) probability mass functions (discrete distributions) and density functions (continuous case),

g

n

tinv

unif cdf

wblcdf

28

unif inv

wblinv

We clarify the use of the above orders by means of an example of the binomial distribution

(discrete case) and the normal distribution (continuous case), the rest of the orders being totally

analogous.

It is important to remark that for all the distributions of the above table, and for all the orders,

the point in which we want to evaluate dierent functions appears in the rst place, before the

parameters of the distributions.

In the following instructions, x will be a real number, or a vector of real values.

>> binopdf (x, n, p) computes the probability mass function at x of the binomial distribution with parameters n and p. If x is a vector it computes the probability mass function

in each component of x.

>> binocdf (x, n, p) computes the cumulative distribution function of the binomial distribution with parameters n and p at the point x. If x is a vector it computes in each

component of x the above value.

>> normpdf (x, a, b) computes the probability density function of the normal distribution

with mean a and standard deviation b (parameters a and b) at the point x. If x is a vector

it computes in each component of x the above value.

>> normcdf (x, a, b) computes the cumulative distribution function of the normal distribution with mean a and standard deviation b (parameters a and b) at the point x. If x is

a vector it computes in each component of x the above value.

>> norminv(x, a, b) computes the inverse of the cumulative distribution function of the

normal distribution with mean a and standard deviation b (parameters a and b) at the

point x. If x is a vector it computes in each component of x the above value.

The rest of orders are completely similar. The parameters of each of the above orders can be

viewed in Point 1) of this Lab practice.

discrete cdf

expcdf

f cdf

geocdf

normcdf

expinv

f inv

geoinv

norminv

Statistics

Statistics

29

3.3

Exercise 3.1. Simulate by means of two dierent procedures 200 throws of fair dice. Calculate the

proportion of values lower than or equal to 4 in such a simulation. Obtain the proportion of each

possible value in the above simulation.

Another possibility is the following. We generate 200 random numbers on the interval (0, 1) with

>> A = rand(200, 1)

those numbers being divided in 6 dierent categories with

>> B = cut(A, 6)

which gives the matrix with the simulation of the numbers of throws. Note that this is possible because

we have a fair dice.

The proportion of values lower than or equal to 4 can be obtained with

>> empirical cdf (4, B)

The proportion of each possible value in the above simulation is computed with

>> table(B)/200

Exercise 3.2. We have a coin with probability of tail equal to 0.4. Simulate 1000 tosses of such a

coin and obtain the proportion of tails. Solve the above simulation in at least two dierent ways.

We can simulate the throws of the coin with the Bernoulli distribution with parameter 0.4 (or

binomial with parameters 1 and 0.4). Note that it takes the value 1 (tail) with probability 0.4, and the

value 0 (head) with probability 0.6. Thus the order is

>> A = binornd(1, 0.4, 1000, 1)

The proportion of ones (tails) in the matrix A can be computed with

>> mean(A)

or with

>> table(A)/length(A)

taking the second value of the output.

h

c

l

e

Another possibility is the following. We simulate 1000 random numbers in the interval (0, 1) with

>> A = rand(1000, 1)

and transform those lower than or equal to 0.4 in the value 1 (tail), and those greater than 0.4 in the

value 0, by means of the order

>> B = (A <= 0.4)

The proportion of tails (ones in the matrix B) can be obtained with

>> mean(B)

or with

>> table(B)/length(B)

taking the second value of the output.

a

B

Statistics

i

r

e

e

n

30

A third possibility is based on the simulation of a discrete random variable which can take values

1 and 0 with probabilities 0.4 and 0.6 respectively. This can be carried out with

>> C = discrete rnd(1000, [0, 1], [0.6, 0.4])

The proportion of tails (ones in the matrix C) can be obtained with

>> mean(C)

or with

>> table(C)/length(C)

taking the second value of the output.

i

g

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

>> A = discrete rnd(200, [1, 2, 3, 4, 5, 6], [1, 1, 1, 1, 1, 1])

The proportion of values lower than or equal to 4 can be displayed with

>> empirical cdf (4, A)

The proportion of each possible value in the above simulation is computed with

>> table(A)/200

We should indicate that the vector of probabilities [1, 1, 1, 1, 1, 1] could be dened with the order

>> ones(1, 6)

r

o

g

n

Exercise 3.3. Suppose that we have a dice where the probabilities of the numbers 1, 2, 3, 4, 5 and 6

are 0.2, 0.15, 0.15, 0.3, 0.1 and 0.1 respectively. Simulate 500 throws of such a dice.

The simulation can be generated with

>> A = discrete rnd(500, [1, 2, 3, 4, 5, 6], [0.2, 0.15, 0.15, 0.3, 0.1, 0.1]) or

>> A = discrete rnd([1, 2, 3, 4, 5, 6], [0.2, 0.15, 0.15, 0.3, 0.1, 0.1], 500, 1)

Note that it is not necessary to write the values of the probabilities, we can write proportional values,

so we could consider the vector [20, 15, 15, 30, 10, 10] in the above orders.

Exercise 3.4. Consider a discrete random variable X which takes the values 1, 0 and 1 with

probabilities 0.25, 0.5 and 0.25 respectively. Generate 2000 values drawn from such a variable and

obtain the sample mean, compare it with the mean of the variable.

>> A = discrete rnd(2000, [1, 0, 1], [0.25, 0.5, 0.25])

The sample mean with the order

>> mean(A)

In our simulation the sample mean was 0.007, value which is very close to the mean of the variable

(EX = 0).

Exercise 3.5. Generate 5000 values of a B(4, 0.5) distribution. Construct the empirical distribution

function of such a sample. Draw in the same gure that function and the distribution function of the

above binomial distribution.

>> A = binornd(4, 0.5, 5000, 1)

Since such values are in the set {0, 1, . . . , 4}, we draw the the empirical cumulative distribution (for

instance) in the interval [1, 5]. Thus we consider

>> x = 1 : 0.01 : 5

>> plot(x, empirical cdf (x, A))

Since we want to overlap pictures, we use the order

>> hold on

The cumulative distribution function of the distribution B(4, 0.5) is drawn with the order

>> plot(x, binocdf (x, 4, 0.5))

In our simulation we obtained the graphic (see next gure)

Exercise 3.6. Generate 1000 values of a normal distribution with parameters 0 and 2. Draw the

empirical distribution function of the sample, and in the same gure, draw the distribution function

of such a normal.

Statistics

31

g

n

i

r

e

e

n

A possible solution is

>> x = 3 : 0.01 : 3

>> plot(x, normpdf (x, 0, 1))

We obtain the gure

32

i

g

>> A = normrnd(0, 2, 1000, 1)

>> x = 3 : 0.01 : 3

>> plot(x, empirical cdf (x, A))

>> hold on

>> plot(x, normcdf (x, 0, 2), r)

In our simulation we have obtained the gure (see next gure)

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

Exercise 3.9. Draw in the interval [3, 3], and in the same gure, the density functions of the normal

distributions with common standard deviation 0.5, and means 0, 0.5 and 1 respectively. By means of

such a graphic, which result about the probabilities of such distributions can be derived by intuition?

The following instructions provide the solution of the problem

>> x = 3 : 0.01 : 3

>> plot(x, normpdf (x, 0, 0.5))

>> hold on

>> plot(x, normpdf (x, 0.5, 0.5), r)

>> plot(x, normpdf (x, 1, 0.5), g)

h

c

l

e

r

o

Exercise 3.7. By means of the above problem, how would you simulate 1000 values of a normal

distribution with parameters 1 and 0.25?

We know that if W N (0, 2), then 14 W +1 N (1, 0.25) (see properties of the normal distribution

in Statistical Notes, Chapter 5).

In the above problem we have stored 1000 values drawn from a normal distribution with parameters 0

and 2 (N (0, 2)) in matrix A. Therefore it is sucient to consider the order

>> B = 0.25 A + 1

to obtain a sample drawn from the N (0, 2) distribution.

a

B

Exercise 3.8. Represent the density function of a N (0, 1) random variable between the values -3 and

3.

Statistics

From a probabilistic point of view, and in accordance with the meaning of the area below a density

function, we sense that the probabilities of those distributions are equal except for a translation.

Statistics

33

Exercise 3.10. Let X be a random variable following a binomial distribution with parameters 10

and 0.2 (X B(10, 0.2)) Obtain P (X = 6) and FX (5).

The solution is given by the orders

>> binpdf (6, 10, 0.2)

>> binocdf (5, 10, 0.2)

g

n

Exercise 3.15. Consider a normal distribution with mean equal to 0 and variance equal to 1 (N (0, 1)).

Obtain the value which determines on its right-hand side a tail of size 0.05. Obtain the value which

leaves on its left-hand side a tail of size 0.025.

i

g

Let F denote the distribution function of a N (0, 1) distribution. We need to obtain the value a

such that F (a) = 0.95, equivalently a = F 1 (0.95).

Therefore

>> norminv(0.95, 0, 1)

gives the solution, which is 1,6449.

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

Exercise 3.11. Let X be a random variable following a geometric distribution with parameter 0.3

(X G(0.3)) Obtain P (X = 4) and P (X 7).

i

r

e

e

n

34

In relation to the second question, we need the value b such that F (b) = 0.025, that is, b =

F 1 (0.025). Thus we obtain the solution with

>> norminv(0.025, 0, 1)

which is equal to 1.9600.

>> geopdf (4, 0.3)

>> geocdf (7, 0.3)

Exercise 3.12. Let Z be a random variable following a normal distribution with parameters 0 and

1. (Z N (0, 1)) Obtain P (Z 2) and fZ (1.3).

We can obtain the above values by means of

>> normcdf (2, 0, 1)

>> normpdf (1.3, 0, 1)

Exercise 3.16. Solve the above problem when the distribution is a 2 with 11 eleven degrees of

freedom (211 ), and when we consider a distribution F with 10 and 15 freedom degrees (F10,15 ).

The solution is similar to the above problem, in this case with the distributions 211 and F10,15 .

Exercise 3.13. Let X be a random variable following a exponential distribution with parameter 4

(X exp(4)) Obtain P (X 3) and fX (2).

Note that for the exponential distribution we should enter the mean of the distribution (1/parameter) instead of the parameter.

The solution is given by

>> expcdf (3, 0.25)

>> exppdf (2, 0.25)

r

o

In relation to the rst distribution, the solutions are given by the orders

>> chi2inv(0.95, 11)

which returns 19.675, and

>> chi2inv(0.025, 11)

giving the value 3.8157.

In the second case the solution can be obtained with

>> f inv(0.95, 10, 15)

returning 2.5437, and

>> f inv(0.025, 10, 15)

which generates the value 0.28396.

Exercise 3.14. Let X be a random variable following a Weibull distribution with parameters 2 and 5.

Without using integration, how would you obtain P (X (1, 4)), P (X 8) and P (X > 3)? Moreover,

determine the point c R such that the distribution function of that variable takes the value 0.95?

Exercise 3.17. An engineer connects in parallel two resistors of 100 and 25 (ohm). However the

real resistance could dier from those values. Suppose that the real resistances are two independent

random variables with distributions X N (100, 10) and Y N (25, 2.5) respectively.

and P (X > 3) = 1 P (X 3) = 1 FX (3).

We can obtain the above values with

>> wblcdf (4, 2, 5) wblcdf (1, 2, 5)

which is equal to 0.96923,

>> wblcdf (8, 2, 5)

obtaining 1 and

>> 1 wblcdf (3, 2, 5)

which is equal to 5.0359e-004.

It is known that the real resistance of the assembly, denoted by R, is given by the formula

a

B

h

c

l

e

In relation to second point, we need a value c such that FX (c) = 0.95, equivalently, c = FX1 (0.95).

Thus we obtain such a value with

>> wblinv(0.95, 2, 5)

being equal to 2.4908.

Statistics

R=

XY

.

X +Y

We estimate by means of the strong law of large numbers and simulation the required probability.

We simulate for instance 10000 values of the random variable R to estimate the probability of

the event 19 < R < 21 by means of the proportions of values of the simulation which satisfy that

condition, that is, by means of the proportions of values of the simulation which belong to the interval

(19, 21).

For the simulation of the values of R, we simulate 10000 values of the variable X and 10000

values of Y , to generate by means of them the simulated values of R.

Such a procedure can be performed with

>> X = normrnd(100, 10, 10000, 1);

Statistics

35

>> R = (X. Y )./(X + Y );

generating a column vector with 10000 simulated values of the variable R.

Now we need to count how many of such values belong to the interval (19, 21).

That is calculated with

>> P = (19 < R);

>> Q = (R < 21);

>> L = P. Q;

>> mean(L)

Exercise 3.18. A machine has three components which work in an independent way. The life (in

hours) of each component has exponential distribution with parameter 1/2. The machine works if at

least two of the components work. Program a procedure which generate the value 1 if the machine

works after half an hour.

In the rst place we generate the values of the three exponential distributions (life of the three

components) with

>> A = exprnd(2, 1, 3)

Note that the rst number of the above order (2) must be the inverse of the parameter of the exponential

distribution, that is, its mean.

r

o

Now we obtain which components are working after 0.5 hours with

>> B = (A > 0.5)

The number of components working after that period can be obtained with

>> c = sum(B)

If c is greater than or equal to 2, the value 1 (the machine works) should be displayed, in any other

case the value 0 (the machine does not work). This can be performed with

>> if c >= 2 (f = 1)

>> else f = 0

>> endif

>> f

h

c

l

e

a

B

Exercise 3.19. By means of the above problem, estimate the probability that the machine woks at

least half an hour.

We apply again the strong law of large number and simulation to estimate such a probability.

We repeat (simulate) the experiment a large number of times and we estimate the probability

by means of the proportion of times such an event occurs in the simulations.

Statistics

i

r

e

e

n

36

Let us consider 10000 repetitions. The following program solves the problem.

>> A = exprnd(2, 10000, 3);

>> B = (A > 0.5);

>> f or i = 1 : 10000;

>> C(i) = sum(B(i, :));

>> endf or

>> f = 0;

>> f or j = 1 : 1000

>> if C(j) >= 2

>> f = f + 1;

>> endif

>> endf or

>> p = f /10000

i

g

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

Matrix P of size 10000 1 has a value 1 in those positions in which the simulation of R takes a

value greater than 19, and has the value 0 in any other case.

In a similar way, matrix Q of size 10000 1 has a value 1 in those positions in which the simulation

of R takes a value lower than 21, and has the value 0 in any other case.

The element-by-element product of P and Q generates a matrix of size 10000 1 with 1 in those

positions in which the simulation of R belongs to the interval (19, 21), and 0 in any other case.

The last order computes the proportion of values equal to 1 in the matrix L, which is the estimation of

the requested probability (we could use >> table(L)/length(L) taking the second value of the output).

Our simulations estimate such a probability around the value 0.45.

g

n

In the rst line we simulate 10000 times (10000 rows) the lives of three components (each row

is a possible machine).

In the second line we detect which components have a duration greater than half an hour.

In the rst loop we count the number of components working after half an hour in each machines (each

row).

With the variable f we count the number of rows with at least two components working after half an

hour (number of machines working after half an hour).

Finally p gives the proportion of machines working after half and hour, that is, the estimation of the

required probability (applying the strong law of large numbers).

In our simulations we obtain that the proportions of times such an event occurred were very close

to 0.087.

Exercise 3.20. Let X be a discrete random variable. It is said that a value a R is a mode of the

random variable X if P (X = a) P (X = x) for any x R, that is, if the probability mass function

reaches its maximum at the point a.

It is well-known that if X follows a Poisson distribution with parameter , its mode(s) is given

by the integer part of if such a value is not an integer number, and by 1 and when is an

integer number.

Corroborate such a result by drawing in the interval [0, 15] the probability mass function of

random variables following Poisson distributions with parameters 2, 5, 6.3 and 7.4.

A possible solution for the rst parameter, the rest of cases being equal, is the following

>> p = 2

>> f or i = 1 : 16

>> y(i) = poisspdf (i 1, p)

>> endf or

>> x = 0 : 15

>> plot(x, y)

Here p plays the role of the parameter of the Poisson distribution.

We should observe that to dene the coordinates of vector y, it is not possible to use y(0), that

is the reason we dene y(i) with i from 1 to 16, the value y(i) being the value of the probability mass

function at the point i 1. Note that the order plot joins the drawn points with segments.

Statistics

g

n

i

r

e

e

n

37

The graphical representation which is obtained appears below. Note that it has two maximum

values at the points 1 and 2, which corroborates the statement of the problem in the case p = 2.

i

g

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

The graphical representations of the other parameters can be obtained just by modifying the value

of the parameter in p.

a

B

h

c

l

e

r

o

Statistics

g

n

i

r

e

e

n

40

In the above order, vector x must contain the sample from which we want to infer.

The variable m should contain the value we want to test if it is the mean of the normal distribution.

i

g

The argument alt determines the kind of test we are performing (the alternative hypothesis).

Thus,

Chapter 4

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

H0 : = m against H1 : = m,

if the argument is ! = (quotation marks must be included), the test we perform is

H0 : = m against H1 : = m

(the same test as in the above case),

parameters of a normal distribution

if the argument is > (quotation marks must be included), the hypothesis of the test

are

H0 : = m against H1 : > m,

if the argument is < (quotation marks must be included), the test is

H0 : = m against H1 : < m.

Content

The above orders compute the p-value of the sample of x in the corresponding test.

4.1

39

4.2

39

4.3

42

. . . . . . . . . . . . . . . . . . . . . . .

The order

>> [pval, stat, gl] = t test(x, m, alt)

calculates three dierent values, pavl which contains the p-value of the sample in the corresponding test, stat with the value of the statistic of the test, and gl which gives the freedom

degrees of the statistic of the test (sample size minus 1).

We recall that the statistic of the test is

4.1

Xn m

,

S

X / n

where

In this practice we are learning to infer on the parameters of a normal distribution by means of some

hypothesis tests studied in expositive classes (see Chapter 7 of Statistical Notes).

4.2

l

e

r

o

distribution

h

c

The orders to test on the parameters of a normal distribution are shown in this section.

Throughout this practice we will assume that we have a random sample (X1 , X2 , . . . , Xn ) drawn

from a normal X with parameters and (X N (, )). By means of that sample we want to infer

on both parameters.

a

B

1) Test on the mean of a normal distribution with unknown variance, (N (, ) with unknown

): to perform a test on the mean of a normal random variable whose variance is unknown, we

will use the order

>> t test(x, m, alt)

39

Xn =

1

Xi

n

i=1

2

S

X

=

1

(Xi X n )2 (sample quasivariance) .

n1

i=1

Such a statistic follows a distribution tn1 (t with n1 freedom degrees) when the null hypothesis

H0 is true, where n is the sample size.

We should remark that the above procedure is also valid when the variance 2 is known, however,

in such a case it is more appropriate to consider the method explained in the following point.

2) Test on the mean of a normal distribution with known variance, (N (, ) with known ): to

perform a test on the mean of a normal variable with known variance, we will use the order

>> z test(x, m, v, alt)

In the above order, vector x must contain the sample from which we want to infer.

The variable m should contain the value we want to test if it is the mean of the normal distribution.

The variable v should contain the variance ( 2 ), which is known (be careful to introduce the

variance and not the standard deviation).

The argument alt determines the kind of test we are performing (the alternative hypothesis).

Thus,

Statistics

41

i

g

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

if the argument is > (quotation marks must be included), the hypothesis of the test

are

H0 : = m against H1 : > m,

respectively, that is, samples in which the value of the statistic, we will denote it by stat, satises

if the argument is < (quotation marks must be included), the test we perform is

H0 : = m against H1 : < m.

The above order computes the p-value of the sample in the corresponding test.

The order

>> [pval, stat] = z test(x, m, v, alt)

computes two dierent values, pavl which contains the p-value of the sample in the corresponding

test, and stat, which displays the value of the statistic of the test.

1

n

Xi ,

(sample mean)

a) H0 : = 0 against H1 : = 0 ,

where

h

c

a

B

2

S

X

=

r

o

l

e

b) stat 2n1;1 ,

c) stat 2n1; ,

2n1;

stat 2n1;1/2

or

stat 2n1;/2 ,

we reject the null hypothesis to conclude the alternative hypothesis, that is, the sample belongs

to the critical region. If such a condition is not satised, we should not reject the null hypothesis.

If we have test b), and

3) Test on the standard deviation of a normal distribution (N (, )): to our knowledge, the

test on the standard deviation of a normal distribution has not been implemented in Octave,

probably because its simplicity. Therefore it is necessary to program such a test. For such a

purpose we are using the orders on the inverse of cumulative distribution functions which have

been analyzed in Lab Practice 2.

c) H0 : = 0 against H1 : > 0 .

stat 2n1;/2 ,

or

i=1

against H1 : < 0 ,

stat 2n1;1/2

respectively, where

stands for the value which has on its right hand side a tail of size

when we consider the distribution 2n1 . Therefore, such a value can be computed with the

order

>> chi2inv(1 , n 1)

b) H0 : = 0

a)

If x denotes the vector with the sample data, the value of the statistic can be obtained with

Xn m

,

/ n

Xn =

42

s

2X

s

2

2n1;1 or (n 1) X2 2n1; ,

2

2

2

0

0

s

2X

b) CR = (x1 , . . . , xn ) : (n 1) 2 2n1;1 ,

0

s

2X

c) CR = (x1 , . . . , xn ) : (n 1) 2 2n1; ,

0

(x1 , . . . , xn ) : (n 1)

a) CR =

H0 : = m against H1 : = m

(the same test as in the above case),

where

i

r

e

e

n

H0 : = m against H1 : = m,

g

n

stat 2n1;1 ,

we reject the null hypothesis to conclude the alternative hypothesis, that is, the sample belongs

to the critical region. If such a condition is not satised, we fail to reject the null hypothesis.

If we have considered test c), and

stat 2n1; ,

we reject the null hypothesis to conclude the alternative hypothesis, that is, the sample belongs

to the critical region. If such a condition is not satised, we should not reject the null hypothesis.

4.3

S

2

(n 1) X2 ,

0

Exercise 4.1. The opening time of a web page is a very important characteristic in the design of such

pages. It is assumed that the opening time follows a normal distribution. In order to analyze such

times, web pages designed by the same programmer are taken at random, annotating their opening

times (135 data) in hundredths of second. These are the data:

1

(Xi X n )2 .

n1

i=1

Such a statistic follows a distribution 2n1 (2 with n 1 degrees of freedom) when H0 is true,

where n is the sample size.

The critical rejections at a level of signicance are

Statistics

the sample can be found in the le DatPrac4Ejer1 which can be loaded from Campus Virtual.

Matrix A contains the opening data.

Statistics

43

i) Could we consider that the mean opening time of the web pages of the above programmer is

0.65 hundredths of second?, or on the contrary is there enough evidence to reject such a hypothesis?

ii) Solve the same question when we know that the variance of the opening times is 0.02.

g

n

Exercise 4.3. In a market analysis on computing and communications, 28 personal computer users

were taken at random, studying, among other characteristics, the space (in gigabytes) taken up by

lms in the hard disks of their computers. Data are the following:

i

g

Since we do not know the true variance of the opening times, in the rst question we consider

the order

>> p1 = t test(A, 0.65)

which gives a p-value equal to 0.

Therefore there exists enough evidence to reject that the mean of the opening times of the web pages

designed by such a programmer is equal to 0.65. The above p-value is stored in the variable p1.

the sample can be found in the le DatPrac4Ejer3 which can be loaded from Campus Virtual.

Matrix A contains the space in gigabytes taken up by lms.

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

In relation to the second question, now we know that the variance of the random variable opening

times is 0.02. Therefore we will use in this case the order

>> p2 = z test(A, 0.65, 0.02)

The computed p-value is 0.36887. Thus we do not have enough evidence to reject the null hypothesis.

That is, we should not reject that the mean of the opening times of the web pages designed by the

programmer is equal to 0.65. Note that in this case, the p-value is stored in the variable p2.

i

r

e

e

n

44

Let us suppose that the taken up space by lms follows a normal distribution.

i) Could we consider that the mean of the taken up space by lms in hard disks of personal

computers is 20 gigabytes, or on the contrary such a mean is lower?

ii) Is there enough evidence to reject that the standard deviation of the taken up space by lms

is 8 gigabytes?

Use level of signicance 0.05.

Exercise 4.2. Some compression rates have been observed using level 9 of Gzip compression to

compress the Lisp source code. Such rates are (80 data):

>> p1 = t test(A, 20, < )

which gives in the variable p1 the p-value of the sample in the test.

Since such a value is 0.28619, we do not have enough evidence to reject that the mean of the space of

lms in the hard disks of personal computers is 20 gigabytes.

the sample can be found in the le DatPrac4Ejer2 which can be loaded from Campus Virtual.

Matrix A contains the compression rates.

The second question is on the standard deviation of the space of lms. The value of the statistic

is obtained with

>> stat = (length(A) 1) var(A)/8 2

i) is it possible to consider that the mean compression rate is 40, or on the contrary, is there

enough evidence to reject such a value and conclude that it is greater?

ii) is it possible to consider that the variance of the compression rate is 100, or on the contrary,

is there enough evidence to reject such a value and conclude that it is lower?

r

o

l

e

>> p1 = t test(A, 40, > )

which gives the p-value of the sample in the variable p1.

In this case we obtain a p-value equal to 0.086073. If the level of signicance is 0.05, we do not have

enough evidence to reject that the mean compression rate is 40 and conclude that it is greater.

h

c

The second question is in relation to the variance of the compression rate. The value of the

statistic can be computed with

>> stat = (length(A) 1) var(A)/100

We store it in the variable stat.

a

B

To know if the sample belongs to the critical region or not, we need to obtain in this case the

value f r = 2n1;0.95 , which determines the frontier of the critical region.

Such a value can be computed with

>> f r = chi2inv(0.05, length(A) 1)

We obtain that stat = 105, 93319 and f r = 59, 522. Thus the relation stat f r is not satises,

therefore there is not sucient evidence to reject that the variance of the compression rate is 100.

Statistics

To know if the sample is in the critical region we need to obtain the values >> f r1 = 2n1;0,975

and >> f r2 = 2n1;0,0.025 , which give the frontiers of the critical region. Such values can be obtained

with

>> f r1 = chi2inv(0.025, length(A) 1)

>> f r2 = chi2inv(0.975, length(A) 1)

respectively. Those values are f r1 = 14, 573 and f r2 = 43, 195.

Note that stat 2n1;0.975 . Therefore the sample belongs to the critical region and so we should

reject that the standard deviation of the space of lms is 8 gigabytes, to conclude that it is dierent

from that value.

Exercise 4.4. A web page gets hung up frequently. A programmer studies the length of the moments

that the web page is got hung up. The data are the following

7.97452, 3.87601,

3.17150, 6.39356,

0.61877,

8.52384,

2.50740,

6.63117,

8.83795,

10.01265,

i) is there enough evidence to reject that the mean time the web page is got hung up is 4 seconds

and conclude that such a mean time is greater than 4?

ii) which is the value of the statistic used in the above question?

Use level of signicance 0.05.

Statistics

45

g

n

i

r

e

e

n

In the rst place, the data are stored in a matrix with name A by means of the order

>> A = [7.97452; 3.87601; 0.61877; 8.52384; 2.50740; 6.63117; 8.83795; 10.01265; 3.17150; 6.39356]

The p-value of above sample for the test proposed in question i) can be calculated with

>> p1 = t test(A, 4, > )

Such p-value appears in the variable p1, being equal to p1 = 0.047068. Since such a p-value is less

than 0.05, we should reject the null hypothesis of mean time equal to 4 to conclude that the mean time

the web page is got hung up is greater than 4 seconds.

i

g

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

To obtain the value of the statistic used in the above question, it is sucient to consider the

order

>> [p1, stat, gl] = t test(A, 4, > )

It computes three dierent values.

The variable p1 contains the p-value of the sample, that is, 0.047068. The variable stat displays the

value of the statistic of the test, in this case 1.8711 (this solves the second question of the problem).

Finally gl gives the freedom degrees of the statistic of the test, in this case 9 (sample size1).

46

Since the sample was generated with a mean equal to 10, it is obvious that if we move away from

10, the p-values should decrease rapidly.

Exercise 4.5. For the data of the above problem, study if it is possible to consider that the variance

of the time the web page is got hung up is 4 or on contrary such a variance is greater.

The value of the statistic is obtained with

>> stat = (length(A) 1) var(A)/4

We obtain that stat = 22.10843

To know if the sample belongs to the critical region, we need to obtain in this case the value

f r1 = 2n1;0,05 , which is the frontier of the critical region.

Such a value is obtained with

>> f r1 = chi2inv(0.95, length(A) 1)

In this case we obtain that f r1 = 16.919. Note that the value stat is greater than f r1. Therefore we

should reject that the variance is 4, to conclude that it is greater.

r

o

Exercise 4.6. Generate a random sample of size 1000 from a distribution N (10, 2). For any value

between 8 and 12 with a step of 0.01, obtain the p-value of the above sample for the test which studies

if such a sample becomes from a normal distribution with mean that value and standard deviation

equal to 2. Represent in a graphic the above values against their p-values. Give an interpretation of

such a graphic.

l

e

>> A = normrnd(10, 2, 1000, 1);

The values between 8 and 12 with a step of 0.01 are created with

>> x = 8 : 0.01 : 12;

the p-values of the tests proposed in the statement of the problem are performed with the order

>> f or k = 1 : 401

>> w(k) = z test(A, x(k), 2);

>> endf or

such p-values being stored in the vector w.

The graphical representation is drawn with the order

>> plot(x, w)

h

c

a

B

Statistics

Statistics

g

n

i

r

e

e

n

CHAPTER 5. LAB PRACTICE 5.- TESTS ON THE PARAMETERS OF TWO INDEPENDENT NORMAL

DISTRIBUTIONS

48

test of equality of variances of two independent normal distributions, we will use the order

>> var test(x, y, alt)

i

g

In the above order, vectors x and y should contain the samples from which we want to infer.

The argument alt determines the kind of test we are performing (the alternative hypothesis(.

Thus,

Chapter 5

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

parameters of two independent normal

distributions

H0 : 1 = 2 against H1 : 1 = 2 ,

if the argument is ! = (quotation marks must be included), the test we perform is

H0 : 1 = 2 against H1 : 1 = 2 ,

(the same test as in the above case),

if the argument is > (quotation marks must be included), the test has the hypothesis

H0 : 1 = 2 against H1 : 1 > 2 ,

H0 : 1 = 2 against H1 : 1 < 2 .

Content

The above orders compute the p-value of the samples in the corresponding test.

5.1

5.1

47

5.2

distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

5.3

50

. . . . . . . . . . . . . . . . . . . . . . .

l

e

r

o

In this practice we are learning to test with Octave on the parameters of two independent normal

distributions by means of the hypothesis tests explained in the expositive classes.

5.2

h

c

a

B

The orders to test on the parameters of two independent normal distributions are shown in this section.

Throughout this practice we will consider that we have two random samples (X1 , X2 , . . . , Xn1 ) and

(Y1 , Y2 , . . . , Yn2 ) drawn from X and Y respectively, where X and Y are independent random variables

with distributions X N (1 , 1 ) and Y N (2 , 2 ).

The main orders in this context are the following:

47

The order

>> [pval, stat, gl1, gl2] = var test(x, y, alt)

calculates four dierent values, pavl which contains the p-value of the samples in the corresponding test, stat with the value of the statistic of the test, gl1 which gives the freedom degrees of

the sample in vector x (size of x1), and gl2 which gives the freedom degrees of the sample in

vector y (size of y1).

are equal: to perform a test of equality of means of two independent normal distributions when

variances are equal we use the order

>> t test 2(x, y, alt)

In the above order, vectors x and y must contain the samples from which we want to infer.

The argument alt determines the kind of test we are performing (the alternative hypothesis).

Thus,

if such an argument does not appear, the test we perform is

H0 : 1 = 2 against H1 : 1 = 2 ,

if the argument is ! = (quotation marks must be included), the test we perform is

H0 : 1 = 2 against H1 : 1 = 2 ,

(the same test as in the above case),

if the argument is > (quotation marks must be included), the test we perform is

H0 : 1 = 2 against H1 : 1 > 2 ,

if the argument is < (quotation marks must be included), the test has the hypothesis

H0 : 1 = 2 against H1 : 1 < 2 .

Statistics

CHAPTER 5. LAB PRACTICE 5.- TESTS ON THE PARAMETERS OF TWO INDEPENDENT NORMAL

49

DISTRIBUTIONS

g

n

i

r

e

e

n

CHAPTER 5. LAB PRACTICE 5.- TESTS ON THE PARAMETERS OF TWO INDEPENDENT NORMAL

DISTRIBUTIONS

50

The above orders compute the p-value of the samples in the corresponding test.

5.3

The order

>> [pval, stat, gl] = t test 2(x, y, alt)

computes three dierent values, pavl which contains the p-value of the samples in the corresponding test, stat which displays the value of the statistic of the test, and gl which gives the

freedom degrees of the statistic (sum of the sizes of x and y minus 2).

Exercise 5.1. A mice factory produces two dierent kinds of mice, say A and B. It is known that

the lives of both models follow normal distributions. In order to compare both models, the life of 140

mice of model A and 105 of model B were studied. The data are the following:

i

g

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

the samples can be found in the le DatPrac5Ejer1 which can be loaded from Campus Virtual.

Matrix A contains the lives of model A, matrix B of model B.

are dierent: to perform a test of equality of means of two independent normal distributions

when variances are dierent, we use the order

>> welch test(x, y, alt)

i) What could we conclude on the mean life of the above mouse models?, are they equal?

In the above order, vectors x and y must contain the samples from which we want to infer.

The argument alt determines the kind of test we are performing (the alternative hypothesis).

Thus,

We test the equality of the means of the life times of the mice.

H0 : 1 = 2 against H1 : 1 = 2 ,

H0 : 1 = 2 against H1 : 1 = 2 ,

(the same test as in the above case),

if the argument is > (quotation marks must be included), the test we perform is

H0 : 1 = 2 against H1 : 1 > 2 ,

if the argument is < (quotation marks must be included), the test we perform is

H0 : 1 = 2 against H1 : 1 < 2 .

r

o

The above orders compute the p-value of the samples in the corresponding test.

l

e

The order

>> [pval, stat, gl] = welch test(x, y, alt)

obtains three dierent values, pavl which has the p-value of the samples in the corresponding

test, stat which displays the value of the statistic of the test, and gl which gives the freedom

degrees of the statistic (see point 8.2.3 of Statistics Notes).

h

c

a

B

If X is the random variable life time of a mouse of model A and Y stands for the random variable

life time of a mouse of model B, we know that X N (1 , 1 ) and Y N (2 , 2 ).

Remark: When we want to test on the means of two independent normal random variables, N (1 , 1 ),

and N (2 , 2 ), by means of random samples drawn from such variables, it is usual that we do not

know if the variances (12 and 22 ) are equal or not (normally 1 , 2 , 1 and 2 are unknown).

In order to know which test should be applied (Point 2 or Point 3), it is necessary to test in the

rst place H0 : 1 = 2 against H1 : 1 = 2 .

If in such a test we reject the null hypothesis concluding that 1 = 2 , we will apply Point 3 to

test on the means. On the contrary, if we fail to reject the null hypothesis, we will apply Point 2.

Statistics

For such a purpose rstly we test the equality of variances, that is, we consider the test

H0 : 1 = 2 against H1 : 1 = 2 .

In accordance with the result of this test we will analyze the equality 1 = 2 with the appropriate

test.

>> p1 = var test(A, B)

which obtains the p-value of the samples in such a test. In this case the p-value is p1 = 0. Therefore

we should reject that the variances of the life times of both models are equal, concluding that 1 = 2 .

Since we have considered that the above variances are dierent, we will use the order welch test

to infer on the means of the life times.

Such an inference can be performed with the order

>> p2 = welch test(A, B)

which computes the p-value of the samples in the corresponding test.

We obtain that such a p-value is p2 = 0.00019569. Since it is less than the usual level of signicance,

there is enough evidence to reject the null hypothesis of equality of the mean life times and conclude

that such values are dierent.

In order to solve the last question, we consider the order

>> p3 = welch test(A, B, < )

which gives the p-value of the samples in the new test.

The p-value is p3 = 9.7847e 004. Note that it is lower than the usual level of signicance, thus we

should reject the hypothesis of equal mean life time and conclude that the mean life time of B is greater

than the mean life time of A.

Exercise 5.2. A company of processors analyzes the quality of two processors, the so-called ProMT1

and ProMT2. It is known that the speeds of both follow normal distributions. Taken at random 15

ProMT1 processors and 23 ProMT2 processors, the following speeds were found:

ProMT1: 2.7, 2.65, 2.83, 2.95, 2.64, 2.45, 3.01, 2.56, 2.76, 2.99, 2.76, 2.87, 3.05, 2.65, 3.09

Statistics

CHAPTER 5. LAB PRACTICE 5.- TESTS ON THE PARAMETERS OF TWO INDEPENDENT NORMAL

51

DISTRIBUTIONS

ProMT2: 2.85, 2.89, 2.77, 3.00, 2.87, 2.76, 2.78, 2.67, 2.97, 2.99, 2.87, 2.93, 2.78, 2.98, 3.01, 2.84,

2.88, 2.79, 2.89, 2.91, 2.87, 2.88, 2.92

the samples can be found in the le DatPrac5Ejer2 which can be loaded from Campus Virtual.

Matrix A contains the lives of model A, matrix B of model B.

We should test if the mean speeds of ProMt1 and ProMt2 processors are equal. For such a

purpose, in the rst place we should test if the variances of the speeds of such processors are the same,

or on the contrary they are dierent. In accordance with the result of the that test, we will apply an

specic test for the question of the problem.

>> A1 = [2.7; 2.65; 2.83; 2.95; 2.64; 2.45; 3.01; 2.56; 2.76; 2.99; 2.76; 2.87; 3.05; 2, 65; 3.09]

>> A2 = [2.85; 2.89; 2.77; 3.00; 2.87; 2.76; 2.78; 2.67; 2.97; 2.99; 2.87; 2.93; 2.78; 2.98; 3.01; 2.84; 2.88; 2.79;

2.89; 2.91; 2.87; 2.88; 2.92]

The inference on the equality of variances is performed by

p1 = var test(A1, A2)

which obtains the p-value of the samples in such a test. Such a p-value is p1 = 9.9280 104 , which

is lower than the usual level of signicance, therefore we reject that the variances of the speeds of both

processors are the same, concluding that they are dierent.

As a consequence we will use the order welch test to infer on the equality of mean speed of both

processors.

We perform the order

>> p2 = welch test(A1, A2)

which obtains the p-value of the samples in such a test.

The p-value is p2 = 0.16534. Since it is greater than the usual level of signicance, there is not enough

evidence to reject the equality of mean speeds.

r

o

Exercise 5.3. The percentages of hard disk taken up by lms in young and adult people follow

normal distributions. In an analysis of such percentages the following data were obtained:

l

e

Young: 20, 18, 22, 32, 17, 25, 34, 19, 17, 27, 24, 31, 17, 23, 26, 28, 30, 26, 33, 12

h

c

Could we consider that young and adult people give the same use to the computers in relation

to the space taken up by lms?

Since the percentages of hard disk taken up by lms in young and adult people follow normal

distributions, we can consider that young and adult people give the same use to the computers (with

respect to lms) if the means and variances of both distributions are the same. Note that in that case,

the probability distributions of the space taken up by lms are the same. Therefore we are testing the

equality of means and variances.

a

B

>> A = [20; 18; 22; 32; 17; 25; 34; 19; 17; 27; 24; 31; 17; 23; 26; 28; 30; 26; 33; 12]

>> B = [6; 12; 17; 7; 21; 34; 6; 12; 18; 12; 11; 8; 28; 17; 15; 16; 14]

i

r

e

e

n

DISTRIBUTIONS

52

The p-value is p1 = 0.43184, thus we should not reject the equality of the variances.

It is interesting to note that if in the above test we had rejected the hypothesis 1 = 2 , the

problem would be nished, since the probability distributions of the space taken up by lms would not

be the same.

i

g

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

By means of the above data, could we conclude that the mean speed of both models is the same?

Adult: 6, 12, 17, 7, 21, 34, 6, 12, 18, 12, 11, 8, 28, 17, 15, 16, 14

g

n

CHAPTER 5. LAB PRACTICE 5.- TESTS ON THE PARAMETERS OF TWO INDEPENDENT NORMAL

Now we test the equality of the means of the space taken up by lms in young and adult people.

As a consequence of the previous test we will use the order t test 2. Thus we perform

>> p2 = t test 2(A, B)

which obtains the p-value of the samples in such a test.

The p-value is p2 = 0.0002. Since it is smaller than the usual level of signicance, we reject that the

mean space taken up by lms is the same for young and adult people.

As a consequence we cannot consider that young and adult people give the same use to the

computers in relation to the space taken up by lms.

Exercise 5.4. A manufacturer asserts that the mean information rate sent its ports is equal to the

mean information rate of a port developed by another manufacturer. However, the latter declares that

the mean information rate of its port is greater than the one of the former. It is known that both

information rates follow normal distributions.

Taken at random ports of both manufacturers, information rates in bytes per second were obtained. What could we conclude with the following data?

First manufacturer: 1199.3, 1200.2, 1200.9, 1198.3, 1200.5, 1200.3, 1200.5, 1199.5, 1200.8, 1200.0,

1200.4, 1199.7, 1199.7, 1199.7, 1198.7, 1199.4, 1200.5, 1199.8, 1199.1, 1199.4

Second manufacturer: 1318.4, 1299.7, 1301.3, 1294.7, 1310.8, 1306.3, 1296.2, 1309.6, 1287.5,

1310.7, 1303.2, 1308.9, 1307.7, 1297.2, 1314.6

the samples can be found in the le DatPrac5Ejer4 which can be loaded from Campus Virtual.

Matrix A contains data of the rst manufacturer, matrix B data of the second one.

In the rst place we test if the variances of the information rates of both ports are the same by

means of

>> p1 = var test(A, B)

which obtains the p-value of the samples in such a test. Such a p-value is p1 = 6.6613e016. Therefore

we reject the equality of variances.

As a consequence, the inference on the means will be carried out with the order welch test So

we consider

>> p2 = welch test(A, B, < )

which computes the p-value of the samples in the above test. Such a value is p2 = 1.9990e017. Therefore we should reject the equality of mean information rates and conclude that the mean information

rate of the second manufacturer is greater than the information rate of the rst manufacturer.

Exercise 5.5. Generate 500 data from a normal distribution N (0, 1). Dene a vector with values

from -3 to 3 with a step of 0.1 (601 values). For each value of the above vector, generate 1000 values

from a normal distribution N (value, 1).

>> p1 = var test(A, B)

Statistics

Statistics

g

n

i

r

e

e

n

CHAPTER 5. LAB PRACTICE 5.- TESTS ON THE PARAMETERS OF TWO INDEPENDENT NORMAL

53

DISTRIBUTIONS

Draw a gure which should associate with each value of the vector, the p-value of the test data

generated from N (0, 1) and data generated from N (value, 1) come from random variables with the

same mean. Give an interpretation of such a graphic.

A possible solution of the problem is

>> A = normrnd(0, 1, 500, 1);

>> x = 3 : 0.01 : 3;

>> f or j = 1 : 601;

>> Bj = normrnd(x(j), 1, 1000, 1);

>> p(j) = t test 2(A, Bj);

>> endf or

>> plot(x, p)

We obtained the following gure in our simulation

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

Since the rst sample was generated with a mean equal to 0, it is obvious that if the our mean

moves away from 0, the corresponding p-values should decrease rapidly.

r

o

Exercise 5.6. Generate 500 data from a normal distribution N (0, 1). Dene a vector with values

from -3 to 3 with a step of 0.1 (601 values). For each value of the above vector, generate 1000 values

from a normal distribution N (0, value).

l

e

Draw a gure which should associate with each value of the vector, the p-value of the test data

generated from N (0, 1) and data generated from N (0, value) come from random variables with the

same variances. Give an interpretation of such a graphic.

h

c

>> A = normrnd(0, 1, 500, 1);

>> x = 3 : 0.01 : 3;

>> f or j = 1 : 601;

>> Bj = normrnd(0, x(j), 1000, 1);

>> p(j) = var test(A, Bj);

>> endf or

>> plot(x, p)

a

B

i

g

Since the rst sample was generated with a variance equal to 1, it is obvious that if our variance

moves away from 1, the corresponding p-values should decrease rapidly.

Statistics

g

n

i

r

e

e

n

56

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

proportions

Content

6.1

6.2

55

6.3

57

. . . . . . . . . . . . . . . . . . . . . . .

55

r

o

l

e

h

c

The main orders in this context are the following:

a

B

1) Test on a population proportion (p): to our knowledge the test on a population proportion p has

not been implemented in Octave, probably because its simplicity. So it is necessary to program

such a test. For such a purpose we are using the orders on the inverse of cumulative distribution

functions which have been analyzed in Lab Practice 3.

We will consider the tests

a) H0 : p = p0

against

H1 : p = p0 ,

b) H0 : p = p0

against

H1 : p > p0 ,

c) H0 : p = p0

against

H1 : p < p0 .

x

n p0

z

a) CR = (x1 , . . . , xn ) :

,

2

p0 (1 p0 )

n

x

n p0

z ,

b) CR = (x1 , . . . , xn ) :

p0 (1 p0 )

x

n p0

z1 = z .

c) CR = (x1 , . . . , xn ) :

p0 (1 p0 )

respectively. That is, samples in which the values of the statistic, we will denote such a value

by stat, satises

a) |stat| z/2 ,

b)

stat z ,

c) stat z1 ,

respectively.

In this practice we are learning to test on population proportions. We will study both the case of one

proportion and the case of two proportions.

6.2

i

g

p0 (1p0 )

n

which follows a distribution (approximately) N (0, 1) when the null hypothesis is true and the

sample size n is large enough.

Chapter 6

6.1

X n p0

Stat =

,

Note that z stands for the value which has on its right hand side a tail of size when we

consider a distribution N (0, 1). That is, if W N (0, 1), then P (W > z ) = . Such a value

can be computed with the order >> norminv(1 , 0, 1) as we saw in Lab Practice 3.

The sample should be stored in a vector, let us denote it by x, whose values should be 0 or 1,

where a value equal to 1 in a position means that in such an observation the property whose

proportion is analyzed was satised, and a value equal to 0 in that position means that in such

an observation the property under study was not held.

If x denotes the vector with the sample data, the value of the statistic can be obtained with

>> (mean(x) p0 )/sqrt(p0 (1 p0 )/length(x))

In accordance with the hypothesis we have considered, we should obtain the critical region to

know if the sample belong to it, and then we should reject such a hypothesis, or on the contrary

it does not belong to the critical region, but to the acceptance region, and so we do not have

enough evidence to reject the null hypothesis.

proportions, the order we will use is

>> prop test 2(x1, n1, x2, n2, alt)

55

Statistics

57

i

r

e

e

n

58

In the above order, the argument x1 is the number of times that the event we are studying its

proportion (p1 ) occurred in the rst sample. The argument n1 is the sample size of the rst

sample.

Exercise 6.2. The proportion of computers of a company aected by a virus is being analyzed by one

of the employees. A workmate arms that the proportion of aected computers is 5%, but a second

workmate says that it is grater than 5%.

The meanings of x2 and n2 are similar but with the second sample.

Taken at random 150 computers, a number 1 was annotated if the computer had such a virus,

and 0 in any other cases. The sample which was obtained is the following:

The argument alt determines the kind of test we are performing. Thus,

if such an argument does not appear, the test we perform is

H0 : p1 = p2 against H1 : p1 = p2 ,

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

if the argument is > (quotation marks should be included), the test we perform is

H0 : p1 = p2 against H1 : p1 > p2 ,

if the argument is < (quotation marks should be included), the test we perform is

H0 : p1 = p2 against H1 : p1 < p2 .

The order >> prop test 2(x1, n1, x2, n2, alt) computes the p-value of the samples in the corresponding test.

The order

>> [pval, stat] = prop test 2(x1, n1, x2, n2, alt)

computes two dierent values, pavl which contains the p-value of the samples in the corresponding test, and stat which displays the value of the statistic of the test.

r

o

Could we conclude that the proportion of visits from outside the University remains during the

summer period, or on the contrary it decreases?

h

c

Let p the proportion of visits from outside the University during the summer months. We want

to tests

H0 : p = 0.2 against H1 : p < 0.2

The value of the statistics can be computed with

>> stat = (32/167 0.2)/(sqrt(0.2 (1 0.2)/167))

whose value is -0.27084.

With the usual level of signicance 0.05, we should compare this value with z0.95 . The quantity z0.95

can be obtained with the order

>> norminv(0.05, 0, 1)

giving z0.95 = 1.6449.

Since stat = 0.27084 1.6449 the sample does not belong the critical region, and so there is not

enough evidence to reject the null hypothesis.

a

B

Could we conclude that the armation of the rst workmate is true?, or should we reject it to

conclude the opinion of the second workmate?

H0 : p = 0.05 against H1 : p > 0.05

The value of the statistic can be obtained with

>> stat = (mean(A) 0.05)/sqrt(0.05 (1 0.05)/length(A))

such a value being 1.6859. That value should be compared with z0.05 . This quantity is obtained with

the order

>> norminv(0.95, 0, 1)

which reads that z0.05 = 1.6449.

Since stat = 1.6859 < z0.05 = 1.6449, the sample does not belong to the critical region of the test.

Therefore we should not reject the null hypothesis. Thus we conclude the armation of the rst

employee.

Exercise 6.3. A keyboard factory asserts that the proportion of keyboards which last over 6000

hours is 90%. A number of 200 keyboards were analyzed, obtaining the following information:

the sample can be found in the le DatPrac6Ejer3 which can be loaded from Campus Virtual.

Matrix A contains the above sample. A value of 1 in the position i means the the keyboard

i worked after 6000 hours, while a value equal to 0 means that its life did not exceed 6000

hours.

Exercise 6.1. It is known that during the academic course, the server of a department registers a

20% of visits from outside of the University. During the summer period, it is observed that there were

167 requests, 32 from outside of the university.

l

e

i

g

the sample can be found in the le DatPrac6Ejer2 which can be loaded from Campus Virtual.

Matrix A contains the above sample.

H0 : p1 = p2 against H1 : p1 = p2 ,

6.3

g

n

Statistics

Let p be the proportion of keywords whose life is more than 6000 hours. We want to tests

H0 : p = 0.9 against H1 : p = 0.9

The value of the statistic is computed with

>> stat = (mean(A) 0.9)/sqrt(0.9 (1 0.9)/length(A))

being equal to -3.2998

To know if the sample belongs to the critical region, we need to obtain z0.025 , value which can be

computed with the order

>> norminv(0.975, 0, 1)

thus z0.025 = 1.9600.

Since the relation |stat| = 3.2998 z0.025 = 1.9600 holds, there is enough evidence to reject the null

hypothesis since the sample belongs to the critical region.

Exercise 6.4. Two companies which repair computers assure that the proportion of computers which

are repaired by each of them in a number of days below 5, is greater than the proportion of the another

company. To analyze such a question, the following information was obtained:

Statistics

59

the sample can be found in the le DatPrac6Ejer4 which can be loaded from Campus Virtual.

Matrix A contains the information of the rst company, a value of 1 in the position i means

that such a computer was repaired in a period of days below 5, while a value equal to 0 means

that the period was at least 5 days. Matrix B contains the same information for the second

company.

What conclusions can be derived?

i

r

e

e

n

60

>> pval = prop test 2(sum(A1), length(A1), sum(B1), length(B1))

which gives the p-value of the samples in the above test.

i

g

In this case pval = 3.9979e 013, thus we should reject the null hypothesis to conclude that both

proportions are dierent.

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

Let p1 and p2 be the proportion of computers which are repaired in a period of days smaller than

5 by the rst and the second companies respectively.

In a rst test we are considering the hypothesis

H0 : p1 = p2 against H1 : p1 = p2

For such a purpose we consider the order

>> prop test 2(x1, n1, x2, n2)

where x1 is the number of computers repaired by the rst company, in a period below 5 days, n1 is the

number of computers repaired by the rst company and the meanings of x2 and n2 are the same with

the second company.

Such values are calculated with the orders >> sum(A), >> length(A), >> sum(B) and >> length(B)

respectively.

The order

>> pval = prop test 2(sum(A), length(A), sum(B), length(B))

gives the p-value of the samples in the above test.

g

n

Since the p-value is pavl = 0.69395 we do not reject the null hypothesis, that is, the proportion

of computers repaired in a time below 5 days is the same for both companies.

Exercise 6.5. A company of processors analyzes the quality of two subsidiary companies in the north

of Spain, studying the speed of the processors manufactured in both places. In the rst subsidiary

company, 125 processors are taken at random, 150 in the second one. The speeds are the following:

r

o

the samples can be found in the le DatPrac6Ejer5 which can be loaded from Campus Virtual.

Matrix A contains the information of the rst subsidiary company, matrix B provides the

information of the second one.

l

e

The company is interested in studying if the proportion of processors with a speed greater than

2.8 is the same in both subsidiary companies.

h

c

Let p1 and p2 the proportion of processors with a speed greater than 2.8 in each of the above

places respectively.

We want to test if

H0 : p1 = p2 against H1 : p1 = p2 .

a

B

We transform the matrices A and B in the matrices A1 and B1, obtained from A and B respectively, which will have the value 1 in those positions where there is a value greater than 2.8, and

a value 0 in any other case. By means of such matrices we can know which processors had a speed

greater than 2.8, and which did not satisfy such a condition.

Fort such a purpose we consider the orders

>> A1 = (A > 2.8)

>> B1 = (B > 2.8)

Statistics

Statistics

g

n

i

r

e

e

n

CHAPTER 7. LAB PRACTICE 7.- GOODNESS OF FIT TESTS, A TEST FOR RANDOMNESS, AND THE

KOLMOGOROV-SMIRNOV TEST FOR TWO SAMPLES.

62

1) A goodness of t test for a normal distribution, the Lilliefords test: to test if a random sample is

drawn from a normal distribution (without conditions on its parameters), we will use the order

>> kolmogorov smirnov test(x, normal, mean(x), var(x, 0))

i

g

Vector x should contain the sample. The above order computes the p-value of the sample in the

test whose null hypothesis is

Chapter 7

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

a test for randomness, and the

Kolmogorov-Smirnov test for two

samples.

2) Goodness of t tests for continuous distribution with given parameters: to test if a random

sample is drawn from a totally specied continuos distribution, we will use the order

>> kolmogorov smirnov test(x, dist, params)

In the above order, vector x should contain the sample.

The argument dist determines the distribution we want to test if data are drawn from it.

Such an argument can be any chain dist such that the order dist cdf or distcdf computes the

distribution function of dist (see Lab Practice 3).

For instance, for the uniform distribution we will write unif orm (quotation marks must

be included), for the normal distribution we will write normal (note that we do not write

norm, however some versions of Octave announce that in future versions it will be norm

instead of normal).

The argument params contains the parameters of the distribution.

Content

The above order computes the p-value of the sample in the test whose null hypothesis is

7.1

7.1

61

7.2

H0 : the sample in x is drawn from the distribution dist with parameters params.

the Kolmogorov-Smirnov test for two samples . . . . . . . . . . .

7.3

61

64

For instance,

>> kolmogorov smirnov test(x, unif orm, 2, 4)

calculates the p-value of the sample in x of the test whose null hypothesis is that the sample is

drawn from a uniform distribution on the interval (2, 4), that is,

. . . . . . . . . . . . . . . . . . . . . . .

r

o

l

e

h

c

In this practice we will see Octave instructions for some goodness of t tests, the Kolmogorov-Smirnov

test for two samples and a test for randomness.

7.2

a

B

Basic orders for goodness of t tests, a test for randomness, and the Kolmogorov-Smirnov test for two

samples

The orders for the above aims are shown in this section.

The main orders in this context are the following:

61

The order

>> kolmogorov smirnov test(x, normal, 0, 2)

obtains the p-value of the samplein x in the test whose null hypothesis is that data are drawn

from a normal distribution N (0, 2), that is,

It is very important to remark that in the case of the normal distribution, we should write in

the order kolmogorov smirnov test the variance instead of the standard deviation.

3) A goodness of t test for a discrete population: to our knowledge the chi-square goodness of t

test has not been implemented in Octave. So it is necessary to program such a test. For such a

purpose we briey recall this test.

Suppose that we have a random sample of size n from a random experiment. Each observation

is classied in one and only one of k possible outcomes. Let A1 , A2 , . . . , Ak be such outcomes.

Statistics

CHAPTER 7. LAB PRACTICE 7.- GOODNESS OF FIT TESTS, A TEST FOR RANDOMNESS, AND THE

63

H0 : pi = P (Ai ), 1 i k,

i

g

i=1 pi = 1.

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

Let ni be the number of observations of Ai (observed frequency) in the sample. If the null

hypothesis is true, it is expected to observe npi times the outcome Ai (expected frequency).

Stat =

k

i=1

)2

(ni npi

,

npi

In such an order, the data of the two samples are collected in vectors x and y.

The above order computes the p-value of the samples in the test.

k

(ni npi )2

2k1, .

CR = samples such that

npi

A way to program the test is the following. Let us suppose that

x is the frequency vector, that is, x = (n1 , n2 , . . . , nk ) and

7.3

The value of the statistic and the p-value of the sample can be computed by means of the

following program:

r

o

In the variable stat we obtain the value of the statistic, the variable pvalue stores the p-value

of the sample.

l

e

h

c

Type 2 port: 1318.4, 1299.7, 1301.3, 1294.7, 1310.8, 1306.3, 1296.2, 1309.6, 1287.5, 1310.7,

1303.2, 1308.9, 1307.7, 1297.2, 1314.6

the samples can be found in the le DatPrac7Ejer1 which can be loaded from Campus Virtual.

Matrix A contains data of the rst port, matrix B data of the second port.

i) could we consider that the data of the rst port are drawn from a normal distribution with

mean 1310 and standard deviation 12?

iii) could we consider that data of the rst port are drawn from a normal distribution?

iv) could we consider that data of the rst port and data of the second one are drawn from the

same distribution?

a

B

Type 1 port: 1199.3, 1200.2, 1200.9, 1198.3, 1200.5, 1200.3, 1200.5, 1199.5, 1200.8, 1200.0,

1200.4, 1199.7, 1199.7, 1199.7, 1198.7, 1199.4, 1200.5, 1199.8, 1199.1, 1199.4

4) A test for randomness: the tests we have considered until now assume that the sample(s) used

to infer is(are) random. A numeric sequence is said to be statistically random when it contains

no recognizable patterns or regularities.

We can check if a sample is random with the order

>> run test(x)

Exercise 7.1. Information rates sent by two kinds of ports are analyzed. The observed rates, in bytes

per second, are the following

probabilities,

>> stat = sum((x expected). 2./expected);

>> pvalue = 1 chi2cdf (stat, length(p) 1)

>> kolmogorov smirnov test 2(x, y)

i=1

64

We want to study if X and Y have the same distribution, which will be denoted by X Y .

Note that we do not specify the distribution.

k

i

r

e

e

n

g

n

CHAPTER 7. LAB PRACTICE 7.- GOODNESS OF FIT TESTS, A TEST FOR RANDOMNESS, AND THE

>> A = [1199.3; 1200.2; 1200.9; 1198.3; 1200.5; 1200.3; 1200.5; 1199.5; 1200.8; 1200.0; 1200.4; 1199.7;

1199.7; 1199.7; 1198.7; 1199.4; 1200.5; 1199.8; 1199.1; 1199.4]

>> B = [1318.4; 1299.7; 1301.3; 1294.7; 1310.8; 1306.3; 1296.2; 1309.6; 1287.5; 1310.7; 1303.2; 1308.9;

1307.7; 1297.2; 1314.6]

That order computes the p-value of the sample in the test whose null hypothesis is

H0 : the sequence of numbers in vector x is random.

5) The Kolmogorov-Smirnov test for two samples: the Kolmogorov-Smirnov test for two samples is

a test of whether two independent random samples have been drawn from the same continuous

random variable or from random variables with the same continuous distribution.

Let us suppose that we have two random samples drawn from continuous independent random

variables X and Y .

Statistics

>> p1 = kolmogorov smirnov test(A, normal, 1310, 144)

Note that in the order kolmogorov smirnov test when we consider the case of a completely specied

normal distribution, we should introduce the variance instead of the standard deviation (144 instead

of 12).

Statistics

CHAPTER 7. LAB PRACTICE 7.- GOODNESS OF FIT TESTS, A TEST FOR RANDOMNESS, AND THE

65

g

n

i

r

e

e

n

CHAPTER 7. LAB PRACTICE 7.- GOODNESS OF FIT TESTS, A TEST FOR RANDOMNESS, AND THE

KOLMOGOROV-SMIRNOV TEST FOR TWO SAMPLES.

The p-value we obtain is almost equal to 0, therefore we have enough evidence to reject the hypothesis

in the question i).

The p-value we obtain is pvalue = 0.96940. Therefore we do not reject the tested hypothesis.

p2 =>> kolmogorov smirnov test(B, normal, 1310, 144)

obtaining the p-value p2 = 0.062242. Therefore there is not enough evidence to reject that data of type

2 port are drawn from a normal distribution with mean 1310 and standard deviation 12.

The second question is solved in the same way. Now the vector of probabilities is

p = [0.25, 0.1, 0.1, 0.3, 0.05, 0.2]

i

g

In this case the p-value is equal to 0. Thus we have enough evidence to reject the proposed model

of probabilities or proportions.

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

In relation to iii), we want to test if the data of type 1 port follow a normal distribution (without

specifying parameters). For such a purpose we consider the order

>> p3 = kolmogorov smirnov test(A, normal, mean(A), var(A, 0))

which gives a p-value equal to p3 = 0.990163 Therefore we should not reject the above hypothesis, and

so we conclude that the information rates sent by the rst port follow normal distribution.

In iv) we want to test if both data come from the same (continuous) distribution. Note that both

information rates are independent. We will use the order

>> p4 = kolmogorov smirnov test 2(A, B)

which gives a p-value p4 = 7.1776e 008. Therefore we have enough evidence to reject such a hypothesis.

Note that Octave advertises that the p-value is approximate, however, it is so small that there are no

doubts on the nal conclusion.

66

Exercise 7.3. Taken at random some processors of two brands A and B, the following processor

speeds were found:

Make A: 2.7, 2.65, 2.83, 2.95, 2.64, 2.45, 3.01, 2.56, 2.76, 2.99, 2.76, 2.87, 3.05, 2.65, 3.09, 3.05,

2.99, 2.67, 3.02

Make B: 2.87, 2.76, 2.78, 2.67, 2.97, 2.99, 2.87, 2.93, 2.78, 2.98, 3.01, 2.84, 2.88, 2.79, 2.89, 2.91,

2.87, 2.88, 2.92

the samples can be found in the le DatPrac7Ejer3 which can be loaded from Campus Virtual.

Matrix A contains data of the rst manufacturer, matrix B data of the second one.

Could we conclude that the speeds of both processors are the same?

Exercise 7.2. A company wants to evaluate a new procedure to send information through the

network. For such a purpose it takes 218 les of the same size and measures the quality of the

transmission. Such a quality is classied in six exclusive categories, namely, Excellent, Good, Normal,

Regular, Decient and Very Decient. The result appears in the following table:

Quality

Nu. Files

Excellent

24

Good

36

Normal

52

Regular

30

Dec.

44

r

o

Very Dec.

32

i) Could we consider that for the new procedure to send information, the qualities Excellent, Good,

Normal, Regular, Decient and Very Decient have probabilities of occurring 0.1, 0.15, 0.25,

0.15, 0.2 and 0.15 respectively?

l

e

ii) And probabilities 0.25, 0.1, 0.1, 0.3, 0.05 and 0.2 respectively?

h

c

>> pval = kolmogorov smirnov test 2(A, B)

which gives to the variable pval the p-value of the samples in the above test.

In this case we obtain that pval = 0.15164. Therefore there is not enough evidence to reject the

above hypothesis, that is, both speeds are the same (they follow the same distribution).

Exercise 7.4. Could we consider that the speeds follow a normal distribution in the case of make A?

In both cases we should consider a goodness of t test for a discrete distribution. Therefore we

will use the chi-square goodness t test .

The program we have designed to solve such a test is

>> expected = sum(x) p/sum(p);

>> stat = sum((x expected). 2./expected);

>> pvalue = 1 chi2cdf (stat, length(p) 1)

where x is the vector of observed frequencies and p is the vector of probabilities or values proportional

to such probabilities.

a

B

We test if both samples follow the same distribution. Note that both speeds are continuous and

independent.

We store data in two matrices A and B

>> A = [2.7; 2.65; 2.83; 2.95; 2.64; 2.45; 3.01; 2.56; 2.76; 2.99; 2.76; 2.87; 3.05; 2.65; 3.09; 3.05;

2.99; 2.67; 3.02]

>> B = [2.87; 2.76; 2.78; 2.67; 2.97; 2.99; 2.87; 2.93; 2.78; 2.98; 3.01; 2.84; 2.88; 2.79; 2.89; 2.91;

2.87; 2.88; 2.92]

>> p1 = kolmogorov smirnov test(A, normal, mean(A), var(A, 0))

which gives the p-value of the sample in the test whose null hypothesis is processor speeds follow a

normal distribution in the case of the make A.

In this case the p-value is p1 = 2.6284e 005. Therefore there is not enough evidence to reject

such a hypothesis.

Exercise 7.5. Given the samples of Exercise 7.3, could we consider that they are random?

In our case

>> x = [24, 36, 52, 30, 44, 32]

which are the observed frequencies of each of the above dierent qualities, and

p = [0.1, 0.15, 0.25, 0.15, 0.2, 0.15]

We should apply a randomness tests to the samples which are in matrices A and B.

We consider the orders

>> p1 = run test(A)

Statistics

Statistics

g

n

i

r

e

e

n

CHAPTER 7. LAB PRACTICE 7.- GOODNESS OF FIT TESTS, A TEST FOR RANDOMNESS, AND THE

67

The p-values of the samples in A and B are p1 = 0.17826 and p2 = 0.53829 respectively. Therefore

we should not reject the randomness of any of the above sample.

Exercise 7.6. In Lab Practice 1 we saw that the order >> rand(n, m) generates a matrix of size

n m with random numbers of the interval (0, 1). How could we check that such an order works

properly?

We can generate numbers with the order rand and check if they are random and follow a uniform

distribution on the interval (0, 1).

We perform this mechanism a large number of times instead of only one time to avoid that a

strange sample make us to take a wrong conclusion.

In each repetition we compute a p-value in relation to randomness of the sample and a p-value

in relation to the question of the uniform distribution on the interval (0, 1).

Once we have both sets of p-values, we obtain a summary p-value of each of them to obtain the

nal conclusion. Normally the summary p-value is the mean or median of all the p-values.

For instance, we could consider

>> A = rand(500, 1000);

>> f or i = 1 : 500

>> pval(i) = run test(transpose(A(i, :)));

>> pvalor(i) = kolmogorov smirnov test(A(i, :), unif orm, 0, 1);

>> endf or

>> w1 = mean(pval)

>> w2 = median(pval)

>> v1 = mean(pvalor)

>> v2 = median(pvalor)

r

o

In the rst line we generate 500 samples (500 rows) of size 1000 by means of the order rand.

l

e

For each of the above 500 samples (each row) we obtain the p-value of the test whose null hypothesis is

H0 : the sample of the row i is random

h

c

(third line of the program), and also the p-value of the test whose null hypothesis is

H0 : row i is drawn from a uniform distribution on the interval (0, 1)

a

B

i

g

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

We try to nd out if the order rand generates random numbers of the interval (0, 1). We could

arm that the order works properly if generated numbers are random, and follow a uniform distribution

on the interval (0, 1).

Such p-values are stored in the vectors pval and pvalor respectively.

In the last lines of the program we obtain the mean (w1, v1) and median (w2, v2) of the both

sets of p-values.

In our simulations we obtain values of w1, v1, w2 and v2 close to 0.5. Therefore we conclude

that the order rand generates numbers on the interval (0, 1) which are random, and follow the uniform

distribution on the interval (0, 1).

Hence we should conclude that the order rand works properly.

Statistics

g

n

Bibliography

4

n

1

E

0

2 re

3

a

s

1

w

c

0

i

t

2 of ist

t

S

a

f

t

o S

[1] Eaton, J.W., Bateman, D., Hauberg, S. GNU Octave Manual Version 3. Network Theory Limited,

2008

[2] Navidi, W. Statistics for Engineers and Scientists. McGraw-Hill Companies, Inc., 2009

[3] Montgomery, D.C., Runger, G.C. Applied Statistics and Probability for Engineers. John Wiley

and Sons Inc, 2005

[4] Prakasa Rao, B. L. S. A rst course in probability and statistics. World Scientic Publishing Co.

Pte. Ltd., Hackensack, NJ, 2009

[5] Trivedi, K. Probability and Statistics with Reliability, Queueing and Computer Science Applications. John Wiley and Sons, 2002.

[6] Rohatgi, V.K., Ehsanes Saleh, A.K. An Introduction to Probability and Statistics. Wiley 2001.

l

e

r

o

h

c

a

B

69

i

g

i

r

e

e

n

- Eig (MATLAB Functions)Uploaded byAkash Ramann
- 8532-Business Math & StatUploaded byHassan Malik
- Common Probability Distribution PDFUploaded byJeanette
- Matrix MethodUploaded bySuraj Rarath
- Assignment Bca 193Uploaded bydolon10
- R Tutorial VBTUploaded byNam Nguyen
- hw3.pdfUploaded byAnyBalderrabano
- Ch 1 Multivariate Random VariablesUploaded byheatblast92
- HW3Uploaded byNick Paoletti
- Syllabus 2015 - 16 Pcm CbseUploaded byKunalKaushik
- udl lesson on boxplotsUploaded byapi-399832630
- Dis TribUploaded byKamil Iżykowski
- LectureNotes13.pdfUploaded bysharief85
- sensors-17-02241-v2Uploaded bysai k
- Continuous Ran VarUploaded byJosh
- ENCS 6161 - Ch3 and 4Uploaded byDania Alashari
- 1203.3437Uploaded byDyra Kesuma
- matrixintrUploaded bydiaz1887
- Mid-term Test Question and Answer 2017Uploaded bysoon0618
- VPadronUniquenessApril17-06Uploaded byVictor Padron
- MATH-I U-3 SOLUN Matrices (1).docxUploaded byenergyengineers
- B3DreviewUploaded byRoy Vesey
- r059210101 Mathematics IIUploaded byprakash.paruchuri
- Simulation Ppt1Uploaded byMounikaChowdary
- Box and Whisker Plot TutorialUploaded bypraveenshridhar
- CS Decomposition IntroUploaded byquaxquax
- Https Doc 00 7g Apps Viewer.googleusercontentUploaded byVignesh Ravishankar
- Pure Birth ProcessesUploaded byPatrick Mugo
- Lecture 1 Special Matrix OperationUploaded byChristopher Martinez
- magicsqaure 1Uploaded byapi-360674086

- ex2_CNUploaded byLaura Gorostidi
- Modern Printable 2018 Calendar Monthly Planner ApieceofrainbowUploaded byLaura Gorostidi
- Industrial (1)Uploaded byLaura Gorostidi
- coal mineUploaded byLaura Gorostidi
- Idioms, Phrasal Verbs and TestsUploaded byLaura Gorostidi
- Baby DriverUploaded byLaura Gorostidi
- Deco Folder2Uploaded byLaura Gorostidi
- Some IdeasUploaded byLaura Gorostidi
- Mecanic AUploaded byLaura Gorostidi
- PresentaciónUploaded byLaura Gorostidi
- Presentación 1Uploaded byLaura Gorostidi
- Chap 1 Freeze and Cherry 1979 IntroductionUploaded byLaura Gorostidi
- 2015 Triangle monthly calendar Sun through Sat TH sec.pdfUploaded byLaura Gorostidi
- Tema2-Componentes Avanzados IGU - 2014 EnglishUploaded byLaura Gorostidi
- Tema1-FundamentosUploaded byLaura Gorostidi
- While ProgramsUploaded byLaura Gorostidi

- 3D Geometry in Geogebra - A Parametric Curve and Tangent Vector(1)Uploaded byDarma Putra
- ReadmeUploaded byDaniel Moura
- Ch09 Peter GregoryUploaded byAnonymous 9d1jFv
- 3 Disini Case Digest ShareUploaded byJeconiah Velasquez
- Marks Price Book- 2014-2015Uploaded bySecurity Lock Distributors
- 001-windows001Uploaded bykyawzinlatt
- Week 4 Part 5 - Module 20 Log AnalysisUploaded byMei Cruz
- Downloaded LicensesUploaded byMeMo Unonueveocho
- Growing Up With Media Exposure to Violent MaterialUploaded bylacewing
- Amos Connect 8 - Installation GuideUploaded bykostasxave
- Template ISMS Tahap IV - Conducting Risk Assessment & Planning Risk Treatment -Rev1Uploaded byAulia Ahsan
- 7.2_Defradar_GDPR Gap Assessment ToolUploaded byJakobović Domagoj
- istr-21-2016-en.pdfUploaded byquedyah
- BJMP Fugitives Information System (BJMPFIMS)Uploaded byRandel Latoza
- Components of LaptopUploaded byMa Michelle Nica Dm
- NO GOODComputeractive - 6 January 2016Uploaded byAula Virtual
- Cisco Asa Lab Manual FinalUploaded bydbenoit1
- EE660_HW3Uploaded byPrateek Agrawal
- Final Report of IT security.docxUploaded bydntalk
- 03472-01 SEG-100 TTG Interface ReferenceUploaded bySunny Girija Sapru
- A2Z of Cyber CrimeUploaded byAbhishek Biswas
- Compairing, Designing, And Deploying VPNsUploaded bymaiminh92
- vhayste_personafes3_2008jul21Uploaded byAhmad Mujahid Huzaidi
- A Review on QR Code for Hiding Private InformationUploaded byIRJET Journal
- US vs BTC-e/VinnikUploaded byCoinDesk
- Gift IndictmentUploaded byAaron Brecher
- IEMRN InstallationUploaded byXandy Tka
- 2016-gtirUploaded byjuan
- Simple and fast data recovery solutionsUploaded byJustin Jacob
- UK Home Office: PC40%202006Uploaded byHome Office