You are on page 1of 9

# STATISTICS SUMMARY #1

Descriptive Statistics
DATA
- Maximum and Minimum (Max and Min values)
- Range (difference between maximum and minimum)
- Mode (value that occurs most frequently, only exists when there is 1 such value)
- Median (middle of the values ordered to size, take average of 2 middle ones if needed)
- Mean (average value of the data)
1
1
n
i
i
x x
n
=
=

- Standard deviation (deviation from mean)
2
1
1
( )
1
n
i
i
S x x
n
=
=

When DATA considered is not a sample but the entire population (with N elements) then:
- Mean
1
1
N
i
i
x
N

=
=

- Standard (deviation from mean)
2
1
1
( )
N
i
i
x
N

=
=

Criterion for Rejecting Questionable Data Points
A. Modified Thompson technique
1 Calculate mean x and standard deviation S
2 Determine deviation
i
of data points,
i i
x x =
3 Select maximum deviation
i
and compare to a threshold, reject data point x
i
if
i
> S
4 Calculate mean and standard deviation for the remaining data after removing rejected data
point
5 Go to step 2 and repeat rejection process until no more data point can be eliminated
Determining the threshold S:
- Use Thompsons table for sample size n and look up then calculate S.
- Reject data point if
i
> S.

B. Two-sigma or three-sigma rejection criteria
1. does data obey normal distribution?
2. calculate mean and standard deviation S=
3. Reject x
i
if 2
i
x > (or 3 is so required)
Dealing with large data sets: - organise data by groups (classification)
- Tabular arrangement of data by classes together with the corresponding class frequencies is
called the frequency distribution
- The frequency for a particular class (or category) is the number of times the class appears in
the data set
- Relative frequency of class = frequency / total number of observations in data set
Calculating the mean and standard deviation of grouped data
Mean:
1
1
K
j j
j
x f m
n
=
=

and standard deviation:
2
1
1
( )
1
K
j j
j
S f m x
n
=
=

,
Where f
j
is the frequency and m
j
is the mean value of the j-th class and K is the total number of
classes.
Mean Interval Estimation
Aim: Find approximation of population mean by determining the sample mean x . The sample
mean is not necessarily identical to , i.e. the confidence interval is given by x x + ,
where is the uncertainty.
The confidence level is the probability that the population mean will fall within the specified interval
it is given by ( ) P x x +
The confidence level is usually expressed in terms of the level of significance so that
( ) 1 P x x + =
Note that the sample mean approaches the normal distribution if the sample size is sufficiently
large: , x N
n

| |
|
\

If the distribution of the total population is not normal then for n < 30 the sample mean only
approximately follows a normal distribution. In such a case we have to use the Student t distribution
to find .
A. If distribution of the total population is normal or n >= 30 then assume , x N
n

| |
|
\

and ( ) ( ) ( ) 1
c c
P x x P x P z Z z + = + = = or
( ) 1
2
c
P Z z

=
Where (0,1)
x
Z N
n

## = which is the standard normal distribution. Note that

c c
S
z z
n n

=
In Excel: use NORMINV to obtain the value of z
c
= NORMINV(1-/2,0,1)
B. If n < 30: find a value t
c
so that the probability ( ) 1
c c
P t t t = , where
c
S
t
n
=
and t
c
can be found from Students t distribution ( )
c
P t t > = , where the number of
degrees of freedom, , is given by = n 1.
In Excel: use t
c
= TINV(,) .
====== END of first lecture

STATISTICS SUMMARY #2
Hypothesis Testing
Example 1
Null-hypothesis H
0
:
0
=
Alternative hypothesis H
1
:
0

- Given a sample with n elements, determine the mean x and the standard deviation S.
- Given a level of significance of ,
o if 30 n > determine z
c
= NORMINV(1-/2,0,1) and then calculate
c
S
z
n
= .
o if 30 n determine t
c
= TINV(,n-1) and the calculate
c
S
t
n
=
- Then conclude ( ) 1 P x x + = and accept H
0
if
0
x x
otherwise reject H
0
.
Note: if the standard deviation of the entire population, , is known use this to replace S in the
calculation of
Example 2
Null-hypothesis H
0
:
0

Alternative hypothesis H
1
:
0
<
- Given a sample with n elements, determine the mean x and the standard deviation S.
- Given a level of significance of ,
o if 30 n > determine z
c
= NORMINV(1-,0,1) and then calculate
c
S
z
n
= .
o if 30 n determine t
c
= TINV(2,n-1) and the calculate
c
S
t
n
=
- Then conclude ( ) 1 P x = and accept H
0
if
0
x otherwise reject H
0
.
Note: if the standard deviation of the entire population, , is known use this to replace S in the
calculation of
Test for differences in mean
Given two samples, sample 1 and sample 2 consisting of n
1
and n
2
elements, respectively. The mean
and standard deviation of sample 1 are
1
x and S
1
, and those for the entire population 1 are
1
and

1
. The mean and standard deviation of sample 2 are
2
x and S
2
, and those for the entire population
2 are
2
and
2
.
We want to test whether the means of the two samples are the same with a confidence level of .
- Null hypothesis H
0
:
1 2
=
- Alternative hypothesis H
1
:
1 2

Calculate
1 2
2 2
1 2
1 2
(0,1)
x x
z N
n n

=
+
, accept H
0
if
c
z z , where (1 / 2,0,1)
c
z NORMINV =
If both populations have the same UNKNOWN standard deviation, then approximate the common
standard deviation S by calculating
2 2
1 1 2 2
1 2
2
n S n S
S
n n
+
=
+
and then
- if the populations are large calculate
1 2
2 2
1 2
(0,1)
x x
z N
S S
n n

=
+
, accept H
0
if
c
z z , where
(1 / 2,0,1)
c
z NORMINV =
- if the populations are small calculate
1 2
2 2
1 2
x x
t
S S
n n

=
+
t distribution, accept H
0
if
c
t t ,
where
1 2
( , 2)
c
t TINV n n = + ... NOTE degree of freedom here is = n
1
+n
2
-2
If both populations have UNKNOWN standard deviations (not necessarily the same) and have
small sample size then calculate
-
1 2
2 2
1 2
1 2
x x
t
S S
n n

=
+
t-distribution, where the degree of freedom is approximated by
2
2 2
1 2
1 2
2 2
2 2
1 2
1 1 2 2
1 1
1 1
S S
n n
S S
n n n n

| | +
|
+
\
=
| | | |
+
| |

\ \
rounded down to nearest integer, accept H
0
if
c
t t
In Excel: ( , )
c
t TINV =
Curve fitting and the method of least squares
i i
y ax b = + , how to determine a and b from n pairs of data (x
i
,y
i
).
If
i i
y ax b = + then the error in the approximation at point x
i
is given by
i i i
e y ax b =
Minimize the sum of the squared errors:
( )
2
2
1 1
( , )
n n
i i i
i i
J a b e y ax b
= =
= =

by two subsequent
differentiations first with respect to a and second with respect to b. The minimum is determined by
equating the two derivatives to zero, giving

( )( ) ( )
( ) ( )
2
2
i i i i xy
xx
i i
x y n x y S
a
S
x n x

= =

And
( )( ) ( )( )
( ) ( )
2
2
2
i i i i i
i i
x x y x y
b y ax
x n x

= =

Where
( )( )
1
( )( )
xy i i i i i i
S x x y y x y x y
n
= =

and
( )
2
2 2
1
( )
xx i i i
S x x x x
n
= =

Quality of fit:
( )
2
2
2
1 1
n n
xy
i i i yy
i i xx
S
SSE e y ax b S
S
= =
= = =

,
Where
( )
2
2 2
1
( )
yy i i i
S y y y y
n
= =

The standard error of estimate is given by
2
SSE
S
n
=

====== END of second lecture
STATISTICS SUMMARY #3
Least Squares Fitting
Example 1
Nonlinear functional relationship ( ) e ln ln
dx
y x c y dx c = = +
then define:
1
ln , , ln y y a d b c = = = , so that
1
ln ln y dx c y ax b = + = +
Example 2
( ) exp
K k
K u u
p f u
u C C

| | | |
= =
`
| |
\ \

)
, integrating this gives:
( ) 1 exp 1 exp
k k
u u
P F u P
C C

| | | |
= = =
` `
| |
\ \

) )

ln(1 ) ln( ln(1 ) ln ln
k
a x b
y
u
P P k u k C
C
+
| |
= =
|
\

1
u
j
j i
i
P pdu P p
=

= =

Example 3
Hyperbola:

1 1
y ax b
ax b y
= = +
+

Rejecting Questionable Data Points Cooks distance
Cooks distance measures the effect of deleting a given observation; the larger the distance the
more suspect the data point is:
2
( )
1
( )
( )
n
j j i
j
i
y y
D
p MSE
=

, where
( )
: prediction from full regression model for j-th observation
: prediction of j-th observ. from refitted regression model without observ. i
: number of fitted parameters (e.g. p=2 for y=ax+b)
j
j i
y
y
p
e
MSE =
2 2
1 1
( )
is the mean square error of regression model
n n
j j j
j j
y y
n p n p
= =

=

Equivalent expression for Cooks distance:
2
, where is the i-th diagonal element of H defined by
( ) (1 )
i ii
i ii
ii
e h
D h
p MSE h
=

1
( )
T T
H X X X X

=
Reject data point
4
when 1 or
i i i
y D D
n
> >
Multi-variable curve-fitting example
1)
[ ] [ ] 1 1
T
a
y ax b x x
b

(
= + = = =
(

Given the data pairs
1, 1 2 2
( ),( , ), ,( , )
n n
x y x y x y , prediction:
i i
y ax b = + write this is
matrix format: [ ] [ ]
1 with error 1
i i i i i i i i
y ax b x e y y y x = + = = =
So that
1 1 1 1 1
2 2 2 2 2

1
n n n n n
e y y x y
e y y x y
E Y X
e y y x y

( ( ( ( (
( ( ( ( (
( ( ( ( (
= = =
( ( ( ( (
( ( ( ( (

and the sum of the
squared errors becomes
2
1
( ) ( ) ( )
n
T T
i
i
J e E E Y X Y X
=
= = =

To minimize the sum of the squared errors, J, we need to solve 0
dJ
d
=
2 ( ) 0 0
T T T T T
dJ
X Y X X Y X X X X X Y
d

= = = =
So that
1
( )
T T
X X X Y

=
Advantages: 1) Easy computation in Matlab
2) Easy generalisation to more complicated curve-fitting problems
2)
2 2
1
i i i i i
a
y ax bx c x x b
c
(
(
( = + + =
(
(

So that
2
1 1 1
2
2 2 2
2
1
1
, ,
1
n n n
x y x
a
x y x
X Y b
c
x y x

( (
(
( (
(
( (
= = =
(
( (
(
( (

(

and again
the solution is determined from
1
( )
T T
X X X Y

=
3)
1
2 2 2
1 2 3 4
3
4
sin( ) cos( ) sin( ) cos( )
i i
x x
i i i i i i i i i i
a
a
y a x z a x z a z a e z x z x z z e z
a
a
(
(
(
( = + + + =

(
(

so that
1
2
2
1 1 1 1 1 1 1 1
2
2 2 2 2 2 2 2 2
3
2
4
sin( ) cos( )
sin( ) cos( )
, ,
sin( ) cos( )
n
x
x
x
n n n n n n n
x z z y a x z e z
x z z y a x z e z
X Y
a
x z z y a x z e z

( ( (
( ( (
( ( (
= = =
( ( (
( ( (
(

and also here:
the solution is determined from
1
( )
T T
X X X Y

=

THE END