Statistics Summary

STATISTICS SUMMARY #1
Descriptive Statistics
DATA
- Maximum and Minimum (Max and Min values)
- Range (difference between maximum and minimum)
- Mode (value that occurs most frequently, only exists when there is 1 such value)
- Median (middle of the values ordered to size, take average of 2 middle ones if needed)
- Mean (average value of the data)
1
1
n
i
i
x x
n
=
=

- Standard deviation (deviation from mean)
2
1
1
( )
1
n
i
i
S x x
n
=
=

When DATA considered is not a sample but the entire population (with N elements) then:
- Mean
1
1
N
i
i
x
N
=
=

- Standard (deviation from mean)
2
1
1
( )
N
i
i
x
N

=
=

Criterion for Rejecting Questionable Data Points
A. Modified Thompson technique
1 Calculate mean x and standard deviation S
2 Determine deviation
i
of data points,
i i
x x =
3 Select maximum deviation
i
and compare to a threshold, reject data point x
i
if
i
> S
4 Calculate mean and standard deviation for the remaining data after removing rejected data
point
5 Go to step 2 and repeat rejection process until no more data point can be eliminated
Determining the threshold S:
- Use Thompsons table for sample size n and look up then calculate S.
- Reject data point if
i
> S.

B. Two-sigma or three-sigma rejection criteria
1. does data obey normal distribution?
2. calculate mean and standard deviation S=
3. Reject x
i
if 2
i
x > (or 3 is so required)
Dealing with large data sets: - organise data by groups (classification)
- Tabular arrangement of data by classes together with the corresponding class frequencies is
called the frequency distribution
- The frequency for a particular class (or category) is the number of times the class appears in
the data set
- Relative frequency of class = frequency / total number of observations in data set
Calculating the mean and standard deviation of grouped data
Mean:
1
1
K
j j
j
x f m
n
=
=

and standard deviation:
2
1
1
( )
1
K
j j
j
S f m x
n
=
=

,
Where f
j
is the frequency and m
j
is the mean value of the j-th class and K is the total number of
classes.
Mean Interval Estimation
Aim: Find approximation of population mean by determining the sample mean x . The sample
mean is not necessarily identical to , i.e. the confidence interval is given by x x + ,
where is the uncertainty.
The confidence level is the probability that the population mean will fall within the specified interval
it is given by ( ) P x x +
The confidence level is usually expressed in terms of the level of significance so that
( ) 1 P x x + =
Note that the sample mean approaches the normal distribution if the sample size is sufficiently
large: , x N
n
| |
|
\

If the distribution of the total population is not normal then for n < 30 the sample mean only
approximately follows a normal distribution. In such a case we have to use the Student t distribution
to find .
A. If distribution of the total population is normal or n >= 30 then assume , x N
n
| |
|
\

and ( ) ( ) ( ) 1
c c
P x x P x P z Z z + = + = = or
( ) 1
2
c
P Z z

=
Where (0,1)
x
Z N
n
= which is the standard normal distribution. Note that

c c
S
z z
n n
=
In Excel: use NORMINV to obtain the value of z
c
= NORMINV(1-/2,0,1)
B. If n < 30: find a value t
c
so that the probability ( ) 1
c c
P t t t = , where
c
S
t
n
=
and t
c
can be found from Students t distribution ( )
c
P t t > = , where the number of
degrees of freedom, , is given by = n 1.
In Excel: use t
c
= TINV(,) .
====== END of first lecture

Hypothesis Testing
Example 1
Null-hypothesis H
0
:
0
=
Alternative hypothesis H
1
:
0

- Given a sample with n elements, determine the mean x and the standard deviation S.
- Given a level of significance of ,
o if 30 n > determine z
c
= NORMINV(1-/2,0,1) and then calculate
c
S
z
n
= .
o if 30 n determine t
c
= TINV(,n-1) and the calculate
c
S
t
n
=
- Then conclude ( ) 1 P x x + = and accept H
0
if
0
x x
otherwise reject H
0
.
Note: if the standard deviation of the entire population, , is known use this to replace S in the
calculation of
Example 2
Null-hypothesis H
0
:
0

Alternative hypothesis H
1
:
0
<
- Given a sample with n elements, determine the mean x and the standard deviation S.
- Given a level of significance of ,
o if 30 n > determine z
c
= NORMINV(1-,0,1) and then calculate
c
S
z
n
= .
o if 30 n determine t
c
= TINV(2,n-1) and the calculate
c
S
t
n
=
- Then conclude ( ) 1 P x = and accept H
0
if
0
x otherwise reject H
0
.
Note: if the standard deviation of the entire population, , is known use this to replace S in the
calculation of
Test for differences in mean
Given two samples, sample 1 and sample 2 consisting of n
1
and n
2
elements, respectively. The mean
and standard deviation of sample 1 are
1
x and S
1
, and those for the entire population 1 are
1
and
1
. The mean and standard deviation of sample 2 are
2
x and S
2
, and those for the entire population
2 are
2
and
2
.
We want to test whether the means of the two samples are the same with a confidence level of .
- Null hypothesis H
0
:
1 2
=
- Alternative hypothesis H
1
:
1 2

Calculate
1 2
2 2
1 2
1 2
(0,1)
x x
z N
n n

=
+
, accept H
0
if
c
z z , where (1 / 2,0,1)
c
z NORMINV =
If both populations have the same UNKNOWN standard deviation, then approximate the common
standard deviation S by calculating
2 2
1 1 2 2
1 2
2
n S n S
S
n n
+
=
+
and then
- if the populations are large calculate
1 2
2 2
1 2
(0,1)
x x
z N
S S
n n
=
+
, accept H
0
if
c
z z , where
(1 / 2,0,1)
c
z NORMINV =
- if the populations are small calculate
1 2
2 2
1 2
x x
t
S S
n n
=
+
t distribution, accept H
0
if
c
t t ,
where
1 2
( , 2)
c
t TINV n n = + ... NOTE degree of freedom here is = n
1
+n
2
-2
If both populations have UNKNOWN standard deviations (not necessarily the same) and have
small sample size then calculate
-
1 2
2 2
1 2
1 2
x x
t
S S
n n
=
+
t-distribution, where the degree of freedom is approximated by
2
2 2
1 2
1 2
2 2
2 2
1 2
1 1 2 2
1 1
1 1
S S
n n
S S
n n n n
| | +
|
+
\
=
| | | |
+
| |

\ \
rounded down to nearest integer, accept H
0
if
c
t t
In Excel: ( , )
c
t TINV =
Curve fitting and the method of least squares
i i
y ax b = + , how to determine a and b from n pairs of data (x
i
,y
i
).
If
i i
y ax b = + then the error in the approximation at point x
i
is given by
i i i
e y ax b =
Minimize the sum of the squared errors:
( )
2
2
1 1
( , )
n n
i i i
i i
J a b e y ax b
= =
= =

by two subsequent
differentiations first with respect to a and second with respect to b. The minimum is determined by
equating the two derivatives to zero, giving

( )( ) ( )
( ) ( )
2
2
i i i i xy
xx
i i
x y n x y S
a
S
x n x
= =

And
( )( ) ( )( )
( ) ( )
2
2
2
i i i i i
i i
x x y x y
b y ax
x n x
= =

Where
( )( )
1
( )( )
xy i i i i i i
S x x y y x y x y
n
= =

and
( )
2
2 2
1
( )
xx i i i
S x x x x
n
= =

Quality of fit:
( )
2
2
2
1 1
n n
xy
i i i yy
i i xx
S
SSE e y ax b S
S
= =
= = =

,
Where
( )
2
2 2
1
( )
yy i i i
S y y y y
n
= =

The standard error of estimate is given by
2
SSE
S
n
=

====== END of second lecture
Least Squares Fitting
Example 1
Nonlinear functional relationship ( ) e ln ln
dx
y x c y dx c = = +
then define:
1
ln , , ln y y a d b c = = = , so that
1
ln ln y dx c y ax b = + = +
Example 2
( ) exp
K k
K u u
p f u
u C C

| | | |
= =
`
| |
\ \

)
, integrating this gives:
( ) 1 exp 1 exp
k k
u u
P F u P
C C

| | | |
= = =
` `
| |
\ \

) )

ln(1 ) ln( ln(1 ) ln ln
k
a x b
y
u
P P k u k C
C
+
| |
= =
|
\

1
u
j
j i
i
P pdu P p
=
= =

Example 3
Hyperbola:

1 1
y ax b
ax b y
= = +
+

Rejecting Questionable Data Points Cooks distance
Cooks distance measures the effect of deleting a given observation; the larger the distance the
more suspect the data point is:
2
( )
1
( )
( )
n
j j i
j
i
y y
D
p MSE
=
, where
( )
: prediction from full regression model for j-th observation
: prediction of j-th observ. from refitted regression model without observ. i
: number of fitted parameters (e.g. p=2 for y=ax+b)
j
j i
y
y
p
e
MSE =
2 2
1 1
( )
is the mean square error of regression model
n n
j j j
j j
y y
n p n p
= =
=

Equivalent expression for Cooks distance:
2
, where is the i-th diagonal element of H defined by
( ) (1 )
i ii
i ii
ii
e h
D h
p MSE h
=

1
( )
T T
H X X X X
=
Reject data point
4
when 1 or
i i i
y D D
n
> >
Multi-variable curve-fitting example
1)
[ ] [ ] 1 1
T
a
y ax b x x
b

(
= + = = =
(

Given the data pairs
1, 1 2 2
( ),( , ), ,( , )
n n
x y x y x y , prediction:
i i
y ax b = + write this is
matrix format: [ ] [ ]
1 with error 1
i i i i i i i i
y ax b x e y y y x = + = = =
So that
1 1 1 1 1
2 2 2 2 2
1
n n n n n
e y y x y
e y y x y
E Y X
e y y x y

( ( ( ( (
( ( ( ( (
( ( ( ( (
= = =
( ( ( ( (
( ( ( ( (

and the sum of the
squared errors becomes
2
1
( ) ( ) ( )
n
T T
i
i
J e E E Y X Y X
=
= = =

To minimize the sum of the squared errors, J, we need to solve 0
dJ
d
=
2 ( ) 0 0
T T T T T
dJ
X Y X X Y X X X X X Y
d

= = = =
So that
1
( )
T T
X X X Y

=
Advantages: 1) Easy computation in Matlab
2) Easy generalisation to more complicated curve-fitting problems
2)
2 2
1
i i i i i
a
y ax bx c x x b
c
(
(
( = + + =
(
(

So that
2
1 1 1
2
2 2 2
2
1
1
, ,
1
n n n
x y x
a
x y x
X Y b
c
x y x
( (
(
( (
(
( (
= = =
(
( (
(
( (

(

and again
the solution is determined from
1
( )
T T
X X X Y

=
3)
1
2 2 2
1 2 3 4
3
4
sin( ) cos( ) sin( ) cos( )
i i
x x
i i i i i i i i i i
a
a
y a x z a x z a z a e z x z x z z e z
a
a
(
(
(
( = + + + =

(
(

so that
1
2
2
1 1 1 1 1 1 1 1
2
2 2 2 2 2 2 2 2
3
2
4
sin( ) cos( )
sin( ) cos( )
, ,
sin( ) cos( )
n
x
x
x
n n n n n n n
x z z y a x z e z
x z z y a x z e z
X Y
a
x z z y a x z e z
( ( (
( ( (
( ( (
= = =
( ( (
( ( (
(

and also here:
the solution is determined from
1
( )
T T
X X X Y

=

THE END

Statistics Summary

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics Summary

Uploaded by

Copyright:

Available Formats

STATISTICS SUMMARY #1

= which is the standard normal distribution. Note that

You might also like