You are on page 1of 22

Summary Statistics

Jake Blanchard
Spring 2008
Uncertainty Analysis for Engineers 1
Summarizing and Interpreting Data
It is useful to have some metrics for
summarizing statistical data (both input and
output)
3 key characteristics are
central tendency (mean, median, mode)
Dispersion (variance)
Shape (skewness, kurtosis)
Uncertainty Analysis for Engineers 2
Central Tendency
Mean

Median=point such that exactly half of the
probability is associated with lower values
and half with greater values

Mode=most likely value (maximum of pdf)
Uncertainty Analysis for Engineers 3
}

=
= = dx x f x x E p x x E
i
n
i
i
) ( ) ( ) (
1
}

=
z
dx x f 5 . 0 ) (
For 1 Dice
Uncertainty Analysis for Engineers 4
5 . 3 mod
5 . 3
5 . 3 ) (
6
1
6
6
1
5
6
1
4
6
1
3
6
1
2
6
1
1 ) ( ) (
6
1
=
=
=
|
.
|

\
|
+
|
.
|

\
|
+
|
.
|

\
|
+
|
.
|

\
|
+
|
.
|

\
|
+
|
.
|

\
|
= =

=
e
x
median
x E
x p x x E
mean
i
x
i
i
For our example, the mean, median, and
mode are given by

The mode is x=0
Uncertainty Analysis for Engineers 5

) 2 ln(
5 . 0
1
) ( ) (
0
0
=
=
= = =
}
} }

z
dt e
median
dt e t dt t tf t E
mean
z
t
t
Other Characteristics
We can calculate the expected value of
any function of our random variable as
Uncertainty Analysis for Engineers 6
( ) | |
( ) ( )

i
i i
x p x h
dx x f x h
x h E
) ( ) (
Some Results
Uncertainty Analysis for Engineers 7
( )
( ) | |

= =
= =
=
(

=
(

=
=
n
j
j j
n
j
j j
n
j
j
n
j
j
x E b x b E
x E x E
x cE cx E
c c E
1 1
1 1
) ( ) (
) (
| |
( )
( )

= =
=

}
}

i
i
k
i
k
k
k
x p x
dx x f x
x E
dx x f x
) (
) (
) (
1
1
1
1

Moments of Distributions
We can define many of these parameters in
terms of moments of the distribution

Mean is first moment.
Variance is second moment
Third and fourth moments are related to
skewness and kurtosis
Uncertainty Analysis for Engineers 8
Variance is a measure of spread or dispersion

For discrete data sets, the biased variance is:

and the unbiased variance is

The standard deviation is the square root of
the variance
Uncertainty Analysis for Engineers 9
| | ( )
}

= = = dx x f x x E ) (
2
1
2
1 2
2
o
( )

=
=
n
i
x x
n
s
1
2
2
1
( )

=
n
i
x x
n
s
1
2
2
1
1
Skewness
skewness is a measure of asymmetry

For discrete data sets, the biased skewness
is related to:

The skewness is often defined as
Uncertainty Analysis for Engineers 10
| | ( )
}

= = dx x f x x E ) (
3
1
3
1 3

( )

=
=
n
i
x x
n
m
1
3
3
1
3
3
1
o

=
Skewness
Uncertainty Analysis for Engineers 11
Kurtosis
kurtosis is a measure of peakedness

For discrete data sets, the biased kurtosis is
related to:

The kurtosis is often defined as
Uncertainty Analysis for Engineers 12
| | ( )
}

= = dx x f x x E ) (
4
1
4
1 4

( )

=
=
n
i
x x
n
m
1
4
4
1
3
4
4
2
=
o

Kurtosis
Pdf of Pearson type VII distribution with
kurtosis of infinity (red), 2 (blue), and 0 (black)
Uncertainty Analysis for Engineers 13
Using Matlab
Sample data is length of time a person was
able to hold their breath (40 attempts)
Try a scatter plot
y = ones(size(breathholds));
h1 = figure('Position',[100 100 400 100],'Color','w');
scatter(breathholds,y);

Uncertainty Analysis for Engineers 14
disp(['The mean is ',num2str(mean(breathholds)),' seconds (green line).']);
disp(['The median is ',num2str(median(breathholds)),' seconds (red line).']);
hold all;
line([mean(breathholds) mean(breathholds)],[0.5 1.5],'color','g');
line([median(breathholds) median(breathholds)],[0.5 1.5],'color','r');
Uncertainty Analysis for Engineers 15
Box Plot
title('Scatter with Min, 25%iqr, Median, Mean, 75%iqr, & Max lines');
xlabel('');
h3 = figure('Position',[100 100 400 100],'Color','w');
boxplot(breathholds,'orientation','horizontal','widths',.5);
set(gca,'XLim',[40 140]);
title('A Boxplot of the same data'); xlabel(''); set(gca,'Yticklabel',[]);
ylabel('');
Uncertainty Analysis for Engineers 16
Box Plot
Uncertainty Analysis for Engineers 17
Min
Max
Median
Outlier
Box
represents
inter-quartile
range (half of
data)
Empirical cdf
h3 = figure('Position',[100 100 600 400],'Color','w');
cdfplot(breathholds);

Uncertainty Analysis for Engineers 18
Multivariate Data Sets
When there are multiple input variables,
we need some additional ways to
characterize the data

If x and y are independent, then
Cov(x,y)=0
Uncertainty Analysis for Engineers 19
| |
( )
) ( ) ( ) ( ) , (
, ) , (
) , ( ) , (
) , (
y E x E xy E y x Cov
discrete y x p y x h
continuous dxdy y x f y x h
y x h E
i j
j i j i
=

} }

Correlation Coefficients
Two random variables may be related
Define correlation coefficient of input (x) and
output (y) as

=1 implies linear dependence, positive slope
=0 no dependence
=-1 implies linear dependence, negative
slope
Uncertainty Analysis for Engineers 20
( )( )
( ) ( )
) ( ) (
) , (
1 1
2 2
1
,
y x
y x Cov
y y x x
y y x x
m
k
m
k
k k
m
k
k k
y x
o o
=

=

= =
=
Example
Uncertainty Analysis for Engineers 21
=0.98
=-0.38
=1
=-0.98
Example
x=rand(25,1)-0.5;
y=x;
corrcoef(x,y)
subplot(2,2,1), plot(x,y,'o')
y2=x+0.2*rand(25,1);
corrcoef(x,y2)
subplot(2,2,2), plot(x,y2,'o')
y3=-x+0.2*rand(25,1);
corrcoef(x,y3)
subplot(2,2,3), plot(x,y3,'o')
y4=rand(25,1)-0.5;
corrcoef(x,y4)
subplot(2,2,4), plot(x,y4,'o')
Uncertainty Analysis for Engineers 22