You are on page 1of 49

Descriptive statistics

Peter Goos
peter.goos@biw.kuleuven.be
What ?
• Summarizing data by means of “summary
statistics”
• Location
o Mean or average
o Median
o Quartiles
o Quantiles or percentiles

• Spread or variation
o Variance
o Standard deviation
What ?
• Relationship between variables
o Covariance
o Ordinary correlation (Pearson)
o Rank correlation (Spearman)

• Many other summary statistics exist


Mode
• The most common value in the data set
• Data:
16, 13, 14, 17, 14, 16, 17, 16, 15, 13
• The mode is 16 since it appears three times
• All other values appear less frequently
Arithmetic mean
• Data:
16, 13, 14, 17, 14, 16, 17, 16, 15, 13
• Arithmetic mean (average):
(16+13+14+17+14+16+17+16+15+13)/10 = 15.1
• Sensitive to outliers: replacing one 16 by 160
results in a mean of 29.5
• Before believing the mean, we should check
whether there are no outlying observations due to
measurement errors, typos, etc.
Geometric mean
• In some cases, it is better to use the geometric
mean
• Examples:
o Interest rates
o Growth rates
Mean and other statistics

Summary Statistics
Mean 15.1
Std Dev 1.5238839
Std Err Mean 0.4818944
Upper 95% Mean 16.190121
Lower 95% Mean 14.009879
N 10

Ask for additional


statistics via red
triangle menu
Mean and other statistics

Summary Statistics Summary Statistics


Mean 15.1 Mean 15.1
Std Dev 1.5238839 Std Dev 1.5238839
Std Err Mean 0.4818944 Std Err Mean 0.4818944
Upper 95% Mean 16.190121 Upper 95% Mean 16.190121
Lower 95% Mean 14.009879 Lower 95% Mean 14.009879
N 10 N 10
Mode 16
Geometric Mean 15.029661
Median
• The value that separates the 50% smallest data
points from the 50% largest data points
• For odd numbers of observations, the median is
the middle value
• For even numbers of observations, the median is
the average of the two middles values
• Determining the median requires you to rank the
data from small to large first
Median
• Data:
16, 13, 14, 17, 14, 16, 17, 16, 15, 13
• Ranked data:
13, 13, 14, 14, 15, 16, 16, 16, 17, 17
• Two middle values are 15 and 16
• The median is (15+16)/2 = 15.5
Median is a robust statistic
• Median is not sensitive to outliers
• The following data sets have the same median,
namely 15.5
• Data set 1:
16, 13, 14, 17, 14, 16, 17, 16, 15, 13
• Data set 2:
16, 13, 14, 17, 14, 160, 17, 16, 15, 13
Quartiles
• There are three quartiles which split the data set
in four parts
• The first quartile separates the 25% smallest data
points from the 75% largest data
• The second quartile separates the 50% smallest
data points from the 50% largest data
• The second quartile is equal to the median
• The third quartile separates the 75% smallest data
points from the 25% largest data
Quartiles
• Ranked data:
13, 13, 14, 14, 15, 16, 16, 16, 17, 17
• The first quartile separates the two smallest data
points from the eight largest ones and equals
13.25
• The median is (15+16)/2 = 15.5
• The third quartile separates the eight smallest
data points from the two largest ones and equals
16.75
Medians, quartiles and quantiles

Quantiles
100.0% maximum 17
99.5% 17
97.5% 17
90.0% 17
75.0% quartile 16.25
50.0% median 15.5
25.0% quartile 13.75
10.0% 13
2.5% 13
0.5% 13
0.0% minimum 13
Quantiles or percentiles
• The 80th percentile separates the 80% smallest
values and the 20% largest values

Custom Quantiles
Quantiles
Actual
Quantile Estimate Lower 95% Upper 95% Coverage
20% 13.2 13 17 89.26
80% 16.8 13 17 89.26
Spread or variation
• The following data sets have the same median
o Data set 1: 16, 13, 14, 17, 14, 16, 17, 16, 15, 13
o Data set 2: 19, 10, 11, 20, 11, 19, 20, 19, 12, 10

• The main difference between the two data sets is


that the values in the second data set lie further
apart
• Simple measures for spread
o Range = maximum − minimum
o Interquartile range Q = Q3 – Q1
Box plot (Outlier box plot)

0 50 100 150 200 250


Distance (km)
Box plot (Outlier box plot)
Box plot (Outlier box plot)

0 50 100 150 200 250


Distance (km)
Box plot (Outlier box plot)
• The box shows the three quartiles (the middle one
is the median)
• The box has two whiskers
o One whisker extends to the smallest values that
is not extreme
o The other whisker extends to the largest value
that is not extreme
• The individual dots are the extreme values
(extremely large and extremely small)
• These values are called outliers
Box plot (Outlier box plot)
• A value is extremely small if it is smaller than
Q1 – 1.5 × Q
• A value is extremely large if it is larger than
Q3 + 1.5 × Q
• The middle of the diamond indicates the mean
• The ends of the diamond represent a 95%
confidence interval (see later)
• The red line indicates the smallest interval that
contains 50% of the data
Delays Brussels Airlines
Delay Time Arrival
Departure Airport N Min Quantiles0.25 Median Mean Quantiles0.75 Max
BHX 210 -22 -22.0 11.0 13.4 -17.3 122
BOD 215 -9 -9.0 9.0 14.7 -8.4 202
BRS 184 -19 -19.0 12.0 16.5 -17.8 236
BUD 185 -32 -32.0 -11.0 -8.1 -31.2 74
CPH 183 -19 -19.0 0.0 6.9 -17.5 109
DUS 184 -18 -18.0 -2.0 0.3 -16.1 32
EDI 215 -18 -18.0 5.0 14.1 -18.0 366
FLR 215 -18 -18.0 11.0 21.9 -16.8 238
GLA 215 -18 -18.0 13.0 17.8 -15.5 124
HAJ 216 -21 -21.0 -5.5 1.1 -19.7 161
HAM 217 -22 -22.0 -3.0 2.0 -21.4 111
LBA 183 -18 -18.0 10.0 18.8 -16.9 240
LCY 152 -10 -10.0 16.0 21.2 -9.3 109
MRS 217 -10 -10.0 9.0 14.6 -9.4 182
NAP 215 -21 -21.0 5.0 8.8 -18.5 86
NCL 215 -12 -12.0 9.0 16.2 -10.8 132
OST 1 206 206.0 206.0 206.0 206.0 206
SXB 185 -10 -10.0 5.0 7.6 -9.2 63
THF 49 -19 -19.0 -3.0 -0.0 -19.0 74
TLS 215 -15 -15.0 6.0 10.3 -12.5 207
TRN 183 -17 -17.0 10.0 12.3 -16.2 191
Side-by-side box plots

TRN
TLS
THF
SXB
OST
NCL
NAP
MRS
LCY
LBA
HAM
HAJ
GLA
FLR
EDI
DUS
CPH
BUD
BRS
BOD
BHX
0 100 200 300
Delay Time Arrival
Other measures of spread or variation
• Variance

• Standard deviation

• Mean absolute deviation


Data set with individual data points
Distribution platform
Summary data set
Distribution platform
Coefficient of variation (CV)
• Consider the following data sets:
o Data set 1: 15, 20, 20, 30, 35, 35, 40, 45
o Data set 2: 1015, 1020, 1020, 1030, 1035,
1035, 1040, 1045
• Variance is equal for both data sets (and so is the
standard deviation
• Means are 30 and 1030
• Relative to the mean, data set 1 exhibits more
variability
• This is quantified by the CV: standard deviation /
mean
JMP
Correlation and covariance
• The best known correlation is the Pearson
correlation
o Measures the linear association between two
variables
o Is based on the covariance

• The Spearman correlation is meant to quantify


monotonic association (which can be nonlinear)
• Correlations and covariances are positive if large
(small) values of one variable correspond to large
(small) values of another variables
Correlation and covariance
• Covariance

• Ordinary correlation (Pearson)


Correlation and covariance
• The absolute value of the covariance is hard to
interpret
o Positive values indicate a positive (linear)
relationship
o Negative values indicate a negative (linear)
relationship
• The correlation takes values between −1 and +1
o −1 if perfect negative linear relationship
o +1 if perfect positive linear relationship
Covariance

Observ-
ation
Correlation 0.7
• The next few pictures all correspond to a
correlation of 0.7
• Only the first picture corresponds to the picture we
expect: a positive relation between two variables,
which is not perfect
• In all other scenarios, the story is more
complicated
o A zero correlation, but one outliying data point
o A nearly perfect correlation, with one outlying
data point
o …
Take-away lesson
• Do not calculate correlations and interpret them
without looking at the data
• The same goes for averages/means, standard
deviations, variances, …
• So, create graphs whenever possible
• Think critically
• Do not study your data too superficially
• In JMP, various types of correlation can be
calculated via the menu `Analyze´ and
`Multivariate´
Spanish red wines
Correlations in JMP
Correlations in JMP
Rank correlation (Spearman)
• Measures more general positive and negative
relations
• Relations should not be linear
• Calculation
o Data points first have to be ranked according to
the values of the two variables under study
o Next, the (ordinary) correlation has to be
calculated for the ranks
Rank correlation
100

80

60

y
40

20

0 2 4 6 8 10 12
x
Rank correlation

Multivariate
Correlations
x y
x 1.0000 0.7169
y 0.7169 1.0000

Nonparametric: Spearman's ρ
Variable by Variable Spearman ρ Prob>|ρ| -.8-.6-.4-.2 0 .2 .4 .6 .8
y x 1.0000 <.0001*
Warning: sample size of 10 is too small, P
value suspect.

You might also like