You are on page 1of 58

I.

Descriptive Statistics
University of Castilla-La Mancha
Department of Mathematics
Institute of Applied Mathematics to Science and Engineering
ETSII

Descriptive Statistics

Outline

1. Frequency distribution.
2. Graphics.
3. Numerical measures: Position, centrality, dispersion and shape.
4. Bidimensional distributions: Regression and correlation.

Descriptive Statistics

Introduction

Modern Statistics: union of two disciplines which were developed


independently,
I
I

Descriptive Statistics.
Probability.

Inference, decision making: Infer conclusions for the population


(probability) from a sample (descriptive statistics).

Descriptive Statistics

Comparative

Size
Variables
Measures
Aspect
Graphics

Sample
n
Statistical
Statistic
Latin letters
x, S 2 ...
Histogram

Population
N
Random
Parameter
Greek letters
, 2 ...
Probability density function (pdf)
Cumulative distribution function (cdf)

Descriptive Statistics

Descriptive statistics (Summary)

Example (failure time)


876
537
811
685
336
868
352
885
562
559

578
642
504
448
526
804
374
751
739
505

718
856
807
571
624
210
267
561
562
703

388
376
719
189
605
421
684
1020
817
809

562
508
464
661
496
435
685
592
690
706

971
529
410
877
296
291
460
814
720
626

698
393
491
563
628
393
570
843
758
631

298
354
557
647
481
605
928
466
731
585

673
725
771
447
224
341
516
498
480
639

Descriptive Statistics

Concepts

Experimental unit.

Measurement.
Types of variables:

Qualitative (Categorical):
I
I

Nominal: Type of engine.


Ordinal: Education degree.

Quantitative:
I
I

Discrete: Number of failures.


Continuous: Temperature, time.

Descriptive Statistics

Frequencies (row data, categorical or discrete)

Digits

0
1
2
3
4
5
6
7
8
9
Total

Example: Last digit of lottery prizes.


Absolute
Cumulative Percentage Cumulative
frequency frequency
frequency
percentage
frequency
19
19
9.5
9.5
8
27
4.0
13.5
13
40
6.5
20.0
20
60
10.0
30.0
26
86
13.0
43.0
31
117
15.5
58.5
26
143
13.0
71.5
20
163
10.0
81.5
20
183
10.0
91.5
17
200
8.5
100
200
100

Descriptive Statistics

Bar chart

Descriptive Statistics

Pie chart

Engine type JRSS

Engine type JASA

Engine type JSPI

Engine type Test

Descriptive Statistics

Continuous variables: grouped data

Class
limits

Mark

0-10
10-20
20-30
30-40
40-50
50-60
60-70
70-80
80-90
90-100

5
15
25
35
45
55
65
75
85
95

Example: scores between 0 an 100


Absolute
Percentage Cumulative
frequency frequency
frequency
8
4
10
5
12
6
22
11
32
16
50
25
28
14
18
9
12
6
8
4
Class range: Length of

8
18
30
52
84
134
162
180
192
200
interval

Cumulative
percentage
frequency
4
9
15
26
42
67
81
90
96
100

Descriptive Statistics

Histogram

Descriptive Statistics

Histogram + Frequency Polygon

Descriptive Statistics

Frequency Polygon

Descriptive Statistics

How many classes?

Rule of thumb:

n 50 > 5 8 classes

n > 50 > 8 12 classes

Descriptive Statistics

Histogram (Cumulative Frequency)

Descriptive Statistics

Histogram + Frequency Polygon

Descriptive Statistics

Cumulative Frequency Polygon

Descriptive Statistics

Stem-and-leaf diagram

3, 7, 11, 12, 13, 14, 15, 16, 17, 17, 18, 18, 18, 19, 19, 19, 20, 20, 21, 21,
21, 22, 22, 23, 23 ...

Descriptive Statistics

Stem-and-leaf diagram

Descriptive Statistics

Scatterplot

Importance of the scale!


Descriptive Statistics

Numerical measures:
Position, centrality, dispersion, shape

Descriptive Statistics

Influence of outliers in the mean

Descriptive Statistics

Median

Descriptive Statistics

Centrality statistics

Example: 56, 62, 63, 65, 65, 65, 65, 68, 70, 72
I

Mean: x =

xi

P
x =

i fi xi

56+62+63+65+65+65+65+68+70+72
10

= 65.1

56 + 62 + 63 + 4 65 + 68 + 70 + 72
= 65.1
10

Median: Middle observation (n odd or even): Me = 65.

Mode: Most frequent value: Mo = 65.

Descriptive Statistics

Dispersion

Descriptive Statistics

Dispersion

Deviation from the mean: xi x

Median absolute deviation: MAD =


2

x)
i fi (xi

i fi |xi Me|

Variance: S =

Quasi-Variance (Sample variance): Sc2 =

I
I

Standard deviation = Positive square root of the Variance: S


Quasi-Standard deviation (Sample standard deviation)= Positive
square root of Sc2 : Sc

Coefficient of variation: CV =

n
x )2
i fi (xi

n1

S
|
x|

Descriptive Statistics

Practical meaning of the Standard Deviation

Chebyshevs Theorem (any data): At least 100(1 1/k 2 ) will lie within k
standard deviations.

That is,

75% of the data within 2 SD


89% of the data within 3 SD

Descriptive Statistics

Practical significance of the Standard Deviation


Assuming a bell shape distribution:

Descriptive Statistics

Position statistics

Pth percentile: x that exceeds P% of the measurements.


Percentile of x =

values before x .
n

Lower quartile: Q1 = 25th percentile.


Upper quartile: Q3 = 75th percentile.
Median = Q2 = 50th percentile.

Descriptive Statistics

Example

1. Sort the data:


280 283 287 288 288 289 289 290 290 290 292 293 293 293
2. Find position (n + 1)/4 = (14 + 1)/4 = 3.75:
280 283 287 288 288 289 ...
3. Q1 = 287 + 0.75(288 287) = 287.75

Descriptive Statistics

Inter-quartile range

IQR = Q3 Q1 .

SIQR =

Q3 Q1
2

(Semi-Inter-quartile range)

Descriptive Statistics

Sample z-score

z=

x x
S

Not unusual 2 z 2
Suspect outlier 3 z 2 or 2 z 3
Extreme outlier z < 3 or z > 3

Descriptive Statistics

Outlier effect

Descriptive Statistics

Outlier effect

Descriptive Statistics

Outlier

Two possible causes:


I

Error recording the observation.

Correct value. Example: Age first motorcycle, ftp.

Descriptive Statistics

How to build a Box-plot

1. Sort the data.


2. Minimum, maximum and quartiles:
 25%  Q1  25%  Q2  25%  Q3  25% 
3. Inter-quartile range: IQR = Q3 Q1
4. Box limited by Q1 and Q3 .
5. Straight line at the median (Q2 ).
6. Lower limit = Q1 1.5 IQR, Upper limit = Q3 + 1.5 IQR.
Outliers out of these limits.
7. Line from the box to the smallest and largest data values within the
lower and upper limits.

Descriptive Statistics

Example (repairing time after failure)

Minimum
Maximum
Quartiles
Q1
Q2
Q3
IQR
Lower limit
Upper limit

189
1020
463.00
574.50
719.25
256.25
78.625
1103.625

Descriptive Statistics

Box-Plot

Descriptive Statistics

Box-Plot (another example with outlier)

Descriptive Statistics

Shape measures: 1. Skewness

Symmetric: x = Me = Mo

Descriptive Statistics

Skewed (biased to the left)

x < Me < Mo

Descriptive Statistics

Skewed (biased to the right)

Mo < Me < x

Descriptive Statistics

Skewness coefficients

xMo
S .
3(
x Me)
.
S

Pearson 1:

Pearson 2:

Fisher: g1 =

m3
S3 ,

where m3 =

< 0,
0,

> 0,

x)
i fi (xi
n

skewed to the left.


symmetric.
skewed to the right.

Descriptive Statistics

Shape measures: 2. Kurtosis (Mesokurtic)

Descriptive Statistics

Leptokurtic

Descriptive Statistics

Platikurtic

Descriptive Statistics

Kurtosis coefficients

Percentile kurtosis:

Fisher: g2 =

m4
S4

Q3 Q1
2(P90 P10 ) .

3, where m4 =

< 0,
0,

> 0,

x)
i fi (xi
n

platikurtic.
mesokurtic.
leptokurtic.

Descriptive Statistics

Covariance

P
Sxy =

i (xi

x)(yi y )
.
n
Descriptive Statistics

Covariance

P
Sxy =

i (xi

x)(yi y )
> 0.
n
Descriptive Statistics

Covariance

P
Sxy =

i (xi

x)(yi y )
< 0.
n
Descriptive Statistics

Covariance

P
Sxy =

i (xi

x)(yi y )
0.
n
Descriptive Statistics

Covariance matrix

For computation: Sxy = xy xy .


Consider k variables and Sij the covariance
Sii = Si2 is the variance of variable i.
Covariance matrix:
2
S1 S12

Sk1 Sk2

of variables i and j.

S1k

Skk

Symmetric and semidefinite positive.

Descriptive Statistics

Linear correlation coefficient

rxy =

Sxy
[1, 1].
Sx Sy

No units

rxy

> 0,

positively correlated

rxy

< 0,

negatively correlated

rxy

0,

uncorrelated

rxy

1 or 1, perfect correlation

Descriptive Statistics

Regression: Least squares

Descriptive Statistics

Least squares estimators (LSE)


Model: y = a + bx + e
Least squares:
X
[yi (a + bxi )]2 .
min Q(a, b) =
a,b

Normal equations:
y = a + b
x
xy = a
x + bx 2
Estimators:

Sxy
b = 2 .
Sx
x .
a = y b

Descriptive Statistics

Least squares estimators (LSE)


Regression line:
y y =

Sxy
(x x).
Sx2

Residuals:
i ).
ei = yi yi = yi (a + bx
Residual variance (variance estimator):
P 2
e
2
SR = i i .
n2
Determination coefficient:
R 2 (= r 2 here).

Descriptive Statistics

Remarks

x and y are not exchangeable.

The model forces a type of line.

Prediction beyond the range of observation is possible (carefully).

r does not measure nonlinear correlation:

Descriptive Statistics

Nonlinear correlation

Descriptive Statistics

You might also like