Professional Documents
Culture Documents
DESCRIPTIVE STATISTICS
Descriptive statistics
introduction
• Population
• Total of all possible data of certain objects (e.g. number of articles in a warehouse, body size of all RWTH
students) or total of all measured values of repeatedly carried out measurements regarding a specific
object (e.g. length of the precast concrete part)
• finite population: The total number of results / values is finite (e.g. number of items in the
warehouse)
• infinite population: The total number of results / characteristics is infinite (e.g. measured values for
the length of the precast concrete part)
• sample
• Representative portion of observations from the population describing it (population)
To describe the data (e.g. a sample) these are sorted according to their Features disassem
• features
• Quantitative characteristics: The characteristics of the data are real numbers Example:
• discreet quantitative characteristics: the set of values is finite or countable (e.g. number of
students at RWTH)
• steady quantitative characteristics: the set of values is infinite, since every number (in an interval)
can occur (e.g. body size)
• Multiplication does not make sense, differences and ratios of differences are allowed
the number of characteristic values that have the respective characteristic value
• h (x • j) • n • Example 3.2
j•1
• By dividing the respective absolute frequency by the total number n of the characteristic values
results in the relative frequency f (x •
j)
m
h (x • j)
f (x •)j • With 0 • f (x • j) • 1 and • (j fx •) • 1
n j•1
• Cumulative frequencies
• The frequency of the characteristic expression above or below certain values on the scale can be
determined by the Cumulative frequency ( only for quantitative and rank characteristics). For this purpose,
the characteristic values of a characteristic are X to be sorted according to size
x1• • x • 2 • • • x • m
With
k
1 • •
With F (x • • Hk) (x • k);
F (x •)k • ( x F. k • 1 ), ( k • 1.2, •• m • 1) ; F (x m ) • 1
n
feature X = Sum of the total when rolling the dice with a real dice Characteristic values:
x •j Checklist
1
x1• • Totals 1
2
x 2• • Totals 2
• 3
4th
x 6th
•
• Totals 6th
5
6th
• Distribution functions
• From the relative cumulative frequency, the empirical distribution function of the feature X if derived
from any real number x a value can be assigned:
F (x) • • f (x •
j)
x j• xk
• The value F (x) indicates the proportion of observation units whose characteristic
values are not greater than x are.
• Important properties
1,000
0.900
0.800
0.700
0.600
F (x)
0.500
0.400
0.300
0.200
0.100
0.000
1 2 3 4th 5 6th
F (x)
Route [miles]
(Source: http://www.mhsg.de)
• Class formation
• In the case of extensive data material (especially in the case of quantitative characteristics), it is advisable
to divide the characteristic values between the largest and smallest value into classes or intervals that
meet at their edges
• The further calculations then always refer to the class (no longer the individual value), represented by
the Mid-class
• Upper (n) and lower (n) class limit or class edge: minimum and maximum characteristic
values within the class
• Class width: Difference between the upper and lower class limit
• Select class margins in such a way that, if possible, no characteristic value lies on the margin
1 17th
1 51
2 15th
2 24
2 52
2 79
2 82
3 3
3 6th
3 53
4th 37
4th 57
5 72
6th 45
7th 24
8th 72
9 24
9 36
9 41
9 87
9 80
... ...
• Example of the European Football Cham
• Class frequencies
• Analogous to unclassified data, absolute and relative class frequencies
be calculated
m : Class number
x •j : Mid-class
n• • h (x •
j)
: Total number of characteristic values x i
Original list: x j • • 39.4 42.2 39.1 30.6 39.5 33.5 45.2 31.8 41.8 34.0 30.3 41.7 • 36.3 •
j • 1.2, •• 90 • n • 90
Number of classes: m • 5 • log n • 5 • log 90 • 9.77 • 10
• Diagrams
• Scatter plot
(Source: RWTH
Aachen,
Table of figures 2011)
• Diagrams ( Continuation)
• bar chart
(Source: RWTH
Aachen,
Table of figures 2011)
• Bar graph
(Source: Statistics in
Geodesy,
Geographic information and
Construction)
• Diagrams ( Continuation)
• Polygon diagram
(Source: RWTH
Aachen,
Numbers table
2011)
• Diagrams ( Continuation)
• Histogram (absolute frequency and cumulative frequency): goals of the European Football Championship 2012
histogram histogram
18th 100.00% 50 100.00%
16 90.00% 45 90.00%
80.00% 40 80.00%
14th
70.00% 35 70.00%
12th
frequency
frequency
4th
20.00% 10 20.00%
2 10.00% 5 10.00%
0 0.00% 0 0.00%
0-15 16-30 31-45 46-60 61-75 76-90 and 0 - 45 46 - 90 and
greater greater
class class
• Diagrams ( Continuation)
• Stem-leaf diagram
16 17th 22nd 23 23 60 0
24 25th 26th 26th 27 50 0 0 0 1 5 8
28 29 29 30th 31
40 0 1 2 2 3 3 4 5 5 9
31 32 32 33 34
34 35 36 37 37 30 0 1 1 2 2 3 4 4 5 6 7 7 8 9 20 2 3 3 4 5 6 6 7 8 9
38 39 40 41 42 9
42 43 43 44 45
10 6 7
45 49 50 50 50
51 55 58 60
Tribes leaves
• Parameters of a sample
• In addition to the graphical representation, other
Parameters or. parameter needed
• Location parameters • Mean values
n x
n n
• x • nx • • x • n
n
• • •x • •x •0
n n
•
i•1i i i
( x i • x) • i
i
i•1 i•1 i•1 n i•1 i•1
• The Sum of the squared deviations of the observed values of any mean M. then becomes a minimum, if M.
is the arithmetic mean
n
• ( x i • x) 2 • min!
i•1
can occur several times, with the help of the frequencies a species weighted arithmetic mean can be
calculated as a position parameter
• It results from the absolute or relative frequencies for the occurrence of the
characteristic values
x•
h (x 1• ) • x •1 • h (x • 2) • x • 2 • • • h (x •) • xm • m • • 1h (x
m
•
j) • x• j
n n j•1
x • f (x • 1) • x •1 • f (x • 2) • x • 2 • • • f (x • m) • x •m • • f (x • j) • x• j
j•1
• Median
• Central value of a series of observations, ie the median x shares one ordered
M.
series of observations
( x ( 1) • x ( 2) • ... • x ( n) ) in two equal parts
• If n odd is:
x M. • x ( n • 1) / 2
x M.• •x
1
n• x • • Example 3.8 ( Continuation)
/2 ( n / 2) • 1
2
• If there are no outliers in the data, the median and the arithmetic mean are roughly the same ( • Median
for troubleshooting)
• In contrast to the arithmetic mean, the median is insensitive to outliers (= robust)
• Of the Breaking point is 50%, ie theoretically a maximum of half of the data can be outliers
0.5 • F (x k •• x / 2)
x M. • x k••• x / 2 • • x ••
f (x k)
36781 1.0000
income
1
0.9
0.8
0.7
F (x • j) 0.6
0.5
0.4
Cut out next slide
0.3
0.2
0.1
0
0
0
1
900
2
1250
3
1500
4th
2000 x 2500 5 6th
3500
7th
5000
8th
17500
k
F (x • j)
0.5 • F (x ••k x / 2)
x M. • x k••• x / 2 • • x ••
f (x k)
• x
F (x) = 0.5
f (x k)
0.5
F. • xk• • x / 2 • 0.5 • F. • x k •• x / 2 •
2000 2500 x
x M. x k
x k •• x / 2 x k •• x / 2
Univ.-Prof. Dr-Ing. J. Blankenbach Applied Statistics - WS 20/21 30th
Descriptive statistics
Dispersion parameters
• Measures of dispersion
• Describe the spread of the characteristic values (e.g. the distance between the characteristic values and the center)
• span
• Scatter range in which all characteristic values of an observation series are located
R. • x Max • x min
• The scattering behavior of the values in between is not taken into account
• Quartiles = quantiles, which the (sorted) data in four equal parts subdivide
• First or lower quartile ( Q1 = Q 0.25 ), middle quartile (= median) ( Q2 = Q 0.5 ), third or upper quartile ( Q3 = Q 0.75 )
• The range of variation between Q1 and Q3 becomes Interquartile range and is less sensitive to outliers
span
Interquartile range
lower
lower upper upper
border
Quartile ( Q1 ) Quartile ( Q3 ) border
Median ( Q2 )
• Symmetrical distribution
• Distribution of the characteristic values symmetrically in relation to the arithmetic mean
• Deviations from the mean of the same amount in terms of amount occur just as often with positive as
with negative signs
• Relative and absolute frequencies of values that are the same upwards and downwards from the mean
are equally large
• Skewed distribution
• Distribution rises sharply to the left and falls flat to the right • right skew
• Distribution rises flat on the left and drops steeply on the right • left skewed
1 3
2 30th
3 6th
4th 10
5 7th
6th 11
7th 10
8th 7th
9 3
10 10
11 13th
Player 2
Player 1
Measurement at the] n • 10
x ( 1) 12,189
12,818
• x 10 • x 1
x ( 3)
1 •x • 12,829 • 12,189 • 0.640
x ( 4) 12,820 Q 2 • x M. • •
2 n / 2 • x ( n / 2) • 1
x ( 5) 12.821
1
• •x • 12,822
x ( 6) 12,823 2 5 • x 6th •
x ( 9) 12,827
x•
• xi• 127.594
• 12.759
x ( 10) 12,829
n 10
∑ 127.594
n i•1
1 •• x i • x • • Example 3.8
n
2
s2• (corrected empirical variance)
n • 1 i•1 (Continuation)
• The variance or standard deviation is THE Quality measure for assessing the quality of observations
in the sense of precision ( • see slide 23)
• s (= less precise)!
h (x i•; y •) j
• the relative frequencies f (x •;i y •) •j
n
• The totality of all combinations of the characteristic values (with the absolute and relative frequencies)
results in the two-dimensional frequency distribution, eg in the form of a two-dimensional frequency table
• The distribution of only one feature of the two-dimensional frequency table is called
Marginal distribution (= Rows or column total)
• Covariance
• is a measure of the common variation of two characteristics X and Y
• describes the (linear) relationship or the dependency between two features X and Y with common
distribution
• is positive, if both features tend to be related in the same way
1n
s XY • •• x i • • X • ( y i • • Y) (empirical covariance)
n i•1
1 •• x i • x • ( y • y) n • 1 i • 1
n
• 1 •x*
m
1 •m •
s
2
H( ) j• ( x *j • x) 2 • • • h (x *)j (x *)j 2 • nx 2 • (empirical variance)
n j•1 n •j•1 •
1m
r
1 •m
r
•
s xy• •• h (x •
j;
•
y k) ( • x *j • x) • ( y k
*
• y) • • •• h (x • j; y • k) x j
* *
y • nxy •
k
n j•1k•1 n• j•1k•1
•
(empirical covariance)
• analogously, the corrected empirical variance or covariance is calculated by dividing with ( n-1 ) calculated
• correlation
• the size of the covariance is scale-dependent, ie it cannot depend on the size on the degree of the linear
dependence of two characteristics X and Y getting closed
• With the empirical correlation coefficient r XY the covariance is therefore normalized to the degree of linear To
describe the relationship between two characteristics
• r = 0 : no correlation 0 no correlation
r = +1
r = +0.830
r = +0.453 r = +0.017
r = -1
r = -0.876
Blood sugar
mirror
[mmol / l]
r = -0.03 (!)
but
Time after
the food [ H]
1 5 y k• y • • • •
h (x •)j
4th x •j 1 • 1 y • • 2 2y • 3 • 3 y • 4th y 54th• 5 y 6th • 6th
2 3 2
x1• • 1 2 3 4th 1 0 0 10
3 2 2
x 2• • 2 4th 10 4th 3 3 1 25th
4th 1 3
•
5 3 4th
x3• 3 4th 6th 15th 6th 3 1 35
7th 2 4th
x 5• • 5 0 1 2 4th 0 1 8th
8th 6th 5
x •6th• 6th 0 0 1 1 0 0 2
9 5 4th
•
10 1 3 h (y k) 10 22nd 33 25th 7th 3 • • 100
16
14th
12th
10
8th
6th
4th
6th
0
5
6th
4th
5
3 4th
2 3
2
1