Professional Documents
Culture Documents
chapter 3
describing data
1
• descriptive statistics
• numeric data:
– mean & median = location
– Standard deviation, variance & IQR = spread
– (coefficient of variation = spread)
• categorical data:
– proportion
2
mean
• measure of location (center)
• sample mean: arithmetic average of sample
values
3
mean
4
mean
Yi : ith observation
Y: sample mean
6
standard deviation
• measure of spread
• s.d. = s = average deviation from the mean
• variance = s2
7
standard deviation
sample variance
sum of squares
why square?
8
standard deviation
9
standard deviation
10
standard deviation
11
standard deviation
sample variance
sample s.d.
13
the unbiased sample estimate of standard
deviation
https://en.wikipedia.org/wiki/Bias_of_an_estimator
Population var
Sample variance
(estimates for
pop var)
15
the normal distribution's
standard deviation
• for normal
distributions:
• Y ± 2 s.d. ~ 95%
of the data
16
www.mathsisfun.com
the issue of rounding
• how to round the mean and s.d.?
– 0.346789452 vs 0.3 vs 0.35
• too many decimal places difficult to interpret
• too much rounding error in downstream
calculations
• one approach: one more decimal point than
the original data
– if original: 7.2, 4.3, ... mean 5.62
17
coefficient of variation
• comparing variation among very different
populations with very different means
• e.g. weight of elephants vs weight of humans
– which one is more variable? which one has higher
variance?
18
coefficient of variation
• if means are different s.d. tends to be
different
– sd elephant weight > sd human weight
– sd elephant weight > sd mouse weight
just because units / scales are larger
19
coefficient of variation
• coefficient of variation (cv) a standardized
measure for variability
20
mean and sd from freq tables
21
mean and sd from freq tables
• for the mean: multiply each value with its freq
• for the sd: multiply each difference squared
with its freq
22
mean and sd from freq tables
mean
standard
deviation
23
median and interquartile range
• median = middle point = 50th percentile = 0.5th
quantile
• if n is odd:
• median = Y(n+1)/2
• if n is even:
• median = ( Yn/2+ Y(n/2)+1 ) / 2
24
median and interquartile range
• first quartile = 25th percentile = 0.25 quantile
• third quartile = 75th percentile = 0.75 quantile
25
median and interquartile range
26
boxplot
Normally whiskers extend to min and max values. If there are values are out of
upper and lower limits (+/- 1.5*IQR) then outliers are specified and whiskers
extend to limits.
https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51
27
boxplot
28
boxplot
29
mean vs median
• we can guess the distribution shape just
studying some statistics
• when distribution symmetric mean ~
median, preferably mean
• when asymmetric mean ≠ median, we use
midean, and each represent different properties
of the data
• their difference indicates the direction of the tail
30
comparing 3 distributions
31
comparing 3 distributions
32
mean vs median:
the mean will be closer to the tail
33
mean vs median:
the mean will be closer to the tail
34
proportions
• proportion: location for categorical variables
35
proportions
36
proportions
same as mean,
assuming
one category = 1,
others = 0
37
summary
38
summary
39
summary
Mean: increase by C
SD: every value will shift to the right by the units of C, box length stays the same
Variance: the same
Median: the value will be added by C
IR: shift our box to right, length the same
If multiply by C:
Mean : doubled
SD: DOUBLED
Variance: 4 times
Median: doubled
40
IR: move to right, increase of box size
exercise
• please solve all problems at the back of the
chapter 3
• Ask your questions at Office hours
41