You are on page 1of 41

biometry – bio220

chapter 3
describing data

1
• descriptive statistics
• numeric data:
– mean & median = location
– Standard deviation, variance & IQR = spread
– (coefficient of variation = spread)
• categorical data:
– proportion

2
mean
• measure of location (center)
• sample mean: arithmetic average of sample
values

• example: undulation rates of 8 gliding snakes,


measured in Hertz (cycles per second)

3
mean

0.9, 1.4, 1.2, 1.2


1.3, 2.0, 1.4, 1.6

4
mean

Yi : ith observation

Y: sample mean

n : number of observations (sample size) 5


mean

6
standard deviation
• measure of spread
• s.d. = s = average deviation from the mean
• variance = s2

7
standard deviation

sample variance

sum of squares

why square?
8
standard deviation

9
standard deviation

10
standard deviation

11
standard deviation
sample variance

sample s.d.

both are always positive values 12


the unbiased sample estimate of standard
deviation
• why divide by n-1?
• if divide by n, on average:
sample var < population var
• the smaller the sample  higher the
difference

13
the unbiased sample estimate of standard
deviation
https://en.wikipedia.org/wiki/Bias_of_an_estimator

Don`t try to fully


understand this slide. This
is just here to show you
that there is a
mathematical proof

sample variance pop variance 14


the unbiased sample estimate of standard
deviation
• if divide by n  sample variance becomes a
biased estimate of pop var (toward smaller
var)
• if we divide by n-1  sample variance is an
unbiased estimate of pop var
dividing by n dividing by n - 1

Population var

Sample variance
(estimates for
pop var)
15
the normal distribution's
standard deviation
• for normal
distributions:

• Y ± 2 s.d. ~ 95%
of the data

16
www.mathsisfun.com
the issue of rounding
• how to round the mean and s.d.?
– 0.346789452 vs 0.3 vs 0.35
• too many decimal places  difficult to interpret
• too much rounding  error in downstream
calculations
• one approach: one more decimal point than
the original data
– if original: 7.2, 4.3, ...  mean 5.62

17
coefficient of variation
• comparing variation among very different
populations with very different means
• e.g. weight of elephants vs weight of humans
– which one is more variable? which one has higher
variance?

18
coefficient of variation
• if means are different  s.d. tends to be
different
– sd elephant weight > sd human weight
– sd elephant weight > sd mouse weight
just because units / scales are larger

• how to normalize the effect of the mean, to


reflect variability?

19
coefficient of variation
• coefficient of variation (cv)  a standardized
measure for variability

20
mean and sd from freq tables

21
mean and sd from freq tables
• for the mean: multiply each value with its freq
• for the sd: multiply each difference squared
with its freq

22
mean and sd from freq tables

mean

standard
deviation
23
median and interquartile range
• median = middle point = 50th percentile = 0.5th
quantile

• if n is odd:
• median = Y(n+1)/2

• if n is even:
• median = ( Yn/2+ Y(n/2)+1 ) / 2
24
median and interquartile range
• first quartile = 25th percentile = 0.25 quantile
• third quartile = 75th percentile = 0.75 quantile

25
median and interquartile range

• quartiles: divide data into 4 pieces

• interquartile range (IQR) = third – first quartile

26
boxplot

Normally whiskers extend to min and max values. If there are values are out of
upper and lower limits (+/- 1.5*IQR) then outliers are specified and whiskers
extend to limits.

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51
27
boxplot

28
boxplot

29
mean vs median
• we can guess the distribution shape just
studying some statistics
• when distribution symmetric  mean ~
median, preferably mean
• when asymmetric  mean ≠ median, we use
midean, and each represent different properties
of the data
• their difference indicates the direction of the tail

30
comparing 3 distributions

31
comparing 3 distributions

32
mean vs median:
the mean will be closer to the tail

33
mean vs median:
the mean will be closer to the tail

34
proportions
• proportion: location for categorical variables

35
proportions

36
proportions
same as mean,
assuming
one category = 1,
others = 0

37
summary

38
summary

39
summary

Mean: increase by C
SD: every value will shift to the right by the units of C, box length stays the same
Variance: the same
Median: the value will be added by C
IR: shift our box to right, length the same
If multiply by C:
Mean : doubled
SD: DOUBLED
Variance: 4 times
Median: doubled
40
IR: move to right, increase of box size
exercise
• please solve all problems at the back of the
chapter 3
• Ask your questions at Office hours

41

You might also like