Professional Documents
Culture Documents
Descriptive statistics
𝑌𝑖
𝑖=1
𝑌ത =
𝑛
𝑌ത = the sample mean
𝑌𝑖 = the observations (the data), the i refers to the
numbers sequentially
𝑛 = the number of observations
∑ = Sigma means summation
Descriptive stats for numerical
variables
• In Excel, the mean is calculated using
=average()
Descriptive stats for numerical
variables
• Several command in R will calculate the mean
• How these commands differ
mean() – calculates the mean of a vector
colMeans() - column or row means (and other stuff)
Other, more complex calculations are possible using other
functions
Descriptive stats for numerical
variables
• Measures of spread around the mean
𝑌ത − 𝑌1
The mean: 𝑌ത
𝑌1 𝑌2
𝑌ത − 𝑌2
Descriptive stats for numerical
variables
• Measures of spread around the mean
• The two main measures are the sample variance and the
standard deviation
• The foundation of both is the ___________________
𝑛
ത 2
𝑆𝑆 = (𝑌𝑖 − 𝑌)
𝑖=1
Descriptive stats for numerical
variables
• Step 1: calculate the sum of squares
• Step 2: calculate the variance (using the sum of squares)
• Step 3: calculate the standard deviation (using the variance)
Observation (𝒀𝒊 ) ഥ)
Deviations (𝒀𝒊 − 𝒀 ഥ )𝟐 )
Squared deviations ((𝒀𝒊 − 𝒀
0.9 -0.475 0.225625
1.2 -0.175 0.030625 𝑛
1.2 -0.175 0.030625
1.3
1.4
-0.075
0.025
0.005625
0.000625
ത 2
𝑆𝑆 = (𝑌𝑖 − 𝑌)
1.4 0.025 0.000625
1.6 0.225 0.050625 𝑖=1
2 0.625 0.390625
mean = 1.375 0 SS = 0.735
Observation (𝒀𝒊 ) ഥ)
Deviations (𝒀𝒊 − 𝒀 ഥ )𝟐 )
Squared deviations ((𝒀𝒊 − 𝒀
0.9 -0.475 0.225625
1.2 -0.175 0.030625 𝑛
1.2 -0.175 0.030625
1.3
1.4
-0.075
0.025
0.005625
0.000625
ത 2
𝑆𝑆 = (𝑌𝑖 − 𝑌)
1.4 0.025 0.000625
1.6 0.225 0.050625 𝑖=1
2 0.625 0.390625
mean = 1.375 SS = 0.735
Observation (𝒀𝒊 ) ഥ)
Deviations (𝒀𝒊 − 𝒀 ഥ )𝟐 )
Squared deviations ((𝒀𝒊 − 𝒀
0.9 -0.475 0.225625
1.2 -0.175 0.030625 𝑛
1.2 -0.175 0.030625
1.3
1.4
-0.075
0.025
0.005625
0.000625
ത 2
𝑆𝑆 = (𝑌𝑖 − 𝑌)
1.4 0.025 0.000625
1.6 0.225 0.050625 𝑖=1
2 0.625 0.390625
0 SS = 0.735
Observation (𝒀𝒊 ) ഥ)
Deviations (𝒀𝒊 − 𝒀 ഥ )𝟐 )
Squared deviations ((𝒀𝒊 − 𝒀
0.9 -0.475 0.225625
1.2 -0.175 0.030625 𝑛
1.2 -0.175 Add 0.030625
1.3
1.4
-0.075
0.025
these 0.005625
0.000625
ത 2
𝑆𝑆 = (𝑌𝑖 − 𝑌)
1.4 0.025 up 0.000625
1.6 0.225 0.050625 𝑖=1
2 0.625 0.390625
0 SS = 0.735
𝑠𝑢𝑚( 𝑥 − 𝑥ҧ ^2)
Descriptive stats for numerical
variables
• The variance (s2) – the sum of squares is divided by
the number of observations (n) minus one
𝑛
ത 2
(𝑌𝑖 − 𝑌)
𝑖=1 SS
𝑠2 = =
𝑛−1 𝑛−1
Descriptive stats for numerical
variables
Calculating the Variance
Observation (𝒀𝒊 ) ഥ)
Deviations (𝒀𝒊 − 𝒀 ഥ )𝟐 )
Squared deviations ((𝒀𝒊 − 𝒀
0.9 -0.475 0.225625 𝑛
1.2 -0.175 0.030625
1.2 -0.175 0.030625 ത 2
(𝑌𝑖 − 𝑌)
1.3 -0.075 0.005625
1.4 0.025
n = 8 0.000625 𝑖=1
1.4 0.025 0.000625 𝑠2 =
1.6 0.225 0.050625 𝑛−1
2 0.625 0.390625
0 SS = 0.735
0.735
𝑠2 = = 0.105
7
8–1=7
Descriptive stats for numerical
variables
• Why divide by n-1?
• This is called Bessel's correction –
• The sample variance is an estimate of the unknown population
variance and thus requires a finite sample to be collected
• The sample will be biased (because it is finite)
• n-1 corrects for that bias
𝑛
ത 2
(𝑌𝑖 − 𝑌)
𝑖=1
𝑠2 =
𝑛−1
Descriptive stats for numerical
variables
• Variance in R
var()
• Variance in Excel
=var()
𝑛
ത 2
(𝑌𝑖 − 𝑌)
2 𝑖=1
𝑠 =
𝑛−1
Descriptive stats for numerical
variables
• The problem with the sum of
squares and variance is that their
units are squared (x^2) this is not
the same units as the raw data or
the mean
• For this reason, we never graph
the sum of squares or the
variance
• ______________________________
𝑛
ത 2
(𝑌𝑖 − 𝑌)
𝑖=1 SS
𝑠2 = =
𝑛−1 𝑛−1
Descriptive stats for numerical
variables
• The standard deviation (s) – is the square root of
the summed squared deviations from the mean
divided by the number of observations minus one
𝑆𝑆
𝑠=
𝑛−1
𝑠
𝐶𝑉 = × 100
𝑌ത
Coefficient of variation
• A larger CV mean more variability, whereas a
smaller CV mean less
𝑠
𝐶𝑉 = × 100
𝑌ത
Coefficient of variation
• CV also works well if one wishes to compare the
variability of traits which are in different units
𝑠
𝐶𝑉 = × 100
𝑌ത
Median and interquartile range
• In addition to the mean, the median also proves a
measure of location in a frequency distribution
• The median is simply the _______________
• 1, 2, 3, 4, 5; 3 is the median in this example
• If the median falls between two numbers, the
median becomes the average of those two numbers
• 1.2, 1.3, 2.5, 5.5; the median is (1.3+2.5)/2 = 1.9
Median and interquartile range
• In R
median()
• In Excel
=median()
Median and interquartile range
• Quartiles are values that partition the data into
quarters
• Interquartile range (IQR) = 3rd quartile – 1st quartile
• The 2nd quartile is the median
When to use which measure of
location and spread
When to use which measure of
location and spread
• Measures of location
should fall nearest to
the central tendency
When to use which measure of
location and spread
• Measures of location
should fall nearest to
the central tendency
𝑆 ത 2
(𝑌𝑖 − 𝑌)
Sensitivity of the mean to extreme observations
𝑠=
𝑖=1
𝑁𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑝ෝ =
𝑛
Proportions
𝑁𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑝ෝ =
𝑛
𝑌𝑖
𝑖=1
𝑌ത =
𝑛