You are on page 1of 47

Chp 3: Describing data

Descriptive statistics

• Regardless of the type of


data you have, you will
have to summarize it
• Often, looking at the
variables isn’t helpful
• Descriptive statistics – are
quantities that capture
important features of the
frequency distribution
Descriptive statistics

• Descriptive stats for numerical variables:


• ____________ – describes where most observations are
centered (i.e. the central tendency)
• e.g. Mean, mode and the median
Descriptive statistics

• Descriptive stats for numerical variables :


• _____________ – is a measure of how variable the
measurements are - how scattered the observations are
from the central tendency
• e.g. range, variance, standard deviation
Descriptive statistics

• Descriptive stats for categorical variables :


• _______________ – measures the fraction of observation
in a give category
Descriptive stats for numerical
variables
• Generally speaking for numerical data, descriptive
stats for spread correspond to a specific measure of
location
• e.g.
• The variance and standard deviation are measures of
the spread around the mean (a measure of the
location)
• Interquartile range is a measure of the spread around
the median (a measure of the location)
Descriptive stats for numerical
variables
• Recall that all the measures we calculate are
estimates (statistics), the true population
parameters are unknown
Descriptive stats for numerical
variables
• The sample mean – sum of all observation divided
by the number of observations
𝑛

෍ 𝑌𝑖
𝑖=1
𝑌ത =
𝑛
𝑌ത = the sample mean
𝑌𝑖 = the observations (the data), the i refers to the
numbers sequentially
𝑛 = the number of observations
∑ = Sigma means summation
Descriptive stats for numerical
variables
• In Excel, the mean is calculated using
=average()
Descriptive stats for numerical
variables
• Several command in R will calculate the mean
• How these commands differ
mean() – calculates the mean of a vector
colMeans() - column or row means (and other stuff)
Other, more complex calculations are possible using other
functions
Descriptive stats for numerical
variables
• Measures of spread around the mean

𝑌ത − 𝑌1

The mean: 𝑌ത
𝑌1 𝑌2

𝑌ത − 𝑌2
Descriptive stats for numerical
variables
• Measures of spread around the mean
• The two main measures are the sample variance and the
standard deviation
• The foundation of both is the ___________________

𝑛
ത 2
𝑆𝑆 = ෍(𝑌𝑖 − 𝑌)
𝑖=1
Descriptive stats for numerical
variables
• Step 1: calculate the sum of squares
• Step 2: calculate the variance (using the sum of squares)
• Step 3: calculate the standard deviation (using the variance)

Sum of squares Variance Standard deviation


Descriptive stats for numerical
variables
Calculating the Sum of Squares

Observation (𝒀𝒊 ) ഥ)
Deviations (𝒀𝒊 − 𝒀 ഥ )𝟐 )
Squared deviations ((𝒀𝒊 − 𝒀
0.9 -0.475 0.225625
1.2 -0.175 0.030625 𝑛
1.2 -0.175 0.030625
1.3
1.4
-0.075
0.025
0.005625
0.000625
ത 2
𝑆𝑆 = ෍(𝑌𝑖 − 𝑌)
1.4 0.025 0.000625
1.6 0.225 0.050625 𝑖=1
2 0.625 0.390625
mean = 1.375 0 SS = 0.735

• Calculate the mean of your data


Descriptive stats for numerical
variables
Calculating the Sum of Squares

Observation (𝒀𝒊 ) ഥ)
Deviations (𝒀𝒊 − 𝒀 ഥ )𝟐 )
Squared deviations ((𝒀𝒊 − 𝒀
0.9 -0.475 0.225625
1.2 -0.175 0.030625 𝑛
1.2 -0.175 0.030625
1.3
1.4
-0.075
0.025
0.005625
0.000625
ത 2
𝑆𝑆 = ෍(𝑌𝑖 − 𝑌)
1.4 0.025 0.000625
1.6 0.225 0.050625 𝑖=1
2 0.625 0.390625
mean = 1.375 SS = 0.735

• Each data point is then subtracted from the mean, for


example the first data point (0.9) is subtracted from the mean
(1.375) to give -0.475
Descriptive stats for numerical
variables
Calculating the Sum of Squares

Observation (𝒀𝒊 ) ഥ)
Deviations (𝒀𝒊 − 𝒀 ഥ )𝟐 )
Squared deviations ((𝒀𝒊 − 𝒀
0.9 -0.475 0.225625
1.2 -0.175 0.030625 𝑛
1.2 -0.175 0.030625
1.3
1.4
-0.075
0.025
0.005625
0.000625
ത 2
𝑆𝑆 = ෍(𝑌𝑖 − 𝑌)
1.4 0.025 0.000625
1.6 0.225 0.050625 𝑖=1
2 0.625 0.390625
0 SS = 0.735

• We then square the deviations


• By squaring the deviations we remove the negative deviations
Descriptive stats for numerical
variables
Calculating the Sum of Squares

Observation (𝒀𝒊 ) ഥ)
Deviations (𝒀𝒊 − 𝒀 ഥ )𝟐 )
Squared deviations ((𝒀𝒊 − 𝒀
0.9 -0.475 0.225625
1.2 -0.175 0.030625 𝑛
1.2 -0.175 Add 0.030625
1.3
1.4
-0.075
0.025
these 0.005625
0.000625
ത 2
𝑆𝑆 = ෍(𝑌𝑖 − 𝑌)
1.4 0.025 up 0.000625
1.6 0.225 0.050625 𝑖=1
2 0.625 0.390625
0 SS = 0.735

• Add the squared deviations up


Descriptive stats for numerical
variables
• Calculating the sum of
The sum of squares (SS)
squares in R
• R is a vectorized language 𝑛
• This means that all you ത 2
𝑆𝑆 = ෍(𝑌𝑖 − 𝑌)
have to do to calculate
𝑖=1
the sum of squares in R
is:

𝑠𝑢𝑚( 𝑥 − 𝑥ҧ ^2)
Descriptive stats for numerical
variables
• The variance (s2) – the sum of squares is divided by
the number of observations (n) minus one

𝑛
ത 2
෍(𝑌𝑖 − 𝑌)
𝑖=1 SS
𝑠2 = =
𝑛−1 𝑛−1
Descriptive stats for numerical
variables
Calculating the Variance

Observation (𝒀𝒊 ) ഥ)
Deviations (𝒀𝒊 − 𝒀 ഥ )𝟐 )
Squared deviations ((𝒀𝒊 − 𝒀
0.9 -0.475 0.225625 𝑛
1.2 -0.175 0.030625
1.2 -0.175 0.030625 ത 2
෍(𝑌𝑖 − 𝑌)
1.3 -0.075 0.005625
1.4 0.025
n = 8 0.000625 𝑖=1
1.4 0.025 0.000625 𝑠2 =
1.6 0.225 0.050625 𝑛−1
2 0.625 0.390625
0 SS = 0.735

0.735
𝑠2 = = 0.105
7
8–1=7
Descriptive stats for numerical
variables
• Why divide by n-1?
• This is called Bessel's correction –
• The sample variance is an estimate of the unknown population
variance and thus requires a finite sample to be collected
• The sample will be biased (because it is finite)
• n-1 corrects for that bias

𝑛
ത 2
෍(𝑌𝑖 − 𝑌)
𝑖=1
𝑠2 =
𝑛−1
Descriptive stats for numerical
variables
• Variance in R
var()
• Variance in Excel
=var()
𝑛
ത 2
෍(𝑌𝑖 − 𝑌)
2 𝑖=1
𝑠 =
𝑛−1
Descriptive stats for numerical
variables
• The problem with the sum of
squares and variance is that their
units are squared (x^2) this is not
the same units as the raw data or
the mean
• For this reason, we never graph
the sum of squares or the
variance
• ______________________________

𝑛
ത 2
෍(𝑌𝑖 − 𝑌)
𝑖=1 SS
𝑠2 = =
𝑛−1 𝑛−1
Descriptive stats for numerical
variables
• The standard deviation (s) – is the square root of
the summed squared deviations from the mean
divided by the number of observations minus one

𝑆𝑆
𝑠=
𝑛−1

• The square root puts this measure of spread back


into the units of the data
• You can graph the standard deviation
Descriptive stats for numerical
variables
• The standard deviation (s) has a direct connection
to the frequency distribution
• If the distribution is bell-shaped:
• ± 1 s will have 68.2% of the data
• ± 2 s will have 95.5% of the data
• ± 3 s will have 99.7% of the data
Descriptive stats for numerical
variables
• The standard deviation (s) has a direct connection
to the frequency distribution
• Thus, if the data does not form a bell-shaped
distribution then s is less informative
Descriptive stats for numerical
variables
• The standard deviation in R
sd()
• The standard deviation in Excel
=STDEV()
Descriptive stats for numerical
variables
• A comment about rounding
• Carry as many numbers as you can
• Round at the end of your calculations
Coefficient of variation
• For many variables, standard deviation and mean
change together when different groups are
compared
• For example elephants have greater mass than
mice and also more variability in mass
Coefficient of variation
• In this example, we would be concerned with the
relative variation among individuals, not variation
induced because one populations is much larger
than the other
Coefficient of variation
• If an elephant gained 10g, it would mean little
• If a mouse gained 10g, it would double its mass
Coefficient of variation
• However, 10% mass gain for an elephant would be
more comparable to a 10% mass gain in a mouse
Coefficient of variation
• In circumstances like this we calculate the
coefficient of variation (CV), which is the standard
deviation as a percentage of the mean:

𝑠
𝐶𝑉 = × 100
𝑌ത
Coefficient of variation
• A larger CV mean more variability, whereas a
smaller CV mean less

𝑠
𝐶𝑉 = × 100
𝑌ത
Coefficient of variation
• CV also works well if one wishes to compare the
variability of traits which are in different units

𝑠
𝐶𝑉 = × 100
𝑌ത
Median and interquartile range
• In addition to the mean, the median also proves a
measure of location in a frequency distribution
• The median is simply the _______________
• 1, 2, 3, 4, 5; 3 is the median in this example
• If the median falls between two numbers, the
median becomes the average of those two numbers
• 1.2, 1.3, 2.5, 5.5; the median is (1.3+2.5)/2 = 1.9
Median and interquartile range
• In R
median()
• In Excel
=median()
Median and interquartile range
• Quartiles are values that partition the data into
quarters
• Interquartile range (IQR) = 3rd quartile – 1st quartile
• The 2nd quartile is the median
When to use which measure of
location and spread
When to use which measure of
location and spread
• Measures of location
should fall nearest to
the central tendency
When to use which measure of
location and spread
• Measures of location
should fall nearest to
the central tendency

Comparison between the median and the mean


using the frequency distribution for the Mm
genotype. The two different colors represent
the two halves of the distribution.
When to use which measure of
location and spread
• The mean will always
be more sensitive to
outliers than the
median

Sensitivity of the mean to extreme observations


(outliers) using the frequency distribution of the
MM genotype. The two different colors
represent the two halves of the distribution.
When to use which measure of
location and spread
• Because the standard
deviation is calculated
with deviations from
the mean, it will
likewise be susceptible
to outliers

𝑆 ത 2
෍(𝑌𝑖 − 𝑌)
Sensitivity of the mean to extreme observations
𝑠=
𝑖=1

𝑛−1 (outliers) using the frequency distribution of the


MM genotype. The two different colors
represent the two halves of the distribution.
When to use which measure of
location and spread
• Why do we care?
• Our measures of location should
represent where most of the
data is

Sensitivity of the mean to extreme observations


(outliers) using the frequency distribution of the
MM genotype. The two different colors
represent the two halves of the distribution.
When to use which measure of
location and spread
• Why do we care?
• Thus, without a good measure of
location we can make few if any
generalizations about the data or
the unknown population
parameters

Sensitivity of the mean to extreme observations


(outliers) using the frequency distribution of the
MM genotype. The two different colors
represent the two halves of the distribution.
Proportions

• Proportion is the most important descriptive


statistic for categorical variables

𝑁𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑝ෝ =
𝑛
Proportions

• The proportion (𝑝ෝ ) has properties in common


with the arithmetic mean

𝑁𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑝ෝ =
𝑛

෍ 𝑌𝑖
𝑖=1
𝑌ത =
𝑛

You might also like