Biometry Lecture 3 Posted

Chp 3: Describing data
Descriptive statistics
• Regardless of the type of

data you have, you will
have to summarize it
• Often, looking at the
variables isn’t helpful
• Descriptive statistics – are
quantities that capture
important features of the
frequency distribution
• Descriptive stats for numerical variables:

• ____________ – describes where most observations are
centered (i.e. the central tendency)
• e.g. Mean, mode and the median
• Descriptive stats for numerical variables :

• _____________ – is a measure of how variable the
measurements are - how scattered the observations are
from the central tendency
• e.g. range, variance, standard deviation
• Descriptive stats for categorical variables :

• _______________ – measures the fraction of observation
in a give category
Descriptive stats for numerical
variables
• Generally speaking for numerical data, descriptive
stats for spread correspond to a specific measure of
location
• e.g.
• The variance and standard deviation are measures of
the spread around the mean (a measure of the
location)
• Interquartile range is a measure of the spread around
the median (a measure of the location)
variables
• Recall that all the measures we calculate are
estimates (statistics), the true population
parameters are unknown
variables
• The sample mean – sum of all observation divided
by the number of observations
𝑛
෍ 𝑌𝑖
𝑖=1
𝑌ത =
𝑛
𝑌ത = the sample mean
𝑌𝑖 = the observations (the data), the i refers to the
numbers sequentially
𝑛 = the number of observations
∑ = Sigma means summation
variables
• In Excel, the mean is calculated using
=average()
variables
• Several command in R will calculate the mean
• How these commands differ
mean() – calculates the mean of a vector
colMeans() - column or row means (and other stuff)
Other, more complex calculations are possible using other
functions
variables
• Measures of spread around the mean
𝑌ത − 𝑌1
The mean: 𝑌ത
𝑌1 𝑌2
𝑌ത − 𝑌2
variables
• Measures of spread around the mean
• The two main measures are the sample variance and the
standard deviation
• The foundation of both is the ___________________
𝑛
ത 2
𝑆𝑆 = ෍(𝑌𝑖 − 𝑌)
𝑖=1
variables
• Step 1: calculate the sum of squares
• Step 2: calculate the variance (using the sum of squares)
• Step 3: calculate the standard deviation (using the variance)
Sum of squares Variance Standard deviation

variables
Calculating the Sum of Squares
Observation (𝒀𝒊 ) ഥ)
Deviations (𝒀𝒊 − 𝒀 ഥ )𝟐 )
Squared deviations ((𝒀𝒊 − 𝒀
0.9 -0.475 0.225625
1.2 -0.175 0.030625 𝑛
1.2 -0.175 0.030625
1.3
1.4
-0.075
0.025
0.005625
0.000625
ത 2
𝑆𝑆 = ෍(𝑌𝑖 − 𝑌)
1.4 0.025 0.000625
1.6 0.225 0.050625 𝑖=1
2 0.625 0.390625
mean = 1.375 0 SS = 0.735
• Calculate the mean of your data

variables
0.9 -0.475 0.225625
1.2 -0.175 0.030625 𝑛
1.2 -0.175 0.030625
1.3
1.4
-0.075
0.025
0.005625
0.000625
ത 2
𝑆𝑆 = ෍(𝑌𝑖 − 𝑌)
1.4 0.025 0.000625
1.6 0.225 0.050625 𝑖=1
2 0.625 0.390625
mean = 1.375 SS = 0.735
• Each data point is then subtracted from the mean, for

example the first data point (0.9) is subtracted from the mean
(1.375) to give -0.475
variables
0.9 -0.475 0.225625
1.2 -0.175 0.030625 𝑛
1.2 -0.175 0.030625
1.3
1.4
-0.075
0.025
0.005625
0.000625
ത 2
𝑆𝑆 = ෍(𝑌𝑖 − 𝑌)
1.4 0.025 0.000625
1.6 0.225 0.050625 𝑖=1
2 0.625 0.390625
0 SS = 0.735
• We then square the deviations

• By squaring the deviations we remove the negative deviations
variables
0.9 -0.475 0.225625
1.2 -0.175 0.030625 𝑛
1.2 -0.175 Add 0.030625
1.3
1.4
-0.075
0.025
these 0.005625
0.000625
ത 2
𝑆𝑆 = ෍(𝑌𝑖 − 𝑌)
1.4 0.025 up 0.000625
1.6 0.225 0.050625 𝑖=1
2 0.625 0.390625
0 SS = 0.735
• Add the squared deviations up

variables
• Calculating the sum of
The sum of squares (SS)
squares in R
• R is a vectorized language 𝑛
• This means that all you ത 2
𝑆𝑆 = ෍(𝑌𝑖 − 𝑌)
have to do to calculate
𝑖=1
the sum of squares in R
is:
𝑠𝑢𝑚( 𝑥 − 𝑥ҧ ^2)
variables
• The variance (s2) – the sum of squares is divided by
the number of observations (n) minus one
𝑛
ത 2
෍(𝑌𝑖 − 𝑌)
𝑖=1 SS
𝑠2 = =
𝑛−1 𝑛−1
variables
Calculating the Variance
0.9 -0.475 0.225625 𝑛
1.2 -0.175 0.030625
1.2 -0.175 0.030625 ത 2
෍(𝑌𝑖 − 𝑌)
1.3 -0.075 0.005625
1.4 0.025
n = 8 0.000625 𝑖=1
1.4 0.025 0.000625 𝑠2 =
1.6 0.225 0.050625 𝑛−1
2 0.625 0.390625
0 SS = 0.735
0.735
𝑠2 = = 0.105
7
8–1=7
variables
• Why divide by n-1?
• This is called Bessel's correction –
• The sample variance is an estimate of the unknown population
variance and thus requires a finite sample to be collected
• The sample will be biased (because it is finite)
• n-1 corrects for that bias
𝑛
ത 2
෍(𝑌𝑖 − 𝑌)
𝑖=1
𝑠2 =
𝑛−1
variables
• Variance in R
var()
• Variance in Excel
=var()
𝑛
ത 2
෍(𝑌𝑖 − 𝑌)
2 𝑖=1
𝑠 =
𝑛−1
variables
• The problem with the sum of
squares and variance is that their
units are squared (x^2) this is not
the same units as the raw data or
the mean
• For this reason, we never graph
the sum of squares or the
variance
• ______________________________
𝑛
ത 2
෍(𝑌𝑖 − 𝑌)
𝑖=1 SS
𝑠2 = =
𝑛−1 𝑛−1
variables
• The standard deviation (s) – is the square root of
the summed squared deviations from the mean
divided by the number of observations minus one
𝑆𝑆
𝑠=
𝑛−1
• The square root puts this measure of spread back

into the units of the data
• You can graph the standard deviation
variables
• The standard deviation (s) has a direct connection
to the frequency distribution
• If the distribution is bell-shaped:
• ± 1 s will have 68.2% of the data
variables
• The standard deviation (s) has a direct connection
to the frequency distribution
• Thus, if the data does not form a bell-shaped
distribution then s is less informative
variables
• The standard deviation in R
sd()
• The standard deviation in Excel
=STDEV()
variables
• A comment about rounding
• Carry as many numbers as you can
• Round at the end of your calculations
Coefficient of variation
• For many variables, standard deviation and mean
change together when different groups are
compared
• For example elephants have greater mass than
mice and also more variability in mass
• In this example, we would be concerned with the
relative variation among individuals, not variation
induced because one populations is much larger
than the other
• If an elephant gained 10g, it would mean little
• If a mouse gained 10g, it would double its mass
• However, 10% mass gain for an elephant would be
more comparable to a 10% mass gain in a mouse
• In circumstances like this we calculate the
coefficient of variation (CV), which is the standard
deviation as a percentage of the mean:
𝑠
𝐶𝑉 = × 100
𝑌ത
• A larger CV mean more variability, whereas a
smaller CV mean less
𝑠
𝐶𝑉 = × 100
𝑌ത
• CV also works well if one wishes to compare the
variability of traits which are in different units
𝑠
𝐶𝑉 = × 100
𝑌ത
Median and interquartile range
• In addition to the mean, the median also proves a
measure of location in a frequency distribution
• The median is simply the _______________
• 1, 2, 3, 4, 5; 3 is the median in this example
• If the median falls between two numbers, the
median becomes the average of those two numbers
• 1.2, 1.3, 2.5, 5.5; the median is (1.3+2.5)/2 = 1.9
• In R
median()
• In Excel
=median()
• Quartiles are values that partition the data into
quarters
• Interquartile range (IQR) = 3rd quartile – 1st quartile
• The 2nd quartile is the median
When to use which measure of
location and spread
location and spread
• Measures of location
should fall nearest to
the central tendency
location and spread
• Measures of location
should fall nearest to
the central tendency
Comparison between the median and the mean

using the frequency distribution for the Mm
genotype. The two different colors represent
the two halves of the distribution.
location and spread
• The mean will always
be more sensitive to
outliers than the
median
Sensitivity of the mean to extreme observations

(outliers) using the frequency distribution of the
MM genotype. The two different colors
represent the two halves of the distribution.
location and spread
• Because the standard
deviation is calculated
with deviations from
the mean, it will
likewise be susceptible
to outliers
𝑆 ത 2
෍(𝑌𝑖 − 𝑌)
𝑠=
𝑖=1
𝑛−1 (outliers) using the frequency distribution of the

location and spread
• Why do we care?
• Our measures of location should
represent where most of the
data is

location and spread
• Why do we care?
• Thus, without a good measure of
location we can make few if any
generalizations about the data or
the unknown population
parameters

Proportions
• Proportion is the most important descriptive

statistic for categorical variables
𝑁𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑝ෝ =
𝑛
Proportions
• The proportion (𝑝ෝ ) has properties in common

with the arithmetic mean
𝑁𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑝ෝ =
𝑛
෍ 𝑌𝑖
𝑖=1
𝑌ത =
𝑛

Biometry Lecture 3 Posted

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Biometry Lecture 3 Posted

Uploaded by

Copyright:

Available Formats

Chp 3: Describing data

• Regardless of the type of

• Descriptive stats for numerical variables:

• Descriptive stats for numerical variables :

• Descriptive stats for categorical variables :

Sum of squares Variance Standard deviation

• Calculate the mean of your data

• Each data point is then subtracted from the mean, for

• We then square the deviations

• Add the squared deviations up

• The square root puts this measure of spread back

Comparison between the median and the mean

Sensitivity of the mean to extreme observations

𝑛−1 (outliers) using the frequency distribution of the

Sensitivity of the mean to extreme observations

Sensitivity of the mean to extreme observations

• Proportion is the most important descriptive

• The proportion (𝑝ෝ ) has properties in common

You might also like