Professional Documents
Culture Documents
Measures of Variation
Measures of Variation
OBJECTIVES:
1. Define Measures of Variation
2. Know and Understand the different measures of Variations
3. Know the formulas for each given measures
4. Provide examples and show how it is being calculated
REFERENCES:
Websites:
1. http://www.encyclopedia.com/computing/dictionaries-thesauruses-pictures-and-
press-releases/measures-variation
2. http://www.statisticshowto.com/measures-variation/
3. https://people.richland.edu/james/lecture/m170/ch03-var.html
4. https://onlinecourses.science.psu.edu/stat500/node/13
5. https://www2.le.ac.uk/offices/ld/resources/numerical-data/variability
6. https://Google.com
A. What is Measures of Variation?
There are four frequently used measures of variability: the range, interquartile range,
variance, and standard deviation. In the next few paragraphs, we will look at each of these
four measures of variability in more detail.
The Range
The range is the simplest measure of variability to calculate, and one you have probably
encountered many times in your life. The range is simply the highest score minus the
lowest score. Let’s take a few examples. What is the range of the following group of
numbers: 10, 2, 5, 6, 7, 3, 4? Well, the highest number is 10, and the lowest number is 2, so
10 - 2 = 8. The range is 8. Let’s take another example. Here’s a dataset with 10 numbers: 99,
45, 23, 67, 45, 91, 82, 78, 62, 51. What is the range? The highest number is 99 and the
lowest number is 23, so 99 - 23 equals 76; the range is 76. Now consider the two quizzes
shown in Figure 1. On Quiz 1, the lowest score is 5 and the highest score is 9. Therefore, the
range is 4. The range on Quiz 2 was larger: the lowest score was 4 and the highest score
was 10. Therefore the range is 6.
Imagine all the data in a set as points on a number line. For example, if you have 3, 7
and 28 in your set of data, imagine them as points on a number line that is centered on 0
but stretches both infinitely below zero and infinitely above zero. Once plotted on that
number line, the smallest data point and the biggest data point in the set of data create the
boundaries of an interval of space on the number line that contains all data points in the
set. The interquartile range (IQR) is the length of the middle 50% of that interval of space.
The IQR is used to measure how spread out the data points in a set are from the mean of
the data set. The higher the IQR, the more spread out the data points; in contrast, the
smaller the IQR, the more bunched up the data points are around the mean. The IQR range
is one of many measurements used to measure how spread out the data points in a data set
are. It is best used with other measurements such as the median and total range to build a
complete picture of a data set’s tendency to cluster around its mean.
Variance
Variance measures how far a data set is spread out. The technical definition is “The average of the
squared differences from the mean,” but all it really does is to give you a very general idea of the spread of
your data. A value of zero means that there is no variability; All the numbers in the data set are the same.
The data set 12, 12, 12, 12, 12 has a var. of zero (the numbers are identical).
The data set 12, 12, 12, 12, 13 has a var. of 0.167; a small change in the numbers equals a very small var.
The data set 12, 12, 12, 12, 13,013 has a var. of 28171000; a large change in the numbers equals a very
large number.
2. Subtracting the mean from each number in the data set and then squaring the result.
The results are squared to make the negatives positive. Otherwise negative numbers
would cancel out the positives in the next step. It’s the distance from the mean that’s
important, not positive or negative numbers.
Using the mean as the measure of the middle of the distribution, the variance is
defined as the average squared difference of the scores from the mean. The data from Quiz
1 are shown in Table 1. The mean score is 7.0. Therefore, the column "Deviation from
Mean" contains the score minus 7. The column "Squared Deviation" is simply the previous
column squared.
9 2 4
9 2 4
9 2 4
8 1 1
8 1 1
8 1 1
8 1 1
7 0 0
7 0 0
7 0 0
7 0 0
7 0 0
6 -1 1
6 -1 1
6 -1 1
6 -1 1
6 -1 1
6 -1 1
5 -2 4
5 -2 4
Means
7 0 1.5
One thing that is important to notice is that the mean deviation from the mean is 0.
This will always be the case. The mean of the squared deviations is 1.5. Therefore, the
variance is 1.5. Analogous calculations with Quiz 2 show that its variance is 6.7. The
formula for the variance is:
Where σ2 is the variance, μ is the mean, and N is the number of numbers. For Quiz 1, μ = 7
and N = 20.
If the variance in a sample is used to estimate the variance in a population, then the
previous formula underestimates the variance and the following formula should be used:
Where s2 is the estimate of the variance and M is the sample mean.
Note that M is the mean of a sample taken from a population with a mean of μ. Since, in
practice, the variance is usually computed in a sample, this formula is most often used. The
simulation "estimating variance" illustrates the bias in the formula with N in the
denominator.
Let's take a concrete example. Assume the scores 1, 2, 4, and 5 were sampled from a
larger population. To estimate the variance in the population you would compute s 2 as
follows:
M = (1 + 2 + 4 + 5)/4 = 12/4 = 3.
Why do we divide by n - 1 instead of by n? Since μ is unknown and estimated by y¯, the yi's
tend to be closer to y¯ than to μ. To compensate, we divide by a smaller number, n - 1. The
sample variance (and therefore sample standard deviation) are the common default
calculations used by software. When asked to calculate the variance or standard deviation
of a set of data, assume - unless otherwise instructed - this is sample data and therefore
calculating the sample variance and sample standard deviation.
For example, let's find S2 for the data set from vending machine A: 1, 2, 3, 3, 4, 5
There are alternate formulas that can be easier to use if you are doing your calculations
with a hand calculator. You should note that these formulas are subject to rounding error if
your values are very large and/or you have an extremely large number of observations.
and
The standard deviation is a measure that summarizes the amount by which every
value within a dataset varies from the mean. Effectively it indicates how tightly the values
in the dataset are bunched around the mean value. It is the most robust and widely used
measure of dispersion since, unlike the range and inter-quartile range, it takes into account
every variable in the dataset. When the values in a dataset are pretty tightly bunched
together the standard deviation is small. When the values are spread apart the standard
deviation will be relatively large. The standard deviation is usually presented in
conjunction with the mean and is measured in the same units.
In many datasets the values deviate from the mean value due to chance and such
datasets are said to display a normal distribution. In a dataset with a normal distribution
most of the values are clustered around the mean while relatively few values tend to be
extremely high or extremely low. Many natural phenomena display a normal distribution.
For datasets that have a normal distribution the standard deviation can be used to
determine the proportion of values that lie within a particular range of the mean value. For
such distributions it is always the case that 68% of values are less than one standard
deviation (1SD) away from the mean value, that 95% of values are less than two standard
deviations (2SD) away from the mean and that 99% of values are less than three standard
deviations (3SD) away from the mean. Figure 3 shows this concept in diagrammatical form.
If the mean of a dataset is 25 and its standard deviation is 1.6, then
Population and sample standard deviations
There are two different calculations for the Standard Deviation. Which formula you
use depends upon whether the values in your dataset represent an entire population or
whether they form a sample of a larger population. For example, if all student users of the
library were asked how many books they had borrowed in the past month then the entire
population has been studied since all the students have been asked. In such cases the
population standard deviation should be used. Sometimes it is not possible to find
information about an entire population and it might be more realistic to ask a sample of
150 students about their library borrowing and use these results to estimate library
borrowing habits for the entire population of students. In such cases the sample standard
deviation should be used.
Whilst it is not necessary to learn the formula for calculating the standard deviation,
there may be times when you wish to include it in a report or dissertation.
Where x represents each value in the population, μ is the mean value of the population, Σ is
the summation (or total), and N is the number of values in the population.
Where x represents each value in the population, x is the mean value of the sample, Σ is the
summation (or total), and n-1 is the number of values in the sample minus 1.
Shortcut Method for Calculating the Standard Deviation
Instead of using the formula for calculating the variance and standard deviation that
involves comparing each observation to the mean, there is a shortcut method to calculating
the variance and standard deviation. This shortcut method is as follows:
For example, recall the data results for Vending Machine A at the beginning of this
lesson: 1, 2, 3, 3, 4, and 5. We calculated the variance to be 2 and the standard deviation to
be 1.414. Using the shortcut method:
1. 1 + 2 + 3 + 3 + 4 + 5 = 18
2. 18*18 = 324
3. 324/6 = 54
4. 1, 4, 9, 9, 16, and 25
5. 1 + 4 + 9 + 9 + 16 + 25 = 64
6. 64 - 54 = 10
7. 10/5 = 2
8. Square root of 2 equals 1.414
Coefficient of Variation
The residual sum of squares is used to help you decide if a statistical model is a good
fit for your data. It measures the overall difference between your data and the values
predicted by your estimation model (a “residual” is a measure of the distance from a data
point to a regression line).
Total SS is related to the total sum and explained sum with the following formula:
The residual sum of squares is used to help you decide if a statistical model is a good
fit for your data. It measures the overall difference between your data and the values
predicted by your estimation model (a “residual” is a measure of the distance from a data
point to a regression line). Total SS is related to the total sum and explained sum with the
following formula:
Contents:
Total Sum of Sq.
Explained Sum of Sq.
Residual Sum of Sq.
The Total SS tells you how much variation there is in the dependent variable.
Other times you might see actual “squares”, like in this regression line:
So the square shapes you see on regression lines are just representations of square
numbers, like 52 or 92. When you’re looking for a sum of squares, use the formula ;
to find the actual number that represents a sum of squares. A diagram (like the regression
line above) is optional, and can supply a visual representation of what you’re calculating.
Sample Question
Step 1: Find the mean by adding the numbers together and dividing by the number of items
in the set:
(3 + 5 + 7) / 3 = 15 / 3 = 5
The Explained SS tells you how much of the variation in the dependent variable your model
explained.
The residual sum of squares tells you how much of the dependent variable’s variation your
model did not explain. It is the sum of the squared differences between the actual Y and
the predicted Y:
Uses
The smaller the residual sum of squares, the better your model fits your data; the
greater the residual sum of squares, the poorer your model fits your data. A value of zero
means your model is a perfect fit. One major use is in finding the coefficient of
determination (R2). The coefficient of determination is a ratio of the explained sum of
squares to the total sum of squares.
The empirical rule states that for a normal distribution, nearly all of the data will fall within
three standard deviations of the mean. The empirical rule can be broken down into three
parts:
68% of data falls within the first standard deviation from the mean.
95% fall within two standard deviations.
99.7% fall within three standard deviations.
Approximately 68% of the data falls within one standard deviation of the mean (or
between the mean – one times the standard deviation, and the mean + 1 times the
standard deviation). In mathematical notation, this is represented as: μ ± 1σ
Approximately 95% of the data falls within two standard deviations of the mean (or
between the mean – 2 times the standard deviation, and the mean + 2 times the
standard deviation). The mathematical notation for this is: μ ± 2σ
Approximately 99.7% of the data falls within three standard deviations of the mean
(or between the mean – three times the standard deviation and the mean + three times
the standard deviation). The following notation is used to represent this fact: μ ± 3σ