You are on page 1of 15

Name: Jonathan L.

Factor Course/Year: BSME-2


Date Due: February 5, 2018 Date Submitted: February 5, 2018

TOPIC 2.1: MEASURES OF VARIATIONS

OBJECTIVES:
1. Define Measures of Variation
2. Know and Understand the different measures of Variations
3. Know the formulas for each given measures
4. Provide examples and show how it is being calculated

REFERENCES:
Websites:
1. http://www.encyclopedia.com/computing/dictionaries-thesauruses-pictures-and-
press-releases/measures-variation

2. http://www.statisticshowto.com/measures-variation/

3. https://people.richland.edu/james/lecture/m170/ch03-var.html

4. https://onlinecourses.science.psu.edu/stat500/node/13

5. https://www2.le.ac.uk/offices/ld/resources/numerical-data/variability

6. https://Google.com
A. What is Measures of Variation?

Measures of variation are quantities that express the amount of variation in a random


variable (compare measures of location). Variation is sometimes described
as spread or dispersion to distinguish it from systematic trends or differences. Measures of
variation are either properties of a probability distribution or sample estimates of them.

B. Different Measures of Variations

 There are four frequently used measures of variability: the range, interquartile range,
variance, and standard deviation. In the next few paragraphs, we will look at each of these
four measures of variability in more detail.

The Range

The range is the simplest measure of variability to calculate, and one you have probably
encountered many times in your life. The range is simply the highest score minus the
lowest score. Let’s take a few examples. What is the range of the following group of
numbers: 10, 2, 5, 6, 7, 3, 4? Well, the highest number is 10, and the lowest number is 2, so
10 - 2 = 8. The range is 8. Let’s take another example. Here’s a dataset with 10 numbers: 99,
45, 23, 67, 45, 91, 82, 78, 62, 51. What is the range? The highest number is 99 and the
lowest number is 23, so 99 - 23 equals 76; the range is 76. Now consider the two quizzes
shown in Figure 1. On Quiz 1, the lowest score is 5 and the highest score is 9. Therefore, the
range is 4. The range on Quiz 2 was larger: the lowest score was 4 and the highest score
was 10. Therefore the range is 6.

The Interquartile Range

The interquartile range is a measure of where the “middle fifty” is in a data set.


Where a range is a measure of where the beginning and end are in a set, an interquartile
range is a measure of where the bulk of the values lie. why it’s preferred over many
other measures of spread (i.e. the average or median) when reporting things like school
performance or SAT scores.

The interquartile range formula is the first quartile subtracted from the third quartile:


IQR = Q3 – Q1.
Where Q3 is the upper quartile and Q1 is the lower quartile.

Imagine all the data in a set as points on a number line. For example, if you have 3, 7
and 28 in your set of data, imagine them as points on a number line that is centered on 0
but stretches both infinitely below zero and infinitely above zero. Once plotted on that
number line, the smallest data point and the biggest data point in the set of data create the
boundaries of an interval of space on the number line that contains all data points in the
set. The interquartile range (IQR) is the length of the middle 50% of that interval of space.

What is an Interquartile Range Used For?

The IQR is used to measure how spread out the data points in a set are from the mean of
the data set. The higher the IQR, the more spread out the data points; in contrast, the
smaller the IQR, the more bunched up the data points are around the mean. The IQR range
is one of many measurements used to measure how spread out the data points in a data set
are. It is best used with other measurements such as the median and total range to build a
complete picture of a data set’s tendency to cluster around its mean.

Variance

Variance measures how far a data set is spread out. The technical definition is “The average of the
squared differences from the mean,” but all it really does is to give you a very general idea of the spread of
your data. A value of zero means that there is no variability; All the numbers in the data set are the same.
 The data set 12, 12, 12, 12, 12 has a var. of zero (the numbers are identical).
 The data set 12, 12, 12, 12, 13 has a var. of 0.167; a small change in the numbers equals a very small var.
 The data set 12, 12, 12, 12, 13,013 has a var. of 28171000; a large change in the numbers equals a very
large number.

How to Calculate the Variance?


 The variance for a population is calculated by:

1. Finding the mean (the average).

2. Subtracting the mean from each number in the data set and then squaring the result.
The results are squared to make the negatives positive. Otherwise negative numbers
would cancel out the positives in the next step. It’s the distance from the mean that’s
important, not positive or negative numbers.

3. Averaging the squared differences.

Using the mean as the measure of the middle of the distribution, the variance is
defined as the average squared difference of the scores from the mean. The data from Quiz
1 are shown in Table 1. The mean score is 7.0. Therefore, the column "Deviation from
Mean" contains the score minus 7. The column "Squared Deviation" is simply the previous
column squared.

Table 1. Calculation of Variance for Quiz 1 scores.

Deviation from Squared


Scores Mean Deviation

9 2 4

9 2 4

9 2 4

8 1 1

8 1 1

8 1 1

8 1 1

7 0 0

7 0 0

7 0 0

7 0 0

7 0 0

6 -1 1
6 -1 1

6 -1 1

6 -1 1

6 -1 1

6 -1 1

5 -2 4

5 -2 4

Means

7 0 1.5

One thing that is important to notice is that the mean deviation from the mean is 0.
This will always be the case. The mean of the squared deviations is 1.5. Therefore, the
variance is 1.5. Analogous calculations with Quiz 2 show that its variance is 6.7. The
formula for the variance is:

Where σ2 is the variance, μ is the mean, and N is the number of numbers. For Quiz 1, μ = 7
and N = 20.
If the variance in a sample is used to estimate the variance in a population, then the
previous formula underestimates the variance and the following formula should be used:

Where s2 is the estimate of the variance and M is the sample mean.

Note that M is the mean of a sample taken from a population with a mean of μ. Since, in
practice, the variance is usually computed in a sample, this formula is most often used. The
simulation "estimating variance" illustrates the bias in the formula with N in the
denominator.
Let's take a concrete example. Assume the scores 1, 2, 4, and 5 were sampled from a
larger population. To estimate the variance in the population you would compute s 2 as
follows:

 M = (1 + 2 + 4 + 5)/4 = 12/4 = 3.

s2 = [(1-3)2 + (2-3)2 + (4-3)2 + (5-3)2]/(4-1)

    = (4 + 1 + 1 + 4)/3 = 10/3 = 3.333

Why do we divide by n - 1 instead of by n? Since μ is unknown and estimated by y¯, the yi's
tend to be closer to y¯ than to μ. To compensate, we divide by a smaller number, n - 1.  The
sample variance (and therefore sample standard deviation) are the common default
calculations used by software.  When asked to calculate the variance or standard deviation
of a set of data, assume - unless otherwise instructed - this is sample data and therefore
calculating the sample variance and sample standard deviation.

For example, let's find S2 for the data set from vending machine A: 1, 2, 3, 3, 4, 5

There are alternate formulas that can be easier to use if you are doing your calculations
with a hand calculator. You should note that these formulas are subject to rounding error if
your values are very large and/or you have an extremely large number of observations.

and

For this example,


The Standard Deviation

The standard deviation is a measure that summarizes the amount by which every
value within a dataset varies from the mean. Effectively it indicates how tightly the values
in the dataset are bunched around the mean value. It is the most robust and widely used
measure of dispersion since, unlike the range and inter-quartile range, it takes into account
every variable in the dataset. When the values in a dataset are pretty tightly bunched
together the standard deviation is small. When the values are spread apart the standard
deviation will be relatively large. The standard deviation is usually presented in
conjunction with the mean and is measured in the same units.

In many datasets the values deviate from the mean value due to chance and such
datasets are said to display a normal distribution. In a dataset with a normal distribution
most of the values are clustered around the mean while relatively few values tend to be
extremely high or extremely low. Many natural phenomena display a normal distribution.

For datasets that have a normal distribution the standard deviation can be used to
determine the proportion of values that lie within a particular range of the mean value. For
such distributions it is always the case that 68% of values are less than one standard
deviation (1SD) away from the mean value, that 95% of values are less than two standard
deviations (2SD) away from the mean and that 99% of values are less than three standard
deviations (3SD) away from the mean. Figure 3 shows this concept in diagrammatical form.

 
If the mean of a dataset is 25 and its standard deviation is 1.6, then

 68% of the values in the dataset will lie between MEAN-1SD (25-1.6=23.4)


and MEAN+1SD (25+1.6=26.6)
 99% of the values will lie between MEAN-3SD (25-4.8=20.2)
and MEAN+3SD(25+4.8=29.8).
If the dataset had the same mean of 25 but a larger standard deviation (for example, 2.3) it
would indicate that the values were more dispersed. The frequency distribution for a
dispersed dataset would still show a normal distribution but when plotted on a graph the
shape of the curve will be flatter as in figure 4.

 
Population and sample standard deviations

There are two different calculations for the Standard Deviation. Which formula you
use depends upon whether the values in your dataset represent an entire population or
whether they form a sample of a larger population. For example, if all student users of the
library were asked how many books they had borrowed in the past month then the entire
population has been studied since all the students have been asked. In such cases the
population standard deviation should be used. Sometimes it is not possible to find
information about an entire population and it might be more realistic to ask a sample of
150 students about their library borrowing and use these results to estimate library
borrowing habits for the entire population of students. In such cases the sample standard
deviation should be used.

Formulae for the standard deviation

Whilst it is not necessary to learn the formula for calculating the standard deviation,
there may be times when you wish to include it in a report or dissertation.

The standard deviation of an entire population is known as σ (sigma) and is calculated


using:

Where x represents each value in the population, μ is the mean value of the population, Σ is
the summation (or total), and N is the number of values in the population.

The standard deviation of a sample is known as S and is calculated using:

Where x represents each value in the population, x is the mean value of the sample, Σ is the
summation (or total), and n-1 is the number of values in the sample minus 1.
Shortcut Method for Calculating the Standard Deviation

Instead of using the formula for calculating the variance and standard deviation that
involves comparing each observation to the mean, there is a shortcut method to calculating
the variance and standard deviation.  This shortcut method is as follows:

1. Sum all the values in the data set.


2. Square this sum.
3. Divide this squared sum by the total number of observations, n, (call this the
average sum squared).
4. Square each value in the data set.
5. Sum these squared values (called the sum of squares).
6. Subtract this sum of squares minus average sum squared.
7. Divide this difference by n - 1; this is the variance.
8. Take the square root to get the standard deviation.

For example, recall the data results for Vending Machine A at the beginning of this
lesson: 1, 2, 3, 3, 4, and 5. We calculated the variance to be 2 and the standard deviation to
be 1.414.  Using the shortcut method:

1. 1 + 2 + 3 + 3 + 4 + 5 = 18
2. 18*18 = 324
3. 324/6 = 54
4. 1, 4, 9, 9, 16, and 25
5. 1 + 4 + 9 + 9 + 16 + 25 = 64
6. 64 - 54 = 10
7. 10/5 = 2
8. Square root of 2 equals 1.414

Coefficient of Variation

Above we considered three measures of variation: Range, Interquartile Range


(IQR), and Variance (and its square root counterpart - Standard Deviation).  These are all
measures we can calculate from one quantitative variable e.g. height, weight.  But how can
we compare dispersion (i.e. variability) of data from two or more distinct populations that
have vastly different means?  A popular statistic to use in such situations is the Coefficient
of Variation or CV.  This is a unit-free statistic and one where the higher the value the
greater the dispersion.  The calculation of CV is:

CV = Standard Deviation / Mean

Sum of Squares: Residual Sum, Total Sum, Explained Sum

The residual sum of squares is used to help you decide if a statistical model is a good
fit for your data. It measures the overall difference between your data and the values
predicted by your estimation model (a “residual” is a measure of the distance from a data
point to a regression line).
Total SS is related to the total sum and explained sum with the following formula:

Total SS = Explained SS + Residual Sum of Squares.

Sum of Squares: Residual Sum, Total Sum, Explained Sum

The residual sum of squares is used to help you decide if a statistical model is a good
fit for your data. It measures the overall difference between your data and the values
predicted by your estimation model (a “residual” is a measure of the distance from a data
point to a regression line). Total SS is related to the total sum and explained sum with the
following formula:

Total SS = Explained SS + Residual Sum of Squares.

Contents:
 Total Sum of Sq.
 Explained Sum of Sq.
 Residual Sum of Sq.

What is the Total Sum of Squares?

The Total SS tells you how much variation there is in the dependent variable.

Total SS = Σ(Yi – mean of Y)2.


Note: Sigma (Σ) is a mathematical term for summation or “adding up.” It’s telling you to
add up all the possible results from the rest of the equation.
Sum of squares is a measure of how a data set varies around a central number (like
the mean). You might realize by the phrase that you’re summing (adding up) squares — but
squares of what? You’ll sometimes see this formula:

Other times you might see actual “squares”, like in this regression line:

Squares of numbers, as in 42 and 102 can be represented with actual geometric squares


(image courtesy of UMBC.edu):

So the square shapes you see on regression lines are just representations of square
numbers, like 52 or 92. When you’re looking for a sum of squares, use the formula ;

to find the actual number that represents a sum of squares. A diagram (like the regression
line above) is optional, and can supply a visual representation of what you’re calculating.
 Sample Question

Find the Sum of Sq. for the following numbers: 3,5,7.

Step 1: Find the mean by adding the numbers together and dividing by the number of items
in the set:

(3 + 5 + 7) / 3 = 15 / 3 = 5

Step 2: Subtract the mean from each of your data items:


3 – 5 = -2
5–5=0
7–5=2

Step 3: Square your results from Step 3:


-2 x -2 = 4
0x0=0
2x2=4

Step 4: Sum (add up) all of your numbers:


4 + 4 + 0 = 8.
That’s it!

What is the Explained Sum of Squares?

The Explained SS tells you how much of the variation in the dependent variable your model
explained.

Explained SS = Σ(Y-Hat – mean of Y)2.


What is the Residual Sum of Squares?

The residual sum of squares tells you how much of the dependent variable’s variation your
model did not explain. It is the sum of the squared differences between the actual Y and
the predicted Y:

Residual Sum of Squares = Σ e2


If all those formulas look confusing, don’t worry! It’s very, very unusual for you to
want to use them. Finding the sum by hand is tedious and time-consuming. It involves
a lot of subtracting, squaring and summing. Your calculations will be prone to errors, so
you’re much better off using software like Excel to do the calculations. You won’t even need
to know the actual formulas, as Excel works them behind the scenes.

Uses

The smaller the residual sum of squares, the better your model fits your data; the
greater the residual sum of squares, the poorer your model fits your data. A value of zero
means your model is a perfect fit. One major use is in finding the coefficient of
determination (R2). The coefficient of determination is a ratio of the explained sum of
squares to the total sum of squares.

What is Empirical Rule?

The empirical rule states that for a normal distribution, nearly all of the data will fall within
three standard deviations of the mean. The empirical rule can be broken down into three
parts:

 68% of data falls within the first standard deviation from the mean.
 95% fall within two standard deviations.
 99.7% fall within three standard deviations.

The rule is also called the 68-95-99 7 Rule or the Three Sigma Rule.


When do we use the Empirical Rule?

The Empirical Rule is often used in statistics for forecasting, especially when


obtaining the right data is difficult or impossible to get. The rule can give you a rough
estimate of what your data collection might look like if you were able to survey the
entire population.
This rule applies generally to a random variable, X, following the shape of a normal
distribution, or bell-curve, with a mean “mu” (the Greek letter &mu) and a standard
deviation “sigma” (the Greek letter σ). The rule doesn’t apply to distributions that are not
normal, but you can apply it to other distributions using Chebyshev’s Theorem.

Empirical Rule: Notation


When applying the Empirical Rule to a data set the following conditions are true:

 Approximately 68% of the data falls within one standard deviation of the mean (or
between the mean – one times the standard deviation, and the mean + 1 times the
standard deviation). In mathematical notation, this is represented as: μ ± 1σ

 Approximately 95% of the data falls within two standard deviations of the mean (or
between the mean – 2 times the standard deviation, and the mean + 2 times the
standard deviation). The mathematical notation for this is: μ ± 2σ

 Approximately 99.7% of the data falls within three standard deviations of the mean
(or between the mean – three times the standard deviation and the mean + three times
the standard deviation). The following notation is used to represent this fact: μ ± 3σ

You might also like