Standard Deviation

You might also like

You are on page 1of 17

Standard Deviation

Introduction
• Standard deviation is a descriptive statistic that is used
to understand the distribution of a dataset.
• It is often reported in combination with the mean (or
average), giving context to that statistic. Specifically, a
standard deviation refers to how much scores in a
dataset tend to spread-out from the mean.
• A small standard deviation (relative to the mean score)
indicates that the majority of individuals (or data points)
tend to have scores that are very close to the mean (see
figure below). In this case, cases may look clustered
around the mean score, with only a few scores farther
away from the mean (probably outliers).
By contrast, a sample with a large standard deviation
(relative to the mean score) tends to have cases that are
more widely spread-out from the mean (see figure on right),
perhaps with only a few cases actually having scores that
fall close to the mean.
• You may be wondering to yourself: “Why
should I care about the standard
deviation?” The answer to that question is
context. To really understand the basic
characteristics of a dataset, you must put
your statistics in context.
Allow me to demonstrate:
• For the sake of demonstration, imagine we have two samples of
chocolate cake eaters, each sample with 10 people, self-reporting
how many pieces of chocolate cake they've eaten in the last seven
days.
• In dataset #1, we have five people that report eating 4 pieces of
cake and five people that report eating 6 pieces of cake, for a mean
of 5 pieces of cake
– (4+4+4+4+4+6+6+6+6+6)/10 = 5

• Mean (Average) = 5
• In dataset #2, we have five people that report eating 0 piece of cake
and five people that report eating 10 pieces of cake, for a mean of 5
pieces of cake
– (0+0+0+0+0+10+10+10+10+10)/10 = 5.
• Mean (Average) = 5
• Looking at the mean score alone would leave us to
believe that these two datasets of people have the same
chocolate cake eating habits (eating about 5 pieces
per person), but would we ever come to that conclusion,
given access to the full information that we have here?
Of course not.
• Instead we would probably say that the mean of 5
pieces per person seems to describe sample #1
reasonably, but not-so-much for sample #2, as it seems
to be composed of people with more extreme chocolate
cake eating habits (either eating a whole lot of chocolate
cake in a week or having none at all).
• In this case the datasets are mathematically
similar, but the mean of the two samples is
somewhat deceptive. In fact, the mean statistic
can be a deceptive little bugger in general, when
it is not presented in context. That is where a
standard deviation comes in!
• Now, you might be thinking: “Why not just look
at the raw data and come to that conclusion?
After all, you just came to that conclusion
without ever talking about the standard
deviation!”
• Well, that is fine as long as you only have ten
people in each sample AND as long as your
sample is so neatly, cleanly, and clearly
organized into moderate values and extreme
values, as it is here. If that is the case, then you
likely can get a perfectly firm grasp on your data
without ever knowing the standard deviation!
Unfortunately, data is rarely that clear and
samples sizes can be in the hundreds,
thousands, or even millions, making it
impossible to "eye-ball" the data and draw
reliable conclusions.
• When these instances arise (which will be
almost every time you work with data),
your friendly standard deviation can give you the
context you need.
• Let's consider the standard deviations of our
chocolate cake datasets. Knowing that larger
values of standard deviation are indicative of
more points "spread" away from the mean,
compared to smaller standard deviation values
(as discussed in our first paragraph), which
sample (#1 or #2) would you expect to have a
larger standard deviation?...
If you happen to want to
calculate by hand, you simply:
• Subtract your mean score from every person's
actual (observed) score
• Square those difference scores for each person
• Add those values together for the whole sample
• Divide that sum by the number of cases in your
data (10 in our case)
• Finally, calculate the square root of the number
calculate in step #4
• Now, back to our example. You will recall that I asked
you to guess which sample you would expect to have the
larger standard deviation (#1 or #2). Well, if you said
sample #2, you would be correct!
• In dataset #1, we have five people that report eating 4
pieces of cake and five people that report eating 6
pieces of cake, for a mean of 5 pieces of cake
([4+4+4+4+4+6+6+6+6+6]/10=5).
– Mean =5; Standard Deviation = 1
• In dataset #2, we have five people that report eating 0
piece of cake and five people that report eating 10
pieces of cake, for a mean of 5 pieces of cake
([0+0+0+0+0+10+10+10+10+10]/10=5).
– Mean = 5; Standard Deviation = 5
• Note: You will almost never see the mean and standard
deviation with the same value. This example was made
intentionally extreme for demonstration purposes, but
clearly you wouldn't typically have your entire sample fall
into either all 0's or all 10's (unless it is categorical data,
in which case there is no need for means and standard
deviation scores).
• From this example , we can see that the standard
deviation is critical to understanding your data, by putting
your mean statistic in context, in this case indicating that
the mean for the dataset #2 is not a very meaningful or
useful statistic for understanding the eating tendencies of
individuals in that dataset.
A few closing notes about standard
deviations:
• A dataset's variance can be calculated by
simply squaring the standard deviation.
– Variance = (standard deviation)2
• One standard deviation above and below the
mean is expected to include about 68% of the
participant's scores in your dataset (assuming
your distribution is normal).
– Two standard deviations above and below the mean
would be expected to include 95% of the values in
your dataset (assuming your distribution is normal).
– Three standard deviations above and below the mean
would be expected to include 99.7% of the values in
your dataset (assuming your distribution is normal).
• As eluded to earlier, standard deviations
should only be calculated for interval data
(also true for a mean score).
– Interval data is data that is numeric and hold
an intrinsic and consistent value between
 values (such as 1 to 2 represents an equal
increase to 2 to 3 or 3 to 4..etc).
Source
• Jeremy Taylor. Sunday, August 1, 2010 at
11:43AM. Stats Makes Me Cry: Analyze.
Interpret. Defend. Top Ten Confusing
Stats Terms Explained in “Plain English”
(#10: Standard Deviation).
http://www.statsmakemecry.com/smmcthe
blog/2010/8/1/top-ten-confusing-stats-term
s-explained-in-plain-english-10.html
. Retrieved: February 22, 2013.

You might also like