You are on page 1of 9

 

The standard deviation is the most common measure of dispersion, or how spread out the
data are about the mean. The symbol σ (sigma) is often used to represent the standard
deviation of a population, while s is used to represent thestandard deviation of a sample.

Step 1: Describe the size of your sample


Use N to know how many observations are in your sample. Minitab does not include
missing values in this count.

You should collect a medium to large sample of data. Samples that have at least 20
observations are often adequate to represent the distribution of your data. However, to
better represent the distribution with a histogram, some practitioners recommend that you
have at least 50 observations. Larger samples also provide more precise estimates of the
process parameters, such as the mean and standard deviation.

Statistics
Variabl SE Minimu Maximu
e N N* Mean Mean StDev m Q1 Median Q3 m
68 0 21.264 0.778 6.422 10.0000 16.000 20.000 24.750 37.0000
Torque
7 8 0 0 0 0
Key Result: N
In these results, you have 68 observations.

Step 2: Describe the center of your data


Use the mean to describe the sample with a single value that represents the center of the
data. Many statistical analyses use the mean as a standard measure of the center of the
distribution of the data.

The median and the mean both measure central tendency. But unusual values, called
outliers, affect the median less than they affect the mean. When you have unusual values,
you can compare the mean and the median to decide which is the better measure to use. If
your data are symmetric, the mean and median are similar.

Statistics
Variabl SE Minimu Maximu
e N N* Mean Mean StDev m Q1 Median Q3 m
68 0 21.264 0.778 6.422 10.0000 16.000 20.000 24.750 37.0000
Torque
7 8 0 0 0 0
Key Results: Mean and Median
In these results, the mean torque that is required to remove a toothpaste cap is 21.265, and the median
torque is 20. The data appear to be skewed to the right, which explains why the mean is greater than the
median.

Step 3: Describe the spread of your data


Use the standard deviation to determine how spread out the data are from the mean.

A higher standard deviation value indicates greater spread in the data..

Statistics
Variabl SE Minimu Maximu
e N N* Mean Mean StDev m Q1 Median Q3 m
68 0 21.264 0.778 6.422 10.0000 16.000 20.000 24.750 37.0000
Torque
7 8 0 0 0 0
Key Result: StDev
In these results, the standard deviation is 6.422. With normal data, most of the observations are spread
within 3 standard deviations on each side of the mean.

Step 4: Assess the shape and spread of your data


distribution
Use the histogram, the individual value plot, and the boxplot to assess the shape and spread
of the data, and to identify any potential outliers.

Examine the shape of your data to determine whether your data appear
to be skewed
When data are skewed, the majority of the data are located on the high or low side of the
graph. Often, skewness is easiest to detect with a histogram or boxplot.

Right-skewed
Left-skewed

The histogram with right-skewed data shows wait times. Most of the wait times are relatively short, and
only a few wait times are long. The histogram with left-skewed data shows failure time data. A few items
fail immediately, and many more items fail later.

Determine how much your data varies


Assess the spread of the points to determine how much your sample varies. The greater the
variation in the sample, the more the points will be spread out from the center of the data.

This individual plot shows that the data on the right has more variation than the data on the left.

Look for multi-modal data


Multi-modal data have multiple peaks, also called modes. Multi-modal data often indicate
that important variables are not yet accounted for.

If you have additional information that allows you to classify the observations into groups,
you can create a group variable with this information. Then, you can create the graph with
groups to determine whether the group variable accounts for the peaks in the data.

Simple
With Groups

For example, a manager at a bank collects wait time data and creates a simple histogram. The histogram
appears to have two peaks. After further investigation, the manager determines that the wait times for
customers who are cashing checks is shorter than the wait time for customers who are applying for home
equity loans. The manager adds a group variable for customer task, and then creates a histogram with
groups.

Look for outliers


Outliers, which are data values that are far away from other data values, can strongly affect
the results of your analysis. Often, outliers are easiest to identify on a boxplot.

On a boxplot, asterisks (*) denote outliers.

Step 5. Compare data from different groups


If you have a Group variable, you can use it to analyze your data by group or by group level.

Statistics
SE
Variabl Machi N Mea StDe Minimu Media Maximu
e ne N * Mean n v m Q1 n Q3 m
Torqu 1 3 0 18.66 0.732 4.394 10.0000 15.25 17.00 21.75 30.0000
e 6 67 5 8 00 00 00
2 3 0 24.18 1.258 7.119 14.000 17.50 24.00 31.00 37.000
2 8 0 0 0

In these results, the summary statistics are calculated separately by machine. You can easily see the
differences in the center and spread of the data for each machine. For example, Machine 1 has a lower
mean torque and less variation than Machine 2. To determine whether the difference in means is
significant, you can perform a 2-sample t-test.
Descriptive statistics
Dr. C. George Boeree

Descriptive statistics are ways of summarizing large sets of quantitative (numerical)


information.  If you have a large number of measurements, the best thing you can do
is to make a graph with all the possible scores along the bottom (x axis), and the
number of times you came across that score recorded vertically (y axis) in the form of
a bar.  But such a graph is just plain hard to do statistical analyses with, so we have
other, more numerical ways of summarizing the data.

Here is a small set of data:  The grades for 15 students.  For our purposes, they range
from 0 (failing) to 4 (an A), and go up in steps of .2.

John -- 3.0
Mary -- 2.8
George -- 2.8
Beth -- 2.4
Sam -- 3.2
Judy -- 2.8
Fritz -- 1.8
Kate -- 3.8
Dave -- 2.6
Jenny -- 3.4
Mike -- 2.4
Sue -- 4.0
Don -- 3.4
Ellen -- 3.2
Orville -- 2.2

Here is the information in bar graph form:


Central tendency

Central tendency refers to the idea that there is one number that best summarizes the
entire set of measurements, a number that is in some way "central" to the set.

The mode.  The mode is the measurement that has the greatest frequency, the one you
found the most of.  Although it isn't used that much, it is useful when differences are
rare or when the differences are non numerical.  The prototypical example of
something is usually the mode.

The mode for our example is 3.2.  It is the grade with the most people (3).

The median.  The median is the number at which half your measurements are more
than that number and half are less than that number.  The median is actually a better
measure of centrality than the mean if your data are skewed, meaning lopsided.  If, for
example, you have a dozen ordinary folks and one millionaire, the distribution of their
wealth would be lopsided towards the ordinary people, and the millionaire would be
an outlier, or highly deviant member of the group.  The millionaire would influence
the mean a great deal, making it seem like all the members of the group are doing
quite well.  The median would actually be closer to the mean of all the people other
than the millionaire.

The median for our example is 3.0.  Half the people scored lower, and half higher
(and one exactly).

The mean.  The mean is just the average. It is the sum of all your measurements,
divided by the number of measurements.  This is the most used measure of central
tendency, because of its mathematical qualities.  It works best if the data is distributed
very evenly across the range, or is distributed in the form of a normal or bell-shaped
curve (see below).  One interesting thing about the mean is that it represents
the expected value if the distribution of measurements were random!  Here is what
the formula looks like:

So 3.0 + 2.8 + 2.8 + 2.4 + 3.2 + 2.8 + 1.8 + 3.8 + 2.6 + 3.4 + 2.4 + 4.0 + 3.4 + 3.2 +
3.2 is 43.8.  Divide that by 15 and that is the mean or average for our example: 2.92.
Statistical dispersion

Dispersion refers to the idea that there is a second number which tells us how "spread
out" all the measurements are from that central number.

The range.  The range is the measure from the smallest measurement to the largest
one.  This is the simplest measure of statistical dispersion or "spread."

The range for our example is 2.2, the distance from the lowest score, 1.8, to the
highest, 4.0.

Interquartile range.  A slightly more sophisticated measure is the interquartile


range.  If you divide the data into quartiles, meaning that one fourth of the
measurements are in quartile 1, one fourth in 2, one fourth in 3, and one fourth in 4,
you will get a number that divides 1 and 2 and a number that divides 3 and 4.  You
then measure the distance between those two numbers, which therefore contains half
of the data.  Notice that the number between quartile 2 and 3 is the median!

The interquartile range for example is .9, because the quartiles divide roughly at 2.45
and 3.35.  The reason for the odd dividing lines is because there are 15 pieces of data,
which, of course, cannot be neatly divided into quartiles!

The standard deviation.  The standard deviation is the "average" degree to which
scores deviate from the mean.  More precisely, you measure how far all your
measurements are from the mean, square each one, and add them all up.  The result is
called the variance.  Take the square root of the variance, and you have the standard
deviation.  Like the mean, it is the "expected value" of how far the scores deviate from
the mean.  Here is what the formula looks like:

So, subtract the mean from each score and square them and sum:  5.1321.  Then
divide by 15 and take the square root and you have the standard deviation for our
example:  .5849....  One standard deviation above the mean is at about 3.5; one
standard deviation below is at about 2.3.

The normal curve


At its simplest, the central tendency and the measure of dispersion describe a
rectangle that is a summary of the set of data.  On a more sophisticated level, these
measures describe a curve, such as the normal curve, that contains the data most
efficiently.

This curve, also called the bell-shaped curve, represents a distribution that reflects
certain probabilistic events when extended to an infinite number of measurements.  It
is an idealized version of what happens in many large sets of measurements:  Most
measurements fall in the middle, and fewer fall at points farther away from the
middle.  A simple example is height:  Very few people are below 3 feet tall; very few
are over 8 feet tall; most of us are somewhere between 5 and 6.  The same applies to
weight, IQs, and SATs!  In the normal curve,  the mean, median, and mode are all the
same.

One standard deviation below the mean contains 34.1% of the measures, as does one
standard deviation above the mean.  From one to two below contains 13.6%, as does
from one to two above.  From two to three standard deviations contains 2.1% on each
end.  An other way to look at it:  Between one standard deviation below and above,
we have 68% of the data; from two below to two above, we have 95%; from three
below to three above, we have 99.7%

Because of its mathematical properties, especially its close ties to probability theory,
the normal curve is often used in statistics, with the assumption that the mean and
standard deviation of a set of measurements define the distribution.  Hopefully, it is
obvious that this is not at all true for nearly all cases.  The best representation of your
measurements is a diagram which includes all the measurements, not just their mean
and standard deviation!  Our example above is a clear example - a normal curve with
a mean of 2.92 and a standard deviation of .58 is quite different from the pattern of the
original data.  A good real life example is IQ and intelligence:  IQ tests are
intentionally scored in such a way that they generate a normal curve, and because IQ
tests are what we use to measure intelligence, we often assume that intelligence is
normally distributed, which is not at all necessarily true!

For more detail, click here.

© Copyright 2005, C. George Boeree

The central limit theorem (CLT) is a statistical theory that states that given a sufficiently large
sample size from a population with a finite level of variance, the mean of all samples from the
same population will be approximately equal to the mean of the population.

You might also like