You are on page 1of 4

THERE’S SOMETHING ABOUT VARIANCE

Imagine practicing hitting a target using darts, bow and arrow, pistol, cannon, missile launcher,
or whatever. You aim for the center of the target. If your shots land where you aimed, you are
considered to be accurate. If all your shots land near each other, you are considered to be
precise. The two properties are not linked. You can be accurate but not precise, precise but not
accurate, neither accurate nor precise, or both accurate and precise.
Accuracy and precision also apply to statistics calculated from data. If you’re trying to determine
some characteristic of a population (i.e., a population parameter), you want your statistical
estimates of the characteristic to be both accurate and precise.
The same also applies to the data themselves. When you start measuring data for an analysis,
you’ll notice that even under similar conditions, you can get dissimilar results. That lack of
precision is called variability. Variability is everywhere; it’s a normal part of life. In fact, it is the
spice in the soup. Without variability, all wines would taste the same. Every race would end in a
tie. Even statistics might lose its charm. Your doctor wouldn’t tell you that you have about a year
to live, he’d say don’t make any plans for January 11 after 6:13 PM EST. So a bit of variability
isn’t such a bad thing. The important question, though, is what kind of variability?

The Inevitability of Variability


Before going further, let me clarify something. Statisticians discuss variability using a variety of
terms, including errors, uncertainty, deviations, distortions, residuals, noise, inexactness,
dispersion, scatter, spread, perturbations, fuzziness, and differences. To nonprofessionals, many
of these terms hold pejorative connotations. But variability isn’t bad … it’s just misunderstood.
Suppose you’re sitting in your living room one cold winter
night contemplating the high cost of heating oil. The
thermostat reads 68 degrees F, but you’re still shivering.
Maybe the thermostat is broken. Maybe the heater is
malfunctioning or you need more insulation. You need a
warmer place to sit while you read An Inconvenient Truth,
so you grab a thermometer from the medicine cabinet and
start measuring temperatures around the room. It’s 115
degrees at the radiator, 68 degrees at your chair, 59 degrees
at the window, and 69 degrees at the stairs. You keep
measuring. It’s 73 degrees at the fish tank, 67 degrees at the
couch and bookcase, 82 degrees at the TV, and 60 degrees at the door. That’s a lot of variation!
Think of those temperature readings as the summation of five components:
Characteristic of Population—the portion of a data value that is the same between a
sample and the population. This part of a data value forms the patterns in the population
that you want to uncover. If you think of the living room space as the population you’re
measuring, the characteristic temperature would be the 68 degrees at your chair where
you want to read.
Natural Variability—the inherent differences between a sample and the population.
This part of a data value is the uncertainty or variability in population patterns. In a
completely deterministic world, there would be no natural variability. You would read the
same value at every point where you took a measurement. But in the real world, if you
made the same measurement again and again, you probably would get different values. If
all other types of variation were controlled, these differences would be the natural or
inherent variability.
Sampling Variability—differences between a sample and the population attributable to
how uncharacteristic (nonrepresentative) the sample is of the population. Minimizing
sampling error requires that you understand the population you are trying to evaluate. The
sampling variability in the living room would be attributable to where you took the
temperature readings. For example, the radiator and TV are heat sources. The door and
window are heat sinks. Furthermore, if all the readings were taken at eye level, the areas
near the ceiling and floor would not have been adequately represented. The floor may be
a few degrees cooler because the more dense cold air sinks displacing the warmer air
upward, which is why the air at the ceiling is warmer.
Measurement Variability—differences between a sample and the population
attributable to how data were measured or otherwise generated. Minimizing measurement
error requires that you understand measurement scales and the actual process and
instrument you use to generate data. Using an oral thermometer for the living room
measurements may have been expedient but not entirely appropriate. The temperatures
you wanted to measure are at the low end of the thermometer’s range and may be less
accurate than around 98 degrees. Also, the thermometer is slow to reach equilibrium and
can’t be read with more than one decimal place of precision. Use a digital infrared laser-
point thermometer next time. More accurate. More precise. More fun.
Environmental Variability—differences between a sample and the population
attributable to extraneous factors. Minimizing environmental variance is difficult because
there are so many causes and because the causes are often impossible to anticipate or
control. For example, the heating system may go on and off unexpectedly. Your own
body heat adds to the room temperature and walking around the living room taking
measurements mixes the ceiling and floor air which adds variability to the temperatures.
When you analyze data, you usually want to evaluate characteristics of some population and the
natural variability associated with the population. Ideally, you don’t want to be mislead by any
extraneous variability that might be introduced by the way you select your samples (or patients,
items, or other entities), measure (generate or collect) the data, or experience uncontrolled
transient events or conditions. That’s why it’s so important to understand the ways of variability.

Variability versus Bias


Remember target practice? If there is little variation in your aim, the deviations from the center
of the target would be random in distance and direction. Your aim would be accurate and precise.
But what if the sight on your weapon were misaligned? Your shots would not be centered on the
center of the target. Instead there would be a systematic deviation caused by the misaligned
sight. Your shots would all be inaccurate, by roughly the same distance and direction from the
center. That systematic deviation is called bias. You may not even have known there was a
problem with the sight before shooting, although you would probably suspect something after all
the misses.
Bias usually carries the connotation of being a bad thing. It usually is. It may be why 19th
Century British Prime Minister Benjamin Disraeli mistakenly associated statistics with lies and
damn lies. But if the systematic deviation is a good thing because it fixes another bias, it’s called
a correction. For example, you could add a correction, an intentional bias in the direction
opposite the bias introduced by the weapon sight, to compensate for the inaccuracy. So bias can
be good (in a way) or bad, intentional or not, but it’s always systematic. On the other hand, a bias
applied to only selected data is a form of exploitation, and is nearly always intentional and a very
bad thing.
So the relationships to remember are:
Variance ↔ Imprecision
Bias ↔ Inaccuracy
Most statistical techniques are unbiased themselves, as long as you meet their assumptions. If
something goes wrong, you can’t blame the statistics. You may have to look in the mirror,
though. During the course of any statistical analysis, there are many decisions that have to be
made, primarily involving data. Whatever the decisions are, such as deleting or keeping an
outlier, there will be some impact on precision and perhaps even accuracy. In an ideal world, the
sum of the decisions wouldn’t add appreciably to the variability. Often, though, data analysts
want to be conservative, so they make decisions they believe are counter to their expectations.
But when they don’t get the results they expected, they go back and try to tweak the analysis. At
that point they have lost all chance of doing an objective analysis and are little better than
analysts with vested interests who apply their biases from the start. Avoiding such analysis bias
requires no more than to make decisions based solely on statistical principles. This sounds
simple but it isn’t always so.
Sometimes bias isn’t the fault of the data analyst, as in the case of reporting bias. In professional
circles the most common form of reporting bias is probably not reporting non-significant results.
Some investigators will repeat a study again and again, continually fine-tuning the study design
until they reach their nirvana of statistical significance. Seriously, is there any real difference
between probabilities of significance of 0.051 versus 0.049? But you can’t fault the investigators
alone. Some professional journals won’t publish negative results, and professionals who don’t
publish perish. Can you imagine the pressure on an investigator looking for a significant result
for some new business venture, like a pharmaceutical? He might take subtle actions to help his
cause then not report everything he did. That’s a form of reporting bias.
Perhaps the most common form of reporting bias in nonprofessional circles is cherry picking, the
practice of reporting just those findings that are favorable to the reporter’s position. Cherry
picking is very common in studies of controversial topics such as climate change, marijuana, and
alternative medicine. Virtually all political discussions use information that was cherry picked.
Given that someone else’s reporting bias is after-the-analysis, why is it important to your
analysis? The answer is that it’s how you can be misled in planning your statistical study. Never
trust a secondary source if you can avoid it. Never trust a source of statistics or a statistical
analysis that doesn’t report variance and sample size along with the results. And always
remember: statistics don’t lie; people do.

Join the Stats With Cats group on Facebook.


http://statswithcats.wordpress.com/2010/08/01/there%E2%80%99s-something-about-variance/

You might also like