Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
Ten Tactics in the War on Error

Ten Tactics in the War on Error

Ratings: (0)|Views: 6 |Likes:
Published by terrabyte
Scientists and other theory-driven data analysts focus on eliminating bias and maximizing accuracy so they can find trends and patterns in their data. That’s necessary for any type of data analysis. For statisticians, though, the real enemy in the battle to discover knowledge isn’t so much accuracy as it is precision. Call it lack of precision, variability, uncertainty, dispersion, scatter, spread, noise, or error, it’s all the same adversary.
Scientists and other theory-driven data analysts focus on eliminating bias and maximizing accuracy so they can find trends and patterns in their data. That’s necessary for any type of data analysis. For statisticians, though, the real enemy in the battle to discover knowledge isn’t so much accuracy as it is precision. Call it lack of precision, variability, uncertainty, dispersion, scatter, spread, noise, or error, it’s all the same adversary.

More info:

Published by: terrabyte on Apr 02, 2011
Copyright:Traditional Copyright: All rights reserved

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

04/02/2011

pdf

text

original

 
T
EN
T
ACTICS USED IN THE
W
AR ON
E
RROR
 
Scientists and other theory-driven data analysts focus on eliminating bias and maximizingaccuracy so they can find trends and patterns in their data
. That’s necessary
for any type of dataanalysis. For statisticians, though, the real enemy in the battle to discover knowledge
isn’t so
much accuracy as it is precision. Call it lack of precision, variability, uncertainty, dispersion,scatter, spread,
noise, or error, it’s
all the same adversary.One caution before advancing further.
There’s a subtle
difference between data mining and data dredging.Data mining is the process of fining patterns in largedata sets using computer algorithms, statistics, and justabout anything else you can think of short of voodoo.Data dredging involves substantial voodoo, usuallyinvolving overfitting models to small data sets that
aren’t really as representative of a population as you
might think. Overfitting
doesn’t occur in a
blitzkrieg;
it’s more of a siege process.
It takes a lot of work. Even
so, it’s
easy to lose track of where you are and what
you’re doing when you’re focused exclusively on a
goal. The upshot of all this is that you may be creatingfalse intelligence. Take the high ground
. Don’t interpret
statistical tests and probabilities
too rigidly. Don’t trust diagnostic statistics with your 
professional life. Any statistician with a blood alcohol level under 0.2 will make you look silly.Here are ten tactics you can use to try to control and reduce variability in a statistical analysis.
Know Your Enemy
There’s an old saying, six months collecting data will save you a week in the library.
Be smartabout your data. Figure out where the variability might be hiding before you launch an attack.Focus at first on three types of variability
— 
sampling, measurement, and environmentalhttp://statswithcats.wordpress.com/2010/08/01/there%E2%80%99s-something-about-variance/ . Sampling variability consists of the differences between a sample and the population that areattributable to how uncharacteristic (non-representative) the sample is of the population.Measurement variability consists of the differences between a sample and the population that areattributable to how data were measured or otherwise generated. Environmental Variabilityconsists of the differences between a sample and the population that is attributable to extraneousfactors. So there are three places to hunt for errors
— 
how you select the samples, how youmeasure their attributes, and everything else you do.
OK, I didn’t say it was goin
g to be easy.
Start with Diplomacy
Start by figuring out what you can do to tame the error before things get messy. Consider howyou can use the concepts of reference, replication, and randomization. The concept behind usinga reference in data generation is that there is some ideal, background, baseline, norm,benchmark, or at least, generally accepted standard that can be compared to all similar data
operations or results. If you can’t
take advantage of a reference point to help control variability,
A lone sentry sits on watch in the night.
 
try repeating some aspects of the study as a form of internal reference. When all else fails,randomize.Five maneuvers you can try in order to control, minimize, or at least be able to assess the effectsof extraneous variability, are(http://statswithcats.wordpress.com/2010/09/19/it%E2%80%99s-all-in-the-technique/ 
 
):Procedural Controls
— 
like standard instructions, training, and checklists.Quality Samples and Measurements
— 
like replicate measurements, placebos, and blanks.Sampling Controls
— 
like random, stratified, and systematic sampling patterns.Experimental Controls
— 
randomly assigning individuals or objects to groups for testing,control groups, and blinding.Statistical Controls
— 
Special statistics and procedures like partial correlations andcovariates.Even if none of these things work, at least everybody will know you tried.
Prepare, Provision, and Deploy
Before entering the fray, you’ll wa
nt to know that your troops data are ready to go. You have toask yourself two questions
— 
do you have the right data and do you have the data right? Gettingthe right data involves deciding what to do about replicates, missing data, censored data, andoutliers. Getting the data right involves making sure all the values were generated appropriatelyand the values in the dataset are identical to the values that were originally generated. Sorting,reformatting, writing test formulas, calculating descriptive statistics, and graphing are some of the data scrubbing maneuvers that will help to eliminate extraneous errorshttp://statswithcats.wordpress.com/2010/10/17/the-data-scrub-3/ . 
Once you’ve done all th
at, theonly thing left to do is lock and load.
Perform Reconnaissance
While analyzing your data, be sure to look at errors inevery way you can. Is it relatively small? Is it constantfor all values of the dependent variable? Infiltrate thefront line of diagnostic statistics. Look beyond r-squaresand test probabilities to the standard error of estimate,DFBETAs, deleted residuals, leverage, and othermeasures of data influencehttp://statswithcats.wordpress.com/2010/12/19/you%e2%80%99re-off-to-be-a-wizard/ .What you learn from thesediagnostics will lead you through the next actions.
Divide and Conquer
Perhaps the best, or at least the most common, way to isolate errors is to divide the data intomore homogeneous groups. There are at least three ways you can do this. First, and easiest, is touse any natural grouping data you might have in your dataset, like species or sex. There may alsobe information you can use to group the data in the metadata. Second is the more problematicalvisual classification. You may be able to classify your data manually by sorting, filtering, andmost of all, plotting(http://statswithcats.wordpress.com/2010/09/26/it-was-professor-plot-in-the-diagram-with-a-graph/ 
 
). For example, by plotting histograms you may be able to identify
Reconnaissance requires stealth and camouflage.
 
thresholds for categorizing continuous-scale data into groups, like converting weight into weightclasses. Then you can analyze each more homogeneous class separately. Sometimes it helps and
sometimes it’s just a lot of work for little resu
lt. The other potential problems with visualclassification are that it takes a bit of practice to know what to do and what to look for, and more
importantly, you have to be careful that your grouping isn’t just coincidental.
 The third method of classifying data is the best or the worst, depending on your perspective.Cluster analysis is unarguably the best way to find the optimal groupings in data(http://statswithcats.wordpress.com/2011/03/13/becoming-part-of-the-group/ 
 
). The downside isthat the technique requires even more skill and experience than visual classification, andfurthermore, the right software.
Call in Reinforcements
If you find that you need more than just groupings to minimize extraneous error, bring in sometransformations(http://statswithcats.wordpress.com/2010/11/21/fifty-ways-to-fix-your-data/ 
 
).You can use transformations to rescale, smooth, shift, standardize, combine, and linearize data,and in the process, minimize unaccounted for errors.
There’s no shame in asking for help— 
notphysical, not mental, and not mathematical.
Shock and Awe
If all else fails, you can call in the big guns. In a sense, this tactic involves rewriting the rules of engagement. Rather than attacking the subjects, you aim at reducing the chaos in the variables.The main technique to try is factor analysis. Factor analysis involves rearranging the informationin the variables so that you have a smaller number of new variables (called factors, componentsor dimensions, depending on the type of analysis) that represent about the same amount of information. These new factors may be able to account for errors more efficiently than theoriginal variables. The downside is that the factors often represent latent, unmeasurablecharacteristics of the samples, making them hard to interpret. You also have to be sure you haveappropriate weapons of math production (i.e., software)
if you’re going to try th
is tactic.
Set Terms of Surrender
If you’re
been pretty aggressive in torturing your data, make sure the error enemy is subduedbefore declaring victory. Errors are like zombies. Just when you think you have everything undercontrol they come back to bite you. Rule 2: Always doubletap. In statistics, this means that you have to verify yourresults using a different data set.
It’s called cross validation
and there are many approaches. You can split the data setbefore you do any analysis, analyze one part (the training dataset), and then verify the results with the other part (the testdata set). You can randomly extract observations from theoriginal data set to create new datasets for analysis andtesting. Finally, you can collect new samples. You just wantto be sure no errors are
hiding where you don’t suspect them
 
Have an Exit Strategy
In the heat of data analysis, sometimes it’s difficult to
recognize when to disengage. Even analysts new to the datacan fall into the same traps as their predecessors. There are
Questioning a suspected errorist.

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->