also be able to collect samples in phases or change the implementation schedule
to accommodate your client’s budget cycle.
Use Supporting Data
There may be historical data available that you can use to
reassess the number of samples you’ll need and even augment the samples you
plan to collect (i.e., provided the quality of the historical data is appropriate). Youcan also consider surrogate sampling, in which you correlate the results of manyinexpensive observations or measurements to the few expensive samples yourclient can afford.
If you think about it, the reason you need more samples inthe first place is because you need to improve precision (not accuracy). So think harder about how you can reduce any extraneous variability in the data generationprocess. Standardized procedures and training of the data collectors mightmitigate the need for quite a few samples.
Can you eat too many potato chips? Of course you can.
It’s happened to many of us.
Likewise,you can have too many samples, which presents its own set of challenges. Here are five:
Statistical software tends to be very efficient, but when you
have tens of thousands of samples, you start to see performance slow a bit. What’s more
In any data set, you may have 5% influential observations not to
mention the outliers and errors that you’ll have to check to determine if they should be
corrected, removed from the dataset, or left alone. This is a very time consuming process.With a small dataset, you may have to investigate just a few samples. With a 1,000-record dataset, you may have to investigate 50 samples. This is part of why datascrubbing can represent most of the work in a data analysis project.
When you’re working with only a few dozen samples, you get to
know each data point. You can look at plots and tables and see how individual details fit
into a bigger picture. You can’t do that with a tho
usand data points. Sometimes you canget around this problem by dividing the data into groups and working with the groups, oranalyzing a higher level of hierarchical data.
It’s tough to see patterns with only a few samples but plotting
ands of samples can be just as perplexing. You won’t be able to use any small plots
like matrix plots. Even with full-scale plots, it will be difficult to see subtle differences indata point markers, like size, shape, and even color. Points will overwrite each other so
you won’t be able to tell it there is one point at a graph location or a hundred points
stacked on top of each other. And even the best statistical software will choke whentrying to print graphs with thousands of data points. Solving this problem usuallyinvolves plotting group means or only randomly selected records from the data matrix.