Professional Documents
Culture Documents
Hand 2013
Hand 2013
records, in which abnormally high recordings occa- Data Mining and Statistics
sionally occurred. Closer examination revealed that
these all occurred at midnight, coincidentally, at the The score functions mentioned in the previous section
same time at which the recording machine automati- may be the same as the criteria used in statistical
cally reset itself. A (small) data set giving times and model fitting, without the probabilistic interpretation,
durations of geyser eruptions which show anoma- or they may be other criteria. An illustration of the
differences in perspective is given by regression anal-
lies is described in Ref. 4. Again, closer examination
ysis. A statistician may find the maximum likelihood
revealed that these were not caused by departures
estimates of the parameters, assuming a normal error
from the underlying common physical mechanism,
distribution. In contrast, a data miner may adopt the
but instead by recording errors. Digit preference sum of squared residuals as a score function to use in
is another cause of such curiosities, and is only choosing the parameters. Since maximizing the likeli-
detectable in large data sets. hood based on a normal error distribution leads to the
In statistics, one is often able to cope with data sum of squares criterion, these two approaches yield
inadequacies by extending the model to cope with the same result, but they start from different posi-
them. Thus, for example, one may handle distorted tions. The statistician has a formal model in mind,
sampling by including a model for the case selec- while the data miner is simply aiming to find a good
tion process, and incomplete vectors of measurements description of the data. Of course, the distinction is
may be handled via the EM algorithm. However, not a rigid one; there is an overlap between the two
such strategies can only be adopted if one has some perspectives.
awareness (and understanding) of the data contamina- This example does show how central is the con-
tion mechanism. In data mining with secondary data, cept of modeling to the statistician. In contrast, data
this is often not the case. Often, such problems are miners tend to place much more emphasis on algo-
rithms. Given the essential role of computers in data
ignored, with obvious potential for misleading con-
mining, this algorithmic emphasis is perhaps not sur-
clusions.
prising. Moreover, when data sets are very large, the
Statistical hypothesis and significance tests popular statistical algorithms may become imprac-
explore the possibility of spurious patterns arising ticable (tools which make repeated passes through
in the data using tools that estimate the probability the data, for example, may be out of the question
of such structures arising by chance – merely as with a billion data points). Computers are, of course,
a consequence of random variation. Unfortunately, also important for statistics, but many statistical tech-
with large data sets, and when large sets of possible niques can be applied on small data sets without
patterns are sought, the opportunity for the discovery computers; in fact, many were originally developed
of apparent structures is clearly great; in particular, that way. One consequence of this is that it may be
issues arising from multiple testing are a real difficult to describe exactly what model is being fit to
potential problem. This means that a formal statistical the data in a data mining exercise. For example, asso-
approach may not always be readily applied. Instead, ciation rules are a type of local structure which has
data miners often simply define score functions been explored in the data mining rather than the sta-
(for the interestingness, unusualness, or some other tistical literature. They are small local models or pat-
characteristic of a pattern), without any probability terns, describing co-occurrences of particular values
of discrete variables. Originally developed for study-
interpretation, and pass those patterns which show the
ing purchase behavior patterns in supermarkets, they
largest such scores over to an expert for evaluation.
have been applied much more widely–for example,
This description reveals the process nature of data to studying local collocation patterns in ecological
mining. Data mining is not a “one-off” exercise, to data sets. The key issue in developing association
be done and finished with. Rather, it is an ongoing rules is not how to define the small local models,
process; one examines a data set, identifies features which is straightforward, but rather how to cope with
of possible interest, discusses them with an expert, the potentially massive search problems, and some
goes back to the data in the light of these discussion, sophisticated algorithms have been developed by the
and so on. data mining community.
The difficulty of describing exactly what model of different kinds of method, are tools for characteriz-
is being fitted can have adverse consequences. For ing, identifying, and locating patterns in multivariate
example, cluster analysis is widely used in data response data; tools for detecting and identifying
mining, but without careful thought about the nature patterns in two-dimensional displays (such as finger-
of the procedure it can be difficult to be clear prints and meteorological charts); identifying sudden
about what sort of “clusters” are being found. Thus, changes over time (such as those induced by chem-
compact structures may be appropriate in some ical leakages or intrusion detection in computer sys-
situations (e.g., to produce compact summarizing tems); detecting changes in behavior (animal migra-
descriptions, with the clusters being represented by tion routes; fraud detection is a common application
“central” points), while in others elongated shapes in various domains, including commercial, scientific,
may be desirable, in which neighboring points in the and governmental – such as electoral fraud); and
same cluster are similar but distant ones are not (e.g., identifying logical combinations of values which dif-
animals of a given species may differ more in some fer between groups. Some examples of important
dimensions than in others). Without an awareness tools in model building in data mining (again cho-
of the type of structure that the method reveals, sen with no particular aim other than to illustrate the
inappropriate conclusions could be drawn; a species range of such methods) include recursive partitioning,
could be incorrectly partitioned on a dimension in cluster analysis, regression modeling, segmentation
which it has substantial variability. of time series into a small number of segment types,
In the above, we have used words such as model, techniques for condensing huge (tens of billions of
structure, and pattern without defining them. In data points) data sets into manageable summaries, and
Ref. 5, a model was defined as a large-scale sum-
collaborative filtering, in which transactions are pro-
mary of a set of data (that is, as the standard statistical
cessed as they arrive so that future transactions may
notion of a model), and a pattern as a small-scale local
be treated in a more appropriate manner.
structure. A Box–Jenkins decomposition of a time
Much statistical theory is aimed at producing valid
series is a model, whereas a conjunction of values
inferences from a sample to some population (real
which occasionally repeats itself (for example, in an
or notional) from which the sample has been drawn.
electroencephalogram (EEG) trace) is a pattern. Mod-
This might be so that one can make comparative
els are the staple of statistics, and patterns are some-
statements about the populations, or for environmen-
thing with which statistics has generally not been con-
cerned. An examination of the data mining literature tal forecasting, or for other reasons. These methods
shows that both models and patterns are important, are also appropriate in data mining, provided one has
but narrow views of data mining sometimes fail to a sample and that it has been drawn in a probabilis-
recognize the diversity of the tools used. Thus, for tic way (so that one knows the probability of each
example, it is sometimes claimed that data mining is object appearing in the sample; see Environmental
merely the application of recursive partitioning meth- tobacco smoke (ETS)). Going further than this, in
ods (e.g., tree classifiers;), but this is a parody of the many data mining applications one has available data
breadth of the field. Likewise, the viewpoint some- on the entire population (for example, all chemical
times proposed in the econometric literature, that data molecules in a particular class), and then, in model
mining is merely an elaborate and extensive form building data mining applications, analyzing a sample
of model search, fails to recognize the various other from the data set may be a sensible way to proceed.
kinds of data mining activities which go on. In contrast, however, in a pattern detection exercise
it will typically be necessary to analyze the entire
data set; if one is seeking those data points which are
Data Mining Tools anomalous, there is no alternative to examining every
data point.
A large number of different kinds of tools are used in It is clear that data mining will be of increasing
data mining, reflecting the eclecticism of its origins. importance as time progresses. However, the impor-
Some recent ones in pattern recognition and detec- tance should not conceal the difficulties. Finding
tion, culled from the data mining literature with no unsuspected structures in large data sets, and iden-
particular objective other than to indicate the diversity tifying those which are due to phenomena of genuine
interest and not merely arising from data contami- [4] Denby, L. & Pregibon, D. (1987). An example of the use
nation or due to chance, is by no means a trivial of graphics in regression, The American Statistician 41,
exercise. Issues of theory, of data management, and 33–38.
[5] Hand, D.J., Blunt, G., Kelly, M.G., & Adams, N.M.
of practice all arise. General descriptions of data min- (2000). Data mining for fun and profit, Statistical Science
ing are given in Refs 6, 7. 15, 111–131.
[6] Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., &
Uthurusamy, R., eds, (1996). Advances in Knowledge
References
Discovery and Data Mining, AAAI Press, Menlo Park.
[7] Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles
[1] Way, M.J., Scargle, J.D., Ali, K., & Srivastava, A.N. of Data Mining, MIT Press, Cambridge.
(2012). Advances in Machine Learning and Data Mining
for Astronomy, Chapman and Hall, London. DAVID J. HAND
[2] Zeljko, I. (2011) Data Mining and Machine Learning
in Astronomy – A Practical Guide, Princeton University
Press, Princeton.
[3] Gustafson, J.P., Shoemaker, R., & Snape, J.W. (2011).
Genome Exploitation: Data Mining the Genome, Springer,
New York.