You are on page 1of 4

Data mining that the data to which it was applied were typically

secondary. That is, the data was usually collected to


answer some other question, or perhaps collected sec-
ondarily in the course of pursuing some other issue
Introduction (rainfall measurements may be collected to study
climate phenomena, but then used as the input to
Data mining is a new discipline which has sprung up a study of groundwater distribution). However, data
at the confluence of several other disciplines, driven mining is a rapidly evolving technology, and increas-
chiefly by the growth of large databases. The basic ing numbers of data sets are now collected with the
motivating stimulus behind data mining is that these specific objective of trawling through them seeking
large databases contain information which is of value interesting or unusual configurations. Examples are
to the database owners, but this information is con- particle physics experiments and many commercial
cealed within the mass of uninteresting data and has applications, especially involving the World Wide
to be discovered. That is, one is seeking surprising, Web. In the latter applications, in particular, formal
novel, unexpected, or valuable information, and the adaptive experimental designs may be used. Novel
aim is to extract this information. This means that applications of data mining specifically aimed at
the subject is closely allied to exploratory data anal- using the web include such things as early detection
ysis. However, issues arising from the sizes of the of epidemics and the spread of disease (see Epidemic
databases, as well as ideas and tools imported from models)from the purchase patterns of pharmaceuti-
other areas, mean that there is more to data mining cals or from Twitter communications.
than merely exploratory data analysis. The excitement of data mining is also partly a con-
Perhaps the main economic driver to the develop- sequence of this secondary nature; it suggests that
ment of data mining tools and techniques has come there is valuable information concealed within the
from the commercial world; the promise of money to data one already has, simply waiting for someone to
be made from data processing innovations is a famil- tease it out. Unfortunately, the “simply” part of this
iar one, and commercial databases are now rapidly exercise is rather misleading. One of the problems is
growing in size, as well as in number. However, there that large data sets necessarily have a great deal of
is also substantial scientific interest; philosophers of structure in them, but this structure has three major
science have remarked that advances and innovation sources in addition to the target one of “important,
often occur when a mismatch between the data and real, undiscovered structure.” These three sources are
the predictions of a theory occurs, and nowadays data contamination, chance occurrences of data, and
to detect such mismatches often requires extensive structure which is already known to the database
analysis of large data sets. Examples of areas of sci- owner (or, if not explicitly articulated as known, suf-
entific applications of data mining include astronomy ficiently obvious once it has been pointed out to be
[1, 2] and molecular biology [3]. Environmental of no genuine interest or value, such as the fact that
applications include such things as detecting the married people come in pairs). The first and second
impact of pollutants on people or (see ecosystems) of these are sufficiently important to warrant some
before these effects become readily apparent, moni- discussion.
toring changes due to urbanization, and monitoring
and detecting climate change. Data sets for these and
other environmental data mining exercises are col- Challenges of Data Mining
lected from a wide range of modalities, including
high altitude and satellite imagery, as well as surface It is probably not too much of an exaggeration to
measuring instruments, both terrestrial and marine. say that all environmental data sets are contaminated,
Apart from the sizes of the data sets, one of though with small data sets this may be difficult to
the early distinguishing features of data mining was detect. With large data sets, it means that the data
miner may triumphantly return an unusual pattern
Based in part on the article “Data mining” by which is simply an artefact of data collection, record-
David Hand, which appeared in the Encyclopedia of ing, or other inadequacies. A postgraduate student
Environmetrics. discovered some curious anomalies in wind speed

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.


This article is © 2013 John Wiley & Sons, Ltd.
This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.
DOI: 10.1002/9780470057339.vad002.pub2
2 Data mining

records, in which abnormally high recordings occa- Data Mining and Statistics
sionally occurred. Closer examination revealed that
these all occurred at midnight, coincidentally, at the The score functions mentioned in the previous section
same time at which the recording machine automati- may be the same as the criteria used in statistical
cally reset itself. A (small) data set giving times and model fitting, without the probabilistic interpretation,
durations of geyser eruptions which show anoma- or they may be other criteria. An illustration of the
differences in perspective is given by regression anal-
lies is described in Ref. 4. Again, closer examination
ysis. A statistician may find the maximum likelihood
revealed that these were not caused by departures
estimates of the parameters, assuming a normal error
from the underlying common physical mechanism,
distribution. In contrast, a data miner may adopt the
but instead by recording errors. Digit preference sum of squared residuals as a score function to use in
is another cause of such curiosities, and is only choosing the parameters. Since maximizing the likeli-
detectable in large data sets. hood based on a normal error distribution leads to the
In statistics, one is often able to cope with data sum of squares criterion, these two approaches yield
inadequacies by extending the model to cope with the same result, but they start from different posi-
them. Thus, for example, one may handle distorted tions. The statistician has a formal model in mind,
sampling by including a model for the case selec- while the data miner is simply aiming to find a good
tion process, and incomplete vectors of measurements description of the data. Of course, the distinction is
may be handled via the EM algorithm. However, not a rigid one; there is an overlap between the two
such strategies can only be adopted if one has some perspectives.
awareness (and understanding) of the data contamina- This example does show how central is the con-
tion mechanism. In data mining with secondary data, cept of modeling to the statistician. In contrast, data
this is often not the case. Often, such problems are miners tend to place much more emphasis on algo-
rithms. Given the essential role of computers in data
ignored, with obvious potential for misleading con-
mining, this algorithmic emphasis is perhaps not sur-
clusions.
prising. Moreover, when data sets are very large, the
Statistical hypothesis and significance tests popular statistical algorithms may become imprac-
explore the possibility of spurious patterns arising ticable (tools which make repeated passes through
in the data using tools that estimate the probability the data, for example, may be out of the question
of such structures arising by chance – merely as with a billion data points). Computers are, of course,
a consequence of random variation. Unfortunately, also important for statistics, but many statistical tech-
with large data sets, and when large sets of possible niques can be applied on small data sets without
patterns are sought, the opportunity for the discovery computers; in fact, many were originally developed
of apparent structures is clearly great; in particular, that way. One consequence of this is that it may be
issues arising from multiple testing are a real difficult to describe exactly what model is being fit to
potential problem. This means that a formal statistical the data in a data mining exercise. For example, asso-
approach may not always be readily applied. Instead, ciation rules are a type of local structure which has
data miners often simply define score functions been explored in the data mining rather than the sta-
(for the interestingness, unusualness, or some other tistical literature. They are small local models or pat-
characteristic of a pattern), without any probability terns, describing co-occurrences of particular values
of discrete variables. Originally developed for study-
interpretation, and pass those patterns which show the
ing purchase behavior patterns in supermarkets, they
largest such scores over to an expert for evaluation.
have been applied much more widely–for example,
This description reveals the process nature of data to studying local collocation patterns in ecological
mining. Data mining is not a “one-off” exercise, to data sets. The key issue in developing association
be done and finished with. Rather, it is an ongoing rules is not how to define the small local models,
process; one examines a data set, identifies features which is straightforward, but rather how to cope with
of possible interest, discusses them with an expert, the potentially massive search problems, and some
goes back to the data in the light of these discussion, sophisticated algorithms have been developed by the
and so on. data mining community.

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.


This article is © 2013 John Wiley & Sons, Ltd.
This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.
DOI: 10.1002/9780470057339.vad002.pub2
Data mining 3

The difficulty of describing exactly what model of different kinds of method, are tools for characteriz-
is being fitted can have adverse consequences. For ing, identifying, and locating patterns in multivariate
example, cluster analysis is widely used in data response data; tools for detecting and identifying
mining, but without careful thought about the nature patterns in two-dimensional displays (such as finger-
of the procedure it can be difficult to be clear prints and meteorological charts); identifying sudden
about what sort of “clusters” are being found. Thus, changes over time (such as those induced by chem-
compact structures may be appropriate in some ical leakages or intrusion detection in computer sys-
situations (e.g., to produce compact summarizing tems); detecting changes in behavior (animal migra-
descriptions, with the clusters being represented by tion routes; fraud detection is a common application
“central” points), while in others elongated shapes in various domains, including commercial, scientific,
may be desirable, in which neighboring points in the and governmental – such as electoral fraud); and
same cluster are similar but distant ones are not (e.g., identifying logical combinations of values which dif-
animals of a given species may differ more in some fer between groups. Some examples of important
dimensions than in others). Without an awareness tools in model building in data mining (again cho-
of the type of structure that the method reveals, sen with no particular aim other than to illustrate the
inappropriate conclusions could be drawn; a species range of such methods) include recursive partitioning,
could be incorrectly partitioned on a dimension in cluster analysis, regression modeling, segmentation
which it has substantial variability. of time series into a small number of segment types,
In the above, we have used words such as model, techniques for condensing huge (tens of billions of
structure, and pattern without defining them. In data points) data sets into manageable summaries, and
Ref. 5, a model was defined as a large-scale sum-
collaborative filtering, in which transactions are pro-
mary of a set of data (that is, as the standard statistical
cessed as they arrive so that future transactions may
notion of a model), and a pattern as a small-scale local
be treated in a more appropriate manner.
structure. A Box–Jenkins decomposition of a time
Much statistical theory is aimed at producing valid
series is a model, whereas a conjunction of values
inferences from a sample to some population (real
which occasionally repeats itself (for example, in an
or notional) from which the sample has been drawn.
electroencephalogram (EEG) trace) is a pattern. Mod-
This might be so that one can make comparative
els are the staple of statistics, and patterns are some-
statements about the populations, or for environmen-
thing with which statistics has generally not been con-
cerned. An examination of the data mining literature tal forecasting, or for other reasons. These methods
shows that both models and patterns are important, are also appropriate in data mining, provided one has
but narrow views of data mining sometimes fail to a sample and that it has been drawn in a probabilis-
recognize the diversity of the tools used. Thus, for tic way (so that one knows the probability of each
example, it is sometimes claimed that data mining is object appearing in the sample; see Environmental
merely the application of recursive partitioning meth- tobacco smoke (ETS)). Going further than this, in
ods (e.g., tree classifiers;), but this is a parody of the many data mining applications one has available data
breadth of the field. Likewise, the viewpoint some- on the entire population (for example, all chemical
times proposed in the econometric literature, that data molecules in a particular class), and then, in model
mining is merely an elaborate and extensive form building data mining applications, analyzing a sample
of model search, fails to recognize the various other from the data set may be a sensible way to proceed.
kinds of data mining activities which go on. In contrast, however, in a pattern detection exercise
it will typically be necessary to analyze the entire
data set; if one is seeking those data points which are
Data Mining Tools anomalous, there is no alternative to examining every
data point.
A large number of different kinds of tools are used in It is clear that data mining will be of increasing
data mining, reflecting the eclecticism of its origins. importance as time progresses. However, the impor-
Some recent ones in pattern recognition and detec- tance should not conceal the difficulties. Finding
tion, culled from the data mining literature with no unsuspected structures in large data sets, and iden-
particular objective other than to indicate the diversity tifying those which are due to phenomena of genuine

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.


This article is © 2013 John Wiley & Sons, Ltd.
This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.
DOI: 10.1002/9780470057339.vad002.pub2
4 Data mining

interest and not merely arising from data contami- [4] Denby, L. & Pregibon, D. (1987). An example of the use
nation or due to chance, is by no means a trivial of graphics in regression, The American Statistician 41,
exercise. Issues of theory, of data management, and 33–38.
[5] Hand, D.J., Blunt, G., Kelly, M.G., & Adams, N.M.
of practice all arise. General descriptions of data min- (2000). Data mining for fun and profit, Statistical Science
ing are given in Refs 6, 7. 15, 111–131.
[6] Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., &
Uthurusamy, R., eds, (1996). Advances in Knowledge
References
Discovery and Data Mining, AAAI Press, Menlo Park.
[7] Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles
[1] Way, M.J., Scargle, J.D., Ali, K., & Srivastava, A.N. of Data Mining, MIT Press, Cambridge.
(2012). Advances in Machine Learning and Data Mining
for Astronomy, Chapman and Hall, London. DAVID J. HAND
[2] Zeljko, I. (2011) Data Mining and Machine Learning
in Astronomy – A Practical Guide, Princeton University
Press, Princeton.
[3] Gustafson, J.P., Shoemaker, R., & Snape, J.W. (2011).
Genome Exploitation: Data Mining the Genome, Springer,
New York.

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.


This article is © 2013 John Wiley & Sons, Ltd.
This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.
DOI: 10.1002/9780470057339.vad002.pub2

You might also like