Professional Documents
Culture Documents
772s Data Mining Concepts and Techniques 2nd Ed
772s Data Mining Concepts and Techniques 2nd Ed
Joe Celko
Earl Cox
Location-Based Services
Soumen Chakrabarti
Jim Melton
Database Tuning: Principles, Experiments, and Troubleshootin
Terry Halpin
Joe Celko
Joe Celko
Richard T. Snodgrass
Richard D. Hackathorn
Don Chamberlin
Jim Melton
V. S. Subrahmanian
Don Chamberlin
Distributed Algorithms
Nancy A. Lynch
A Guide to Developing Client/Server SQL
Applications
Application submitted
Chapter1 Introduction 1
Relational Databases 10
Data Warehouses 12
Transactional Databases 14
Cluster Analysis 25
Outlier Analysis 26
Evolution Analysis 27
2.3.2
Noisy Data 62
2.3.3
Data Integration 67
Data Transformation 70
Data Reduction 72
Dimensionality Reduction 77
Numerosity Reduction 80
Summary 97 Exercises 97
A Multilayer Feed-Forward Ne
Backpropagation 329
k-Nearest-Neighbor Classifier
6.10.2
6.10.3
352
Prediction 354
Frequent-Pattern Mining in Da
9.2
9.2.2
9.2.3
56
9.2.4
565
9.3
Multirelational Data Mining
571
Multidimensional Analysis of M
Automatic Classification of We
10.5.5 Web Usage Mining 640
Bibliography 703
Index 745
We are deluged by data—scientific data, medica
and marketing data. People have no time to l
become the precious resource. So, we must find
to automatically classify it, to automatically sum
characterize trends in it, and to automatically
active and exciting areas of the database
research ing statistics, visualization, artificial
intelligence to this field. The breadth of the field
makes it dif over the last few decades.
To the Instructor
To the Professional
This book was designed to cover a wide range o
result, it is an excellent handbook on the subjec
as stand-alone as possible, you can focus on the
can be used by application programmers and in
learn about the key ideas of data mining on thei
technical data analysis staff in banking, insuranc
are interested in applying data mining solution
may serve as a comprehensive survey of the da
researchers who would like to advance the sta
the scope of data mining applications.
The abundance of data, coupled with the need for been described as a
data rich but information poor situ dous amount of data, collected and
stored in large and far exceeded our human ability for comprehension
wit As a result, data collected in large data repositories beco that are
seldom visited. Consequently, important decisi the information-rich
data stored in data repositories, intuition, simply because the decision
maker does not able knowledge embedded in the vast amounts of da
system technologies, which typically rely on users or d knowledge into
knowledge bases. Unfortunately, this errors, and is extremely time-
consuming and costly. analysis and may uncover important data
patterns,
emphasis on mining from large amounts of da
characterizing the process that finds a small set
raw material (Figure 1.3). Thus, such a misno
ing” became a popular choice. Many other te
meaning to data mining, such as knowledge m
data/pattern analysis, data archaeology, and d
1
A popular trend in the information industry is to pe
preprocessing step, where the resulting data are stored i
2
Sometimes data transformation and consolidation ar
particularly in the case of data warehousing. Data reduc
representation of the original data without sacrificing it
data cleaning, integration and selection
Database
Data
World Wide
Other
Warehouse
Web
Repos
Figure 1.5 Architecture of a typical data mining system.
AllElectronics).
Integrate
Data
Transform
Warehouse
Load
Refresh
Data source in Toronto
Data source in Vancouver
Figure 1.7 Typical framework of a data warehouse for AllElectronics.
Example 1.2 A data cube for AllElectronics.
A data cube for is presented in Figure 1.8(a).
The cube has thr
entertainment
(a)
item (types)
(b)
Drill-down
o
New York
ies)
Chicago
ess
(cit
Toronto
addr
Vancouver
(months)time Feb
Jan
100
150
150
March
computer
security
home
phone
item
(types)
entertainment
ies
r
ount
ess
(c
Cana Q1
(quarters)
Q2
addr
Q3
time
Q4
en
Object-Relational Databases
buys(X;“computer”) ) buys(X;“software”)
In general, the class labels are not present in the traini not
known to begin with. Clustering can be used to gen clustered
or grouped based on the principle of maxim minimizing the
interclass similarity. That is, clusters of o within a cluster have
high similarity in comparison to on to objects in other clusters.
Each cluster that is formed c from which rules can be derived.
Clustering can also fac is, the organization of observations
into a hierarchy of together.
“So,” you may ask, “are all of the patterns inter tion of the patterns
potentially generated would
This raises some serious questions for data pattern interesting? Can a data
mining system ge a data mining system generate only interesting pa
Database
Statistics
technology
Information
Data
science
Mining
Visualization
Other discipli
Figure 1.12 Data mining as a confluence of multiple
disciplin
relation analysis, classification, prediction, clusterin
analysis. A comprehensive data mining system
usual grated data mining functionalities.
Concept hierarchies
Simplicity
group by T.cust ID
3
Note that in this book, query language keywords are
However, in this example, the two classes are
implici retrieved and considered examples of
“promising cus customers in the customer table are
considered as “n sification is performed based on
this training set. L results are to be displayed as a
set of rules. Several de introduced in Chapter 6.
Is it another hype?
1
Neural networks and nearest-neighbor
classifiers are de in Chapter 7.
transactions
T2 a
tt
Data
ri
transformation
b
T3 u
t
e
s
T4
Data reduction
... T2000
A1 A2
x
∑ i
x1 + x
x=
i=1
N
This corresponds to the built-in aggregate func
relational database systems.
median = L
N=2 (∑freq)
1
freq
median
2
Data cube computation is described in detail in Chapters 3 and
1
mean mode = 3 (m
This implies that the mode for unimodal frequ can easily
be computed if the mean and median In a unimodal
frequency curve with perfect median, and mode are all at
the same center va data in most real applications are not
symmetr skewed, where the mode occurs at a value that
is or negatively skewed, where the mode occu
(Figure 2.2(c)).
IQR = Q3 Q1.
quartile.
40
20
Figure 2.3 Boxplot for the unit price data for items sold
at fo time period.
2
σ =
(x
∑ i
2
x) =
N
N
i=1
Aside from the bar charts, pie charts, and line graphs u
ical data presentation software packages, there are ot the
display of data summaries and distributions. Th plots, q-q
plots, scatter plots, and loess curves. Such grap
inspection of your data.
Figure 2.4 shows a histogram for the data set of Table equal-
width ranges representing $20 increments and th sold.
Histograms are at least a century old and are a method.
However, they may not be as effective as the qu methods for
comparing groups of univariate observati
A quantile plot is a simple and effective way to data distribution.
First, it displays all of the data for t user to assess both the
overall behavior and unusua quantile information. The
mechanism used in this st percentile computation discussed in
Section 2.2.2. Le sorted in increasing order so that x1 is the
smallest o Each observation, xi, is paired with a percentage, fi,
wh 100 fi% of the data are below or equal to the value, xi.
Figure 2.4 A histogram for the data set of Table 2.1.
Table 2.1 A set of unit price data for items sold at a bran
40
43
47
..
74
75
78
..
115
117
120
there may not be a value with exactly a fracti Note that the
0.25 quantile corresponds to qua and the 0.75 quantile is
Q3 .
Let
i0
fi = N
compare their Q1, median, Q3, and other fi values at a g plot for
the unit price data of Table 2.1.
700
600
sold
500
400
Items
300
200
100
0
0 20 40 60 8
Unit price (
Figure 2.7 A scatter plot for the data set of Table 2.1.
3
A statistical test for correlation is given in Section 2.4.1
Figure 2.9 Three cases where there is no
observed correlation betwee of the data sets.
Bin 1: 4, 4, 15
4
Data integration and the removal of redundant
data t described in Section 2.4.1.
or clustering to identify outliers. They may also
use the were described in Section 2.2.
It is likely that your data analysis task will invol from multiple
sources into a coherent data stor may include multiple
databases, data cubes, or There are a number of issues to
consider dur and object matching can be tricky. How can
equ
c r
(oi j ei j)
2
χ2 = ∑ ∑
e
ij
i=1 j=1
where oi j is the observed frequency (i.e., actual count) is
the expected frequency of (Ai;B j), which can be comp
2
(250 90)
(50
2
210)
+
90
210
=
284:44 + 121:90 +
71:11 + 30
0
v =
v minA
(new max
A
maxA minA
73;600 12;000
formed to (1:0 0)+ 0 = 0:716. 98;000 12;000