Professional Documents
Culture Documents
— Chapter 1 —
2
Why Data Mining?
3
Chapter 1. Introduction
◼ Why Data Mining?
4
What Is Data Mining?
5
Knowledge Discovery (KDD) Process
◼ This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
◼ Data mining plays an essential
role in the knowledge discovery
process Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
6
Example: A Web Mining Framework
7
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
9
Example: Medical Data Mining
10
Chapter 1. Introduction
◼ Why Data Mining?
11
Multi-Dimensional View of Data Mining
◼ Data to be mined
◼ Database data (extended-relational, object-oriented, heterogeneous,
analysis, etc.
◼ Descriptive vs. predictive data mining
◼ Techniques utilized
◼ Data warehouse (OLAP), machine learning, statistics, pattern
12
Chapter 1. Introduction
◼ Why Data Mining?
13
Data Mining: On What Kinds of Data?
◼ Database-oriented data sets and applications
◼ Relational database, data warehouse, transactional database
◼ Advanced data sets and advanced applications
◼ Data streams and sensor data
◼ Time-series data, temporal data, sequence data (incl. bio-sequences)
◼ Structure data, graphs, social networks and multi-linked data
◼ Object-relational databases
◼ Heterogeneous databases and legacy databases
◼ Spatial data and spatiotemporal data
◼ Multimedia database
◼ Text databases
◼ The World-Wide Web
14
Chapter 1. Introduction
◼ Why Data Mining?
15
Data Mining Function: (1) Generalization
16
Data Mining Function: (2) Association and
Correlation Analysis
◼ Frequent patterns (or frequent itemsets)
◼ What items are frequently purchased together in your
Walmart?
◼ Association, correlation vs. causality
◼ A typical association rule
◼ Diaper → Beer [0.5%, 75%] (support, confidence)
◼ Are strongly associated items also strongly correlated?
◼ How to mine such patterns and rules efficiently in large
datasets?
◼ How to use such patterns for classification, clustering,
and other applications?
17
Data Mining Function: (3) Classification
18
Data Mining Function: (4) Cluster Analysis
19
Data Mining Function: (5) Outlier Analysis
◼ Outlier analysis
◼ Outlier: A data object that does not comply with the general
behavior of the data
◼ Noise or exception? ― One person’s garbage could be another
person’s treasure
◼ Methods: by product of clustering or regression analysis, …etc.
◼ Useful in fraud detection, rare events analysis
20
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
◼ Sequence, trend and evolution analysis
◼ Trend, time-series, and deviation analysis: e.g.,
memory cards
◼ Periodicity analysis
◼ Similarity-based analysis
21
Evaluation of Knowledge
◼ Are all mined knowledge interesting?
◼ One can mine tremendous amount of “patterns” and knowledge
◼ Some may fit only certain dimension space (time, location, …)
◼ Some may not be representative, may be transient, …
◼ Evaluation of mined knowledge → directly mine only
interesting knowledge?
◼ Descriptive vs. predictive
◼ Coverage
◼ Typicality vs. novelty
◼ Accuracy
◼ Timeliness
◼ …
22
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
23
Data Mining: Confluence of Multiple Disciplines
24
Why Confluence of Multiple Disciplines?
26
Concepts and
Techniques
— Chapter 2 —
1
Chapter 2: Getting to Know Your
Data
Data Visualization
Summary
2
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
crosstabs
wi
n
y
Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
Transaction data
Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
Social or information networks
Molecular Structures
Ordered
TID Items
Video data: sequence of images
Temporal data: time-series
1 Bread, Coke, Milk
Sequential Data: transaction sequences 2 Beer, Bread
Genetic sequence data 3 Beer, Coke, Diaper, Milk
Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
Spatial data: maps
5 Coke, Diaper, Milk
Image data:
Video data:
3
Attributes
Attribute (or dimensions, features, variables):
a data field, representing a characteristic or
feature of a data object.
E.g., customer _ID, name, address
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled: both the order and the exact
differences between the values are known.
Ratio-scaled: e.g height, money, age, weight etc.
4
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red,
white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g.,
HIV positive)
Ordinal
Values have a meaningful order (ranking) but magnitude
between successive values is not known.
Size = {small, medium, large}, grades, army rankings
5
Discrete vs. Continuous
Attributes
6
Chapter 2: Getting to Know Your
Data
Data Visualization
Summary
7
Graphic Displays of Basic Statistical
Descriptions
8
Scatterplot Matrices
9
Little problem for you
10
Little problem for you
11
Chapter 2: Getting to Know Your
Data
Data Visualization
Summary
12
Similarity and Dissimilarity
13
Data Matrix and Dissimilarity
Matrix
Data matrix
n data points with x11 ... x1f ... x1p
p dimensions ... ... ... ... ...
Two modes x ... xif ... xip
i1
... ... ... ... ...
x ... xnf ... xnp
n1
Dissimilarity matrix
n data points, but
0
d(2,1) 0
registers only the
d(3,1) d ( 3,2) 0
distance
A triangular matrix : : :
d ( n,1) d ( n,2) ... ... 0
Single mode
14
Proximity Measure for Nominal
Attributes
Can take 2 or more states, e.g., red, yellow, blue,
green (generalization of a binary attribute)
Method 1: Simple matching
m: # of matches, p: total # of variables
d (i, j) p p m
Method 2: Use a large number of binary
attributes
creating a new binary attribute for each of the
M nominal states
15
Proximity Measure for Binary
Attributes
Object j
A contingency table for binary data
Object i
Distance measure for symmetric
binary variables:
Distance measure for asymmetric
binary variables:
Jaccard coefficient (similarity
measure for asymmetric binary
variables):
zif s
f
18
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 4 x1 x2 x3 x4
2
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
19
Distance on Numeric Data: Minkowski
Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
20
Special Cases of Minkowski Distance
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
21
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan
x1 1 2 (L1)L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
22
Ordinal Variables
23
Ordinal Variables
There are three states for test-2: fair, good, and
excellent, that is, Mf D 3. For step 1, if we replace each value for
test-2 by its rank, the
four objects are assigned the ranks 3, 1, 2, and 3, respectively. Step
2 normalizes the
ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0.
For step 3, we can
use, say, the Euclidean distance (Eq. 2.16), which results in the
following dissimilarity
matrix:
24
Little problem for you
25
Cosine Similarity
26
Example: Cosine Similarity
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d|: the length of vector d
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0) 0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1) 0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
27
28
29
Chapter 2: Getting to Know Your
Data
Data Visualization
Summary
30
Summary
Data attribute types: nominal, binary, ordinal, interval-scaled,
ratio-scaled
Many types of data sets, e.g., numerical, text, graph, Web,
image.
Gain insight into the data by:
Basic statistical data description: central tendency,
dispersion, graphical displays
Data visualization: map data onto graphical primitives
Measure data similarity
Above steps are the beginning of data preprocessing.
Many methods have been developed but still an active area of
research.
31