Professional Documents
Culture Documents
Tan,Steinbach, Kumar
4/18/2004
What is Data?
O
An attribute is a property or
characteristic of an object
Attributes
A collection of attributes
describe an object
Object is also known as
record, point, case, sample,
entity, or instance
Tan,Steinbach, Kumar
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
4/18/2004
Attribute Values
O
4/18/2004
Measurement of Length
O
B
7
2
C
D
10
E
15
Tan,Steinbach, Kumar
4/18/2004
Types of Attributes
O
Ordinal
Interval
Ratio
Tan,Steinbach, Kumar
4/18/2004
Distinctness:
Order:
Addition:
Multiplication:
Tan,Steinbach, Kumar
4/18/2004
Attribute
Type
Description
Examples
Nominal
mode, entropy,
contingency
correlation, 2 test
Ordinal
hardness of minerals,
{good, better, best},
grades, street numbers
median, percentiles,
rank correlation,
run tests, sign tests
Interval
calendar dates,
temperature in Celsius
or Fahrenheit
mean, standard
deviation, Pearson's
correlation, t and F
tests
temperature in Kelvin,
monetary quantities,
counts, age, mass,
length, electrical
current
geometric mean,
harmonic mean,
percent variation
Ratio
Operations
Attribute
Level
Transformation
Nominal
Ordinal
Interval
new_value =a * old_value + b
where a and b are constants
An attribute encompassing
the notion of good, better
best can be represented
equally well by the values
{1, 2, 3} or by { 0.5, 1,
10}.
Thus, the Fahrenheit and
Celsius temperature scales
differ in terms of where
their zero value is and the
size of a unit (degree).
Ratio
new_value = a * old_value
Comments
Discrete Attribute
Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a collection of
documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented
using a finite number of digits.
Continuous attributes are typically represented as floating-point
variables.
Tan,Steinbach, Kumar
4/18/2004
4/18/2004
10
Record
Data Matrix
Document Data
Transaction Data
Graph
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
Tan,Steinbach, Kumar
Curse of Dimensionality
Sparsity
Resolution
Tan,Steinbach, Kumar
4/18/2004
11
Record Data
O
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Tan,Steinbach, Kumar
4/18/2004
12
Data Matrix
O
Projection
of y load
Distance
Load
Thickness
10.23
5.27
15.22
2.7
1.2
12.65
6.25
16.22
2.2
1.1
Tan,Steinbach, Kumar
4/18/2004
13
Document Data
O
team
coach
pla
y
ball
score
game
wi
n
lost
timeout
season
Document 1
Document 2
Document 3
Tan,Steinbach, Kumar
4/18/2004
14
Transaction Data
O
Items
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Tan,Steinbach, Kumar
4/18/2004
15
Graph Data
O
2
1
5
2
5
Tan,Steinbach, Kumar
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
4/18/2004
16
Chemical Data
O
Tan,Steinbach, Kumar
4/18/2004
17
4/18/2004
18
Ordered Data
O
Sequences of transactions
Items/Events
An element of
the sequence
Tan,Steinbach, Kumar
Ordered Data
O
Tan,Steinbach, Kumar
4/18/2004
19
4/18/2004
20
Ordered Data
O
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Tan,Steinbach, Kumar
Data Quality
What kinds of data quality problems?
O How can we detect problems with the data?
O What can we do about these problems?
O
Tan,Steinbach, Kumar
4/18/2004
21
Noise
O
22
Outliers
O
Tan,Steinbach, Kumar
4/18/2004
23
Missing Values
O
Tan,Steinbach, Kumar
4/18/2004
24
Duplicate Data
O
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues
Tan,Steinbach, Kumar
4/18/2004
25
4/18/2004
26
Data Preprocessing
Aggregation
O Sampling
O Dimensionality Reduction
O Feature subset selection
O Feature creation
O Discretization and Binarization
O Attribute Transformation
O
Tan,Steinbach, Kumar
Aggregation
O
Purpose
Data reduction
Change of scale
Tan,Steinbach, Kumar
4/18/2004
27
Aggregation
Variation of Precipitation in Australia
28
Sampling
O
Tan,Steinbach, Kumar
4/18/2004
29
Sampling
O
Tan,Steinbach, Kumar
4/18/2004
30
Types of Sampling
O
Stratified sampling
Split the data into several partitions; then draw random samples
from each partition
Tan,Steinbach, Kumar
4/18/2004
31
Sample Size
8000 points
Tan,Steinbach, Kumar
2000 Points
500 Points
4/18/2004
32
Sample Size
O What
Tan,Steinbach, Kumar
4/18/2004
33
Curse of Dimensionality
O
When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
Tan,Steinbach, Kumar
4/18/2004
34
Dimensionality Reduction
O
Purpose:
Avoid curse of dimensionality
Reduce amount of time and memory required by data
mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce
noise
Techniques
Principle Component Analysis
Singular Value Decomposition
Others: supervised and non-linear techniques
Tan,Steinbach, Kumar
4/18/2004
35
x1
Tan,Steinbach, Kumar
4/18/2004
36
x2
e
x1
Tan,Steinbach, Kumar
4/18/2004
37
Tan,Steinbach, Kumar
4/18/2004
38
Tan,Steinbach, Kumar
4/18/2004
39
Redundant features
duplicate much or all of the information contained in
one or more other attributes
Example: purchase price of a product and the amount
of sales tax paid
Irrelevant features
contain no information that is useful for the data
mining task at hand
Example: students' ID is often irrelevant to the task of
predicting students' GPA
Tan,Steinbach, Kumar
4/18/2004
40
Techniques:
Brute-force approch:
Try
Embedded approaches:
Feature selection occurs naturally as part of the data mining
algorithm
Filter approaches:
Wrapper approaches:
Use the data mining algorithm as a black box to find best subset
of attributes
Tan,Steinbach, Kumar
4/18/2004
41
Feature Creation
O
domain-specific
combining features
Tan,Steinbach, Kumar
4/18/2004
42
transform
O Wavelet
transform
Tan,Steinbach, Kumar
Frequency
4/18/2004
43
based approach
Tan,Steinbach, Kumar
4/18/2004
44
Data
Equal frequency
Tan,Steinbach, Kumar
K-means
Introduction to Data Mining
4/18/2004
45
Attribute Transformation
O
Tan,Steinbach, Kumar
4/18/2004
46
Similarity
Numerical measure of how alike two data objects are.
Is higher when objects are more alike.
Often falls in the range [0,1]
Dissimilarity
Numerical measure of how different are two data
objects
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Tan,Steinbach, Kumar
4/18/2004
47
Tan,Steinbach, Kumar
4/18/2004
48
Euclidean Distance
O
Euclidean Distance
dist =
( pk
k =1
qk )2
Tan,Steinbach, Kumar
4/18/2004
49
Euclidean Distance
3
point
p1
p2
p3
p4
p1
p3
p4
1
p2
0
0
y
2
0
1
1
p1
p1
p2
p3
p4
x
0
2
3
5
0
2.828
3.162
5.099
p2
2.828
0
1.414
3.162
p3
3.162
1.414
0
2
p4
5.099
3.162
2
0
Distance Matrix
Tan,Steinbach, Kumar
4/18/2004
50
Minkowski Distance
O
dist = ( | pk qk
k =1
1
|r ) r
Tan,Steinbach, Kumar
4/18/2004
51
r = 2. Euclidean distance
Tan,Steinbach, Kumar
4/18/2004
52
Minkowski Distance
point
p1
p2
p3
p4
x
0
2
3
5
y
2
0
1
1
L1
p1
p2
p3
p4
p1
0
4
4
6
p2
4
0
2
4
p3
4
2
0
2
p4
6
4
2
0
L2
p1
p2
p3
p4
p1
p2
2.828
0
1.414
3.162
p3
3.162
1.414
0
2
p4
5.099
3.162
2
0
L
p1
p2
p3
p4
p1
p2
p3
p4
0
2.828
3.162
5.099
0
2
3
5
2
0
1
3
3
1
0
2
5
3
2
0
Distance Matrix
Tan,Steinbach, Kumar
4/18/2004
53
Mahalanobis Distance
mahalanobi s ( p , q ) = ( p q ) 1 ( p q )T
is the covariance matrix of
the input data X
j ,k =
1 n
( X ij X j )( X ik X k )
n 1 i =1
4/18/2004
54
Mahalanobis Distance
Covariance Matrix:
0.3 0.2
=
0.2 0.3
A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
Tan,Steinbach, Kumar
4/18/2004
55
2.
3.
Tan,Steinbach, Kumar
4/18/2004
56
2.
Tan,Steinbach, Kumar
4/18/2004
57
Tan,Steinbach, Kumar
4/18/2004
58
Tan,Steinbach, Kumar
4/18/2004
59
Cosine Similarity
O
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
4/18/2004
60
Tan,Steinbach, Kumar
4/18/2004
61
Correlation
Correlation measures the linear relationship
between objects
O To compute correlation, we standardize data
objects, p and q, and then take their dot product
O
4/18/2004
62
Scatter plots
showing the
similarity from
1 to 1.
Tan,Steinbach, Kumar
4/18/2004
63
Tan,Steinbach, Kumar
4/18/2004
64
Tan,Steinbach, Kumar
4/18/2004
65
Density
O
Examples:
Euclidean density
Probability density
Graph-based density
Tan,Steinbach, Kumar
4/18/2004
66
Tan,Steinbach, Kumar
4/18/2004
67
Tan,Steinbach, Kumar
4/18/2004
68