You are on page 1of 68

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Data Mining: Data


Lecture Note !or "ha#ter 2
Introduction to Data Mining
b$
Tan, Steinbach, Kumar
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2
What is Data?

"o%%ection o! data ob&ect and


their attribute

'n attribute i a #ro#ert$ or


characteritic o! an ob&ect
( )*am#%e+ e$e co%or o! a
#eron, tem#erature, etc,
( 'ttribute i a%o -no.n a
/ariab%e, !ie%d, characteritic,
or !eature

' co%%ection o! attribute


decribe an ob&ect
( 0b&ect i a%o -no.n a
record, #oint, cae, am#%e,
entit$, or intance
Tid Refund Marital
Status
Taxable
Income
Cheat
1 1e Sing%e 122K No
2 No Married 100K No
3 No Sing%e 40K No
4 1e Married 120K No
2 No Di/orced 52K Yes
6 No Married 60K No
4 1e Di/orced 220K No
8 No Sing%e 82K Yes
5 No Married 42K No
10 No Sing%e 50K Yes
10

Attributes
Objects
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3
Attribute Values

'ttribute /a%ue are number or $mbo% aigned


to an attribute

Ditinction bet.een attribute and attribute /a%ue


(
Same attribute can be ma##ed to di!!erent attribute
/a%ue

)*am#%e+ height can be meaured in !eet or meter


(
Di!!erent attribute can be ma##ed to the ame et o!
/a%ue

)*am#%e+ 'ttribute /a%ue !or ID and age are integer

7ut #ro#ertie o! attribute /a%ue can be di!!erent

ID has no limit but age has a maximum and minimum value


Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4
Measurement of Length

The way you measure an attribute is somewhat may not match


the attributes roerties!
1
2
3
2
2
4
8
12
10
4
'
7
"
D
)
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2
Types of Attributes

There are di!!erent t$#e o! attribute


(
Nomina%

)*am#%e+ ID number, e$e co%or, 8i# code


(
0rdina%

)*am#%e+ ran-ing 9e,g,, tate o! #otato chi# on a ca%e


!rom 1:10;, grade, height in <ta%%, medium, hort=
(
Inter/a%

)*am#%e+ ca%endar date, tem#erature in "e%iu or


>ahrenheit,
(
?atio

)*am#%e+ tem#erature in Ke%/in, %ength, time, count


Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6
Properties of Attribute Values

The t$#e o! an attribute de#end on .hich o! the


!o%%o.ing #ro#ertie it #oee+
(
Ditinctne+ @
(
0rder+ A B
(
'ddition+ C :
(
Mu%ti#%ication+ D /
(
Nomina% attribute+ ditinctne
(
0rdina% attribute+ ditinctne E order
(
Inter/a% attribute+ ditinctne, order E addition
(
?atio attribute+ a%% 4 #ro#ertie
Attribute
Type
Description Examples Operations
Nominal The values of a nominal attribute
are just different names, i.e.,
nominal attributes provide only
enough information to distinguish
one object from another. (, !
"ip codes, employee
ID numbers, eye color,
sex# $male, female%
mode, entropy,
contingency
correlation,
&
test
'rdinal The values of an ordinal attribute
provide enough information to order
objects. ((, )!
hardness of minerals,
$good, better, best%,
grades, street numbers
median, percentiles,
ran* correlation,
run tests, sign tests
Interval +or interval attributes, the
differences bet,een values are
meaningful, i.e., a unit of
measurement exists.
(-, . !
calendar dates,
temperature in /elsius
or +ahrenheit
mean, standard
deviation, 0earson1s
correlation, t and F
tests
2atio +or ratio variables, both differences
and ratios are meaningful. (3, 4!
temperature in 5elvin,
monetary 6uantities,
counts, age, mass,
length, electrical
current
geometric mean,
harmonic mean,
percent variation
Attribute
Level
Transformation Comments
Nominal 7ny permutation of values If all employee ID numbers
,ere reassigned, ,ould it
ma*e any difference8
'rdinal 7n order preserving change of
values, i.e.,
new_value = f(old_value)
,here f is a monotonic function.
7n attribute encompassing
the notion of good, better
best can be represented
e6ually ,ell by the values
$9, &, :% or by $ ;.<, 9,
9;%.
Interval new_value =a * old_value + b
,here a and b are constants
Thus, the +ahrenheit and
/elsius temperature scales
differ in terms of ,here
their "ero value is and the
si"e of a unit (degree!.
2atio new_value = a * old_value =ength can be measured in
meters or feet.
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5
Discrete and Continuous
Attributes

Dicrete 'ttribute
( Fa on%$ a !inite or countab%$ in!inite et o! /a%ue
( )*am#%e+ 8i# code, count, or the et o! .ord in a co%%ection
o! document
( 0!ten re#reented a integer /ariab%e,
( Note+ binar$ attribute are a #ecia% cae o! dicrete attribute

"ontinuou 'ttribute
( Fa rea% number a attribute /a%ue
( )*am#%e+ tem#erature, height, or .eight,
( Gractica%%$, rea% /a%ue can on%$ be meaured and re#reented
uing a !inite number o! digit,
( "ontinuou attribute are t$#ica%%$ re#reented a !%oating:#oint
/ariab%e,
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10
Types of data sets

Record
( "ata Matrix
( "ocument "ata
( Transaction "ata

#rah
( $orld $ide $eb
( Molecular Structures

Ordered
( Satial "ata
( Temoral "ata
( Se%uential "ata
( #enetic Se%uence "ata
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11
mportant Characteristics of !tructured
Data
(
"imensionality

Curse of "imensionality
(
Sarsity

Only resence counts


(
Resolution

&atterns deend on the scale


Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12
"ecord Data

Data that conit o! a co%%ection o! record, each


o! .hich conit o! a !i*ed et o! attribute
Tid Refund Marital
Status
Taxable
Income
Cheat
1 1e Sing%e 122K No
2 No Married 100K No
3 No Sing%e 40K No
4 1e Married 120K No
2 No Di/orced 52K Yes
6 No Married 60K No
4 1e Di/orced 220K No
8 No Sing%e 82K Yes
5 No Married 42K No
10 No Sing%e 50K Yes
10

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13
Data Matri#

I! data ob&ect ha/e the ame !i*ed et o! numeric


attribute, then the data ob&ect can be thought o! a
#oint in a mu%ti:dimeniona% #ace, .here each
dimenion re#reent a ditinct attribute

Such data et can be re#reented b$ an m b$ n matri*,


.here there are m ro., one !or each ob&ect, and n
co%umn, one !or each attribute
1,1 2,2 16,22 6,22 12,62
1,2 2,4 12,22 2,24 10,23
Thic'ness (oad "istance &rojection
of y load
&rojection
of x (oad
1,1 2,2 16,22 6,22 12,62
1,2 2,4 12,22 2,24 10,23
Thic'ness (oad "istance &rojection
of y load
&rojection
of x (oad
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14
Document Data

)ach document become a HtermI /ector,


(
each term i a com#onent 9attribute; o! the /ector,
(
the /a%ue o! each com#onent i the number o! time
the corre#onding term occur in the document,
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12
Transaction Data

' #ecia% t$#e o! record data, .here


(
each record 9tranaction; in/o%/e a et o! item,
(
>or e*am#%e, conider a grocer$ tore, The et o!
#roduct #urchaed b$ a cutomer during one
ho##ing tri# contitute a tranaction, .hi%e the
indi/idua% #roduct that .ere #urchaed are the item,
TID Items
1 Bread, Coke, ilk
! Beer, Bread
" Beer, Coke, Diaper, ilk
# Beer, Bread, Diaper, ilk
$ Coke, Diaper, ilk

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16
$raph Data

)*am#%e+ Jeneric gra#h and FTML Lin-


2
2
1
2
2
Aa hre!@K#a#er/#a#er,htm%LbbbbKB
Data Mining A/aB
A%iB
Aa hre!@K#a#er/#a#er,htm%LaaaaKB
Jra#h Gartitioning A/aB
A%iB
Aa hre!@K#a#er/#a#er,htm%LaaaaKB
Gara%%e% So%ution o! S#are Linear S$tem o! )Muation A/aB
A%iB
Aa hre!@K#a#er/#a#er,htm%L!!!!KB
N:7od$ "om#utation and Dene Linear S$tem So%/er
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14
Chemical Data

7en8ene Mo%ecu%e+ "


6
F
6
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18
%rdered Data

SeMuence o! tranaction
An element of
the se%uence
Items)*+ents
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15
%rdered Data

Jenomic eMuence data


GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20
%rdered Data

S#atio:Tem#ora% Data
A+era,e Monthly
Temerature of
land and ocean
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21
Data &uality

Nhat -ind o! data Mua%it$ #rob%emO

Fo. can .e detect #rob%em .ith the dataO

Nhat can .e do about thee #rob%emO

)*am#%e o! data Mua%it$ #rob%em+


(
Noie and out%ier
(
miing /a%ue
(
du#%icate data
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
'oise

Noie re!er to modi!ication o! origina% /a%ue


(
)*am#%e+ ditortion o! a #eronP /oice .hen ta%-ing
on a #oor #hone and Qno.R on te%e/iion creen
Two Sine $a+es Two Sine $a+es - Noise
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23
%utliers

0ut%ier are data ob&ect .ith characteritic that


are coniderab%$ di!!erent than mot o! the other
data ob&ect in the data et
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24
Missing Values

?eaon !or miing /a%ue


(
In!ormation i not co%%ected
9e,g,, #eo#%e dec%ine to gi/e their age and .eight;
(
'ttribute ma$ not be a##%icab%e to a%% cae
9e,g,, annua% income i not a##%icab%e to chi%dren;

Fand%ing miing /a%ue


(
)%iminate Data 0b&ect
(
)timate Miing Sa%ue
(
Ignore the Miing Sa%ue During 'na%$i
(
?e#%ace .ith a%% #oib%e /a%ue 9.eighted b$ their
#robabi%itie;
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
Duplicate Data

Data et ma$ inc%ude data ob&ect that are


du#%icate, or a%mot du#%icate o! one another
(
Ma&or iue .hen merging data !rom heterogeou
ource

)*am#%e+
(
Same #eron .ith mu%ti#%e emai% addree

Data c%eaning
(
Groce o! dea%ing .ith du#%icate data iue
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26
Data Preprocessing

'ggregation

Sam#%ing

Dimeniona%it$ ?eduction

>eature ubet e%ection

>eature creation

Dicreti8ation and 7inari8ation

'ttribute Tran!ormation
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24
Aggregation

"ombining t.o or more attribute 9or ob&ect; into


a ing%e attribute 9or ob&ect;

Gur#oe
(
Data reduction

?educe the number o! attribute or ob&ect


(
"hange o! ca%e

"itie aggregated into region, tate, countrie, etc


(
More Qtab%eR data

'ggregated data tend to ha/e %e /ariabi%it$


Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28
Aggregation
Standard "e+iation of A+era,e
Monthly &reciitation
Standard "e+iation of A+era,e
Yearly &reciitation
.ariation of &reciitation in Australia
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25
!ampling

%amplin& is t'e main tec'ni(ue employed for data selection)


( *t is often used for bot' t'e preliminary investi&ation of t'e data
and t'e final data analysis)

%tatisticians sample because obtainin& t'e entire set of data


of interest is too expensive or time consumin&)

%amplin& is used in data minin& because processin& t'e


entire set of data of interest is too expensive or time
consumin&)
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30
!ampling (

The -e$ #rinci#%e !or e!!ecti/e am#%ing i the


!o%%o.ing+
(
uing a am#%e .i%% .or- a%mot a .e%% a uing the
entire data et, i! the am#%e i re#reentati/e
(
' am#%e i re#reentati/e i! it ha a##ro*imate%$ the
ame #ro#ert$ 9o! interet; a the origina% et o! data
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31
Types of !ampling

Sim#%e ?andom Sam#%ing


( There i an eMua% #robabi%it$ o! e%ecting an$ #articu%ar item

Sam#%ing .ithout re#%acement


( ' each item i e%ected, it i remo/ed !rom the #o#u%ation

Sam#%ing .ith re#%acement


( 0b&ect are not remo/ed !rom the #o#u%ation a the$ are
e%ected !or the am#%e,

In am#%ing .ith re#%acement, the ame ob&ect can be #ic-ed u#


more than once

Strati!ied am#%ing
( S#%it the data into e/era% #artitionT then dra. random am#%e
!rom each #artition
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32
!ample !i)e

/000 oints 1000 &oints 200 &oints
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33
!ample !i)e

+'at sample si,e is necessary to &et at least one


ob-ect from eac' of 1. &roups)
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34
Curse of Dimensionality

Nhen dimeniona%it$
increae, data become
increaing%$ #are in the
#ace that it occu#ie

De!inition o! denit$ and


ditance bet.een #oint,
.hich i critica% !or
c%utering and out%ier
detection, become %e
meaning!u%
U Randomly ,enerate 200 oints
U Comute difference between max and min
distance between any air of oints
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32
Dimensionality "eduction

Gur#oe+
(
'/oid cure o! dimeniona%it$
(
?educe amount o! time and memor$ reMuired b$ data
mining a%gorithm
(
'%%o. data to be more eai%$ /iua%i8ed
(
Ma$ he%# to e%iminate irre%e/ant !eature or reduce
noie

TechniMue
(
Grinci#%e "om#onent 'na%$i
(
Singu%ar Sa%ue Decom#oition
(
0ther+ u#er/ied and non:%inear techniMue
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36
Dimensionality "eduction: PCA

Joa% i to !ind a #ro&ection that ca#ture the


%arget amount o! /ariation in data
x
&
x
9
e
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34
Dimensionality "eduction: PCA

>ind the eigen/ector o! the co/ariance matri*

The eigen/ector de!ine the ne. #ace


x
&
x
9
e
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38
Dimensionality "eduction: !%MAP

"ontruct a neighbourhood gra#h

>or each #air o! #oint in the gra#h, com#ute the hortet


#ath ditance ( geodeic ditance
>y# Tenenbaum, de ?ilva,
=angford (&;;;!
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35
Dimensions = 10 Dimensions = 40 Dimensions = 80 Dimensions = 120 Dimensions = 160 Dimensions = 206
Dimensionality "eduction: PCA
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40
*eature !ubset !election

'nother .a$ to reduce dimeniona%it$ o! data

?edundant !eature
(
du#%icate much or a%% o! the in!ormation contained in
one or more other attribute
(
)*am#%e+ #urchae #rice o! a #roduct and the amount
o! a%e ta* #aid

Irre%e/ant !eature
(
contain no in!ormation that i ue!u% !or the data
mining ta- at hand
(
)*am#%e+ tudentI ID i o!ten irre%e/ant to the ta- o!
#redicting tudentI JG'
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41
*eature !ubset !election

TechniMue+
(
7rute:!orce a##roch+

Tr$ a%% #oib%e !eature ubet a in#ut to data mining a%gorithm


(
)mbedded a##roache+

>eature e%ection occur natura%%$ a #art o! the data mining


a%gorithm
(
>i%ter a##roache+

>eature are e%ected be!ore data mining a%gorithm i run


(
Nra##er a##roache+

Ve the data mining a%gorithm a a b%ac- bo* to !ind bet ubet


o! attribute
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42
*eature Creation

"reate ne. attribute that can ca#ture the


im#ortant in!ormation in a data et much more
e!!icient%$ than the origina% attribute

Three genera% methodo%ogie+


(
>eature )*traction

domain:#eci!ic
(
Ma##ing Data to Ne. S#ace
(
>eature "ontruction

combining !eature
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43
Mapping Data to a 'e+ !pace
Two Sine $a+es
Two Sine $a+es - Noise 3re%uency

/ourier transform

+avelet transform
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44
Discreti)ation ,sing Class Labels

Entropy based approac'


4 cate,ories for both x and y 2 cate,ories for both x and y
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42
Discreti)ation Without ,sing Class
Labels
"ata *%ual inter+al width
*%ual fre%uency 56means
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46
Attribute Transformation

' !unction that ma# the entire et o! /a%ue o! a


gi/en attribute to a ne. et o! re#%acement /a%ue
uch that each o%d /a%ue can be identi!ied .ith
one o! the ne. /a%ue
(
Sim#%e !unction+ *
-
, %og9*;, e
*
, W*W
(
Standardi8ation and Norma%i8ation
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44
!imilarity and Dissimilarity

Simi%arit$
(
Numerica% meaure o! ho. a%i-e t.o data ob&ect are,
(
I higher .hen ob&ect are more a%i-e,
(
0!ten !a%% in the range X0,1Y

Diimi%arit$
(
Numerica% meaure o! ho. di!!erent are t.o data
ob&ect
(
Lo.er .hen ob&ect are more a%i-e
(
Minimum diimi%arit$ i o!ten 0
(
V##er %imit /arie

Gro*imit$ re!er to a imi%arit$ or diimi%arit$


Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48
!imilarity-Dissimilarity for !imple
Attributes
p and q are the attribute /a%ue !or t.o data ob&ect,
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45
.uclidean Distance
)uc%idean Ditance

Nhere n i the number o! dimenion 9attribute; and pk and qk are, re#ecti/e%$, the -th attribute 9com#onent; or data ob&ect p and q,
Standardi8ation i necear$, i! ca%e di!!er,

=
=
n
k
k k
q p dist
9
&
! (
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20
.uclidean Distance
0
1
2
3
0 1 2 3 4 2 6
#1
#2
#3 #4
point x y
p1 ; &
p! & ;
p" : 9
p# < 9
"istance Matrix
p1 p! p" p#
p1 ; &.@&@ :.9A& <.;BB
p! &.@&@ ; 9.C9C :.9A&
p" :.9A& 9.C9C ; &
p# <.;BB :.9A& & ;
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21
Min/o+s/i Distance
Min-o.-i Ditance i a genera%i8ation o! )uc%idean Ditance

Nhere r i a #arameter, n i the number o! dimenion 9attribute; and pk and qk are, re#ecti/e%$, the -th attribute 9com#onent; or data ob&ect p and q,
r
n
k
r
k k
q p dist
9
9
! D D (
=
=
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
Min/o+s/i Distance: .#amples

r @ 1, "it$ b%oc- 9Manhattan, ta*icab, L


1
norm; ditance,
( ' common e*am#%e o! thi i the Famming ditance, .hich i &ut the
number o! bit that are di!!erent bet.een t.o binar$ /ector

r @ 2, )uc%idean ditance

r , Qu#remumR 9L
ma*
norm, L

norm; ditance,
( Thi i the ma*imum di!!erence bet.een an$ com#onent o! the /ector

Do not con!ue r .ith n, i,e,, a%% thee ditance are


de!ined !or a%% number o! dimenion,
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23
Min/o+s/i Distance
"istance Matrix
point x y
p1 ; &
p! & ;
p" : 9
p# < 9
L1 p1 p! p" p#
p1 ; C C A
p! C ; & C
p" C & ; &
p# A C & ;
L! p1 p! p" p#
p1 ; &.@&@ :.9A& <.;BB
p! &.@&@ ; 9.C9C :.9A&
p" :.9A& 9.C9C ; &
p# <.;BB :.9A& & ;
L

p1 p! p" p#
p1 ; & : <
p! & ; 9 :
p" : 9 ; &
p# < : & ;
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24
Mahalanobis Distance
T
q p q p q p s mahalanobi ! ( ! ( ! , (
9
=

3or red oints7 the *uclidean distance is 89!:7 Mahalanobis distance is ;!
is the co+ariance matrix of
the inut data X

=
n
i
k
ik
j
ij k j
X X X X
n
9
,
! !( (
9
9
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
Mahalanobis Distance
Co+ariance Matrix<

=
: . ; & . ;
& . ; : . ;
=
A
C
A< >0!27 0!2?
=< >07 8?
C< >8!27 8!2?
Mahal>A7=? @ 2
Mahal>A7C? @ 9
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26
Common Properties of a Distance

Ditance, uch a the )uc%idean ditance,


ha/e ome .e%% -no.n #ro#ertie,
1. d(p, q) 0 !or a%% p and q and d(p, q) = 0 on%$ i!
p = q, 9Goiti/e de!initene;
2. d(p, q) = d(q, p) !or a%% p and q, 9S$mmetr$;
3, d(p, r) d(p, q) + d(q, r) !or a%% #oint p, q, and r,
9Triang%e IneMua%it$;
.here d(p, q) i the ditance 9diimi%arit$; bet.een
#oint 9data ob&ect;, p and q,

' ditance that ati!ie thee #ro#ertie i a


metric
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24
Common Properties of a !imilarity

Simi%aritie, a%o ha/e ome .e%% -no.n


#ro#ertie,
1. s(p, q) = 1 9or ma*imum imi%arit$; on%$ i! p = q,
2. s(p, q) = s(q, p) !or a%% p and q, 9S$mmetr$;
.here s(p, q) i the imi%arit$ bet.een #oint 9data
ob&ect;, p and q,
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28
!imilarity 0et+een 0inary Vectors

"ommon ituation i that ob&ect, p and q, ha/e on%$


binar$ attribute

"om#ute imi%aritie uing the !o%%o.ing Muantitie


M
01
= the number of attributes where p was 0 and q was 1
M
10
= the number of attributes where p was 1 and q was 0
M
00
= the number of attributes where p was 0 and q was 0
M
11
= the number of attributes where p was 1 and q was 1

Sim#%e Matching and Zaccard "oe!!icient


SM" @ number o! matche / number o! attribute
@ 9M
11
C M
00
; / 9M
01
C M
10
C M
11
C M
00
;
Z @ number o! 11 matche / number o! not:both:8ero attribute /a%ue
@ 9M
11
; / 9M
01
C M
10
C M
11
;
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25
!MC 1ersus 2accard: .#ample
p @ 1 0 0 0 0 0 0 0 0 0
q @ 0 0 0 0 0 0 1 0 0 1
M
01
= 2 (the number of attributes where p was 0 and q was 1)
M
10
= 1 (the number of attributes where p was 1 and q was 0)
M
00
= 7 (the number of attributes where p was 0 and q was 0)
M
11
= 0 (the number of attributes where p was 1 and q was 1)
SM" @ 9M
11
C M
00
;/9M
01
C M
10
C M
11
C M
00
; @ 90C4; / 92C1C0C4; @ 0,4
Z @ 9M
11
; / 9M
01
C M
10
C M
11
; @ 0 / 92 C 1 C 0; @ 0
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 60
Cosine !imilarity

I! d
1
and d
2
are t.o document /ector, then
co9 d
1
, d
2
; @ 9d
1
d
2
; / WWd
1
WW WWd
2
WW ,
.here indicate /ector dot #roduct and WW d WW i the %ength o! /ector d,

)*am#%e+
d
1
@ 4 1 0 2 0 0 0 1 0 0
d
2
@ 8 0 0 0 0 0 0 8 0 1
d
1
d
2
@ 3D1 C 2D0 C 0D0 C 2D0 C 0D0 C 0D0 C 0D0 C 2D1 C 0D0 C 0D2 @ 2
WWd
1
WW @ 93D3C2D2C0D0C2D2C0D0C0D0C0D0C2D2C0D0C0D0;
0!2
@ 942;
0!2
@ 6,481
WWd
2
WW @ 91D1C0D0C0D0C0D0C0D0C0D0C0D0C1D1C0D0C2D2;
0!2
@ 96;
0!2
@ 2,242
co9 d
1
, d
2
; @ ,3120
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 61
.#tended 2accard Coe3cient
4Tanimoto5

Sariation o! Zaccard !or continuou or count


attribute
(
?educe to Zaccard !or binar$ attribute
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 62
Correlation

"orre%ation meaure the %inear re%ationhi#


bet.een ob&ect

To com#ute corre%ation, .e tandardi8e data


ob&ect, # and M, and then ta-e their dot #roduct
! ( 4 !! ( ( p std p mean p p
k k
=

! ( 4 !! ( ( q std q mean q q
k k
=

q p q p n correlatio

= ! , (
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 63
Visually .1aluating Correlation
Scatter lots
showin, the
similarity from
A8 to 8!
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 64
$eneral Approach for Combining
!imilarities

Sometime attribute are o! man$ di!!erent


t$#e, but an o/era%% imi%arit$ i needed,
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 62
,sing Weights to Combine
!imilarities

Ma$ not .ant to treat a%% attribute the ame,


( Ve .eight .
-
.hich are bet.een 0 and 1 and um
to 1,
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 66
Density

Denit$:baed c%utering reMuire a notion o!


denit$

)*am#%e+
(
)uc%idean denit$

)uc%idean denit$ @ number o! #oint #er unit /o%ume


(
Grobabi%it$ denit$
(
Jra#h:baed denit$
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 64
.uclidean Density 6 Cell7based

Sim#%et a##roach i to di/ide region into a


number o! rectangu%ar ce%% o! eMua% /o%ume and
de!ine denit$ a L o! #oint the ce%% contain
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 68
.uclidean Density 6 Center7based

)uc%idean denit$ i the number o! #oint .ithin a


#eci!ied radiu o! the #oint

You might also like