Professional Documents
Culture Documents
Dicrete 'ttribute
( Fa on%$ a !inite or countab%$ in!inite et o! /a%ue
( )*am#%e+ 8i# code, count, or the et o! .ord in a co%%ection
o! document
( 0!ten re#reented a integer /ariab%e,
( Note+ binar$ attribute are a #ecia% cae o! dicrete attribute
"ontinuou 'ttribute
( Fa rea% number a attribute /a%ue
( )*am#%e+ tem#erature, height, or .eight,
( Gractica%%$, rea% /a%ue can on%$ be meaured and re#reented
uing a !inite number o! digit,
( "ontinuou attribute are t$#ica%%$ re#reented a !%oating:#oint
/ariab%e,
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10
Types of data sets
Record
( "ata Matrix
( "ocument "ata
( Transaction "ata
#rah
( $orld $ide $eb
( Molecular Structures
Ordered
( Satial "ata
( Temoral "ata
( Se%uential "ata
( #enetic Se%uence "ata
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11
mportant Characteristics of !tructured
Data
(
"imensionality
Curse of "imensionality
(
Sarsity
SeMuence o! tranaction
An element of
the se%uence
Items)*+ents
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15
%rdered Data
S#atio:Tem#ora% Data
A+era,e Monthly
Temerature of
land and ocean
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21
Data &uality
)*am#%e+
(
Same #eron .ith mu%ti#%e emai% addree
Data c%eaning
(
Groce o! dea%ing .ith du#%icate data iue
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26
Data Preprocessing
'ggregation
Sam#%ing
Dimeniona%it$ ?eduction
>eature creation
'ttribute Tran!ormation
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24
Aggregation
Gur#oe
(
Data reduction
Strati!ied am#%ing
( S#%it the data into e/era% #artitionT then dra. random am#%e
!rom each #artition
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32
!ample !i)e
/000 oints 1000 &oints 200 &oints
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33
!ample !i)e
Nhen dimeniona%it$
increae, data become
increaing%$ #are in the
#ace that it occu#ie
Gur#oe+
(
'/oid cure o! dimeniona%it$
(
?educe amount o! time and memor$ reMuired b$ data
mining a%gorithm
(
'%%o. data to be more eai%$ /iua%i8ed
(
Ma$ he%# to e%iminate irre%e/ant !eature or reduce
noie
TechniMue
(
Grinci#%e "om#onent 'na%$i
(
Singu%ar Sa%ue Decom#oition
(
0ther+ u#er/ied and non:%inear techniMue
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36
Dimensionality "eduction: PCA
?edundant !eature
(
du#%icate much or a%% o! the in!ormation contained in
one or more other attribute
(
)*am#%e+ #urchae #rice o! a #roduct and the amount
o! a%e ta* #aid
Irre%e/ant !eature
(
contain no in!ormation that i ue!u% !or the data
mining ta- at hand
(
)*am#%e+ tudentI ID i o!ten irre%e/ant to the ta- o!
#redicting tudentI JG'
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41
*eature !ubset !election
TechniMue+
(
7rute:!orce a##roch+
domain:#eci!ic
(
Ma##ing Data to Ne. S#ace
(
>eature "ontruction
combining !eature
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43
Mapping Data to a 'e+ !pace
Two Sine $a+es
Two Sine $a+es - Noise 3re%uency
/ourier transform
+avelet transform
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44
Discreti)ation ,sing Class Labels
Simi%arit$
(
Numerica% meaure o! ho. a%i-e t.o data ob&ect are,
(
I higher .hen ob&ect are more a%i-e,
(
0!ten !a%% in the range X0,1Y
Diimi%arit$
(
Numerica% meaure o! ho. di!!erent are t.o data
ob&ect
(
Lo.er .hen ob&ect are more a%i-e
(
Minimum diimi%arit$ i o!ten 0
(
V##er %imit /arie
=
=
n
k
k k
q p dist
9
&
! (
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20
.uclidean Distance
0
1
2
3
0 1 2 3 4 2 6
#1
#2
#3 #4
point x y
p1 ; &
p! & ;
p" : 9
p# < 9
"istance Matrix
p1 p! p" p#
p1 ; &.@&@ :.9A& <.;BB
p! &.@&@ ; 9.C9C :.9A&
p" :.9A& 9.C9C ; &
p# <.;BB :.9A& & ;
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21
Min/o+s/i Distance
Min-o.-i Ditance i a genera%i8ation o! )uc%idean Ditance
Nhere r i a #arameter, n i the number o! dimenion 9attribute; and pk and qk are, re#ecti/e%$, the -th attribute 9com#onent; or data ob&ect p and q,
r
n
k
r
k k
q p dist
9
9
! D D (
=
=
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
Min/o+s/i Distance: .#amples
r @ 2, )uc%idean ditance
r , Qu#remumR 9L
ma*
norm, L
norm; ditance,
( Thi i the ma*imum di!!erence bet.een an$ com#onent o! the /ector
p1 p! p" p#
p1 ; & : <
p! & ; 9 :
p" : 9 ; &
p# < : & ;
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24
Mahalanobis Distance
T
q p q p q p s mahalanobi ! ( ! ( ! , (
9
=
3or red oints7 the *uclidean distance is 89!:7 Mahalanobis distance is ;!
is the co+ariance matrix of
the inut data X
=
n
i
k
ik
j
ij k j
X X X X
n
9
,
! !( (
9
9
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
Mahalanobis Distance
Co+ariance Matrix<
=
: . ; & . ;
& . ; : . ;
=
A
C
A< >0!27 0!2?
=< >07 8?
C< >8!27 8!2?
Mahal>A7=? @ 2
Mahal>A7C? @ 9
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26
Common Properties of a Distance
I! d
1
and d
2
are t.o document /ector, then
co9 d
1
, d
2
; @ 9d
1
d
2
; / WWd
1
WW WWd
2
WW ,
.here indicate /ector dot #roduct and WW d WW i the %ength o! /ector d,
)*am#%e+
d
1
@ 4 1 0 2 0 0 0 1 0 0
d
2
@ 8 0 0 0 0 0 0 8 0 1
d
1
d
2
@ 3D1 C 2D0 C 0D0 C 2D0 C 0D0 C 0D0 C 0D0 C 2D1 C 0D0 C 0D2 @ 2
WWd
1
WW @ 93D3C2D2C0D0C2D2C0D0C0D0C0D0C2D2C0D0C0D0;
0!2
@ 942;
0!2
@ 6,481
WWd
2
WW @ 91D1C0D0C0D0C0D0C0D0C0D0C0D0C1D1C0D0C2D2;
0!2
@ 96;
0!2
@ 2,242
co9 d
1
, d
2
; @ ,3120
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 61
.#tended 2accard Coe3cient
4Tanimoto5
! ( 4 !! ( ( q std q mean q q
k k
=
q p q p n correlatio
= ! , (
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 63
Visually .1aluating Correlation
Scatter lots
showin, the
similarity from
A8 to 8!
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 64
$eneral Approach for Combining
!imilarities
)*am#%e+
(
)uc%idean denit$