You are on page 1of 42

Mt s phng php phn cm d liu

HDL Hi Phng

MC LC
MC LC ................................................................................................................................. 1 DANH MC HNH MINH HA .......................................................................................... 3 LI CM N............................................................................................................................ 4 CHNG 1: TNG QUAN V KHAI PH D LIU .................................................... 5 1.1 Gii thiu v khm ph tri thc ..................................................................... 5 1.2 Khai ph d liu v cc khi nim lin quan ................................................. 7 1.2.1 1.2.2 1.2.3 1.2.4 Khi nim khai ph d liu ................................................................... 7 Cc phng php khai ph d liu ........................................................ 7 Cc lnh vc ng dng trong thc tin .................................................. 8 Cc hng tip cn c bn v k thut p dng trong khai ph d liu 8

CHNG 2: PHN CM D LIU V CC TIP CN ............................................ 10 2.1 Khi nim chung .......................................................................................... 10 2.2 Cc kiu d liu v o tng t .............................................................. 10 2.2.1 2.2.2 2.3.1 2.3.2 2.3.3 2.3.4 2.3.5 2.3.6 Cc kiu d liu ................................................................................... 10 o tng t v phi tng t ........................................................... 12 Phng php phn cm phn hoch .................................................... 15 Phng php phn cm phn cp ........................................................ 15 Phng php phn cm da trn mt ............................................. 16 Phng php phn cm da trn li ................................................ 17 Phng php phn cm da trn m hnh ........................................... 18 Phng php phn cm c d liu rng buc .................................... 19

2.3 Cc k thut tip cn trong phn cm d liu ............................................. 15

2.4 Cc ng dng phn cm d liu .................................................................. 20 CHNG 3: MT S THUT TON C BN TRONG PHN CM D LIU .. 21 3.1 Cc thut ton phn cm phn hoch .......................................................... 21 3.1.1 3.1.2 Thut ton K-means............................................................................. 21 Thut ton K-Medoids ......................................................................... 23

3.2 Thut ton phn cm phn cp .................................................................... 24 3.3 Thut ton COP-Kmeans ............................................................................. 26 V Minh ng CT1002 1

Mt s phng php phn cm d liu

HDL Hi Phng

CHNG 4: NG DNG THUT TON K-MEANS CHO PHN ON NH . 28 4.1 Tng quan v phn vng nh ....................................................................... 28 4.1.1 4.1.2 4.1.3 4.1.4 4.2.1 4.2.2 Phn vng nh theo ngng bin ................................................... 28 Phn vng nh theo min ng nht.................................................... 29 Phn vng da theo ng bin .......................................................... 31 Phn on da theo kt cu b mt ..................................................... 31 M t bi ton ...................................................................................... 32 Cc bc thc hin chnh trong thut ton .......................................... 33 Tm kim Top X color ................................................................ 34 Tnh khong cch v phn cm .................................................. 36 Tnh li trng tm cm ................................................................ 37 Kim tra hi t ............................................................................ 38 Mi trng ci t....................................................................... 39 Mt s giao din.......................................................................... 39

4.2 Thut ton K-means cho phn on nh ...................................................... 32

4.2.2.1 4.2.2.2 4.2.2.3 4.2.2.4 4.2.3 4.2.3.1 4.2.3.2

Kt qu thc nghim............................................................................ 39

KT LUN ............................................................................................................................. 41 TI LIU THAM KHO ..................................................................................................... 42

V Minh ng CT1002

Mt s phng php phn cm d liu

HDL Hi Phng

DANH MC HNH MINH HA


Hnh 1. 1: Quy trnh pht hin tri thc ........................................................................ 6 Hnh 2. 1: M hnh cu trc d liu li .................................................................. 18 Hnh 3. 1: Cc cm d liu c khm ph bi CURE ............................................ 24 Hnh 4. 1: Thut ton K-means ................................................................................. 34 Hnh 4. 2: Tm kim Top X color. ............................................................................ 35 Hnh 4. 3: Phn cm. ................................................................................................. 36 Hnh 4. 4: Tnh trng tm mi. ................................................................................. 37 Hnh 4. 5: Kim tra hi t. ........................................................................................ 38

V Minh ng CT1002

Mt s phng php phn cm d liu

HDL Hi Phng

LI CM N

Trc ht em xin chn thnh cm n thy Ng Trng Giang l gio vin hng dn em trong qu tnh lm n . Thy gip em rt nhiu v cung cp cho em nhiu ti liu quan trng phc v cho qu trnh tm hiu v ti Tm hiu mt s phng php phn cm d liu v ng dng. Th hai, em xin chn thnh cm n cc thy c trong b mn cng ngh thng tin ch bo em trong qu trnh hc v rn luyn trong 4 nm hc va qua. ng thi em cm n cc bn sinh vin lp CT1002 gn b vi em trong qu trnh rn luyn ti trng. Cui cng em xin chn thnh cm n ban gim hiu trng i Hc Dn Lp Hi Phng to iu kin cho em c kin thc, th vin ca trng l ni m sinh vin trong trng c th thu thp ti liu tr gip cho bi ging trn lp. ng thi cc thy c trong trng ging dy cho sinh vin kinh nghim cuc sng. Vi kin thc v kinh nghim s gip cho em trong cng vic v cuc sng sau ny. Em xin chn thnh cm n!

Hi Phng, ngy

thng

nm 2010

Sinh vin

V MINH NG

V Minh ng CT1002

Mt s phng php phn cm d liu

HDL Hi Phng

CHNG 1:

TNG QUAN V KHAI PH D LIU

1.1 Gii thiu v khm ph tri thc Nu cho rng cc in t v cc sng in t chnh l bn cht ca cng ngh in t truyn thng th d liu , thng tin v tri thc hin ang l tiu im ca mt lnh vc mi trong nghin cu v ng dng v pht hin tri thc (Knowledge Discovery) v khai ph d liu (Data Mining). Thng thng chng ta coi d liu nh mt dy cc bit, hoc cc s v cc k hiu, hoc cc i tng vi mt ngha no khi c gi cho mt chng trnh di mt dng nht nh . Chng ta s dng cc bit o lng cc thng tin v xem n nh l cc d liu c lc b cc d tha, c rt gn ti mc ti thiu c trng mt cch c bn cho d liu . Chng ta c th xem tri thc nh l cc thng tin tch hp , bao gm cc s kin v cc mi quan h gia chng. Cc mi quan h ny c th c hiu ra, c th c pht hin, hoc c th c hc. Ni cch khc, tri thc c th c coi l d liu c tru tng v t chc cao. Pht hin tri thc trong cc c s d liu l mt qui trnh nhn bit cc mu hoc cc m hnh trong d liu vi cc tnh nng: hp thc, mi, kh ch, v c th hiu c. Cn khai thc d liu l mt bc trong qui trnh pht hin tri thc gm c cc thut ton khai thc d liu chuyn dng di mt s qui nh v hiu qu tnh ton chp nhn c tm ra cc mu hoc cc m hnh trong d liu. Ni mt cch khc, mc ch ca pht hin tri thc v khai ph d liu chnh l tm ra cc mu hoc cc m hnh ang tn ti trong cc c s d liu nhng vn cn b che khut bi hng ni d liu.

V Minh ng CT1002

Mt s phng php phn cm d liu

HDL Hi Phng

Quy trnh pht hin tri thc:

Hnh 1. 1: Quy trnh pht hin tri thc

Bc th nht: l tm hiu lnh vc ng dng v hnh thnh bi ton , bc ny s quyt nh cho vic rt ra c cc tri thc hu ch v cho php chn cc phng php khai ph d liu thch hp vi mc ch ng dng v bn cht ca d liu. Bc th hai: l thu thp v x l th, cn c gi l tin x l d liu nhm loi b nhiu, x l vic thiu d liu, bin i d liu v rt gn d liu nu cn thit, bc ny thng chim nhiu thi gian nht trong ton b qui trnh pht hin tri thc. Bc th ba: l khai ph d liu, hay ni cch khc l trch ra cc mu hoc cc m hnh n di cc d liu. Bc th t: l hiu tri thc tm c, c bit l lm sng t cc m t v d on. Cc bc trn c th lp i lp li mt s ln , kt qu thu c c th c ly trung bnh trn tt c cc ln thc hin.
V Minh ng CT1002 6

Mt s phng php phn cm d liu

HDL Hi Phng

1.2 Khai ph d liu v cc khi nim lin quan Khai ph d liu nh l mt qui trnh phn tch c thit k thm d mt lng cc ln cc d liu nhm pht hin ra cc mu thch hp hoc cc mi quan h mang tnh h thng gia cc bin v sau s hp thc ho cc kt qu tm c bng cch p dng cc mu pht hin c cho cc tp con mi ca d liu. Qui trnh ny bao gm ba giai on c bn : thm d, xy dng m hnh hoc nh ngha mu, hp thc, kim chng. 1.2.1 Khi nim khai ph d liu Do s pht trin mnh m ca khai ph d liu (Data mining) v phm vi cc lnh vc ng dng trong thc t v cc phng php tm kim, ln c rt nhiu cc khi nim khc nhau v khai ph d liu. Trong bi ny em xin nu ra mt nh ngha ngn gn nh sau: Khai ph d liu l qu trnh khm ph cc tri thc mi v cc tri thc c ch dng tim nng trong ngun d liu c. 1.2.2 Cc phng php khai ph d liu Vi hai ch chnh ca khai ph d liu l: d on (Prediction) v m t (Description), ngi ta thng s dng cc phng php sau cho khai ph d liu: Phn lp (Classfication) Hi qui (Regression) Trc quan ha (Visualiztion) Phn cm (Clustering) Tng hp (Summarization) M hnh rng buc (Dependency modeling) Biu din m hnh (Model Evaluation) Phn tch s pht trin v lch (Evolution and deviation analyst) Lun kt hp (Associantion rules ) Phng php tm kim (Search Method)

V Minh ng CT1002

Mt s phng php phn cm d liu

HDL Hi Phng

1.2.3 Cc lnh vc ng dng trong thc tin Phn tch d liu v h tr ra quyt nh. Phn lp vn bn, tm tt vn bn, phn lp cc trang Web v phn cm nh mu. Chun on triu chng, phng php trong iu tr y hc. Tm kim, i snh cc h Gene v thng tin di truyn trong sinh hc. Phn tch tnh hnh ti chnh, th trng, d bo gi c phiu trong ti chnh, th trng v chng khon. Bo him 1.2.4 Cc hng tip cn c bn v k thut p dng trong khai ph d liu Cc k thut khai ph d liu thng c chia thnh 2 nhm chnh : K thut khai ph d liu m t: c nhim v m t v cc tnh cht hoc cc c tnh chung ca d liu trong CSDL hin c . Cc k thut ny gm c: phn cm (Clustering), tng hp (Summerization), trc quan ha (Visualiztion), phn tch s pht trin v lch (Evolution and deviation analyst), lun kt hp (Associantion rules) K thut khai ph d liu d on : c nhim v a ra cc d on vo cc suy din trn d liu hin thi. Cc k thut ny gm c: phn lp (Classification), hi quy (Regression). . . Sau y em xin c gii thiu 3 phng php thng dng nht l: phn cm d liu, phn lp d liu v khai ph lun kt hp. Phn lp d liu: Mc tiu ca phng php phn lp d liu l d on nhn lp cho cc mu d liu. Qu trnh phn lp d liu thng gm 2 bc: xy dng m hnh v s dng m hnh phn lp d liu.

V Minh ng CT1002

Mt s phng php phn cm d liu

HDL Hi Phng

Bc 1: mt m hnh s c xy dng da trn vic phn tch cc mu d liu sn c. Mi mu tng ng vi mt lp, c quyt nh bi mt thuc tnh gi l thuc tnh lp. Cc mu d liu ny cn c gi l tp d liu hun luyn (Training dataset). Cc nhn lp ca tp d liu hun luyn u phi c xc nh trc khi xy dng m hnh v vy phng php ny cn c gi l hc c thy (Supervised learning) khc vi phn cm d liu l hc khng c thy (Unsupervised learning). Bc 2: s dng m hnh phn lp d liu. Trc ht chng ta phi tnh chnh xc ca m hnh. Nu chnh xc l chp nhn c, m hnh s c s dng d on nhn lp cho cc mu d liu khc trong tng lai. Phn cm d liu: Mc tiu chnh ca phn cm d liu l nhm cc i tng tng t nhau trong tp d liu vo cc cm sao cho cc i tng thuc cng mt lp l tng ng cn cc i tng thuc cc cm khc nhau s khng tng ng. Trong phng php ny bn s khng th bit kt qu cc cm thu c s nh th no khi bt u qu trnh. V vy, thng thng cn c mt chuyn gia v lnh vc nh gi cc cm thu c. Phn cm d liu cn l bc tin x l cho cc thut ton khai ph d liu khc. Khai ph lun kt hp: Mc tiu ca phng php ny l pht hin a ra cc mi lin h gia cc gi tr d liu trong CSDL. Mu u ra ca gii thut khai ph d liu l tp lun kt hp tm c.

V Minh ng CT1002

Mt s phng php phn cm d liu

HDL Hi Phng

CHNG 2:

PHN CM D LIU V CC TIP CN

2.1 Khi nim chung Khai ph d liu (Datamining) l qu trnh trch xut cc thng tin c gi tr tim n bn trong tp d liu ln c lu tr trong cc c s d liu , kho d liu. Ngi ta nh ngha [1]: Phn cm d liu l mt k thut trong Data Mining, nhm tm kim, pht hin cc cm, cc mu d liu t nhin tim n, quan trng trong tp d liu ln, t cung cp thng tin, tri thc hu ch cho vic ra quyt nh Nh vy phn cm d liu l qu trnh chia mt tp d liu ban u thnh cc cm d liu sao cho cc phn t trong mt cm tng t (Similar) vi nhau v cc phn t trong cc cm khc nhau s phi tng t (Dissimilar) vi nhau. S cc cm d liu c phn y c th c xc nh trc theo kinh nghim hoc c th c t ng xc nh. 2.2 Cc kiu d liu v o tng t 2.2.1 Cc kiu d liu Cho mt mt c s d liu D cha n i tng trong khng gian k chiu trong x, y, z l cc i tng thuc D: x = (x1, x2, , xk); y = (y1, y2, , yk); z = (z1, z2, , zk), trong xi, yi, zi vi i = 1, k l cc c trng hoc cc thuc tnh tng ng ca cc i tng x, y, z. a) Phn loi theo kch thc min Thuc tnh lin tc (Continnuous Attribute): nu min gi tr ca n l v hn khng m c. Thuc tnh ri rc (DiscretteAttribute): nu min gi tr ca n l tp hu hn, m c.

V Minh ng CT1002

10

Mt s phng php phn cm d liu

HDL Hi Phng

Lp cc thuc tnh nh phn: l trng hp c bit ca thuc tnh ri rc m min gi tr ca n ch c hai phn t c din t nh : Yes / No hoc False / True, b) Phn loi da theo h o Gi s rng chng ta c hai i tng x, y v cc thuc tnh xi, yi tng ng vi thuc tnh th i ca chng. Chng ta c cc lp kiu d liu nh sau: Thuc tnh nh danh (Nominal Scale ): y l dng thuc tnh khi qut ha ca thuc tnh nh phn, trong min gi tr l ri rc khng phn bit th t v c nhiu hn hai phn t: ngha l nu x v y l hai i tng thuc tnh th ch c th xc nh l x # y hoc x = y. Thuc tnh c th t (Ordinal Scale): l thuc tnh nh danh c thm tnh th t, nhng chng khng c nh lng. Nu x v y l hai thuc tnh th t th ta c th xc nh l x # y hoc x = y hoc x > y hoc x <y Thuc tnh khong (Interval Scale): Vi thuc tnh khong, chng ta c th xc nh mt thuc tnh l ng trc hoc ng sau thuc tnh khc vi mt khong l bao nhiu. Nu xi > yi th ta ni x cch y mt khong xi yi tng ng vi thuc tnh th i. Thuc tnh t l (Ratio Scale): l thuc tnh khong nhng c xc nh mt cch tng i so vi im mc, th d nh thuc tnh chiu cao hoc cn nng ly im 0 lm mc. Trong cc thuc tnh d liu trnh by trn, thuc tnh nh danh v thuc tnh c th t gi chung l thuc tnh hng mc (Categorical), thuc tnh khong v thuc tnh t l c gi l thuc tnh s (Numeric).

V Minh ng CT1002

11

Mt s phng php phn cm d liu

HDL Hi Phng

2.2.2 o tng t v phi tng t phn cm, ngi ta phi i tm cch thch hp xc nh khong cch gia cc i tng, hay l php o tng t d liu. y l cc hm o s ging nhau gia cc cp i tng d liu, thng thng cc hm ny hoc l tnh tng t (Similar) hoc l tnh phi tng t (Dissimilar) gia cc i tng d liu. Tt c cc o di y c xc nh trong khng gian metric. Mt khng gian metric l mt tp trong c xc nh cc khong cch gia tng cp phn t, vi nhng tnh cht thng thng ca khong cch hnh hc. Ngha l, mt tp X (cc phn t ca n c th l nhng i tng bt k) cc i tng d liu trong c s d liu D nh cp trn c gi l mt khng gian metric nu: Vi mi cp phn t x, y thuc X u c xc nh, theo mt quy tc no , mt s thc (x, y), c gi l khong cch gia x v y. Quy tc ni trn tho mn h tnh cht sau: (i) (x, y) > 0 nu x y ; (ii) (x, y)=0 nu x =y; (iii) (x, y) = (y, x) vi mi x, y; (iv) (x, y) (x, z)+(z, y). Hm (x, y) c gi l mt metric ca khng gian. Cc phn t ca X c gi l cc im ca khng gian ny. Thuc tnh khong: Sau khi chun ha, o phi tng t ca hai i tng d liu x, y c xc nh bng cc matrix khong cch nh sau:
n

Khong cch Minskowski: d(x, y) = (


i 1

xi

yi

)1 q trong q l s

t nhin dng.

V Minh ng CT1002

12

Mt s phng php phn cm d liu


n

HDL Hi Phng

Khong cch Euclide: d(x, y) =


i 1

xi

yi

, y l trng hp c

bit ca khong cch Minskowski trong trng hp q =2.


n

Khong cch Mahattan: d(x, y) =


i 1

xi

yi , y l trng hp c

bit ca khong cch Minskowski trong trng hp q =1. Khong cch cc i: d(x, y) =

Max

i 1

xi

yi , y l trng hp
.

c bit ca khong cch Minskowski trong trng hp q

Thuc tnh nh phn: l tng s cc thuc tnh c gi tr l 1 trong x, y. l tng s cc thuc tnh c gi tr l 1 trong x v 0 trong y. l tng s cc thuc tnh c gi tr l 0 trong x v 1 trong y. l tng s cc thuc tnh c gi tr l 0 trong x v y. = + + + . Cc php o tng ng i vi d liu thuc tnh nh phn c nh ngha nh sau: H s i snh n gin: d(x, y) = , y c hai i tng x v

y c vai tr nh nhau, ngha l chng i xng v c cng trng s. H s Jacard: d(x, y) = , (b qua s cc i snh gia 0-0).

Cng thc tnh ny c s dng trong trng hp m trng s ca cc thuc tnh c gi tr 1 ca i tng d liu c cao hn nhiu so vi cc thuc tnh c gi tr 0, nh vy cc thuc tnh nh phn y l khng i xng.

V Minh ng CT1002

13

Mt s phng php phn cm d liu

HDL Hi Phng

Thuc tnh nh danh: o phi tng t gia hai i tng x v y c nh ngha nh sau: d(x, y) =
p m p

Trong m l s thuc tnh i snh tng ng trng nhau v p l tng s cc thuc tnh. Thuc tnh c th t: Gi s i l thuc tnh th t c Mi gi tr (Mi kch thc min gi tr ): Cc trng thi Mi c sp th t nh sau: [1 Mi], chng ta c th thay th mi gi tr ca thuc tnh bng gi tr cng loi r i vi ri {1 Mi). Mi mt thuc tnh c th t c cc min gi tr khc nhau , v vy chng ta chuyn i chng v cng min gi tr [0, 1] bng cch thc hin php bin i sau cho mi thuc tnh:

(i ) i

ri ( i ) Mi

1 1

S dng cng thc tnh phi tng t ca cc thuc tnh khong i vi cc gi tr Zi(i), y cng chnh l phi tng t ca thuc tnh c th t. Thuc tnh t l: C nhiu cch khc nhau tnh tng t gia cc thuc tnh t l . Mt trong nhng s l s dng cng thc tnh logarit cho mi thuc tnh . Hoc loi b n v o ca cc thuc tnh d liu bng cch chun ha chng , hoc gn trng s cho mi thuc tnh gi tr trung bnh lch chun . Vi mi thuc tnh d liu c gn trng s tng ng wi (1 i k), tng ng d liu c xc nh nh sau:
n

d(x, y) =
i 1

wi xi

yi

V Minh ng CT1002

14

Mt s phng php phn cm d liu

HDL Hi Phng

2.3 Cc k thut tip cn trong phn cm d liu Cc k thut phn cm c rt nhiu cch tip cn v ng dng trong thc t, n hng ti hai mc tiu chung l cht lng ca cc cm khm ph c v tc thc hin ca thut ton . Hin nay, cc k thut phn cm c th phn loi theo cc cch tip cn chnh sau. 2.3.1 Phng php phn cm phn hoch Phng php phn cm phn hoch nhm phn mt tp d liu c n phn t cho trc thnh k nhm d liu sao cho: mi phn t d liu ch thuc v mt nhm d liu v mi nhm d liu c ti thiu t nht mt phn t d liu. Cc thut ton phn hoch d liu c phc tp rt ln khi xc nh nghim ti u ton cc cho vn PCDL, do n phi tm kim tt c cc cch phn hoch c th c. Chnh v vy, trn thc t ngi ta thng i tm gii php ti u cc b cho vn ny bng cch s dng mt hm tiu chun nh gi cht lng ca cc cm cng nh hng dn cho qu trnh tm kim phn hoch d liu. Vi chin lc ny, thng thng ngi ta bt u khi to mt phn hoch ban u cho tp d liu theo php ngu nhin hoc theo heuristic v lin tc tinh chnh n cho n khi thu c mt phn hoch mong mun, tho mn rng buc cho trc. Cc thut ton phn cm phn hoch c gng ci tin tiu chun phn cm, bng cch tnh cc gi tr o tng t gia cc i tng d liu v sp xp cc gi tr ny, sau thut ton la chn mt gi tr trong dy sp xp sao cho hm tiu chun t gi tr ti thiu. Nh vy, tng chnh ca thut ton phn cm phn hoch ti u cc b l s dng chin lc n tham (Greedy) tm kim nghim. Mt s thut ton phn cm phn hoch in hnh nh K-means, PAM, CLARA, CLARANS, s c trnh by chi tit chng sau. 2.3.2 Phng php phn cm phn cp Phng php ny xy dng mt phn cp trn c s cc i tng d liu ang xem xt. Ngha l sp xp mt tp d liu cho thnh mt cu trc
V Minh ng CT1002 15

Mt s phng php phn cm d liu

HDL Hi Phng

c dng hnh cy, cy phn cp ny c xy dng theo k thut quy c hai cch tip cn ph bin ca k thut ny l: Ha nhp nhm: thng c gi l tip cn Bottom-Up: Phng php ny bt u vi mi i tng c khi to tng ng vi cc cm ring bit, sau tin hnh nhm cc i tng theo mt o tng t (nh khong cch gia hai trung tm ca hai nhm), qu trnh ny c thc hin cho n khi tt c cc nhm c ha nhp vo mt nhm (mc cao nht ca cy phn cp) hoc cho n khi cc iu kin kt thc tha mn. Nh vy, cch tip cn ny s dng chin lc n tham trong qu trnh phn cm. Phn chia nhm: thng c gi l tip cn Top -Down: Bt u vi trng thi l tt c cc i tng c xp trong cng mt cm. Mi vng lp thnh cng, mt cm c tch thnh cc cm nh hn theo gi tr ca mt php o tng t no cho n khi mi i tng l mt cm, hoc cho n khi iu kin dng tha mn. Cch tip cn ny s dng chin lc chia tr trong qu trnh phn cm. Mt s thut ton phn cm phn cp in hnh nh CURE, BIRCH, s c trnh by chi tit trong chng sau. Thc t p dng, c nhiu trng hp ngi ta kt hp c hai phng php phn cm phn hoch v phng phn cm phn cp, ngha l kt qu thu c ca phng php phn cp c th ci tin thng quan bc phn cm phn hoch. Phn cm phn hoch v phn cm phn cp l hai phng php PCDL c in, hin nay c nhiu thut ton ci tin da trn hai phng php ny c p dng ph bin trong khai ph d liu. 2.3.3 Phng php phn cm da trn mt Phng php ny nhm cc i tng theo hm mt xc nh. Mt c nh ngha nh l s cc i tng ln cn ca mt i tng d liu theo mt ngng no . Trong cch tip cn ny, khi mt cm d liu
V Minh ng CT1002 16

Mt s phng php phn cm d liu

HDL Hi Phng

xc nh th n tip tc c pht trin thm cc i tng d liu mi min l s cc i tng ln cn ca cc i tng ny phi ln hn mt ngng c xc nh trc. Phng php phn cm da vo mt ca cc i tng xc nh cc cm d liu c th pht hin ra cc cm d liu vi hnh th bt k. K thut ny c th khc phc c cc phn t ngoi lai hoc gi tr nhiu rt tt, tuy vy vic xc nh cc tham s mt ca thut ton rt kh khn, trong khi cc tham s ny li c tc ng rt ln n kt qu phn cm d liu. Mt s thut ton PCDL da trn mt in hnh nh DBSCAN, OPTICS, . . . s c trnh by chi tit trong chng tip theo. 2.3.4 Phng php phn cm da trn li K thut phn cm da trn mt khng thch hp vi d liu nhiu chiu, gii quyt cho i hi ny, ngi ta d dng phng php phn cm da trn li. y l phng php da trn cu trc d liu li PCDL, phng php ny ch yu tp trung p dng cho lp d liu khng gian. Th d nh d liu c biu din di dng cu trc hnh hc ca i tng trong khng gian cng vi cc quan h, cc thuc tnh, cc hot ng ca chng. Mc tiu ca phng php ny l lng ho tp d liu thnh cc (cell), cc cell ny to thnh cu trc d liu li, sau cc thao tc PCDL lm vic vi cc i tng trong tng cell ny. Cch tip cn da trn li ny khng di chuyn cc i tng trong cc cell m xy dng nhiu mc phn cp ca nhm cc i tng trong mt cell. Trong ng cnh ny, phng php ny gn ging vi phng php phn cm phn cp nhng ch c iu chng khng trn cc cell. Do vy cc cm khng da trn o khong cch (hay cn gi l o tng t i vi cc d liu khng gian) m n c quyt nh bi mt tham s xc nh trc. u im ca phng php PCDL da trn li l thi gian x l nhanh v c lp vi s i tng d liu trong tp d liu ban u, thay vo l chng ph thuc vo s cell trong mi chiu ca khng gian

V Minh ng CT1002

17

Mt s phng php phn cm d liu

HDL Hi Phng

li. Mt th d v cu trc d liu li cha cc cell trong khng gian nh hnh 6 sau: Tng 1 . . . Cell mc i-1 c th tng ng vi 4 cell mc i Tng i 1
Tng i

Mc 1 (mc cao nht) c th ch cha 1 cell

Hnh 2. 1: M hnh cu trc d liu li Mt s thut ton PCDL da trn cu trc li in hnh STING, WaveCluster. . 2.3.5 Phng php phn cm da trn m hnh Phng php ny c gng khm ph cc php xp x tt ca cc tham s m hnh sao cho khp vi d liu mt cch tt nht. Chng c th s dng chin lc phn cm phn hoch hoc chin lc phn cm phn cp, da trn cu trc hoc m hnh m chng gi nh v tp d liu v cch m chng tinh chnh cc m hnh ny nhn dng ra cc phn hoch. Phng php PCDL da trn m hnh c gng khp gia d liu vi m hnh ton hc, n da trn gi nh rng d liu c to ra bng hn hp phn phi xc sut c bn. Cc thut ton phn cm da trn m hnh c hai tip cn chnh: m hnh thng k v mng Nron. Phng php ny gn ging vi phng php da trn mt , bi v chng pht trin cc cm ring bit nhm ci tin cc m hnh c xc nh trc , nhng i khi n khng bt u vi mt s cm c nh v khng s dng cng mt khi nim mt cho cc cm.

V Minh ng CT1002

18

Mt s phng php phn cm d liu

HDL Hi Phng

2.3.6 Phng php phn cm c d liu rng buc S pht trin ca PCDL khng gian trn CSDL ln cung cp nhiu cng c tin li cho vic phn tch thng tin a l , tuy nhin hu ht cc thut ton ny cung cp rt t cch thc cho ngi dng xc nh cc rng buc trong th gii thc cn phi c tha mn trong qu trnh phn cm. PCDL khng gian hiu qu hn, cc nghin cu b sung cn c thc hin cung cp cho ngi dng kh nng kt hp cc rng buc trong thut ton phn cm. Hin nay, cc phng php phn cm trn , ang c pht trin v p dng nhiu trong cc lnh vc khc nhau v c mt s nhnh nghin cu c pht trin trn c s ca cc phng php nh: Phn cm thng k: Da trn cc khi nim phn tch thng k, nhnh nghin cu ny s dng cc o tng t phn hoch cc i tng, nhng chng ch p dng cho cc d liu c thuc tnh s. Phn cm khi nim: K thut ny c pht trin p dng cho d liu hng mc chng phn cm cc i tng theo cc khi nim m chng x l. Phn cm m: S dng k thut m PCDL, cc thut ton thuc loi ny chia ra lc phn cm thch hp vi tt c cc hot ng i sng hng ngy, chng ch x l cc d liu thc hin khng chc chn. Phn cm mng Kohonen: Loi phn cm ny da trn khi nim ca cc mng Nron. Mng Kohonen c tng Nron vo v cc tng Nron ra. Mi Nron ca tng vo tng ng vi mi thuc tnh ca bn ghi , mi mt Nron vo kt ni vi tt c cc Nron ca tng ra . Mi lin kt c gn lin vi mt trng s nhm xc nh v tr ca Nron ra tng ng. Cc k thut PCDL trnh by trn c s dng rng ri trong thc t, th nhng hu ht chng ch nhm p dng cho tp d liu vi
V Minh ng CT1002 19

Mt s phng php phn cm d liu

HDL Hi Phng

cng mt kiu thuc tnh. V vy, vic PCDL trn tp d liu c kiu hn hp l mt vn t ra trong khai ph d liu trong giai on hin nay. 2.4 Cc ng dng phn cm d liu Phn cm d liu c rt nhiu ng dng trong cc lnh vc khc nhau: Thng mi: Gip cc doanh nhn khm pha ra cc nhm khch hng quan trng a ra cc mc tiu tip th. Sinh hc: Xc nh cc loi sinh vt, phn loi cc Gen vi chc nng tng ng v thu c cc cu trc trong cc mu. Lp quy hoch th: Nhn dng cc nhm nh theo kiu v v tr a l, nhm cung cp thng tin cho quy hoch th. Th vin: Phn loi cc cm sch c ni dung v ngha tng ng nhau cung cp cho c gi. Bo him: Nhn dng nhm tham gia bo him c chi ph bi thng cao, nhn dng gian ln thng mi. Nghin cu tri t: Phn cm theo di cc tm ng t nhm cung cp thng tin cho nhn dng cc vng nguy him. World Wide Web: C th khm ph cc nhm ti liu quan trng, c nhiu ngha trong mi trng web. Cc lp ti liu ny tr gip cho vic khai ph d liu t d liu.

V Minh ng CT1002

20

Mt s phng php phn cm d liu

HDL Hi Phng

CHNG 3:

MT S THUT TON C BN TRONG PHN CM D LIU

3.1 Cc thut ton phn cm phn hoch 3.1.1 Thut ton K-means Thut ton phn hoch K-means do MacQueen xut trong lnh vc thng k nm 1967. Thut ton da trn o khong cch ca cc i tng d liu trong cm. Trong thc t, n o khong cch ti gi tr trung bnh ca cc d liu trong cm. N c xem nh l trung tm ca cm. Nh vy, n cn khi to mt tp trung tm cc trung tm cm ban u v thng qua n lp li cc bc gm gn mi i tng ti cm m trung tm gn v tnh ton li trung tm ca mi cm trn c s gn mi cho cc i tng . Qu trnh lp ny dng khi cc trung tm cm hi t. Mc ch ca thut ton K-means l sinh k cm d liu {C1, C2, , Ck} t mt tp d liu cha n i tng trong khng gian d chiu Xi = { xi1 , xi2 , , xid }, i = 1 n, sao cho hm tiu chun:
k

E
i 1

x Ci

D 2 x mi

t gi tr ti thiu, trong mi l trng tm ca cm Ci, D l khong cch gia hai i tng. Thut ton K-means bao gm cc bc sau:

V Minh ng CT1002

21

Mt s phng php phn cm d liu

HDL Hi Phng

Input: S cm k v cc trng tm cm m j

k j 1

Output: Cc cm Ci (i = 1, k ) v hm tiu chun E t gi tr ti thiu Begin Bc 1: Khi to: Chn k trng tm m j


k j 1

ban u trong khng gian Rd (d l s chiu

ca d liu). Vic la chn ny c th l ngu nhin hoc theo kinh nghim.

Bc 2: Tnh ton khong cch: i vi mi im Xi (1 i n ), tnh khong cch ca n ti mi trng tm mj: j = 1, k . V sau tm trng tm gn nht i vi mi im.

Bc 3: Cp nht li trng tm: i vi mi j = 1, k , cp nht trng tm cm mj bng cch xc nh trung bnh cng cc vect i tng d liu. iu kin dng: Lp li cc bc 2 v 3 cho n khi cc trng tm ca cm khng thay i. End.

V Minh ng CT1002

22

Mt s phng php phn cm d liu

HDL Hi Phng

Nhn xt: phc tp ca thut ton l O((3nkd) T flop) vi n l s i tng d liu a vo, k l s cm d liu, d l s chiu, s vng lp, T flop l thi gian thc hin mt php tnh c s nh php tnh nhn , chia, Do K-means phn tch cm n gin nn c th p dng i vi tp d liu ln. Tuy nhin, nhc im ca K-means l ch p dng vi d liu c thuc tnh s v khm ph ra cc cm c dng hnh cu , K-means cn rt nhy cm vi nhiu v cc phn t ngoi lai trong d liu. 3.1.2 Thut ton K-Medoids Thut ton K-Medoids c kh nng khc phc c nhiu bng cch chn i tng gn tm cm nht lm i din cho cm (medoid) . Thut ton K-Medoids c thc hin qua cc bc sau: Bc 1: Chn K i tng bt k trong N i tng ban u lm cc medoid ban u Bc 2: Lp cho ti khi hi t Gn mi i tng cn li vo cm c medoid gn nht vi n . Thay th medoid hin ti bng mt i tng khng phi l medoid sao cho cht lng phn cm c ci thin (cht lng c nh gi s dng hm chi ph, hm tnh phi tng t gia mt i tng v medoid ca cm cha i tng ). K-medoid t ra hiu qu hn K-means trong trng hp d liu c nhiu hoc i tng ngoi lai (Outlier). Nhng so vi K-means th KMedoid c phc tp tnh ton cao hn. C hai thut ton trn u c nhc im chung l s lng cm k c cung cp bi ngi dng. Ngoi thut ton K-means v K-Medoid, phn cm phn hoch cn bao gm mt s thut ton khc nh: thut ton PAM, thut ton CLARA,

V Minh ng CT1002

23

Mt s phng php phn cm d liu

HDL Hi Phng

3.2 Thut ton phn cm phn cp Trong khi hu ht cc thut ton thc hin phn cm vi cc cm hnh cu v kch thc tng t, nh vy l khng hiu qu khi xut hin cc phn t ngoi lai. Thut ton CURE khc phc c vn ny v tt hn vi cc phn t ngoi lai. Thut ton ny nh ngha mt s c nh cc im i din nm ri rc trong ton b khng gian d liu v c chn m t cc cm c hnh thnh. Cc im ny c to ra nh la chn cc i tng nm ri rc cho cm v sau co li hoc di chuyn chng v trung tm cm bng nhn t co cm. Qu trnh ny c lp li v nh vy trong qu trnh ny, c th o t l gia tng ca cm. Ti mi bc ca thut ton, hai cm c cp cc im i din gn nhau (mi im trong cp thuc v mi cm khc nhau) c ha nhp. Nh vy, c nhiu hn mt im i din mi cm cho php CURE khm ph c cc cm c hnh dng khng phi l hnh cu . Vic co li cc cm c tc dng lm gim tc ng ca cc phn t ngoi lai . Nh vy, thut ton ny c kh nng x l tt trong trng hp c cc phn t ngoi lai v lm cho hiu qu vi nhng hnh dng khng phi l hnh cu v kch thc rng bin i. Hn na, n t l tt vi c s d liu ln m khng lm gim cht lng phn cm.

Hnh 3. 1: Cc cm d liu c khm ph bi CURE x l c cc c s d liu ln, CURE s dng mu ngu nhin v phn hoch, mt mu l c xc nh ngu nhin trc khi c phn hoch v sau tin hnh phn cm trn mi phn hoch , nh vy mi phn hoch
V Minh ng CT1002 24

Mt s phng php phn cm d liu

HDL Hi Phng

l tng phn c phn cm, cc cm thu c li c phn cm ln th hai thu c cc cm con mong mun, nhng mu ngu nhin khng nht thit a ra mt m t cho ton b tp d liu. Thut ton CURE c thc hin qua cc bc c bn sau:

Chn mt mu ngu nhin S t tp d liu ban u. Phn hoch mu S thnh cc nhm d liu c kch thc bng nhau: tng chnh y l phn hoch mu thnh p nhm d liu bng nhau, kch thc ca mi phn hoch l n/p (n l kch thc ca mu). Phn cm cc im ca mi nhm: thc hin PCDL cho cc nhm cho n khi mi nhm c phn thnh n/pq cm (vi q>1). Loi b cc phn t ngoi lai: trc ht, khi cc cm c hnh thnh cho n khi s cc cm gim xung mt phn so vi s cc cm ban u. Sau , trong trng hp cc phn t ngoi lai c ly mu cng vi qu trnh pha khi to mu d liu, thut ton s t ng loi b cc nhm nh. Phn cm cc cm khng gian: cc i tng i din cho cc cm di chuyn v hng trung tm cm, ngha l chng c thay th bi cc i tng gn trung tm hn. nh du d liu vi cc nhn tng ng.

phc tp tnh ton ca thut ton CURE l O(n2log(n)). CURE l thut ton tin cy trong vic khm ph ra cc cm c hnh th bt k v c th p dng tt i vi d liu c phn t ngoi lai v trn cc tp d liu hai chiu. Tuy nhin, n li rt nhy cm vi tham s nh s cc i tng i din, t l co ca cc phn t i din.
V Minh ng CT1002 25

Mt s phng php phn cm d liu

HDL Hi Phng

Ngoi thut ton CURE ra, phn cm phn cp cn bao gm mt s thut ton khc nh: thut ton BIRCH, thut ton AGNES, thut ton DIANA, thut ton ROCK, thut ton CHANMELEON. 3.3 Thut ton COP-Kmeans Thut ton COP-Kmeans l mt thut ton phn cm d liu na gim st, vi phng php tip cn da trn tm kim. Trong thut ton COPKmeans (Wagstaff xut nm 2001), cc thng tin b tr c cung cp di dng mt tp cc rng buc must-link v cannot-link. Trong : Must-link: hai i tng d liu phi cng nm trong mt cm. Cannot-link: hai i tng d liu phi khc cm vi nhau. Cc rng buc ny c p dng vo trong sut qu trnh phn cm. Nhm iu hng qu trnh phn cm t c kt qu phn cm theo mun. Thut ton COP-Kmeans c thc hin nh sau:

V Minh ng CT1002

26

Mt s phng php phn cm d liu

HDL Hi Phng

Input:
-

Tp cc i tn d liu X = {x1,,xn}, x1 S lng cm: K Tp rng buc must-link v cannot-link

Rd

Output: - K phn hoch tch ri: X h c ti u. Cc bc th hin: 1. Khi to cc cm: cc tm ban u c chn ngu nhin sao cho khng vi phm rng buc cho. 2. Lp cho ti khi hi t - Gn cm: gn mi i tng d liu vo trong cm gn nht sao cho khng vi phm rng buc. - c lng tm: cp nht li tm l trung bnh ca tt c i tng nm trong cm ca tm . -t t+1.
K h 1

ca X sao cho hm mc tiu

V Minh ng CT1002

27

Mt s phng php phn cm d liu

HDL Hi Phng

CHNG 4:

NG DNG THUT TON K-MEANS CHO PHN ON NH

4.1 Tng quan v phn vng nh Phn vng nh l bc then cht trong x l nh . Giai on ny nhm phn tch nh thnh nhng thnh phn c cng tnh cht no da theo bin hay cc vng lin thng. Tiu chun xc nh vng lin thng c th l cng mc xm, cng mu hay cng nhm, v. v. Nu phn vng da trn cc min lin thng, ta gi l k thut phn vng da theo min ng nht. Nu ta phn vng da vo bin gi l k thut phn vng bin. Ngoi ra, cn c cc k thut khc nh phn vng da vo bin , phn vng theo kt cu (Texture Segmentation). Mc ch ca phn tch nh l c mt miu t tng hp v nhiu phn t khc nhau cu to nn nh th (brut image) . V lng thng tin cha trong nh l rt ln, trong khi a s ng dng ch cn mt s thng tin c trng no , do vy cn c mt qu trnh gim lng thng tin khng l y. Qu trnh ny bao gm phn vng nh v trch chn c tnh ch yu . Cc k thut dng cho qu trnh ny s c cp ti phn sau. 4.1.1 Phn vng nh theo ngng bin c tnh n gin nht v c hu ch ca nh l bin ca cc tnh cht vt l ca nh nh: phn x, truyn sng, mu sc hoc p ng a ph. Th d, trong nh X-quang, bin mc xm biu din c tnh bo ha ca cc phn hp th ca c th lm cho ta c kh nng phn bit xng vi cc phn mm, t bo lnh vi cc t bo b nhim bnh, v. v. K thut phn ngng theo bin rt c ch i vi nh nh phn nh vn bn in, ha, nh mu hay nh X-quang. Vic chn ngng trong k

V Minh ng CT1002

28

Mt s phng php phn cm d liu

HDL Hi Phng

thut ny l bc rt quan trng. Ngi ta thng tin hnh theo cc bc chung sau: Xem xt lc xm ca nh xc nh cc nh v cc khe. Nu nh c dng rn ln (nhiu nh v nhiu khe), cc khe c th s dng chn ngng. Chn ngng t sao cho mt phn xc nh trc l thp hn t. iu chnh ngng da trn xem xt lc xm ca cc im ln cn . Chn ngng nh xem xt lc xm ca nhng im tha tiu chun chn. Th d vi nh c tng phn thp, lc ca nhng im c bin Laplace g(m, n) ln hn gi tr t nh trc (sao cho t 5% n 10% s im nh vi gradient ln nht s coi nh bin) s cho php xc nh cc c tnh nh lng cc tt hn nh gc. Khi c mt m hnh phn lp xc sut , vic xc nh ngng da vo tiu chun nhm cc tiu xc sut ca sai s hoc mt s tnh cht khc theo lut Bayes. 4.1.2 Phn vng nh theo min ng nht K thut phn vng nh thnh cc min ng nht da vo cc tnh cht quan trng no ca min. Vic la chn cc tnh cht ca min s xc nh tiu chun phn vng. y cng cn phi xc nh r tnh ng nht ca mt min ca nh v l im ch yu xc nh tnh hiu qu ca vic phn vng. Cc tiu chun hay c dng l s thun nht v mc xm, mu sc i vi nh mu, kt cu si v chuyn ng. Th d, trong ng dng nh v hng khng, vic phn vng theo mu cho php phn bit thm thc vt: cnh ng mu xanh hay vng, rng xanh thm, ng mu xm, mi nh , v. v. ca ton b s mu

V Minh ng CT1002

29

Mt s phng php phn cm d liu

HDL Hi Phng

i vi nh chuyn ng, ngi ta tin hnh tr hai nh quan st c ti hai thi im khc nhau. Trong trng hp ny phn nh khng thay i s nhn gi tr khng, nhng phn thay i s nhn gi tr dng hay m tng ng vi thay i hay dch chuyn. Cc phng php thc hin l: Phng php tch cy t phn Phng php ny kim tra tnh hp thc ca tiu chun mt cch tng th trn min ln ca nh. Nu tha mn tiu chun vic phn on coi nh kt thc. Trong trng hp ngc li, ta chia min ang xt thnh bn min nh hn. Vi mi min nh, ta li p dng mt cch quy phng php trn cho n khi tt c cc min u tha mn. Phng php cc b hay phn vng bi hp tng ca phng php ny l xem xt nh t cc min nh nht ri hp chng li nu tha mn tiu chun c mt min ng nht ln hn . Ta li tip tc vi cc min thu c cho ti khi khng th hp c na. S min cn li cho ta kt qu phn on. Nh vy, min nh nht ca bc xut pht l im nh. iu quan trng ca phng php ny l nguyn l hp hai vng . Vic hp hai vng thc hin theo nguyn tc sau: Hai vng phi p ng tiu chun, th d nh cng mu hoc cng mc xm. Chng phi k cn nhau. Phng php tng hp Hai phng php va xt trn c mt s nhc im. Phng php tch s to nn mt cu trc phn cp v thit lp mi quan h gia cc vng . Tuy nhin n thc hin vic chia qu chi tit.

V Minh ng CT1002

30

Mt s phng php phn cm d liu

HDL Hi Phng

Phng php hp cho php lm gim s min lin thng xung ti thiu, nhng cu trc hng ngang dn tri, khng cho ta thy mi quan h gia cc min. Chnh v vy ngi ta ngh n vic phi hp c hai phng php. Trc tin, dng phng php tch to nn cy t phn , phn on theo hng t gc ti l. Tip theo, tin hnh duyt cy theo chiu ngc li v hp cc vng c cng tiu chun. Vi phng php ny ta thu c mt miu t cu trc ca nh vi cc min lin thng c kch thc ti a. Cc bc chnh bao gm: Kim tra tiu chun ng nht. Hp vng. 4.1.3 Phn vng da theo ng bin Bin l mt trong nhng c trng quan trng ca nh . Cng v th m trong nhiu ng dng, ngi ta s dng cc phn on da theo bin. Vic phn on nh da vo bin c tin hnh qua mt s bc nh sau: Pht hin v lm ni bin. Lm mnh bin. Nh phn ha ng bin. M t bin. 4.1.4 Phn on da theo kt cu b mt Kt cu l thut ng phn nh s lp li ca cc phn t si (texel) c bn. S lp li ny c th ngu nhin hay c tnh chu k hoc gn nh c chu k. Mt texel cha rt nhiu im nh. Trong phn tch nh, kt cu c phn lm hai loi chnh: Thng k. Cu trc.
V Minh ng CT1002 31

Mt s phng php phn cm d liu

HDL Hi Phng

Khi i tng xut hin trn mt nn c tnh kt cu cao, vic phn on da vo tnh kt cu tr nn kh quan trng . Nguyn nhn l v kt cu si thng cha mt cao cc g (edge) v lm cho phn on da vo bin tr nn km hiu qu, tr phi ta loi tnh kt cu. Vic phn on da vo min ng nht cng c th p dng cho cc c trng kt cu v c th dng phn on cc min c tnh kt cu. 4.2 Thut ton K-means cho phn on nh Tm quan trng v nhng kh khn ca vic gom nhm cc i tng mang tnh tri gic ca con ngi t lu c nghin cu nhiu trong cc lnh vc ca th gic my tinh c bit trong lnh vc ca x l nh . V phn on nh c nhng ng dng mnh m v rng ri trong cc bi ton phn tch v hiu nh t ng, nhng n cng l mt bi ton kh m n by gi cc nh khoa hc vn cha gii quyt c mt cch hon ton thu o . Lm th no phn chia mt nh thnh cc tp con . Nhng cch kh thi c th lm c iu . l nhng cu hi m ngi ta t ra t lu v mong mun tm c cu tr li. Trong khong 30 nm tr li y c rt nhiu cc thut ton c xut gii quyt bi ton phn on nh. Cc thut ton hu ht u da vo hai thuc tnh quan trng ca mi im nh so vi cc im ln cn ca n, l: s khc (dissimilarity) v ging nhau (similarity). C c phng php da trn s ging nhau ca cc im nh c gi l phng php min (region-based methods), cn cc phng php da trn s khc nhau ca cc im nh c gi l cc phng php bin (boundarybased methods). Trong bi bo co ny em xin php c trnh by thut ton K-means gii quyt bi ton phn on nh. 4.2.1 M t bi ton Input: nh c kch thc m*n.
V Minh ng CT1002 32

Mt s phng php phn cm d liu

HDL Hi Phng

S cm (k) mun phn on. Output: nh c phn thnh k on c mu sc tng ng nhau. 4.2.2 Cc bc thc hin chnh trong thut ton Thun ton s da vo s lng cm mong mun , trng tm cc cm m tnh ton khong cch gia cc im vi cc trng tm cm. Sau gn cc im ti cm m n c khong cch ti trng tm cm l nh nht, cp nht li trng tm cm. Kt qu thu c sau khi tm cc cm l khng i. Lu tng qut ca thut ton:

V Minh ng CT1002

33

Mt s phng php phn cm d liu

HDL Hi Phng

Begin

Tm Top X color gn lm trng tm

Tnh d(x, y)=


n

xi
i 1

yi

a cc im v cc cm Cp nht li tm cc cm

No

Tm mi = Tm c

Yes End

Hnh 4. 1: Thut ton K-means. 4.2.2.1 Tm kim Top X color u tin ta so snh s mu thc t c trong nh v s cm mu , nu s mu thc t nh hn s cm mu th ta nhn s cm mu chnh l s mu thc t. To danh sch cha cc loi mu, sau sp xp chng theo th t gim dn. Ly X phn t u tin ca danh sch.

V Minh ng CT1002

34

Mt s phng php phn cm d liu

HDL Hi Phng

c nh

int i = 0; int numColours; colours.Count;

No

colours.Count < numColours Yes numColours = colours.Count; _topColours = new Color[numColours]; List summaryList = new List; summaryList.AddRange(colours); summaryList.Sort;

i < _topColours.Length

Yes

_topColours[i] = Color.FromArgb (summaryList[i].Value.Colour.R, summaryList[i].Value.Colour.G, summaryList[i].Value.Colour.B); i ++;

No Trng tm khi to cm

Hnh 4. 2: Tm kim Top X color.

V Minh ng CT1002

35

Mt s phng php phn cm d liu

HDL Hi Phng

4.2.2.2 Tnh khong cch v phn cm S dng thut ton Euclide tnh khong cch mu ca cc im vi cc tm cm. Da vo khong cch a cc im vo cm m khong cch ca n ti tm cm l nh nht.
Trng tm

Dictionary distances = new Dictionary; KeyValuePair c;

Yes c < _currentCluster No List list = new List; list. AddRange(distances) ; list. Sort ;

float d= (float)Math.Sqrt((double)Math.Pow ((c.Value.CentroidR -pd.Ch1), 2)+ double) Math.Pow((c.Value.CentroidG-pd.Ch2),2)+ (double)Math.Pow((c.Value. CentroidBpd.Ch3),2)); distances. Add(c. Key, new Distance(d)); c ++ ;

_pixelDataCluster Allocation.Contai nsKey(list[0].Key) No

Yes ((List<PixelData>)_pixelDataClust erAllocation[list[0].Key]).Add(pd);

List clrList = new List; clrList. add(pd); _pixelDataClusterAllocation. Add(list[0]. Key, clrList);

X cm mu

Hnh 4. 3: Phn cm.

V Minh ng CT1002

36

Mt s phng php phn cm d liu

HDL Hi Phng

4.2.2.3 Tnh li trng tm cm

Cm mu KeyValuePair cluster; PixelData clr;

No Trng tm mi List<PixelData> clrList=(List<PixelData>) _pixelDataClusterAllocation[cluster.Key]; float cR=0, cG=0, cB=0;

cluster < _currentCluster Yes

clr < clrList Yes cR += clr.Ch1; cG += clr.Ch2; cB += clr.Ch3; No clr ++; !_clusterColours. ContainsKey(clr. Name) Yes

No

float count = clrList.Count + 1; cluster.Value.CentroidR = (cluster.Value.CentroidR + cR)/ count ; cluster.Value.CentroidG = (cluster.Value.CentroidG + cG)/ count ; cluster.Value.CentroidB = (cluster.Value.CentroidB + cB)/ count ; cluster ++ ;

_clusterColours.Add(clr.Name, Color.FromArgb((int)cluster.Value. CentroidR,(int)cluster.Value.CentroidG, (int)cluster.Value.CentroidB));

Hnh 4. 4: Tnh trng tm mi.

V Minh ng CT1002

37

Mt s phng php phn cm d liu

HDL Hi Phng

4.2.2.4 Kim tra hi t kim tra tnh hi t ca d liu chng ta kim tra trng tm hin ti va tnh c vi trng tm trc ca cm.
Trng tm mi

bool match = true ;

cluster ++; No cluster < _currentCluster yes Centroid != _previousClus ter. Centroid yes match = false

No No ! match yes No _converged = match; No cluster < _currentCluster yes _previousCluster. Centroid = Centroid ; cluster ++ ;

_converged

Hnh 4. 5: Kim tra hi t.

V Minh ng CT1002

38

Mt s phng php phn cm d liu

HDL Hi Phng

4.2.3 Kt qu thc nghim 4.2.3.1 Mi trng ci t. Chng trnh c lp trnh vi ngn ng C# , ci t v chy th nghim trn mi trng h iu hnh Windows XP. 4.2.3.2 Mt s giao din. Giao din khi ng

a d liu vo

V Minh ng CT1002

39

Mt s phng php phn cm d liu

HDL Hi Phng

Qu trnh x l d liu.

Kt qu phn cm.

V Minh ng CT1002

40

Mt s phng php phn cm d liu

HDL Hi Phng

KT LUN
Trong qu trnh nghin cu, tm hiu v hon thnh ti n tt nghip Tm hiu mt s phung php phn cm d liu v ng dng , em thu nhn c thm nhng kin thc v em cng nhn thy phn cm d liu trong khai ph d liu l mt lnh vc nghin cu rng ln, cn nhiu iu m chng ta cn khm ph. Trong ti em c gng tp trung tm hiu v nghin cu tng quan khai ph d liu, phn cm d liu v mt s thut ton ca n, tng quan v phn vng nh. Ci t th nghim thut ton k-means vi ng dng l phn on nh. Do thi gian thc hin hn ch nn em mi ch tm hiu c mt s k thut c bn trong phn cm d liu, ci t th nghim vi thut ton Kmeans. Nhng cn mt s cc k thut em vn cha tm hiu, khai thc v ng dng cho cc bi ton Trong thi gian ti em s c gng tip tc nghin cu, tm hiu thm mt s k thut phn cm v nht l c th tm hiu v pht trin cc k thut phn on nh c th x l vi nh ng.

Sinh vin

V MINH NG

V Minh ng CT1002

41

Mt s phng php phn cm d liu

HDL Hi Phng

TI LIU THAM KHO


Ti liu tham kho ting Vit [1.] Nhp mn x l nh, Lng Mnh B v Nguyn Thanh Thy, nh xut bn Khoa hc K thut, 1999. [2.] Gio trnh x l nh, Ng Quc To, lp CHCLC H Cng Ngh HQG H Ni nm 2001- 2002. [3.] Bi ging mn Data Mining, Ng Quc To, lp CHK5 H Thi Nguyn 2006 2008. [4.] Thut ton phn cm d liu na gim st, Lu Tun Lm n tt nghip HDL Hi Phng. Ti liu tham kho ting Anh [5.] Discovering Knowledge in Data: An Introduction to Data Mining, Daniel T. Larose, ISBN 0-471-66657-2 CopyrightC 2005 John Wiley & Sons, Inc.
[6.]

In Proc. 1996 Int. Conf. Data Mining and Knowledge Discovery (KDD-96), A. Arning, R. Agrawal and P. Raghavan. Alinear method for deviation detection in larger databases, Portland, Oregon, August 1996.

[7.]

http://www.wikipedia.org

V Minh ng CT1002

42

You might also like