You are on page 1of 70

I HC THI NGUYN

TRNG I HC CNTT & TRUYN THNG

TRN C THUN

PHN CM D LIU V NG DNG


TRONG PHN LOI CU TRC PROTEIN

LUN VN THC S KHOA HC MY TNH

THI NGUYN - 2012

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


I HC THI NGUYN
TRNG I HC CNTT & TRUYN THNG

TRN C THUN

PHN CM D LIU V NG DNG


TRONG PHN LOI CU TRC PROTEIN

Chuyn ngnh: Khoa hc my tnh


M s: 60.48.01

LUN VN THC S KHOA HC MY TNH

Ngi hng dn khoa hc: PGS.TS. ON VN BAN

THI NGUYN - 2012

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


i

LI CAM OAN

Ti xin cam oan bn lun vn Phn cm d liu v ng dng trong phn


loi cu trc protein" l cng trnh nghin cu ring ca ti. Cc s liu trong
lun vn c s dng trung thc. Kt qu nghin cu c trnh by trong lun
vn ny cha tng c cng b ti bt k cng trnh no khc.
Ti cng xin chn thnh cm n cc thy c trong Vin Cng ngh
Thng tin, cc thy c trong Trng Cng Ngh Thng Tin v Truyn thng
Thi Nguyn, thy gio Trn ng Hng - Ging vin Khoa Cng ngh thng
tin v Trung tm khoa hc tnh ton, i hc S phm H Ni, cc bn b,
ng nghip ti Trung tm Thng tin Cng ngh - S Khoa hc Cng ngh
Thi Nguyn, Cc D tr Nh nc khu vc Bc Thi gip ti rt
nhiu trong qu trnh hc tp, su tm, tm ti ti liu v trong cng tc ti
c th hon thnh bn lun vn ny.
Ti xin by t lng knh trng, v bit n su sc ti PGS.TS on Vn
Ban, ngi trc tip hng dn, gip ti trong sut thi gian thc hin
lun vn ny.
Thi Nguyn, thng 08 nm 2012
Hc vin

Trn c Thun

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


ii

MC LC
Li cam oan .......................................................................................................................... i
Mc lc ..................................................................................................................................ii
Danh mc bng biu .............................................................................................................. v
Danh mc cc hnh ................................................................................................................ v
M u ................................................................................................................................... 1
1. L do chn ti. .............................................................................................................. 1
2. Mc tiu nghin cu ...................................................................................................... 1
3. Phng php nghin cu ............................................................................................... 2
4. Tng quan lun vn........................................................................................................ 2
CHNG 1-TNG QUAN L THUYT V PHN CM D LIU ............................. 3
1.1. Tng quan v phn cm d liu.................................................................................. 3
1.2. Phn cm trong phn loi d liu ............................................................................... 4
1.3. Cc yu cu ca phn cm d liu............................................................................. 6
1.4. Cc kiu d liu trong phn cm ................................................................................ 8
1.4.1. Phn loi kiu d liu da trn kch thc min ................................................. 9
1.4.2. Phn loi kiu d liu da trn h o ................................................................. 9
1.5. Cc php o tng t v khong cch i vi cc kiu d liu .......................... 10
1.5.1. Khi nim tng t v phi tng t .................................................................. 10
1.5.2. Thuc tnh khong cch .................................................................................... 11
1.5.3. Thuc tnh nh phn .......................................................................................... 13
1.5.4. Thuc tnh nh danh ........................................................................................ 15
1.5.5. Thuc tnh c th t .......................................................................................... 16
1.5.6. Thuc tnh t l ................................................................................................... 16
1.6. Kt lun chng ........................................................................................................ 17
CHNG 2 - K THUT PHN CM D LIU NG DNG TRONG PHN LOI
CU TRC PROTEIN ....................................................................................................... 18
2.1. Gii thiu .................................................................................................................. 18
2.2. Thut ton K-means .................................................................................................. 18
2.3. Thut ton PAM........................................................................................................ 22
2.4. Thut ton CLARA ................................................................................................... 24
2.5. Thut ton CLARANS.............................................................................................. 26

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


iii

2.6. Kt lun chng ........................................................................................................ 28


CHNG 3 - TIN SINH HC V PHN LOI CU TRC PROTEIN ....................... 29
3.1. Tng quan v tin sinh hc ......................................................................................... 29
3.1.1. Ch thuyt trung tm ca sinh hc phn t ....................................................... 29
3.1.2. DNA (DesoxyriboNucleic Acid) ....................................................................... 30
3.1.3. RNA (RiboNucleic Acid) .................................................................................. 31
3.1.4. Protein ................................................................................................................ 31
3.1.5. Cc dng protein. ............................................................................................... 32
3.2. Cc phng php phn loi cu trc protein ............................................................ 34
3.2.1. Phn loi cu trc vi SCOP ............................................................................. 38
3.2.2. Phn loi cu trc vi CATH............................................................................. 39
3.2.3. Phn loi cu trc vi phn loi min Dali (DDD) ........................................... 40
3.3. Kt lun chng ........................................................................................................ 41
CHNG 4 - CHNG TRNH DEMO VI PHN MM CLUSTERS 3.0 ................. 42
4.1. Phn mm Clusters 3.0 ............................................................................................. 42
4.1.1. Yu cu phn cng............................................................................................. 42
4.1.2. Ngun d liu demo chng trnh ..................................................................... 42
4.1.3. S dng th vin phn cm ............................................................................... 42
4.2. S dng thut ton K-mean, K-medians ................................................................... 43
4.2.1. Khi to ............................................................................................................. 43
4.2.2. Tm trng tm cm ............................................................................................ 44
4.2.3. Tm trung bnh cm, hoc trung v cm ............................................................ 44
4.2.4 Tm gii php ti u vi K-means v K-medians............................................... 46
4.3. Phn mm demo........................................................................................................ 48
4.3.1. u vo ca chng trnh .................................................................................. 48
4.3.2. Giao din mt s chc nng chnh ca chng trnh ........................................ 49
4.3.3. Tp u ra ca chng trnh .............................................................................. 52
KT LUN V HNG NGHIN CU ......................................................................... 53
Kt lun ............................................................................................................................ 53
Hng nghin cu trong thi gian ti ............................................................................. 53
TI LIU THAM KHO ................................................................................................... 54
PH LC ............................................................................................................................ 56

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


iv

BNG K HIU CC CH VIT TT

Ch vit tt Ngha ting anh Ngha ting vit


Phn t nucleic acid mang thng
tin di truyn m ha cho hot
DNA DesoxyriboNucleic Acid
ng sinh trng v pht trin
ca cc dng sng
L mt trong hai loi axt
RNA RiboNucleic Acid nucleic, l c s di truyn cp
phn .
Thut ton phn cm phn vng
PAM Partitioning Around Medoids
xung quanh Medoids
Thut ton phn cm ng dng
CLARA Clustering Large Application
ln
Clustering Large Applications Thut ton phn cm vi ng
CLARANS based upon RANdomized dng ln trn c s tm kim
Search ngu nhin
L ARN m ha v mang thng
rRNA ribosome RNA
tin t ADN
tRNA transfer RNA L RNA vn chuyn
mRNA messenger RNA RNA thng tin
Structural Classification of
SCOP Phn loi cu trc cc protein
Proteins
Class Architecture Topology Phn loi cu trc protein vi
CATH
Homologous superfamily CATH
DDD Dali Domain Dictionary T in min Dali
PDB Protein Data Bank Ngn hng d liu protein
Families of Structurally Dng h protein vi cu trc
FSSP
Similar Proteins tng t

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


v

DANH MC BNG BIU


Bng 1.1 Bng d on cho hai i tng nh phn x v y... 14
Bng 1.2 V d v phi tng t ca thuc tnh nh phn.. 15
Bng 2.1. Bng so snh cc thut ton phn cm trung tm. 28
Bng 3.1 a ra mt s ngun ti nguyn phn loi trnh t protein... 35
Bng 3.2 Ngun ti nguyn cho phn loi cu trc protein... 36
Bng 3.3 Cc cp chnh ca CATH.. 39

DANH MC CC HNH
Hnh 1.1. Phn cm cc vector truy vn .................................................... 5
Hnh 1.2. Hnh thnh cm cha ................................................................... 6
Hnh 1.3. Cc t l khc nhau c th dn ti cc cm khc nhau .............. 12
Hnh 2.1 S phn loi cc phng php phn cm.. 18
Hnh 2.2. Cc thit lp xc nh danh gii cc cm ban u ................ 19
Hnh 2.3. Tnh ton trng tm ca cc cm mi ........................................ 20
Hnh 2.4 V d minh ha thut ton K-means ........................................... 21
Hnh 2.5 V d minh ha thut ton PAM ................................................ 24
Hnh 3.1. Ch thuyt trung tm ca sinh hc phn t ............................... 30
Hnh 3.2. Cu trc DNA ............................................................................ 30
Hnh 3.3. Cc kiu cu trc ca Protein ..................................................... 32
Hnh 3.4. Cu trc bc 2 thng thy ca protein ..................................... 33
Hnh 3.5. Hai v d v protein mng .......................................................... 34
Hnh 3.6. S pht trin ca cu trc d liu protein .................................. 35
Hnh 4.1 u vo d liu... 48
Hnh 4.2 Giao din chn tp u vo. 49
Hnh 4.3 Giao din tab Lc d liu.. 49
Hnh 4.4 Giao din tab chnh sa d liu. 50
Hnh 4.5 Giao din Tab K-Means, s dng K-means hoc K-medians 51
phn cm
Hnh 4.6 u ra d liu. 52

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


1

M U

1. L DO CHN TI
Vi s pht trin vt bc ca cng ngh thng tin, c bit l ng dng
cng ngh thng tin vo cc ngnh sinh hc gip ch rt nhiu cho vic tm
hiu nghin cu v sinh hc phn t. Chnh v vy Tin sinh hc, mt lnh vc
cn kh mi, ra i, s dng cc cng ngh ca cc ngnh ton hc ng
dng, tin hc, thng k, khoa hc my tnh, tr tu nhn to, ha hc, sinh hc
gii quyt cc vn ca sinh hc.
Nh chng ta bit, cc c s phn t ca cuc sng da trn hot
ng ca phn t sinh hc, bao gm axit nucleic (DNA v RNA),
carbohydrate, cht bo, v protein. Mc d mi loi u ng mt vai tr thit
yu trong cuc sng, nhng protein c mt s ni bt bi chng l thnh phn
biu din chnh cc chc nng ca t bo. Chnh v vy, tm hiu v nghin
cu cu trc phn t sinh hc ni ln nh mt hng i mi vi nhng tri
nghim hng vo vic khm ph cu trc ca cc phn t sinh hc. Hng
pht trin ny ca sinh hc tri qua vi s pht trin cao thng qua nghin
cu cu trc vi mc ch c ci nhn ton din v khng gian cu trc
protein, thng tin lu tr trong d liu cu trc protein l cha kha thnh
cng nm trong kh nng t chc, phn tch thng tin cha trong c s d
liu, tch hp nhng thng tin vi nhng n lc khc nhm gii quyt
nhng b n ca chc nng t bo.
Nhn thy tnh thit thc ca vn ny v c s gi ca ging vin
hng dn, em chn ti "Phn cm d liu v ng dng trong phn
loi cu trc protein"
2. MC TIU NGHIN CU
- Tm hiu tng quan v l thuyt phn cm d liu.
- Nghin cu mt s k thut phn cm d liu ng dng trong phn loi
cu trc protein.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


2

- Tm hiu v Tin sinh hc v mt s vn lin quan, nghin cu cc


phng php phn loi cu trc protein.
- Tm hiu v s dng phn mm Cluster 3.0 ng dng vo trong phn
loi cu trc protein.
3. PHNG PHP NGHIN CU
- Nghin cu qua cc ti liu nh: sch, sch in t, cc bi bo, thng
tin ti liu trn cc website v cc ti liu lin quan.
- Phn tch, tng hp l thuyt v gii thiu v phn cm d liu v mt
s thut ton phn cm d liu da vo cm trung tm ng dng trong phn
loi cu trc protein.
- Tm hiu v s dng phn mm Cluster 3.0 ng dng thut ton K-
means phn loi cu trc protein.
4. TNG QUAN LUN VN
Lun vn c trnh by trong 4 chng v phn kt lun, vi ni dung
c trnh by t vic tm hiu cc khi nim c bn n cc ni dung chnh
cn i su tm hiu, gip ngi c c ci nhn tng quan cng nh nhng
vn chnh c nghin cu:
- Chng 1 - Tng quan: Gii thiu tng quan v l thuyt phn cm d
liu.
- Chng 2 - Mt s k thut phn cm d liu ng dng trong phn
loi cu trc protein.
- Chng 3 - Tin sinh hc v Phn loi cu trc Protein.
- Chng 4 - Chng trnh Demo vi phn mm cluster 3.0.
- Kt lun - Tm tt cc ni dung chnh, cc kt qu t c v hng
nghin cu tip theo ca lun vn.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


3

CHNG 1
TNG QUAN L THUYT V PHN CM D LIU

1.1. TNG QUAN V PHN CM D LIU


Phn cm l chia d liu thnh cc nhm m cc i tng trong cng
mt nhm th ging nhau theo mt ngha no v khc vi cc i tng
trong cc nhm khc. Mi nhm c gi l mt cluster. Mi i tng c
m t bi mt tp cc o hoc bng mi quan h vi cc i tng khc.
Cng c rt nhiu nh ngha v cluster, nhng cc nh ngha sau y c
s dng nhiu nht [4]:
- "Mt cluster l mt tp cc i tng ging nhau v khc vi cc i
tng khng trong cluster ".
- "Mt cluster l mt tp cc im trong khng gian m khong cch
gia hai im bt k trong n lun nh hn khong cch gia mt im bt k
trong n v mt im ngoi".
- "Cc cluster c th c m t nh cc min lin thng trong khng
gian a chiu cha mt tng i cao cc im, phn bit gia cc min
bng mt kh thp ca cc im".
Phn cm c ngha rt quan trng trong hot ng ca con ngi t y
t, gio dc, x l thng tin, nghin cu phn tch th trng, Phn cm
c s dng rng ri trong nhiu ng dng, bao gm nhn dng mu, phn
tch d liu, x l nh, nghin cu th trng, phn loi trong tin sinh hc,
Bng phn cm, trong thng mi c th gip nhng nh phn tch th trng
tm ra nhng nhm khch hng c nhng nhu cu ring da trn tui, s
thch v tm l tiu dng. Trong sinh hc, n c th c s dng phn
loi thc vt, ng vt, phn loi cu trc protein da trn cc cu trc tng
ng vn c, t c th xy dng ngn hng d liu protein. Trong x l
thng tin, phn cm gip phn loi cc ti liu vi dng lu tr vn, trn a

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


4

mm, trn cng, trn mng internet thnh gip to lp v hon chnh kho d
liu khng l v tri thc ca loi ngi.
L mt chc nng khai ph d liu, phn cm c th s dng nh mt
cng c c lp quan st c trng ca mi cm thu c bn trong s
phn b d liu v tp trung vo mt tp ring bit ca cc cm phn
tch. Phn cm c th dng nh mt bc tin x l cho cc thut ton nh
phn loi, m t c im, pht hin ra cc cm vi cc c trng, tnh cht
khc nhau.
1.2. PHN CM TRONG PHN LOI D LIU
Cc mc d liu tng t nhau c nhm li hnh thnh cc cm
trn c s o mc tng t no . Mi cm c biu din bi trng tm
vector c trng ca cm. Trong khi truy vn, ta tnh ton tng t gia
vector truy vn v tng cm (i din bi trng tm cm). Cc cm m
tng t ca n vi vector truy vn m ln hn ngng no th c la
chn. Sau , tng t gia vector truy vn vi tng vector c trng
trong cm c tnh ton v k mc gn nht c xp hng v c xem nh
kt qu.
V d, cc vector c trng trn hnh 1.1 c nhm vo 11 cm. Trong
khi truy tm, vector truy vn c so snh vi ln lt 11 trng tm cm. Nu
tm thy trng tm cm 2 gn ging vector truy vn nht th ta tnh khong
cch gia vector truy vn vi tng vector c trng trong cm 2. Tng s tnh
ton khong cch i hi phi nh hn nhiu tng cc vector c trng trong
c s d liu.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


5

Hnh 1.1: Phn cm cc vector truy vn

Trong phng php truy tm trn c s cm trn y, mc tng t


c tnh ton gia cu truy vn v tng trng tm v vi tng vector c
trng trong cm la chn. Khi tng s cm m ln, ta s dng cm nhiu
tng lm gim tnh ton mc tng t gia truy vn v trng tm. Cc
cm tng t nhau c nhm hnh thnh cm ln hn (super-cluster).
Trong khi truy tm, trc ht so snh vector truy vn vi trng tm ca cm
cha sau so snh vi tng trng tm cc cm bn trong cm cha, cui cng
so snh vi cc vector c trng ca cm con. Hy xem xt khng gian c
trng trn hnh 1.1, ta c th hnh thnh cm cha nh hnh 1.2.
Trong khi truy vn, so snh vector truy vn vi tng trng tm ca 4 cm
cha. Nu tm thy trng tm ca cm cha 1 l gn vector truy vn nht, hy so
snh vector truy vn vi ba trng tm cm con trong cm cha 1. Trong th d
cm hai mc ny, tng s khong cch tnh ton i hi gia vector truy vn
v trng tm (ca cc cm cha v cm con) l 7 (4+3), nh hn 11 tnh ton
khi s dng cm mt tng.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


6

Hnh 1.2: Hnh thnh cm cha


Cm khng ch lm truy tm hiu qu m cn lm d dng cho vic duyt
v dn ng. Vi duyt v dn ng, mt mc i din m c vector c
trng gn trng tm cm ca n th c hin th cho mi cm. Nu ngi s
dng quan tm n mc i din th h c th quan st cc mc khc trong
cm.
Cc k thut cm c s dng chung vi cc cu trc d liu tm
kim hiu qu hn. Cc mc tng t c nhm thnh cm. Trng tm cc
cm hoc/v cc mc trong mi cm c t chc nh cu trc d liu no
tm kim hiu qu.
1.3. CC YU CU CA PHN CM D LIU
Phn cm l mt thch thc trong lnh vc nghin cu ch nhng ng
dng tim nng ca chng c a ra ngay chnh trong nhng yu cu c
bit ca chng, sau y l mt s yu cu c bn ca phn cm d liu [1]:
- C kh nng m rng: Nhiu thut ton phn cm lm vic tt vi
nhng tp d liu nh cha t hn 200 i tng, tuy nhin, mt c s d liu
ln c th cha ti hng triu i tng. Vic phn cm vi mt tp d liu
ln c th lm nh hng ti kt qu. Vy lm cch no chng ta c th

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


7

pht trin cc thut ton phn cm c kh nng m rng cao i vi cc c s


d liu ln?
- Kh nng thch nghi vi cc kiu thuc tnh khc nhau: Nhiu thut
ton c thit k cho vic phn cm d liu c kiu khong (kiu s). Tuy
nhin, nhiu ng dng c th i hi vic phn cm vi nhiu kiu d liu
khc nhau, nh kiu nh phn, kiu tng minh (nh danh - khng th t), v
d liu c th t hay dng hn hp ca nhng kiu d liu ny.
- Khm ph cc cm vi hnh dng bt k: Nhiu thut ton phn cm
xc nh cc cm da trn cc php o khong cch Euclide v khong cch
Manhattan. Cc thut ton da trn cc php o nh vy hng ti vic tm
kim cc cm hnh cu vi mt v kch c tng t nhau. Tuy nhin, mt
cm c th c bt c mt hnh dng no. Do , vic pht trin cc thut ton
c th khm ph ra cc cm c hnh dng bt k l mt vic lm quan trng.
- Ti thiu lng tri thc cn cho xc nh cc tham s u vo: Nhiu
thut ton phn cm yu cu ngi dng a vo nhng tham s nht nh
trong phn tch phn cm (nh s lng cc cm mong mun). Kt qu ca
phn cm thng kh nhy cm vi cc tham s u vo. Nhiu tham s rt
kh xc nh, nht l vi cc tp d liu c lng cc i tng ln. iu
ny khng nhng gy tr ngi cho ngi dng m cn lm cho kh c th
iu chnh c cht lng ca phn cm.
- Kh nng thch nghi vi d liu nhiu: Hu ht nhng c s d liu
thc u cha ng d liu ngoi lai, d liu li, d liu cha bit hoc d
liu sai. Mt s thut ton phn cm nhy cm vi d liu nh vy v c th
dn n cht lng phn cm thp.
- t nhy cm vi th t ca cc d liu vo: Mt s thut ton phn cm
nhy cm vi th t ca d liu vo, v d nh vi cng mt tp d liu, khi
c a ra vi cc th t khc nhau th vi cng mt thut ton c th sinh

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


8

ra cc cm rt khc nhau. Do , vic quan trng l pht trin cc thut ton


m t nhy cm vi th t vo ca d liu.
- S chiu ln: Mt c s d liu hoc mt kho d liu c th cha mt
s chiu hoc mt s cc thuc tnh. Nhiu thut ton phn cm p dng tt
cho d liu vi s chiu thp, bao gm ch t hai n 3 chiu. Ngi ta nh
gi vic phn cm l c cht lng tt nu n p dng c cho d liu c t
3 chiu tr ln. N l s thch thc vi cc i tng d liu cm trong
khng gian vi s chiu ln, c bit v khi xt nhng khng gian vi s
chiu ln v c nghing ln.
- Phn cm rng buc: Nhiu ng dng thc t c th cn thc hin
phn cm di cc loi rng buc khc nhau. Gi s rng cng vic ca ta l
la chn v tr cho mt s trm rt tin t ng mt thnh ph. quyt
nh da trn iu ny, c th phn cm nhng h gia nh trong khi xem xt
cc mng li sng v i l, v nhng yu cu khch hng ca mi vng
nh nhng s rng buc. Mt nhim v t ra l i tm nhng nhm d liu
c trng thi phn cm tt v tha mn cc rng buc.
- D hiu v d s dng: Ngi s dng c th ch i nhng kt qu
phn cm d hiu, d l gii v d s dng. Ngha l, s phn cm c th cn
c gii thch ngha v ng dng r rng. Vic nghin cu cch mt
ng dng t c mc tiu l rt quan trng, c th gy nh hng ti s la
chn cc phng php phn cm.
1.4. CC KIU D LIU TRONG PHN CM
Trong phn cm, cc i tng d liu thng c biu din di dng
cc c tnh hay cn gi l thuc tnh (khi nim cc kiu d liu v cc
kiu thuc tnh d liu c xem l tng ng vi nhau) [1]. Cc thuc
tnh ny l cc tham s cho php gii quyt vn phn cm v chng c tc
ng ng k n kt qu phn cm. Phn loi cc kiu thuc tnh khc nhau

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


9

l vn cn gii quyt i vi hu ht cc tp d liu nhm cung cp cc


phng tin thun li nhn dng s khc nhau ca cc phn t d liu. C
hai c trng phn loi: kch thc min v h o.
Cho mt c s d liu D cha n i tng trong khng gian k chiu; x, y, z
l cc i tng thuc D: x = (x1, x2,...,xk); y = (yl, y2,..., yk); z = (zl, z2,..., zk)
Trong xi, yi, zi vi i 1, k l cc c trng hoc thuc tnh tng ng
ca cc i tng x, y, z; nh vy s c cc kiu d liu sau [9].
1.4.1. Phn loi kiu d liu da trn kch thc min
- Thuc tnh lin tc: Nu min gi tr ca n l v hn khng m c,
ngha l gia hai gi tr tn ti v s gi tr khc (v d, cc thuc tnh mu,
nhit hoc cng m thanh,...).
- Thuc tnh ri rc: Nu min gi tr ca n l tp hu hn, m c
(v d, cc thuc tnh s, ...); trng hp c bit ca thuc tnh ri rc l
thuc tnh nh phn m min gi tr ch c hai phn t (v d:Yes/No,
True/False, On/Off...)
1.4.2. Phn loi kiu d liu da trn h o
- Thuc tnh nh danh: L dng thuc tnh khi qut ha ca thuc tnh
nh phn, trong min gi tr l ri rc khng phn bit th t v c nhiu
hn hai phn t. Nu x v y l hai i tng thuc tnh th ch c th xc nh
l x y hoc x = y.
- Thuc tnh c th t: L thuc tnh nh danh c thm tnh th t,
nhng chng khng c nh lng. Nu x v y l hai thuc tnh th t th
c th xc nh l x y hoc x = y hoc x > y hoc x < y.
- Thuc tnh khong: o cc gi tr theo xp x tuyn tnh, vi thuc
tnh khong c th xc nh mt thuc tnh l ng trc hoc ng sau thuc
tnh khc vi mt khong l bao nhiu. Nu x i > yi th c th ni x cch y mt
khong xi - yi tng ng vi thuc tnh th i.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


10

- Thuc tnh t l: L thuc tnh khong nhng c xc nh mt cch


tng i so vi im mc y ngha.
Trong cc thuc tnh trnh by trn, thuc tnh nh danh v thuc tnh
c th t gi chung l thuc tnh hng mc, cn thuc tnh khong v thuc
tnh t l c gi l thuc tnh s.
c bit, cn c d liu khng gian l loi d liu c thuc tnh s khi
qut trong khng gian nhiu chiu, d liu khng gian m t cc thng tin lin
quan n khng gian cha ng cc i tng (v d, thng tin v hnh
hc,...). D liu khng gian c th l d liu lin tc hoc ri rc.
- D liu khng gian lin tc: Bao cha mt vng khng gian.
- D liu khng gian ri rc: C th l mt im trong khng gian nhiu
chiu v cho php xc nh c khong cch gia cc i tng d liu
trong khng gian. Thng thng, cc thuc tnh s c o bng cc n v
xc nh nh kilogams hay centimeters. Tuy nhin, vic thay i cc n v
c nh hng n kt qu phn cm (v d, thay i n v o cho thuc
tnh chiu cao t centimeters sang inches c th mang li kt qu khc nhau
trong phn cm). khc phc iu ny phi chun ha d liu c thc
hin bng cch thay th mi mt thuc tnh bng thuc tnh s hoc thm cc
trng s cho cc thuc tnh.
1.5. CC PHP O TNG T V KHONG CCH I
VI CC KIU D LIU
1.5.1. Khi nim tng t v phi tng t
Khi cc c tnh ca d liu c xc nh, phi tm cch thch hp
xc nh khong cch gia cc i tng, hay l php o tng t d liu.
y l cc hm o s ging nhau gia cc cp i tng d liu, thng
thng cc hm ny hoc l tnh tng t hoc l tnh phi tng t
gia cc i tng d liu. Gi tr ca hm tnh o tng t cng ln th s

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


11

ging nhau gia cc i tng cng ln v ngc li cn hm tnh phi


tng t t l nghch vi hm tnh tng t. tng t hoc phi
tng t c nhiu cch xc nh, chng thng c o bng khong
cch gia cc i tng. Tt c cc cch o tng t u ph thuc
vo kiu thuc tnh m ngi s dng phn tch. V d, i vi thuc tnh
hng mc th khng s dng o khong cch m s dng mt hng
hnh hc ca d liu.
Tt c cc o di y c xc nh trong khng gian metric. Bt k
mt metric no cng l mt o, nhng iu ngc li khng ng.
trnh s nhm ln, thut ng o y cp n hm tnh tng t
hoc hm tnh phi tng t. Mt khng gian metric l mt tp trong c
xc nh khong cch gia tng cp phn t, vi nhng tnh cht thng
thng ca khong cch hnh hc. Ngha l, mt tp X (cc phn t ca n c
th l nhng i tng bt k) cc i tng d liu trong c s d liu D
cp trn c gi l mt khng gian metric nu:
- Vi mi cp phn t x, y thuc X u xc nh theo mt quy tc no
, mt s thc (x, y) c gi l khong cch gia x v y.
- Quy tc ni trn tha mn h tnh cht sau:
+ (x, y) > 0 nu x y;
+ (x, y) = 0 nu x = y;
+ (x, y) = (y, x) vi mi x, y;
+ (x, y) (x, z)+ (z, y)
Hm (x, y) c gi l mt metric ca khng gian. Cc phn t ca X
c gi l cc im ca khng gian ny.
1.5.2. Thuc tnh khong cch
Mt thnh phn quan trng trong thut ton phn cm l php o khong
cch gia hai im d liu. Nu thnh phn ca vector th hin d liu thuc

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


12

trong cng mt n v ging nhau th n tn ti khong cch Euclide c th


xc nh c nhm d liu tng t. Tuy nhin, khng phi lc no khong
cch Euclide cng cho kt qu chnh xc. Hnh 1.3 minh ha v d v php o
chiu cao v chiu ngang ca mt i tng c thc hin trong mt n v
vt l ging nhau nhng khc nhau v t l.

Hnh 1.3: Cc t l khc nhau c th dn ti cc cm khc nhau


Tuy nhin ch rng y khng ch l vn th: vn pht sinh t
cng thc ton hc c s dng kt hp khong cch gia cc thnh phn
n c tnh d liu vector vo trong mt o khong duy nht m c th
c s dng cho mc ch phn cm v cc cng thc khc nhau dn ti
nhng cm khc nhau.
Cc thut ton cn c cc php o khong cch hoc tng t gia hai
i tng thc hin phn cm. Kin thc min phi c s dng biu
din php o khong cch thch hp cho mi ng dng. Hin nay thng s
dng mt s php o khong cch ph bin [8]:
- Php o khong cch Minkowski c nh ngha nh sau:
1
n q
q
dist q ( x, y ) xi yi ,q 1
i 1
trong , x v y l hai i tng vi n l s lng thuc tnh, x = (x 1, x2,,
xn) v y = (y1, y2,, yn); dist l kch thc ca d liu, q l s nguyn dng.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


13

- Php o khong cch Euclide:


n

x y
2
dist q ( x, y) i i ;
i 1

l khong cch Minkowski gia hai i tng trong trng hp q=2


- Php o khong cch Manhattan:
n
dist q ( x, y) xi yi ;
i 1
l khong cch Minkowski gia hai i tng trong trng hp c bit q =1.
- Khong cch Chebychev:

dist ( x, y) max i=1


n
xi yi ;

Trong trng hp q = , hu ch nh ngha cc i tng phi tng t


nu chng khc nhau ch trong mt kch thc bin i.
1.5.3. Thuc tnh nh phn
Mt thuc tnh nh phn l mt thuc tnh c hai gi tr chnh xc nht c
th, chng hn nh "ng" hay "Sai". Lu rng cc bin nh phn c th
c chia thnh hai loi: bin nh phn i xng v cc bin nh phn bt i
xng. Trong mt bin nh phn i xng, hai gi tr c quan trng khng km
nhau. Mt v d l "nam-n". Bin nh phn i xng l mt bin danh ngha.
Trong mt bin khng i xng, mt trong nhng gi tr ca n mang tm
quan trng hn bin khc. V d, "c" l vit tt ca s hin din ca mt
thuc tnh nht nh v "khng" ngha l s vng mt ca mt thuc tnh
nht nh.
Nu xem xt p l bin nh danh, c th nh gi tng t ca cc
trng hp bng s cc bin m c gi tr ging nhau, nh ngha vi mt
bin nh phn mi t mi bin danh ngha, bng vic nhm cc nhn danh
ngha thnh hai lp, mt nhn l 1, v nhn khc l 0. Xy dng v xem xt

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


14

bng d on cc s kin c th xy ra v nh ngha cc thuc tnh ca i


tng x, y bng cc bin s nh phn 0 v 1 [8]:
Bng 1.1 Bng d on cho hai i tng nh phn x v y

a l tng s cc thuc tnh c gi tr 1 trong hai i tng x, y


b l tng s cc thuc tnh c gi tr 1 trong x v gi tr 0 trong y
c l tng s cc thuc tnh c gi tr 0 trong x v gi tr 1 trong y
d l tng s cc thuc tnh c gi tr 0 trong hai i tng x, y
p l tng tt cc thuc tnh ca hai i tng x, y
o khong cch i xng ca bin nh phn:

d (i, j) bc
a bcd
o khong cch bt i xng ca bin nh phn:

d (i, j) bc
a bc
Cc php o tng t ca cc trng hp vi d liu thuc tnh nh
phn c thc hin bng cc cch sau [8]:
+ H s Jaccard:
a
d ( x, y) ; tham s ny b qua s cc i snh 0 - 0.
abc

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


15

Cng thc ny s dng trong trng hp m trng s ca cc thuc tnh


c gi tr 1 ca i tng d liu cao hn nhiu so vi cc thuc tnh c gi tr
0, nh vy thuc tnh nh phn y l khng i xng.
a
d ( x, y)
p;
a
d ( x, y) ;
bc
a
d ( x, y) ;
2a b c
V d v phi tng t ca thuc tnh nh phn:
Bng 1.2 V d v phi tng t ca thuc tnh nh phn

Gii Bnh Bnh Kim Kim Kim Kim


Tn
tnh st ho tra 1 tra 2 tra 3 tra 4
A Nam C Khng Tt Xu Xu Xu
B N C Khng Tt Xu Tt Xu
C Nam C C Xu Xu Xu Xu

- Gii tnh l thuc tnh i xng


- Cc thuc tnh cn li l bt i xng, nh phn
- Thit lp cc thuc tnh v C, Tt l 1, thuc tnh Xu v Khng l 0
0 1
d ( A, B ) 0.33
2 0 1
11
d ( A, C ) 0.67
111
1 2
d ( B, C ) 0.75
11 2
1.5.4. Thuc tnh nh danh
o phi tng t gia hai i tng x v y c nh ngha nh sau [8]:

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


16

pm
d ( x, y) ;
p
trong , m l s thuc tnh i snh tng ng trng nhau, v p l tng
s cc thuc tnh.
1.5.5. Thuc tnh c th t
Php o phi tng t gia cc i tng d liu vi thuc tnh th t
c thc hin nh sau [8]:
Gi s i l thuc tnh th t c Mi gi tr (Mi l kch thc min gi tr):
Cc trng thi Mi c sp th t nh nhau: [1...Mi], c th thay th mi
gi tr ca thuc tnh bng gi tr cng loi ri vi ri {1...Mi}.
Mi mt thuc tnh c th t c cc min gi tr khc nhau, v vy phi
chuyn i chng v cng min gi tr [0, 1] bng cch thc hin php bin
i sau cho mi thuc tnh:
ri( f ) 1
Z ( j)
;
M i 1
i

S dng cng thc tnh phi tng t ca thuc tnh khong i vi


cc gi tr Zi( j ) , y cng chnh l phi tng t ca thuc tnh c th t.
1.5.6. Thuc tnh t l
C nhiu cch khc nhau tnh tng t gia cc thuc tnh t l.
Mt trong nhng s l s dng cng thc tnh logarit cho mi thuc tnh
xi, v d qi = log(xi), lc ny qi ng vai tr nh thuc tnh khong. Php bin
i logarit ny thch hp trong trng hp cc gi tr ca thuc tnh l s m.
Trong thc t, khi tnh o tng t d liu, ch xem xt mt phn cc
thuc tnh c trng i vi cc kiu d liu hoc l nh trng s cho tt c
cc thuc tnh d liu. Trong mt s trng hp, loi b n v o ca cc
thuc tnh d liu bng cch chun ha chng, hoc gn trng s cho mi
thuc tnh gi tr trung bnh, lch chun. Cc trng s ny c th s dng

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


17

trong cc o khong cch trn, v d vi mi thuc tnh d liu c


gn trng s tng ng wi (1 i k), tng ng d liu c xc nh
nh sau [8]:
n

w x y
2
d ( x, y) i i i
i 1

C th chuyn i gia cc m hnh cho cc kiu d liu trn, v d d


liu kiu hng mc c th chuyn i thnh d liu nh phn hoc ngc li.
Th nhng, gii php ny rt tn km v chi ph tnh ton, do vy, cn phi
cn nhc khi p dng cch thc ny. Tm li, ty tng trng hp d liu c
th m c th s dng cc m hnh tnh tng t khc nhau. Vic xc nh
tng ng d liu thch hp, chnh xc, m bo khch quan l rt quan
trng, gp phn xy dng thut ton phn cm d liu c hiu qu cao trong
vic m bo cht lng cng nh chi ph tnh ton.
1.6. KT LUN CHNG
Chng 1 nu nhng kin thc c bn v khi nim phn cm, cc yu
cu trong phn cm d liu i vi thc t, cc kiu thuc tnh d liu trong
phn cm, mt s php o khong cch ph bin cng nh mt s thuc tnh
ph bin p dng trong cc thut ton phn cm d liu.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


18

CHNG 2
K THUT PHN CM D LIU NG DNG TRONG
PHN LOI CU TRC PROTEIN

2.1. GII THIU


Cc k thut phn cm c rt nhiu cch tip cn v cc ng dng trong
thc t. Cc k thut phn cm u hng ti hai mc tiu chung: cht lng
ca cc cm khm ph c v tc thc hin ca thut ton.
Di y l s phn loi cc phng php phn cm d liu [3]:

Hnh 2.1 S phn loi cc phng php phn cm


Trong s phn thnh hai nhm chnh l phn cm phn cp v phn
cm da vo cm trung tm (K-means, PAM, CLARA, CLARANS).
Trong gii hn tm hiu, lun vn ch i su vo mt s thut ton phn
cm da vo cm trung tm nhm p dng phn loi cu trc protein.
2.2. THUT TON K-MEANS
K-means l thut ton phn cm theo cc phn t bi trung tm ca cc
cm. Phng php ny da trn o khong cch ca cc i tng d
liu trong cm. N c xem nh l trung tm ca cm. Nh vy, n cn
khi to mt tp trung tm cc trung tm cm ban u, v thng qua n lp
li cc bc gm gn mi i tng ti cm m trung tm gn, v tnh ton

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


19

ti trung tm ca mi cm trn c s gn mi cho cc i tng. Qu trnh


ny dng khi cc trung tm cm hi t.
Trong phng php K-means, chn mt gi tr k v sau chn ngu
nhin k trung tm ca cc i tng d liu. Tnh ton khong cch gia i
tng d liu trung bnh mi cm tm kim phn t no l tng t v
thm vo cm . T khong cch ny c th tnh ton trung bnh mi ca
cm v lp li qu trnh cho n khi mi cc i tng d liu l mt b phn
ca cc cm k.

Hnh 2.2: Cc thit lp xc nh danh gii cc cm ban u


Mc ch ca thut ton K-means l sinh k cm d liu {C1, C2,, Ck}
t mt tp d liu cha n i tng trong khng gian d chiu Xi = {xi1,
k
xi2,xid}, i = 1 + n, sao cho hm tiu chun: E =
i 1
xCi
( x mi )2 t gi tr

ti thiu [8].
Trong : mi l trung bnh cm ca cm Ci, x l im d liu i din
cho mt i tng. Trng tm ca cm l mt vector, trong gi tr ca mi
phn t ca n l trung bnh cng ca cc thnh phn tng ng ca cc i
tng vector d liu trong cm ang xt. Tham s u vo ca thut ton l
s cm k, v tham s u ra ca thut ton l cc trng tm ca cc cm d
liu. o khong cch gia cc i tng d liu thng c s dng l

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


20

khong cch Euclide. Hm tiu chun v o khong cch c th c xc nh


c th hn ty vo ng dng hoc quan im ca ngi dng.

Hnh 2.3 Tnh ton trng tm ca cc cm mi


Cc bc c bn ca thut ton K- means

Input: S cm k v cc trng tm cm m j j 1
k

Output: Cc cm C[i](1 i k) v hm tiu chun E t gi tr ti thiu.


Begin
1. Khi to

Chn k trng tm m j j 1 ban u trong khng gian Rd (d l s chiu ca


k

d liu). Vic la chn ny c th l ngu nhin hoc theo kinh nghim.


2. Tnh ton khong cch
i vi mi im Xi (1 i k), tnh ton khong cch ca n ti
mi trng tm mj (1 i k). Sau tm trng tm gn nht i vi im.
3. Cp nht li trng tm
i vi mi (1 i k), cp nht trng tm cm mj bng cch xc
nh trung bnh cng cc vector i tng d liu.
iu kin dng:

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


21

Lp li cc bc 2 v 3 cho n khi cc trng tm ca cm khng


thay i.
End
V d: Gi s trong khng gian hai chiu, cho 12 im (n = 12) cn phn
12 im ny thnh hai cluster (k=2). u tin chn hai im ngu nhin vo
hai cluster, gi s chn im (1,3) v im (9,4) (im c mu trn hnh
2.4.a). Coi im (1,3) l tm ca cluster 1 v im (9,4) l tm ca cluster hai.
Tnh ton khong cch t cc im khc n hai im ny v ta gn c cc
im cn li ny vo mt trong hai cluster, nhng im c mu xanh l vo
cluster 1, nhng im c mu xanh m vo cluster 2 (hnh 2.4.b). Hiu chnh
li tm ca hai cluster, im mu trn hnh 2.4.c l tm mi ca hai cluster.
Tnh li cc khong cch cc im n tm mi v gn li cc im ny, hnh
2.4.d. Tip tc hiu chnh li tm ca hai cluster. C nh th lp li cho n khi
khng cn s thay i na th dng. Khi ta thu c u ra ca bi ton.

Hnh 2.4 V d minh ha thut ton k-means [8]


Nu gi n l s cc i tng trong c s d liu, k l s phn cm, t l
s ln lp i lp li thut ton, vi k,t << n, ta c phc tp ca thut ton
K- means: O(tkn) [10].

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


22

Thut ton ny c hiu qu tng i vi cc c s d liu ln, c u


im l r rng, d dng ci t. Nhng nhc im ca thut ton ny l phi
ch ra s lng cluster v yu cu c s d liu cn phn nhm phi xc nh
c tm. Thut ton ny cng khng ph hp vi vic khai ph cc d liu
gm cc cluster c hnh dng khng li (non-convex). C th a thm nhiu
ci tin vo k-mean c thut ton hiu qu hn, nh thay i cch chn
cc mu khi u, cch tnh tiu chun,...
Cc thut ton sau ny nh PAM, CLARA,... u l s ci tin ca thut
ton K-means.
2.3. THUT TON PAM
Thut ton PAM l thut ton m rng ca thut ton K-means nhm c
kh nng x l hiu qu i vi d liu nhiu hoc phn t ngoi lai, PAM
s dng cc i tng i din biu din cho cc cm d liu, mt i
tng i din l i tng t ti v tr trung tm nht bn trong mi cm [8].
V vy, i tng i din t b nh hng ca cc i tng rt xa trung
tm, trong khi cc trng tm ca thut ton K-means li rt b tc ng bi
cc im xa trung tm ny. Ban u, PAM khi to k i tng i din v
phn phi cc i tng cn li vo cc cm vi i tng i din tng
ng sao cho chng tng t i vi i din trong cm nht. Gi s Oj l i
tng khng phi i din m Om l mt i tng i din, khi ta ni Oj
thuc v cm c i tng i din l Om nu d(Oj, Om) = minOe(Oj, Oe);
trong d(Oj, Om) l phi tng t gia Oj v Oe, minOe l gi tr nh nht
ca phi tng t gia Oj v tt c cc i tng i din ca cc cm d
liu. Cht lng ca mi cm c khm ph c nh gi thng qua phi
tng t trung bnh gia mt i tng v i tng i din tng ng vi
cm ca n, ngha l cht lng phn cm c nh gi thng qua cht
lng ca tt c cc i tng i din. phi tng t c xc nh bng

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


23

o khong cch, thut ton PAM c p dng cho d liu khng gian.
xc nh cc i din, PAM bt u bng cch la chn k i tng i din
bt k. Sau mi bc thc hin, PAM c gng hon chuyn gia i tng
i din Om v mt i tng Op, khng phi l i din, min l s hon
chuyn ny nhm ci tin cht lng ca phn cm, qu trnh ny kt thc khi
cht lng phn cm khng thay i. Cht lng phn cm c nh gi
thng qua hm tiu chun, cht lng phn cm tt nht khi hm tiu chun
t gi tr ti thiu. PAM tnh gi tr Cjmp cho tt c
cc i tng Oj lm cn c cho vic hon chuyn gia Om v Op.
Om: l i tng i din hin thi cn c thay th:
Op: l i tng i din mi thay th cho Om;
Oj: L i tng d liu (Khng phi i din) c th c di chuyn
sang cm khc;
Oj,2: L i tng i din hin thi gn i tng Oj nht
TCmp l tng khong cch gia i tng i din hin thi Op v i
tng i din Om mi thay th Op
Cc bc thc hin thut ton PAM
Input: Tp d liu c n phn t, s cm k.
Output: k cm d liu sao cho cht lng phn hoch l tt nht.
BEGIN
1. Chn k i tng i din bt k;
2. Tnh TCmp cho tt c cc cp i tng Om, Op. Trong , Om l i
tng i din v Op l i tng khng phi i din;
3. Chn cp i tng Om v Op. Tnh MinOm, MinOp, TCmp, nu TCmp l
m thay th Om bi Op v quay li bc 2. Nu TCmp dng, chuyn sang
bc 4;

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


24

4. Vi mi i tng khng phi i din, xc nh i tng i din


tng t vi n nht ng thi gn nhn cm cho chng.
END.
- V d thut ton PAM

Hnh 2.5 V d minh ha thut ton PAM [8]


Nu gi n l s cc i tng trong c s d liu, k l s phn cm, t l
s ln lp i lp li thut ton vi k,t << n, ta c phc tp ca thut ton
PAM: O(tk(n-k)2) [10].
2.4. THUT TON CLARA
Thut ton CLARA c a ra nhm khc phc nhc im ca thut
ton PAM trong trng hp gi tr k v n l ln [8]. CLARA tin hnh trch
mu cho tp d liu c n phn t, n p dng thut ton PAM cho mu ny v
tm ra cc i tng trung tm i din cho mu c trch ra t d liu ny.
Nu mu d liu c trch theo mt cch ngu nhin, th cc i din ca
n xp x vi cc i din ca ton b tp d liu ban u. tin ti mt xp
x tt hn, CLARA a ra nhiu cch ly mu v thc hin phn cm cho mi
trng hp, sau tin hnh chn kt qu phn cm tt nht khi thc hin

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


25

phn cm trn mu ny. o chnh xc, cht lng ca cc cm c nh


gi thng qua phi tng t trung bnh ca ton b cc i tng d liu
trong tp i tng d liu ban u. Kt qu thc nghim ch ra rng, 5 mu
d liu c kch thc 40 +2k cho kt qu tt. Cc bc thc hin ca thut
ton CLARA:
Input: S i tng trong c s d liu(n), s phn cm k, kch thc
mu s,
Output: k cm d liu vi cht lng tt nht.
BEGIN
1. For i = 1 to 5 do
2. Ly mt mu c 40 + 2k i tng d liu ngu nhin t tp d liu v
p dng thut ton PAM cho mu d liu ny nhm tm cc i tng
i din i din cho cc cm.
3. i vi mi tng Oj trong tp d liu ban u, xc nh i tng
i din tng t nht trong s k i tng i din.
4. Tnh phi tng t trung bnh cho phn hoch cc i tng thu
c bc trc, nu gi tr ny b hn gi tr ti thiu hin thi th s dng
gi tr ny thay cho gi tr ti thiu trng thi trc, nh vy tp k i
tng i din xc nh bc ny l tt nht cho n thi im ny.
5. Quay v bc 1
END
y, mt phn nh d liu hin thi c chn nh mt i din ca
d liu thay v s dng ton b d liu v sau i din c chn t mu
s dng PAM. Nu mu c chn theo cch ngu nhin th n c th cn
phi i din tp d liu gc. Cc i tng i din (i dins) c chn l
tng t m c chn t tp d liu. N a ra nhiu mu ca tp d
liu, p dng PAM trn mi mu, v tr li cm tt nht u ra. Nh vy,
CLARA c th x l vi tp d liu ln hn PAM.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


26

Nu gi n l s cc i tng trong c s d liu, k l s phn cm, s l


kch thc ca mu c trch rt, vi k,s << n, ta c phc tp ca thut
ton CLARA: O(ks2+k(n-k)) [10].
2.5. THUT TON CLARANS
Thut ton k-means v k-medoids thng thc hin vi CSDL va v
nh, ch khong vi trm n vi nghn i tng. Do yu cu ca vic phn
cm d liu khng gian ngy cng ln, nm 1994 Raymond T.Ng v Jiawei
Han a ra mt thut ton phn cm theo kiu k-medoids hiu qu hn cc
thut ton trc . Thut ton ny c tn l CLARANS (Clustering Large
Applications based on RANdomized Search). CLARANS phn cm da trn
vic tm kim ngu nhin cc tp gm k i tng lm tm ca k cm. Ti
mi bc tm kim s xc nh c tt ca n v gi li kt qu tm kim
tt nht [2].
Cc tc gi thc hin vic tm kim v nh gi tt ca php tm
kim bng cch xy dng mt th tm kim nh sau. Vi n i tng cn
chia lm k cm th th c t tn l G n,k. Mi nh ca Gn,k l mt tp
gm k i tng O m ,...., O m , ng rng mi i tng Omi l tm ca mt
1 k

cm. Tp nh ca th l O m ,...., O m | O m ,..., O m CSDL. Hai nh ca


1 k 1 k

th c gi l k nhau nu chng c khc nhau duy nht mt i tng.


Ngha l S1 = O m1 ,..., O mk v S2 = O w1
,..., O w1 th S1 v S2 c gi l k nhau

nu v ch nu |S1 A2| = k - 1.Nh vy mi nh c k(n-k) nh k. Theo


cch nh ngha th th mi nh l mt phng n chn k im tm ca k
cm, gn mi nh ca th vi mt trng s l tng khong ca tt c cc
i tng n tm tng ng. Dng trng s ny nh gi tt ca mi
phng n.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


27

Thut ton CLARANS cn c hai tham s u vo l numlocal (s cc


b a phng cn tm) v maxneighbour (s nh k cn xt).
Qu trnh thc hin ca thut ton nh sau:
u tin chn ngu nhin mt nh lm nh hin thi, ly ngu nhin
mt nh k vi nh hin thi, so snh trng s ca nh hin thi v nh
va chn, nu nh va chn c trng s nh hn th t nh lm nh
hin thi, lp li qu trnh maxneighbour ln, tm c mt nh c trng
s thp nht, trng s ny gi l cc tiu a phng. So snh cc tiu a
phng vi mt s mincost, nu nh hn th gn mincost bng cc tiu a
phng v lu li nh hin thi vo bestnode. Lp li qu trnh trn
numlocal ln. Kt qu c cha trong bestnode.
Thut ton CLARANS c m t c th nh sau:
1. t i = 1, mincost = +.
2. Chn mt nh ngu nhin ca Gn,k gi l nh hin thi current.
3. t j = 1.
4. Chn ngu nhin 1 nh S k vi current. Tnh trng s ca S.
5. Nu S c trng s nh hn current th t S l nh hin thi v tr li
bc 3.
6. Ngc li, j = j + 1. Nu j maxneighbour th quay v bc 4.
7. Ngc li, khi j > maxneighbour th so snh trng s ca current vi
mincost, nu nh hn mincost th t mincost bng trng s ca current
v t bestnode = current.
8. i = i + 1, nu i > numlocal th a ra bestnode v dng, ngc li th
quay v bc 2.
Thut ton CLARANS c nh gi l hiu qu trn CSDL ln, vn
xc nh tham s numlocal v maxneighbour khng phc tp. C th chn cc
gi tr khc nhau sao cho ph hp vi CSDL cn nghin cu.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


28

Tuy nhin, CLARANS cng bc l mt s hn ch. Th nht,


CLARANS cn a c CSDL vo b nh trong. Th hai thi gian thc thi c
nhiu trng hp rt xu, trng hp tt nht l O(kn2) v trng hp xu
nht l O(nk) vi n l s cc i tng, k l s phn cm, vi k << n [10].
2.6. KT LUN CHNG
Trn y l mt s thut ton da vo cm trung tm phn cm d
liu. Ta c nh gi v so snh ca tng thut ton v tham s u vo, cu
trc phn cm, v phc tp thut ton trnh by di y [10]:

Bng 2.1 Bng so snh cc thut ton phn cm trung tm

Thut ton phn cm trung tm


Thut Tham s Ti u Cu trc phc tp
ton u vo ha cho phn cm thut ton
Phn chia cc
K-means S cc phn cm Hnh cu O(tkn)
phn cm.
Phn chia cc
PAM S cc phn cm phn cm, D Hnh cu O(tk(n-k)2)
liu nh.
CLARA S cc phn cm D liu ln. Hnh cu O(ks2+k(n-k))
D liu khng
gian, cht lng
S cc phn cm, S
CLARANS phn cm tt Hnh cu O(kn2)
i tng hng xm
hn so vi PAM
v CLARA.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


29

CHNG 3
TIN SINH HC V PHN LOI CU TRC PROTEIN

3.1. TNG QUAN V TIN SINH HC


Tin sinh hc (Bioinformatics) l mt ngnh khoa hc mi v quan trng
c ra i bi s kt hp gia hai ngnh khoa hc l tin hc v sinh hc.
i tng nghin cu chnh ca Tin sinh hc l cc i phn t sinh hc [15].
Cc i phn t sinh hc chnh l protein , nucleic acid , lipid va
polysaccaride, trong o quan trong hn ca la nucleic acid lu tr thng tin di
truyn va protein - biu hin cua vt cht sng . Protein c hi nh thanh t 20
loi amino acid , c cu trc khng gian c trng . Chc nng cua protein rt
a dang: tham gia vao cu tao t bao , xc tc cc phn ng chuyn ha , nhn
bit cac phn t la , tham gia vao cac qua tri nh sng nh actin va myosin trong
s vn ng c . Nucleic acid c hi nh thanh t 4 loi nucleotide (adenine,
thymine, cytosine, v guanine) v gm hai loi: DNA va RNA. Phn t DNA
l mt chui xon kp do s kt hp ca hai mch b sung . RNA la mt phn
t mach n , gm ba loai: mRNA mang thng tin ma hoa cho protein , rRNA
l mt thnh phn ca ribosome v tRNA tham gia vo qu trnh dch m .
Trong ph n nay chung ti gii thiu tng quan v cac i tng nghin cu
ca BioInformatics ; Cu trc v chc nng ca Protein; Cc phng php
phn loi cu trc Protein.
3.1.1. Ch thuyt trung tm ca sinh hc phn t
Nhim sc th (chromosomes) l cc i phn t DNA cha rt nhiu
gene, cc vt cht c bn v cc n v mang chc nng di truyn . Mt gene
l mt chui gm nhiu nucleotide , gene mang thng tin quy nh cu trc
ca protein. H gene cua con ngi rt ln , nhng mi ch c khong 10%
c giai ma . H gene bao gm nhiu vung cha c giai ma , nhng chc

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


30

nng nay c hiu mt cach cha y u . Tt ca cac c th sng u cha


rt nhiu protein.
Khi nim v phin m DNA thnh RNA v dch m thnh Protein c
coi nh la chu thuyt trung tm cua sinh hoc phn t [2].

Hnh 3.1 Ch thuyt trung tm ca sinh hc phn t


3.1.2. DNA (DesoxyriboNucleic Acid)
Phn t DNA la mt chui xon kep gm hai mach n , mi mach n
l mt chui nucleotide . Mi nucleotide gm nhom phosphate , ng
desoxyribose va mt trong bn base (adenine - A, cytosine - C, guanine - G
v thymine - T). Hai mach n kt hp vi nhau nh cac lin kt hydro hi nh
thnh gia cac base b sung nm trn hai mach . A b sung cho T va C b
sung cho G. Mi mach n la mt tri nh t co i nh hng . Hng mach n
ca hai chui xon kp l ngc nhau , ngi ta goi chung la hai mach i
song song.

Hnh 3.2 Cu truc DNA

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


31

3.1.3. RNA (RiboNucleic Acid)


Phn t RNA co cu truc tng t nh DNA . C ba im khc bit gia
DNA va RNA nh sau:
- Phn t RNA la chui n.
- Pentose cua phn t RNA la ribose thay vi deoxyribose.
- Thymine, mt trong bn loai base hi nh thanh nn phn t DNA c
thay th bng uracil trong phn t RNA.
Trong t bao co 3 loi RNA chnh:
- Cac RNA thng tin (mRNA - messenger RNA): l bn sao ca nhng
trnh t nht nh trn DNA
, ng vai tr trung gian chuyn thng tin m ha trn
phn t DNA n b may giai ma thanh phn t protein tng . ng
- Cac RNA vn chuyn (tRNA - transfer RNA ): ong vai tro vn
chuyn cac amino acid cn thit n b may di ch m tng hp protein t
mRNA tng ng.
- Cac RNA cua ribosome (rRNA - ribosome RNA): Chim n 80%
tng s RNA t bao. Cc rRNA kt hp vi cc Protein chuyn bit to thnh
ribosome, mt thanh phn cua b may di ch ma t bo.
3.1.4. Protein
Amino acid la n vi c s cu thanh protein , c khong 20 loi amino
acid chi nh tham gia xy dng thanh protein . Cc amino acid c ni vi
nhau bi cac lin kt peptide . Lin kt nay c hi nh thanh do s k t hp
nhm amine ca mt amino acid vi nhm carboxyl ca amino acid k tip .
Peptide la mt chui ni tip nhiu amino acid
(nh hn 30), cn vi s lng ln
hn thi goi la Polypeptide. T "protein" c dung chi mt cu trc phc tp
trong khng gian ch khng chi n thun la tri nh t cac amino acid
.
Protein co bn mc t chc [2]:
- Cu truc bc 1 (primary protein structure): l trnh t sp xp cc amino
acid trong chui polypeptide.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


32

- Cu trc bc 2 (secondary protein structure): pht sinh t vic un cc


phn cua chui polypeptide thanh nhng cu truc u n trong khng gian
(dng xon , hay dang lp mong )
- Cu truc bc 3 (tertiary protein structure): quy nh s kt hp cc
chui xon hay lp mong o thanh hi nh dang ba chiu trong khng gian.
- Cu truc bc 4 (quarternary protein structure): l s t chc nhiu chui
polypeptide thanh mt phn t protein

Hnh 3.3 Cc kiu cu trc ca protein


3.1.5. Cc dng protein
Cu trc protein trong mt phm v rng ln ca kch thc v hnh
dng. Chng c th c chia lm ba nhm chnh [5]: Dng si protein,
protein mng t bo v hnh cu protein.
- Protein dng si L cc phn t ko di, trong cu trc bc 2 l cu
trc chnh. Bi v chng khng ha tan trong nc, tham gia vo chuyn ng

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


33

(chuyn ng c bp). Protein dng si thng (khng phi tt c) thng


xuyn lp li cu trc Keratin, v d: c tm thy trong tc v mng tay, l
mt chui xon v c mt cu trc lp i lp li. Mt khc, trong cht
collagen thnh phn chnh ca protein l m lin kt, tt c d lng l
glycine v nhiu proline khc.

Hnh 3.4 Cu trc bc 2 thng thy ca protein [5].


- Protein mng t bo b hn ch bi mng bao quanh cc t bo v
nhiu c quan t bo ca n. Nhng protein ny bao gm mt phm vi rng
v kch c v hnh dng ca chng l cc hnh cu neo. Chc nng ca chng
thng bo m cho cc ion vn chuyn v cc phn t nh nh cht dinh
dng thng qua mng t bo. Cc cu trc y ca mng c nhng
protein c th c chia lm 2 loi chnh: Tt c l cc cu trc xon nh l
bacteriorhodopsin v loi th 2 l loi tt c l cu trc v d nh porin.
(xem Hnh 3.5).

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


34

Hnh 3.5 Hai v d v protein mng [5].


Thng 10 nm 2004, c 158 cu trc ca protein mng t bo trong ngn
hng d liu protein (PDB) trong c 86 cu trc l duy nht.
- Protein hnh cu l mt chui nonepetitive, chng c kch thc t 100
n vi trm d lng v p dng mt cu trc nh gn duy nht. Trong
protein hnh cu, amino acid khng cc, chui bn c xu hng co cm li
vi nhau to thnh nhn v k nc. Trong khi cc a nc c th truy cp
vo dung mi bn ngoi glob Trong cu trc th ba, si thng ghp ni
vi sp xp song song hoc khng song song to thnh sheet . Tnh trung
bnh mt protein chui chnh bao gm khong 25% d lng trong mt chui
xon v 25% d lng trong si , phn cn li ca cc d lng l cc cu
trc sp xp khng thng gp.
3.2. CC PHNG PHP PHN LOI CU TRC PROTEIN
Nm 1960, myoglobin v hemoglobin, hai cu trc u tin c pht
hin cp phn t khi s dng tia X, c cu tc tng t mc d trnh t
l c s khc bit [5]. Hai protein c chc nng tng t, nh chng c
tham gia lu tr v vn chuyn oxy tng ng. K t , vic tm kim cc
im tng ng v cu trc cc protein v chia s v chc nng m khng
th pht hin bi cc thng tin trnh t. Mt h qu hp l ca s quan tm
ny l pht trin ca h thng phn loi cu trc protein, l nh dnh v
nhm cc protein chia s cc cu trc tng t tm ra mi quan h tin ha.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


35

Hnh 3.6 S pht trin ca cu trc d liu protein [5]


Phn loi cu trc protein tr ln cp thit bi v khi lng d liu
cu trc sn c (xem hnh 3.6). Song song vi s pht trin phng php phn
loi cu trc protein l s pht trin ca nhiu Phng php phn loi trnh t
protein c m t.
Bng 3.1 a ra mt s ngun ti nguyn phn loi trnh t protein.
Th t M t chi tit Ngun truy cp website
Pfam Cp min phn http://www.sanger.ac.uk/Software/Pfam/
cm ca trnh t
protein
PRINTS Thng tin vn tay http://www.bioinf.man.ac.uk/dbbrowser/
trn trnh t protein PRINTS/
PROSITE Trnh t nh ngha http://www.expasy.org/prosite/
TIGRFAMS c s d liu Gia http://www.tigr.org/TIGRFAMs/
PRODOM BLOCKS nh protein, C s http://protein.toulouse.inra.fr/prodom.html
eMOTIF d liu min protein, http://blocks.fhcrc.org/
Khi sp xp phc http://motif.stanford.edu/emotif/
tp t PRINTS v
BLOCKS
CluSTr COGS Mi qun h phn http://www.ebi.ac.uk/clustr/
ProtoMap cm protein

TRIBES C s d liu gia http://maine.ebi.ac.uk:8000/services/tribes


PIR international nh protein http://pir.georgetown.edu/
C s d liu trnh t
protein
InterPro C s d liu protein http://www.ebi.ac.uk/interpro/
gia nh v cc min
(Ngun ti liu cho Phn loi trnh t protein [5].)

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


36

Tt c cc phng php phn loi cu trc l c s cho s sp xp c h


thng tng t: Cu trc protein c chia thnh cc lnh vc ri rc, hnh
cu, sau c phn loi mc cp 1(lp), cp 2 np gp, cp ba
superfamily, cp 4 families. S khc bit gia cc chng trnh hin c
n t cc phng php xc nh cc min v cc th tc phn loi. Sau khi
xem xt cc trnh t xc nh mt phn loi, c 3 phng php phn loi cu
trc protein chnh: SCOP, CATH v DALI Domain Dictionary (DDD). Cc
lin kt ti c s d liu v cc dch v lin quan c lit k trong bng 3.2
Bng 3.2 Ngun ti nguyn cho phn loi cu trc protein [5]
Th t M t chi tit Ngun truy cp website
SCOP Phn loi cu trc ca http://scop.mrc-lmb.cam.ac.uk/scop/
Protei: th cng index.html
CATH Lp, kin trc, hnh trng, http://www.biochem.ucl.ac.uk/bsm/cath
tnh tng ng, phn loi
bn t ng cu trc
protein
DALI Np gp T ng phn loi s dng http://www.bioinfo.biocenter.helsin-
Classification min DALI s dng ki.:8080/dali/index.html
DALI, thay th FSSP
ASTRAL C s d liu v cc cng http://astral.berkeley.edu/
c phn tch cu trc t
SCOP
HOMSTRAD Cu trc 3D tnh tng http://www-cryst.bioc.cam.ac.uk/data/align
ng protein

Cc bin th u tin kt hp vi phn loi cu trc lin quan n thc t


l cu trc protein thng bao gm cc cu trc hnh cu ring bit. Bi v
min ny c th hot ng c lp, vi cc vai tr v chc nng khc bit,
protein thng c chia thnh cc min trc khi phn loi.
Lm th no xc nh v gii hn cc min ny vn l vn m nh
trnh by trn. iu quan trng l nhn ra rng cc thut ton hin ti cho
tng min khng phi lc no cng ng, tng ng vi s khc bit trong
phm vi nh ngha to thnh s khc bit gia cc phng php phn loi
cu trc, iu khng chia s cng mt nh ngha.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


37

Mt khi cc protein c phn chia thnh cc min. Cc min c phn


lp c phn cp, cp trn ca phn loi, chng ta thng tm lp ca
min protein, thng c xc nh t cc thnh phn tng th ca n trong
cc phn t cu trc bc 2. Ba lp chnh ca min protein hin ti:
Min chnh , min chnh , v vng hn hp ca v (cc min trong
lp , c mt s vng trong phn cn li ca min kia.) cu trc cp 2 v
cc min vi s hn tp ca min cu trc cp 2 +. Trong mi lp, cc
min c phn chia trong np gp ty theo dng hnh hc ca chng. Mt
np gp c nhn dng t s, s sp xp, v s kt ni ca cc phn t min
cu trc cp 2. Nhng np gp c phn chia trong siu h, mt siu h
cha cc min protein vi cc hm tng t trong nhm ca chng, thng
trong trng hp khng c trnh t c pht hin tng t. Thng tin trnh
t nh ngha h, v d cc lp nh ca siu h c nhm li cc min c
trnh t l gn ging nhau.
H thng phn loi c thit k nh l tm thi v t ng, mt phn
lp tm thi da trn c s gim st ca con ngi, mt s c hng dn
bi s phn tch my tnh, nh ngha s tng t gia cc cu trc protein
trong nhm. Mt s phn lp t ng tin cy loi tr trn kt qu ca cc th
tc my tnh nh ngha s tng t, s phn lp ny v sau c x l t
ng sn sinh ra nhm. Mt li th ca phng php l s c trng cht
lng cao ca kt qu phn cm, s bt li l phng php l kh nh gi
vi lng d liu ln, ngc li, nhng th tc t ng c th sn sinh v
nh gi vi d liu ln, nhng chng c th thiu chnh xc cho s nh gi
cc cu trc tng t. Ba phng php phn loi cu trc protein ph bin
thng thy l:
- SCOP l phng php gn nh hon ton lm bng th cng
- Min DALI da trn c s c lm hon ton t ng

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


38

- CATH l trung gian gia 2 s phn cm, s dng cc th tc t ng


b sung cc bin php can thip ca con ngi.
3.2.1. Phn loi cu trc vi SCOP
Phn loi cu trc vi SCOP l mt phng php phn loi ch yu l
hng dn s dng min cu trc protein da trn s tng t ca cu
trc v chui acid amin. Mt ng lc cho s phn loi ny l xc nh mi
quan h tin ha gia cc protein. Protein vi cc hnh dng tng t nhng
c t trnh t hoc chc nng tng t c t trong cc "superfamilies"
khc nhau, v c cho l c t tin duy nht ph bin rt xa. Protein c hnh
dng ging nhau v ging nhau mt s trnh t v/hoc chc nng c t
trong "family", v c gi nh c mt t tin chung gn gi.
Cc c s d liu SCOP c truy cp t do trn internet. SCOP c to ra
vo nm 1994. N c duy tr bi Alexei G. Murzin v cc ng nghip
ti Phng th nghim Sinh hc phn t Cambridge, Anh. Phin bn hin ti
ca SCOP l 1,75, pht hnh thng su nm 2009 [12].
Cu trc protein c sp xp v tp hp thnh ngn hng d liu
Protein. Cc n v phn loi cc cu trc trong SCOP l min protein .
Cc hnh dng ca min c gi l "np gp" trong SCOP. Min
thuc cng mt np gp c cng mt cu trc th cp ln trong vic b tr
cng vi cc kt ni cng mt hnh trng. C 1195 np gp trong SCOP phin
bn 1.75. M t ngn ca mi ln a ra. V d, "globin " ln c m t nh
l ct li: 6 xon, gp l, mt phn m ra . Np gp mt min c xc nh
qua s kim tra, ch khng phi bng phn mm.
Cc mc SCOP nh sau:
- Lp: Cc loi np gp, v d nh, tm bn beta.
- Np gp: Cc hnh dng khc nhau ca lnh vc trong mt lp.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


39

- Siu h: Cc lnh vc trong mt ln c chia thnh cc nhm siu h,


trong c t nht l t tin xa xi.
- H: Cc min trong siu h c chia thnh cc nhm gia nh, c mt
t tin chung gn y.
- Min protein: lnh vc trong gia nh c chia thnh cc nhm lnh
vc protein, m c bn l protein.
- Loi: Cc lnh vc trong min protein c nhm li theo loi.
- Min: mt phn ca mt protein. i vi cc protein n gin, n c
th l ton b protein.
3.2.2. Phn loi cu trc vi CATH
Phn loi cu trc Protein CATH l bn t ng, phn cp phn loi
cc lnh vc protein c xut bn vo nm 1997 bi Christine Orengo, Janet
Thornton v cc ng nghip ca h [13].
CATH chia s nhiu tnh nng m rng vi cc i th chnh ca
n, SCOP, tuy nhin cng c nhiu lnh vc, trong phn loi chi tit khc
rt nhiu.
Bng 3.3 Cc cp chnh ca CATH
Bn cp chnh ca h thng phn cp CATH nh sau:
STT Cp M t
1 Class (Lp) Trung-cu trc tng th ni dung ca tn min
Cu trc tng t cao nhng khng c bng
Architecture (Kin
2 chng tng ng . Tng ng vi mt ln trong
trc)
SCOP
Mt nhm quy m ln ca cu trc lin kt chia s
3 Topology (T p)
cc tnh nng cu trc c bit
Homologous Du hiu ca mt mi quan h tin ha c th
4
(tng ng) chng minh. Tng ng mc siu h SCOP.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


40

CATH nh ngha bn lp: a s alpha, a s beta, alpha, v mt s cu


trc beta th cp.
hiu r hn v h thng phn loi CATH l hu ch, bit lm th
no n c xy dng nhiu qu trnh c thc hin bng phng php t
ng, tuy nhin c nhng yu t quan trng hng dn s dng phn loi.
Bc u tin l tch cc protein vo cc min. iu ny l kh
khn a ra mt nh ngha r rng ca mt min v iu ny l mt trong
nhng phn m CATH v SCOP khc nhau.
Cc min s c t ng sp xp vo cc lp (C) v cc nhm trn c s
tng ng trnh t. Cc nhm ny hnh thnh nn mc phn loi (H). Cc cp
cu trc lin kt (T) c hnh thnh bng cch so snh cu trc ca cc nhm
tng ng. Cui cng, mc Kin trc (A) c phn nh th cng.
Cp lp phn loi c thc hin trn c s 4 tiu ch:
1. Ni dung cu trc cp hai;
2. Cc cch tip cn cu trc cp hai;
3. Cc chnh sa cu trc cp hai;
4. Phn trm cc thnh phn song song.
3.2.3. Phn loi cu trc vi phn loi min Dali (DDD)
DDD, cn c gi l phn loi min Dali, c m t bi cc phng
thc xc nh t ng v cc min phn loi. Khi so snh hai cu trc protein,
Dali tnh ton mt bin php tng t hoc qua im S, C ngha v lch
chun ca cc im S nm trn tt c cc cp ca cc protein c nh gi.
Chuyn cc im S bi ngha ca chng v thay i t l ca lch chun
mang li ngha thng k vi cc im Z.
Chng trnh Dali ban u c s dng to ra dng h c s d liu
cc protein tng t (FSSP) [5]. Trong FSSP, cc cp so snh c thc hin
gia cc protein ca tp hp i din, trong khng c hai protein c nhn

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


41

dng chui ln hn 25%. i vi mi thnh vin ca i din thit lp mt


tp tin c to ra c cha tt c cc cp cu trc vi im Z ln hn 2.0. Cc
th tc tng t to thnh mt phn loi hon chnh ca tt c min protein
trong c s d liu PDB90, DDD PDB90 l mt tp con cc i din ca
PDB, ni m khng c 2 chui chia s hn 90% trnh t nhn dng. Mt lin
kt trung bnh cng ngh phn cm phn cp to ra mt cy np gp gm cc
d liu PDB90. S sp xp cu trc c chia vi im Z ct ca 2, 4,8,16,32
v 64, to ra su ch s k t cho mi min. Cp th nht (Z>2) c s
dng nh mt s hot ng ca cc np gp. Cp thp hn khng c
nhm ln vi siu h v cc cp h ca CATH v SCOP, nh chng khng
da trn mi quan h trc tip v chc nng hay tin ha. C hai FSSP v
DDD c cp nht lin tc, iu ny c th v chng c bt ngun t mt
th tc t ng hon chnh.
3.3. KT LUN CHNG
Chng ny gii thiu mt cch tng quan mt s khi nim lin quan
trong tin sinh hc, DNA, RNA, Protein v cc cc mc t chc ca
protein. Chng ny cng gii thiu mt cch khi qut mt s cc phng
php phn loi cu trc protein nh SCOP - vi phn loi th cng, CATH-
vi phn loi bn t ng c kt hp yu t phn tch ca cc chuyn gia.
DDD - vi phng thc phn loi hon ton t ng.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


42

CHNG 4
CHNG TRNH DEMO VI PHN MM CLUSTERS 3.0

4.1. PHN MM CLUSTERS 3.0


Cluster c vit trn ngn ng Borland C++ Builder bi Michael Eisen,
i hc Standford, Anh [7]. Tc gi s dng v nng cp thut ton K-mean
trong chng trnh Cluster 3.0. Chng trnh c kh nng chy trn cc mi
trng Window, Mac OS, Linux v Unix.
4.1.1. Yu cu phn cng
My tnh chp Pentium IV tr ln
- Cu hnh ti thiu 256Mb RAM, cu hnh ngh 1024Mb RAM.
- cng cn rng ti thiu 1Gb
4.1.2. Ngun d liu demo chng trnh
Tp datafile demo, m s c s dng trong tt c cc v d:
- http://rana.lbl.gov/downloads/data/demo.txt
- http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/demo.txt.
Ti v d liu v chn n khi chy chng trnh Cluster 3.0. Chng
trnh s cung cp y thng tin ca tp d liu u vo.
4.1.3. S dng th vin phn cm
Cc thut ton phn vng phn chia cc phn t vo k cm, nh vy tng
ca cc khong cch trn cc phn t ti trung tm ca cc cm l nh nht.
S cm k c ch nh bi ngi s dng. Trong th vin phn cm C, hai
thut ton c s dng: K-means v K-medians [6].
Nhng thut ton ny khc nhau ch cm trung tm c nh ngha
th no. Trong phn nhm K-means, cc cm trung tm c nh ngha l
trung bnh vector d liu trn tt c cc mc trong cm. Thay v trung bnh,
trong K-medians, phn nhm trung v c tnh cho mi kch thc trong cc
d liu vector.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


43

4.2. S DNG THUT TON K-MEANS, K-MEDIANS


Cc thut ton K-means, K-medians thng c s dng tm cc
phn vng nhm k. Bc u tin trong thut ton l to ra cc cm k v ch
nh ngu nhin i tng (gen hoc microarray) cho chng.
Sau lp:
- Tnh ton cc trng tm ca mi cm;
- i vi mi mc, xc nh trng tm cm gn nht;
- Phn chia li cc mc cm . Lp i lp li c dng li nu khng
c phn t no phn chia li tip tc din ra.
Khi phn chia ban u ca cc cm c thc hin ngu nhin, thng l
mt gii php phn cm khc nhau c tm thy mi ln cc thut ton c
thc thi. tm cc phn nhm ti u gii php, thut ton K-means c lp
i lp li nhiu ln, mi ln bt u t mt phn nhm ngu nhin khc nhau
ban u. Tng khong cch ca cc phn t n trung tm cm ca chng
c lu li cho mi ln chy, v cc gii php vi gi tr nh nht ca tng
ny s c tr li nh l gii php tng th cm.
Cc thut ton ti u c chy th no ph thuc vo s lng cc
phn t c phn cm. Chng ta c th xem xt mc thng xuyn cc
gii php ti u c tm thy. Con s ny c tr v bi cc thut ton
phn vng c thc hin trong th vin ny. Nu gii php ti u c
tm thy nhiu ln v khng c gii php tt hn tn ti. Tuy nhin, nu cc
gii php ti u c tm thy ch mt ln, c th cng c gii php khc
vi mt s tng cc khong cch nh hn.
4.2.1. Khi to
Thut ton K-means khi to bi s phn chia bt k cc phn t (gen
hoc microarray) phn cm. chc chn khng c cm rng no c
to ra, chng ta s dng phn phi nh thc chn ngu nhin s cc phn

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


44

t trong mi phn cm l mt hoc nhiu. Sau chng ta hon chuyn ngu


nhin cm phn chia cc phn t, mi cm c mt xc xut bng nhau trong
bt k cc cm. Mi cm nh vy bo m cha t nht mt phn t.
4.2.2. Tm trng tm cm
Trng tm cm c nh ngha bng nhiu cch khc nhau. Vi K-
means, trng tm ca cm c nh ngha l trung bnh khong cch ca tt
c cc phn t trong cm.
Th vin Phn cm C h tr th tc tnh ton, trung bnh cm, trung v
cm [6].
4.2.3. Tm trung bnh cm, hoc trung v cm
Th tc getclustercentroids tnh ton trng tm ca cc cm bi cch
tnh trung bnh hoc trung v cho mi phn t cm [6]. Nhng gi tr li
khng c thm vo trong vic tnh ton trung bnh hoc trung v. Gi tr li
trong trng tm cm c biu th trong mng cmask. Nu phn cm I cc gi
tr vi khong cch j l c li trong tt c cc phn t, sau cmask [i][j]
(hoc cmask[j][i] nu transpose =1) c thit lp bng 0. Trong mi trng
hp khc n c thit lp l 1. Tham s method xc nhn nu cc trung bnh
hoc trung v c tnh ton. Nu mt li a phng xut hin,
getclustercentroids s tr v 0, iu ny c th xy ra nu phn cm c tnh
ton. Nu khng c li xut hin, getclustercentroids tr v gi tr 1.
Khai bo
int getclustercentroids(int nclusters, int nrows, int ncolumns, double** data,
int** mask, int clusterid[], double** cdata, int** cmask, int transpose, char
method)
Vi cc tham s:
- int nclusters: S cm

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


45

- int nrows: s hng trong ma trn d liu, bng s gen trong mu th


nghim.
- int columns: L s ct trong ma trn d liu, bng s microarray trong
mu th nghim.
- double** data: l mng d liu cha d liu gen th nghim.
Gen c lu tr theo hng, microarray c lu tr theo ct. Kch
thc [nrows][ncolumns]
- int** mask: Mng ny biu din cc phn t trong mng data, nu c
li, mask[i][j] =0, khi mng data [i][j] c li. vi [nrows][ncolumn] l kch
thc ca mng.
- int clusterid[]: S phn cm ca mi phn t. Mi phn t trong mng
gia 0 v v ncluster-1. Vi kch thc [nrows] nu transpose=0 hoc
[ncolumn] nu transpose =1
- double** cdata: Ma trn ny cha thng tin trng tm. L ni cho ma
trn ny c phn nh trc khi gi th tc getclustercentroids. Vi kch
thc [nclusters][ncolums] nu transpose =0) (phn cm hng), hoc
[nrows][nclusters] nu transpose =1(phn cm ct.)
- int ** cmask: Ma trn ny cha cc gi tr trong cdata b li. Nu
cmask[i][j] = 0 th cdata[i][j] b li. Khng gian cho phn nh trc khi gi
getclustercentroids. Kch thc: [nclusters][ncolums] nu transpose = 0)(phn
cm hng), hoc [nrows][nclusters] nu transpose = 1(phn cm ct).
int transpose: Bin c ny biu din hng - gen, hoc ct - microarray phn
cm c thc hin. Nu transpose = 0, hng-gen c phn cm, trong
trng hp cn li ct - microarray c phn cm.
- char method: Nu method ='a' trng tm cm c tnh ton nh
khong cch trung bnh s hc trn mi khong cch. Vi method='m' trng
tm cm c tnh ton trn khong cch trung v.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


46

Gi tr tr v: Th tc ny tr v 1 l thnh cng, tr v 0 l b li.


4.2.4 Tm gii php ti u vi K-means v K-medians
Cc gii php ti u c thc hin bng vic s dng nhiu ln cc
thut ton ti a ha. N c th thc hin bng cch gi thng xuyn th tc
kcluster. Th tc tnh ton trng tm cm v hm khong cch c la chn
da trn cc tham s ca th tc kcluster.
Cc thut ton ti u ha sau c thc hin lp i lp li, v cc gii
php ti u tt nht c tr v qua nhng th tc. Ngoi ra kcluster tnh ton
s thng xuyn ca cc thut ton ti u. Nu n c tm thy qua nhiu
ln, chng ta c th gi nh khng c cc php no tt hn vi tng khong
cch nh hn. Tuy nhin nu c mt gii php no cho kt qu tt hn th
xem xt gii php l ti u nht [6].
Th tc: void kcluster
void kcluster (int nclusters, int nrows, int ncolumns, double** data, int**
mask, double weight[], int transpose, int npass, char method, char dist, int
clusterid[], double* error, int* ifound);
Trong :
- int nclusters: S phn cm k
- int nrows: S hng ca ma trn d liu, bng vi s gen trong th
nghim.
- int ncolumns: S ct trong ma trn d liu, bng vi s microarray
trong th nghim.
- double** data: Mng data cha d liu gen. Cc gen c cha theo
hng, cc microarray c cha theo ct,vi [nrows][ncolumns] l kch thc
mng.
- int** mask: Mng ny biu th cc phn t trong mng d liu, nu c
th l b li.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


47

Nu mask[i][j]=0 th data[i][j] b li, Vi nrows][ncolumns] l kch


thc mng.
double weight[]: Cc trng lng c s dng tnh ton khong
cch. Kch thc l nrows] nu bin transpose = 0; l [ncolumns] nu bin
transpose =1;
- int transpose: l bin c vi mc ch phn nhm gen hoc microarray.
Nu transpose = 0 th cc hng c phn cm, nu transpose = 1 th
microarray c phn cm.
- int npass: l khong thi gian thc hin thut ton ti u. Nu npass =0
th thut ton ti u chy vi mt phn cm khi to bi clusterid. Ch nh
ti phn cm khc v ngn chn cc cm cn li s dng clusterid.
char method: bin ny c gi tr u vo l a th phn cm K-means, nu
l m th phn cm K-medians.
- char dist: Hm khong cch c s dng
- int clusterid: Mng ny c s dng cha s phn cm c
phn chia bi thut ton phn cm. Khng gian cho clusterid c phn b
trc khi gi th tc kcluster. Nu npass = 0 th ni dng ca clusterid u
v c s dng nh l phn chia ban u cc phn cm, u ra, clusterid
cha gii php phn cm ca thut ton ti u. Kch thc mng: nrow nu
transpose = 0, ncolumn nu transpose = 1
- double error: Tng khong cch ca mi phn t ti trung tm cm sau
thut ton K-means. n c s dng nh l mt tiu chun so snh cc
gii php trong cc ln gi kcluster khc nhau.
- int* ifound: Tr v gii php phn cm ti u c tm thy. Trong
trng hp khng thng nht cc tham s u vo(c bit nu nclusters l

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


48

ln hn s phn t c phn cm),* ifound c thit lp l 0, nu b nh b


li th *ifound c thit lp l - 1
4.3. PHN MM DEMO
4.3.1. u vo ca chng trnh
u vo ca chng trnh phn cm l mt tp tin vn bn vi cu trc
nh sau:
UNIQUID (Chui hoc s): Ct Ch s ca mi gen, Ct ny cha nh
dnh cho mi gen.
NAME: Ct tn, m t chi tit mi gen c s dng khi hin th.
EWEIGHT (S thc): M t trng lng ca mi th nghim, c th
c c tnh so vi cc gen c th nghim khc.
GWEIGHT (S thc): Mt trng lng tng t cho mi gen c s
dng khi phn cm cc mng.
GORDER (S thc): L gi tr c s dng so snh cc node trong
hin th chng trnh.
EXPID (chui, v d: EXP1, EXP2,): L on vn bn m t mi th
nghim s c s dng trong biu din.
DATA (S thc): D liu cho tng gen trong tng th nghim. Gi tr li
c chp nhn

Hnh 4.1 u vo d liu


C th m file input bng Excel nhn r thm chi tit.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


49

4.3.2. Giao din mt s chc nng chnh ca chng trnh


- Giao din chn file u vo:

Hnh 4.2 Giao din chn tp u vo


- Giao din tab Lc d liu:

Hnh 4.3 Giao din tab Lc d liu


Tab Data Filter cho php ta loi b cc gen khng c thuc tnh mong
mun t c s d liu. Cc thuc tnh hin ti c th c s dng lc d
liu
% Present> = X - Thuc tnh ny cho php loi b tt c cc gen b
mt gi tr ln hn (100 - X ) phn trm ca cc ct.
SW (Gene Vector)> = X - iu ny loi b tt c cc gen c lch
chun ca cc gi tr nhn xt c t hn X .

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


50

At least X observation with abs (Val)> = Y iu ny loi b tt c cc


gen khng c t nht X nhn xt vi gi tr tuyt i ln hn so vi Y .
MaxVal - MinVal> = X iu ny loi b tt c cc gene c gi tr ti a
tr i gi tr ti thiu l nh hn so vi X .
Ghi ch: Khi ta bm nt Filter, b lc khng p dng ngay lp tc cho
cc tp d liu. Khi ln u tin bm Filter, s c thng bo c bao nhiu gen
thng qua b lc. Nu mun chp nhn kt qu ca b lc, c th bm
Accept, nu khng th khng c s thay i no c thc hin.
- Giao din tab Chnh sa d liu:

Hnh 4.4 Giao din tab chnh sa d liu


T tab d liu Ajust data, ta c th thc hin mt s hot ng thay i
d liu c bn t cc d liu u vo. Cc hot ng ny l:
- Log Transform Data: thay th tt c cc d liu gi tr x bng cch
ng bng log2(x)
- Center genes [mean hoc median]: Ty chn mean hoc median t
cc gi tr trong mi hng ca d liu, do gi tr trung bnh hoc trung bnh
ca mi hng l 0.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


51

- Center arrays [mean hoc median]: Tr ct mean hoc median t cc


gi tr trong mi ct ca d liu, do gi tr trung bnh hoc trung bnh ca
mi ct l 0.
- Normalize genes: Nhn tt c cc gi tr trong mi hng d liu vi h
s t l S tng bnh phng ca cc gi tr trong mi hng l 1.0
(ring S c tnh cho mi hng).
- Normalize genes: Nhn tt c cc gi tr trong mi ct ca d liu vi
h s t l S tng ca bnh phng cc gi tr trong mi ct l 1.0
(ring S c tnh cho mi ct).
Cc hot ng ny l ring r, do th t trong cc hot ng ny c
p dng l rt quan trng, nn xem xt cn thn trc khi p dng cc hot
ng ny. Th t ca cc hot ng:
+ Log transform all values.
+ Center rows by subtracting the mean or median.
+ Normalize rows
+ Center columns by subtracting the mean or median
+ Normalize columns.
- Giao din Tab k-Means, s dng K-means hoc K-medians phn cm

Hnh 4.5 Giao din Tab k-Means, s dng K-means v K-medians

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


52

Vi hai khung l Genes v Arrays, ngi dng c th chn trc s cm


k v s ln chy thut ton nh hnh v. Chng trnh cng cho php ngi dng
s dng 2 loi K-means hoc K-medians v chn o tng t vi tng ty
chn nh tnh theo khong cch Euclide hoc khong cch City-block.
4.3.3. Tp u ra ca chng trnh
L mt file text (C th m tp u ra bng Excel) vi cc gen c cu trc
tng t c phn cm v sp xp theo cc tham s u vo chng trnh.

Hnh 4.6 u ra d liu

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


53

KT LUN V HNG NGHIN CU

KT LUN
Trong bn lun vn ny ti tm hiu, nghin cu mt s vn sau:
- Lun vn trnh by l thuyt c bn v phn cm d liu, v mt s thut
ton phn cm d liu da vo cm trung tm ng dng vo phn loi cu trc
protein.
- Gii thiu v Protein, cu trc, chc nng ca protein, mt s phng
php phn loi cu trc protein
- Lun vn s dng chng trnh Cluster 3.0 vi mc ch biu din
phn cm d liu s dng thut ton K-means v K-medians. Chng trnh
c s dng th vin phn cm C p dng cc thut ton K-means, v K-
medians tm trung tm cm, trung v cm v x l phn cm da trn hai
thut ton trn.
Vi nhim v l nghin cu tm hiu vic p dng mt s thut ton phn
cm phn loi cu trc protein. Tuy nhin chng trnh c nhiu hn ch:
- Chng trnh mi ch s dng u vo l tp tin .txt (c th m vi
excel) cha s dng nh dng ca protein t c ngn hng d liu protein
PDB phn loi.
- Chng trnh mi ch dng vic s dng 2 thut ton K-means v
K-medians.
HNG NGHIN CU TRONG THI GIAN TI
Trong tng lai ti pht trin theo hng nghin cu phn loi Protein
vi phn loi trnh t, phn loi cu trc ca protein.
Tm hiu ngn hng d liu protein v s dng ngun d liu ny trong
chng trnh.
C s d liu protein l rt ln v a dng cho nhiu loi, v vy cn s
dng cc thut ton c tc x l tt hn.

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


54

TI LIU THAM KHO

Ting Vit
[1] Lu Th Hi Yn (2008), Lun vn Thc s - ti: Nghin cu pht
trin h thng a phng tin trn c s phn cm d liu, i hc Thi
Nguyn- Khoa cng ngh thng tin.
[2] Nguyn Q.D., Trn .H., Trn T.T.B., Phm T.H. (2011). D on
chc nng protein bng phng php phn cm d liu, c san CNTT, Tp
ch Khoa hc HSPHN, 56, 3-16.
Ting Anh
[3] A.K Jain, M.N Murty and P,J Flyn (1999), Data Clustering: A
Review, ACM Computing Survey.
[4] A.K Jain, Richard C. Dubes (1988), Algorithms for Clustering Data,
Prentice Hall, Englewood Cliffs, New Jersey 07632.
[5] Patrice Koehl (2006), Protein Structure Classification, Department of
Computer Science and Genome Center, University of California, Davis,
California.
[6] Michiel Jan Laurens de Hoon (2010), The C Clustering Library for
cDNA microarray data, The University of Tokyo, Institute of Medical
Science, Human Genome Center, University of Tokyo, 4-6-1 Shirokanedai,
Minato-ku, Tokyo 108-8639, Japan.
[7] Michael Eisen; updated by Michiel de Hoon (2002), Cluster 3.0
Manual, Human Genome Center, University of Tokyo, 4-6-1 Shirokanedai,
Minato-ku, Tokyo 108-8639, Japan.
[8] Michelle Kamber and Jiawei Han (2001), Data Mining: Concepts
and Techniques. Morgan Kaufmann Publishers, University of Illinois at
Urbana-Champaign, 500 Sansome Street, Suite 400, San Francisco, CA 94111

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


55

[9] Michael R. Anderberg (1973), Cluster analysis for applications.


Academic Press.
[10] Periklis Andritsos (2002), Data Clustering Technique, University of
Toronto, Department of Computer Science.
Ngun internet:
[11]. http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm
[12]. http://en.wikipedia.org/wiki/Structural_Classification_of_Proteins_
database
[13]. http://en.wikipedia.org/wiki/CATH
[14]. http://rana.lbl.gov/EisenSoftware.htm
[15]. http://www.agbiotech.com.vn/vn/?mnu=preview&key=475

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


56

PH LC

/* ********HM GETCLUSTERCENTROIDS*******/

int getclustercentroids(int nclusters, int nrows, int ncolumns,double**


data, int** mask, int clusterid[], double** cdata, int** cmask,int
transpose, char method)
{ switch(method)
{ case 'm':
{ const int nelements = (transpose==0) ? nrows: ncolumns;
double* cache = malloc(nelements*sizeof(double));
if (!cache) return 0;
getclustermedians(nclusters, nrows, ncolumns, data, mask,
clusterid,
cdata, cmask, transpose, cache);
free(cache);
return 1;
}
case 'a':
{ getclustermeans(nclusters, nrows, ncolumns, data, mask, clusterid,
cdata, cmask, transpose);
return 1;
}
}
return 0;
}

/* **********TH TC GETCLUSTERMEANS******* */

static void getclustermeans(int nclusters, int nrows, int ncolumns,


double** data, int** mask, int clusterid[], double** cdata, int**
cmask,
int transpose)
{ int i, j, k;
if (transpose==0)
{ for (i = 0; i < nclusters; i++)
{ for (j = 0; j < ncolumns; j++)
{ cmask[i][j] = 0;
cdata[i][j] = 0.;
}
}
for (k = 0; k < nrows; k++)
{ i = clusterid[k];
for (j = 0; j < ncolumns; j++)
{ if (mask[k][j] != 0)
{ cdata[i][j]+=data[k][j];
cmask[i][j]++;
}
}
}
for (i = 0; i < nclusters; i++)
{ for (j = 0; j < ncolumns; j++)
{ if (cmask[i][j]>0)
{ cdata[i][j] /= cmask[i][j];
cmask[i][j] = 1;
}

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


57

}
}
}
else
{ for (i = 0; i < nrows; i++)
{ for (j = 0; j < nclusters; j++)
{ cdata[i][j] = 0.;
cmask[i][j] = 0;
}
}
for (k = 0; k < ncolumns; k++)
{ i = clusterid[k];
for (j = 0; j < nrows; j++)
{ if (mask[j][k] != 0)
{ cdata[j][i]+=data[j][k];
cmask[j][i]++;
}
}
}
for (i = 0; i < nrows; i++)
{ for (j = 0; j < nclusters; j++)
{ if (cmask[i][j]>0)
{ cdata[i][j] /= cmask[i][j];
cmask[i][j] = 1;
}
}
}
}
}
/* ****************TH TC GETCLUSTERMEDIANS***********/

static void
getclustermedians(int nclusters, int nrows, int ncolumns, double** data,
int** mask, int clusterid[], double** cdata, int** cmask,int transpose,
double cache[])
{ int i, j, k;
if (transpose==0)
{ for (i = 0; i < nclusters; i++)
{ for (j = 0; j < ncolumns; j++)
{ int count = 0;
for (k = 0; k < nrows; k++)
{ if (i==clusterid[k] && mask[k][j])
{ cache[count] = data[k][j];
count++;
}
}
if (count>0)
{ cdata[i][j] = median(count,cache);
cmask[i][j] = 1;
}
else
{ cdata[i][j] = 0.;
cmask[i][j] = 0;
}
}
}
}
else

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


58

{ for (i = 0; i < nclusters; i++)


{ for (j = 0; j < nrows; j++)
{ int count = 0;
for (k = 0; k < ncolumns; k++)
{ if (i==clusterid[k] && mask[j][k])
{ cache[count] = data[j][k];
count++;
}
}
if (count>0)
{ cdata[j][i] = median(count,cache);
cmask[j][i] = 1;
}
else
{ cdata[j][i] = 0.;
cmask[j][i] = 0;
}
}
}
}
}

/* *************TH TUC KCLUSTER********** */


void kcluster (int nclusters, int nrows, int ncolumns,double** data,
int** mask, double weight[], int transpose,int npass, char method, char
dist,
int clusterid[], double* error, int* ifound)
*/
{ const int nelements = (transpose==0) ? nrows: ncolumns;
const int ndata = (transpose==0) ? ncolumns: nrows;

int i;
int ok;
int* tclusterid;
int* mapping = NULL;
double** cdata;
int** cmask;
int* counts;

if (nelements < nclusters)


{ *ifound = 0;
return;
}
/* More clusters asked for than elements available */

*ifound = -1;

/* This will contain the number of elements in each cluster, which is


* needed to check for empty clusters. */
counts = malloc(nclusters*sizeof(int));
if(!counts) return;

/* Find out if the user specified an initial clustering */


if (npass<=1) tclusterid = clusterid;
else
{ tclusterid = malloc(nelements*sizeof(int));
if (!tclusterid)
{ free(counts);

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


59

return;
}
mapping = malloc(nclusters*sizeof(int));
if (!mapping)
{ free(counts);
free(tclusterid);
return;
}
for (i = 0; i < nelements; i++) clusterid[i] = 0;
}

/* Allocate space to store the centroid data */


if (transpose==0) ok = makedatamask(nclusters, ndata, &cdata, &cmask);
else ok = makedatamask(ndata, nclusters, &cdata, &cmask);
if(!ok)
{ free(counts);
if(npass>1)
{ free(tclusterid);
free(mapping);
return;
}
}

if (method=='m')
{ double* cache = malloc(nelements*sizeof(double));
if(cache)
{ *ifound = kmedians(nclusters, nrows, ncolumns, data, mask, weight,
transpose, npass, dist, cdata, cmask, clusterid,
error,
tclusterid, counts, mapping, cache);
free(cache);
}
}
else
*ifound = kmeans(nclusters, nrows, ncolumns, data, mask, weight,
transpose, npass, dist, cdata, cmask, clusterid,
error,
tclusterid, counts, mapping);

/* Deallocate temporarily used space */


if (npass > 1)
{ free(mapping);
free(tclusterid);
}

if (transpose==0) freedatamask(nclusters, cdata, cmask);


else freedatamask(ndata, cdata, cmask);

free(counts);
}

/* *********************************************************************
*/

static int
kmeans(int nclusters, int nrows, int ncolumns, double** data, int** mask,

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


60

double weight[], int transpose, int npass, char dist,


double** cdata, int** cmask, int clusterid[], double* error,
int tclusterid[], int counts[], int mapping[])
{ int i, j, k;
const int nelements = (transpose==0) ? nrows: ncolumns;
const int ndata = (transpose==0) ? ncolumns: nrows;
int ifound = 1;
int ipass = 0;
/* Set the metric function as indicated by dist */
double (*metric)
(int, double**, double**, int**, int**, const double[], int, int,
int) =
setmetric(dist);

/* We save the clustering solution periodically and check if it


reappears */
int* saved = malloc(nelements*sizeof(int));
if (saved==NULL) return -1;

*error = DBL_MAX;

do
{ double total = DBL_MAX;
int counter = 0;
int period = 10;

/* Perform the EM algorithm. First, randomly assign elements to


clusters. */
if (npass!=0) randomassign (nclusters, nelements, tclusterid);

for (i = 0; i < nclusters; i++) counts[i] = 0;


for (i = 0; i < nelements; i++) counts[tclusterid[i]]++;

/* Start the loop */


while(1)
{ double previous = total;
total = 0.0;

if (counter % period == 0) /* Save the current cluster assignments


*/
{ for (i = 0; i < nelements; i++) saved[i] = tclusterid[i];
if (period < INT_MAX / 2) period *= 2;
}
counter++;

/* Find the center */


getclustermeans(nclusters, nrows, ncolumns, data, mask, tclusterid,
cdata, cmask, transpose);

for (i = 0; i < nelements; i++)


/* Calculate the distances */
{ double distance;
k = tclusterid[i];
if (counts[k]==1) continue;
/* No reassignment if that would lead to an empty cluster */
/* Treat the present cluster as a special case */
distance =
metric(ndata,data,cdata,mask,cmask,weight,i,k,transpose);

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


61

for (j = 0; j < nclusters; j++)


{ double tdistance;
if (j==k) continue;
tdistance =
metric(ndata,data,cdata,mask,cmask,weight,i,j,transpose);
if (tdistance < distance)
{ distance = tdistance;
counts[tclusterid[i]]--;
tclusterid[i] = j;
counts[j]++;
}
}
total += distance;
}
if (total>=previous) break;
/* total>=previous is FALSE on some machines even if total and
previous
* are bitwise identical. */
for (i = 0; i < nelements; i++)
if (saved[i]!=tclusterid[i]) break;
if (i==nelements)
break; /* Identical solution found; break out of this loop */
}

if (npass<=1)
{ *error = total;
break;
}

for (i = 0; i < nclusters; i++) mapping[i] = -1;


for (i = 0; i < nelements; i++)
{ j = tclusterid[i];
k = clusterid[i];
if (mapping[k] == -1) mapping[k] = j;
else if (mapping[k] != j)
{ if (total < *error)
{ ifound = 1;
*error = total;
for (j = 0; j < nelements; j++) clusterid[j] = tclusterid[j];
}
break;
}
}
if (i==nelements) ifound++; /* break statement not encountered */
} while (++ipass < npass);

free(saved);
return ifound;
}

/* ----------------------------------------------------------------------
*/

static int
kmedians(int nclusters, int nrows, int ncolumns, double** data, int**
mask,
double weight[], int transpose, int npass, char dist,
double** cdata, int** cmask, int clusterid[], double* error,

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


62

int tclusterid[], int counts[], int mapping[], double cache[])


{ int i, j, k;
const int nelements = (transpose==0) ? nrows: ncolumns;
const int ndata = (transpose==0) ? ncolumns: nrows;
int ifound = 1;
int ipass = 0;
/* Set the metric function as indicated by dist */
double (*metric)
(int, double**, double**, int**, int**, const double[], int, int,
int) =
setmetric(dist);

/* We save the clustering solution periodically and check if it


reappears */
int* saved = malloc(nelements*sizeof(int));
if (saved==NULL) return -1;

*error = DBL_MAX;

do
{ double total = DBL_MAX;
int counter = 0;
int period = 10;

/* Perform the EM algorithm. First, randomly assign elements to


clusters. */
if (npass!=0) randomassign (nclusters, nelements, tclusterid);

for (i = 0; i < nclusters; i++) counts[i]=0;


for (i = 0; i < nelements; i++) counts[tclusterid[i]]++;

/* Start the loop */


while(1)
{ double previous = total;
total = 0.0;

if (counter % period == 0) /* Save the current cluster assignments


*/
{ for (i = 0; i < nelements; i++) saved[i] = tclusterid[i];
if (period < INT_MAX / 2) period *= 2;
}
counter++;

/* Find the center */


getclustermedians(nclusters, nrows, ncolumns, data, mask,
tclusterid,
cdata, cmask, transpose, cache);

for (i = 0; i < nelements; i++)


/* Calculate the distances */
{ double distance;
k = tclusterid[i];
if (counts[k]==1) continue;
/* No reassignment if that would lead to an empty cluster */
/* Treat the present cluster as a special case */
distance =
metric(ndata,data,cdata,mask,cmask,weight,i,k,transpose);
for (j = 0; j < nclusters; j++)

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn


63

{ double tdistance;
if (j==k) continue;
tdistance =
metric(ndata,data,cdata,mask,cmask,weight,i,j,transpose);
if (tdistance < distance)
{ distance = tdistance;
counts[tclusterid[i]]--;
tclusterid[i] = j;
counts[j]++;
}
}
total += distance;
}
if (total>=previous) break;
/* total>=previous is FALSE on some machines even if total and
previous
* are bitwise identical. */
for (i = 0; i < nelements; i++)
if (saved[i]!=tclusterid[i]) break;
if (i==nelements)
break; /* Identical solution found; break out of this loop */
}

if (npass<=1)
{ *error = total;
break;
}

for (i = 0; i < nclusters; i++) mapping[i] = -1;


for (i = 0; i < nelements; i++)
{ j = tclusterid[i];
k = clusterid[i];
if (mapping[k] == -1) mapping[k] = j;
else if (mapping[k] != j)
{ if (total < *error)
{ ifound = 1;
*error = total;
for (j = 0; j < nelements; j++) clusterid[j] = tclusterid[j];
}
break;
}
}
if (i==nelements) ifound++; /* break statement not encountered */
} while (++ipass < npass);

free(saved);
return ifound;
}

/* *********************************************************************
*/

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn

You might also like