You are on page 1of 55
MUC LUC MO’ BAU .. 4. Myc dich thye tap chuyén nganh 2. Gidi thigu v6 dé tai thye tap chuyén NgaMh wrcnncnnnee 3, Yeu cau cua dé tai... CHU'ONG | TONG QUAN Vi. WER MINING.. 41. Giei thigu chung 2. Web mining 2.1 Téng quan .. 2.2 Cc thanh phin cua web mining va cdc phuong phap luan ‘a._Kham pha théng tin (IR) b. Trich rut, ya chon va tin x0 1y thONg tin err , Téng quat hoa d. Phan tich 3. Web content mining va Web structure mining . 3.1 Web content mining 3.2 Web structure mining .. 4, Web text mining 4.1, Text Classification 2 Text Clustering wn. 4.3 Association analysis 4.4 Trend Prediction CHUONG II KHAI PHA DO LIEU |. Téng quan vé khai pha dO’ liu ..... 4.1 hai nigm 1.2 Cac bude cia qua ‘tinh khai pha dir liu ... 2. Nhigm vy chinh ciia khai pha oi 3. Cac phuong phap khai pha di liéu 4, M@tsé bai toan chinh doi voi nghién ctvu vé khai pha dir surona Ml VAN BAN VA XU’ LY VAN BAN Khai nigm 2. Phurona php bidu aiénvn bin Bing me hin eng gian vecter Mo hinh Boolean MO hinh Tan suat ‘a. Phuong phap dua trén tn s6 thuat no (TF — Term Frequency). 23 b, Phuong phao dua tren nghich dao tan s6 van ban (IDF - Inverse Document Frequency) ©. Phuong phap TF x IDF 2.3 Phuong phap xthly vector thya . 3. Cée bai loan xthly van ban khdng c6 cu ini... Bai toan phan ogi van ban 3.1.1 Gidithigu.. 3.1.2. Cac phuong php phan logi van ban a. Decision Tree b. k-Nearest Neighbor . 3.2. Bai loan lap nhom van ban 3.2.4 Gidi thigu 3.2.2 Cac phuong phap lap nhom van ban a. Thuat tean phan cap Bayesiar b. Thuat toan chép nhom theo d@ tung ty. cc. Thual toan K-means CHUNG IV XAY DUNG THU NGHIEM UNG DUNG WEB CLUSTERING.. 4. Bal toan datra 2. Phuong huéng giai quyét .. Web Crawler a. Gidi thigu b. Ther ty Crawl cdc URLS c. Mot sé van dé can chi ¥ cho Web Craver ....ssssnseensen es d. Thuat togn sor dung cho Web Crawier . ‘Ap cyng cae thuat todn lap nhém cho bg da'tigu thu duge .... 2.2.1 Cac bude thy hign 68 biéu di&n vector van ban .... ressvne 4B. “TRGt BY aegseanatannanemaineimarennanare ema Loa bs Stopwords na. ea Stemming Sap xép cdc keyword « Xay dung bag-of-words « 1. Biéu dign ting file van ban thanh cae vector 2.2.2 Ap dung cdc thuat todn lap nhom .. TAI LIEU THAM KHAO_ MO’ BAU 41. Myc dich thyc tép chuyén nganh ‘Sau 2 ndém hee chuyén nganh, ching em 8 due tiép thu kha nhidu kién thie nén tng nho cdc thay c6 trong khoa Céng nah thong tin (CNTT), tuy nhién chung em van chua thyc sy di sau vao tiép can cac céng nghé méi, céc van d& trong nganh CNTT chi tit va cu thd. Vi vay Got thy tép chuyén nganh nay la mét dip dB ching em c6 thé tim hidu mot vén d8 cu thé trong nganh CNTT. Bay cing f& bude mé dau va cing la bude tiép can dé ching em tiép tye chuyén sang h9c giai dogn chuyén nganh hep Cdng nghé phan mém. 2. Gidi thigu vé é tai thyc tap chuyén nganh Tw mt thap ky 8 Iai day, chung ta da duge ching kién sy phat trién nhy va bao cila nh@ing nguén théng tin sn cé trén World Wide Web (WWW). Ngay nay, cac lrinh duy@l Web cho phép chiing ta d& dang truy nhap vae v6 s6 nguin dir ligu van ban va multimedia. Véi han mét ty trang web, viée éanh chi s6 béi céc search engine va tim kidm nhGing thong tin cn thit khéng cn ld mbt vie 8 dang. Luong thong tin dBi dao to Ion 68 Jam ndy sinh nhu céu phat trigén céc ky thuat khal pha ty dong {ren WWW, hay con dutoc acl béi mot thual nav’ la "Web mining’, BE c6 dupe nhong cong cy web théng minh, han ché sy can thigp cla con ‘ngudi, chiing ta c&n phal tich hop hay nhung tri tue han tao (artificial inteligence) yao cic ofng oy nay. Sy cn thidt tgo ra céc hé thdng thong minh phia server va phia client nhdm khai théc md céch c6 higu qua ec tri thie tran Internet hay tir ec trang Web cu thé 48 thu hit sy chi y cia nhing nha nghién edu tir nhidu tinh vue kind nhau nhur truy hdl thong tin (information retrieval), Khai pha tr thi: (knowledge discovery), may hoc (machine learning) va tri tue nhan tao (Al). Tuy nhién, bai toan phat trién cac céng cy ty dong 68 tim kiém, trich rut, loc, hay lugng gid théng tin ma guei st’ dung mong muén tir dor ligu web vén khdng duge gan nhan lai phan tan, hén tap van cén dang trong giai doan tring nurse. rong mél théng the tgp chuyén nganh ngén ngil, em chua cé a0 thal gian Ga tim higu mt cach sau séo v8 Web mining ma méi chi dig lai 6 gid thigu tng quan vva cal dat mét ing dung nhdm minh hoa cae thudt toan da tim hiéu. Mong thay va cae ban cho em nhing ¥ kién quy bau dé em o6 thé tiép tye phat trién, hoan thign hon a8 tai nay. 3. Yéu cdu cua dé tai 4. Timhiéu chung vé Web mining 2, Timhiéu Web Text mining 3. X@y ding Ung dying cho phép thu thap théng tin tur tn mang Internet. Sau 6, ap dung cac thuat toan lap nhom cho b6 cd’ ligu va thu duge. CHUNG | TONG QUAN VE WEB MINING ‘Voi mét khdi Iugng théng tin trye tuyén khéng Id, World Wide Web da tré thanh mt finh vue phong phi, dbi dio cho cdc nghién ctu vé data mining. C6 thé n6i, hing nghién cou vé Web mining la sy téng hgp cla nhiéu Minh vyc nghién cou khac nhau nh database, thy hdi thdng tin (information retrieval), ri tu® nhan tao (Ab, G&c bigt 18 sy gdp mat cla may hoc (machine learning) va xi ngon ng ty nhién. Tuy vay, hay con ¢@ rt nhidu nhing nhdm ln, nhing map mar Kai Gem so sénh nhing két qua nghién ctu tty céc quan diém khée nhau. 41. Gi6i thigu chung Noay nay, World Wide Web la mbt mi trubng tong tic phd dung, duge ding 8 phd cap théng tin. Nhong neuel si dyng thong tin, khi twong tac voi Web, thuréng gap phai nhOng van 8 sau: a. Tim kiém nhing théng tin thich hep : Con nguoi str cng trinh duyét va eae dich vy tim kiém khi ho muén ¢6 direc nhimg théng tin de bigt nao dé trén Web. Muén str dung dich vy tim kiém, ngui str dung chi cn dua ra nhiing cau truy vn don gién. Duémg nhur ngay lap tte, két qua tim kiém due hidn thi thanh danh each cae trang duge sép xép dua trén d6 tong déng vol cau truy van. Tuy nhién, cac cong cy tim kiém ngay nay vén cén gp phai mot s6 nhOng van 8 néi c§m. Ther nnhdt, 69 chinh xe thdp do sy khdng thich hop cla nhibu két qua tim kiém. Thur hal, kh nang trigu hdi (recall) thdp do khdng di kha nang danh chi s6 cho tt 68 cdc thong tin e6 sin tran Web, b. Tao tri thire moi tir nhimng théng tin cé s&n trén Web : C6 thé col bai todn nay la m@t bai toan con c¥a bai toan tn. Trong Khi bai todn trén xt Iy oc cau truy van (retrieval oriented) thi bai toan nay lei xir1y dir ligu, trong do ta gia si 6 co sn mot {ap di ligu Web, c&n phai trich ra nhiing tri thre tiém én cé ich ti tap 60 igu nay. ©. Sw 4 nhan hod théng tin : Khi tuong tae voi Web, mdi ngudi Iai co mot each biéu dién thng tin riéng, tuy thuge va sé thich aia ho. d. Nghién cou v8 nguai tigu ding va nhiing nguai sir dung riéng : Bay fd bai todn gidi quydt cho bai toan c 6 trén. Bai todn nay s& cho biét khach hang lam gi va mudi gi 2. Web mining 2.4 Téng quan Web mining thue chat 1a vige sit dung ede ky thudt eda Data mining nh’m tr ong khai ph, trich dn thong tin tw cdc tai Iu va cdc dich vu Web. Hign nay, no dang la mt inh vue nghién cou réng lon, thu hot sy cha ¥ cua nhiéu tinh vye nghién cu khdc nhau do sy phat trién ghé gém clia cdc nguén théng tin trén Web cing nh thyong mai dién ti. Web mining duge phan thanh nhiing nhiém vu nhd sau 4, Tim kiém cae ngudn tai nguyen : truy hdi cdc tai ligu Web mong muén. Bay la qua trinh truy hdi dO ligu truc tuyén hoc khong trye tuyén tir céc nguén van ban sn 06 trén Web. Cée nguén thang tin nay c6 thé 1a cée ban tin dign t0, cac bire Gién tin, n6i dung van ban tir cdc tai ligu HTML, thu duge bang cach loai bé cdc HTML tag. 2. Lya chon va tién xtr ly théng tin : ty dong Iya chon va tidn xi iy cdc thong tin vira nhan dug ti ca¢ ngudn tai nguyén Web, Nig chuyén di nay c& thé Ia loai b6 cac stop words, stemming... hoae tian xir ly d@ cd nhémg biéu thich hop, hay chuyén d6i sang mé hinh quan h@, dang logic 1 (first order logic form). 3, Téng quat hod : Ty dong khai pha ra cdc mau téng quat tir cdc Web site riéng Diet hoae tte mot Nhom cae Web site, B t6ng qual hod ra cdc mau, nguoi ta thudng sir dyng cdc ky thuat machine learning hoge data mining. rong qua trinh khal pha thong tin va ta thurc, con ngui dong mot vai tro cure iy quan trong béi Web I& m6t méi trung twang tac 4, Phan tich higu lyre va (hodo) gid thich cdc mau vira khai pha dugc. Néi mot céch ngén gon, Web mining lé mét kj thudt dang dé kha! pha va phén tich théng tin ¢6 Ich tt a ligu Web. Web c6 hai loai ddr ligu chin: Y Web content data Y Web structure data Tuong ting véi méi foal a ligu cn Khai thac, ngubi ta cling chia ra céc ky thugt Web mining thanh Y Web content mining Y Web structure mining Web Mining [Content Mining [Structure Mining Text | [Multimedia Mining| | Mining External | [Internal } [ URL Hinh 1. Phan loai Web mining Structure | } Structure] | Mining Mining | | Mining Web structure mining c6 thé dirac chia nhé thant: Khai pha cdu tric ngoai (Extemal Structure mining) : tap trung khai pha siéu lian k&t gite céc trang Web, Khai pha edu true trong (Internal Structure mining) : khai pha cdu true n6i tal lia trang Web. ¥ URL mining, ‘Web content mining durye chia thank: Y Text mining: bao oém tex’ fle, HTML, document, Multimedia mining, Mac dau khai pha cée dé ligu multimedia c6 rt nhiéu diéu tha text mining lai c6 mdt val tr cuc ky quan trong, béi Ié hign nay, vén ban la phuong tien thing tin chi yéu tren Web, vp dan, nhung 2.2 Céc thanh phan cia Web mining va cc phwong phap luan Nhur 68 nai tran Web mining 06 thé duge chia than bén nhigen vu nnd: Khim phd thong tin (information Retrieval hay Resource Discovery) Y Trich dn, Ia chon va ti8n xt fy théng tin (Information Selection/Extraction and preprocessing) ¥- Téng quat hoa (Generalization) Y Analysis a. Khém pha théng tin (IR) Kham pha théng tin a gidi quyét bai ton ty éng nhan va tat ca cac tai figu lén quan, dng thai han ché ti da céc tai ligu khéng c6 ich. Qua trinh IR thong bao gm biéu din tai igu (document representation), dénh chi s6 (indexing) va tim kiém tai lieu. ‘Va co ban, chi myc 1a noi ghi lai cdc thudt ngd. MBi khi can tim tai li@u lién qua én thugt ngQr néo, ta sé tra cu trang chi myc dé biét Guge tét c cA tai lieu cb chia thuat ngdr nay, Tuy nhién, dann chi myc céc trang Web Iai kha phire tap. Voi mét sé lung khéng 5 cdc trang web oé tinh chat déng va thuéng xuyén duec cap nhat, cée ky thuat éanh chi sé dutng nhu fa didu khong thé. Hign nay, o6 bén phuong phap danh chi s6 cho cac tai ligu trén waby manual indexing automatic indexing intelligent or agent - based indexing Y metadata - based indexing ‘Search engine la chuong trinh duge viét 68 truy van va nhan vé cac théng tin dug luu trir rong cae co 86 di liGu (hoan ton C6 cau tric), cac trang HTML (nia cau tric) hay cac van ban text c@ tran web, 4 Kid trie cua mét hé IR Foseback Queries tome | ove Documents Inout Voi aau vao la cae tai liu va truy van cia nguei ding, knéi xi ly (processor) sé cho két qua vé cac tai ligu thod mn truy van. Cac két qua nay d6i khi due sir dyng lam théng fin phan hd cho hé IR nhiim o6 durge ede két aia tt hon trong cc tinh hudng tutong ty sau nay. b. Trich nit, Iya chon va tién xii ly théng tin Khi cdc tai igu duge nhén vé, nhiém vy tiép theo Ia phai trich rit duge tri thire va InhGing thong tin yéu edu kha mé khong can Gén sy tuong tac ela con ngudi. Trich rot théng tin (IE) co nhiém vy xée dinh cc doan éac trung cu thanh nén noi dung gO nghia hat nhan cia tai ligu. Cho dén nay, cdc phuong phdp IE ééu it nhiéu én quan dén vige viét cae wranper dé anh xa cdc tai ligu sang m6t vai m6 hinh 6 lieu. Cac hé théng tich hop thong tin hign nay thudng thao téc bing céch dich cdc site kndc nhau thanh cae nguén tri thirc i trich nit thang tin tty chiing. Mot phirong php khde 08 trich rit thong tin tir céc siéu van ban (hypertext) 18 coi mdi trang nhu mot {ap céc cau hei chudn. Bai todn dt ra ka can phai xac dinh dug eéc Goan van ban {ra loi cho m6t 56 cau hdl xac dinh nao do. Néi mOt cach Khe, cdc khe hé sé cuye dién Gay bang céc Goan van ban trong tai igu. Nnw vay, myc tiéu cua IE 1a trich rit nhng tri thire md tor nhimg tai ligu nh&n duge bang cach igi cyng ou tric ola tai ligu va sy biéu eign t8i feu trong khi cde chuyén gia IR lai coi tai ligu Ia mot tap ede tir (bag of words) ma khOng hé chi ¥ t6i céu tnic ota tai ligu. Van dé trér ngai chinh 61 véi IE chinh Ia kich thuc va tinh chét dong ofa web. Vi vay, cac hé théng IE thuéng chi trich rit théng tin ti» mét s6 site cu thé va tap trung vao mbt sf ving xéc inh ma thd ‘DE C6 thé trich rut due bat cur loal tri thie ndo tt cde CSDL c6 Kich thuée trung binh, chung ta can co mot hé théng tén xi ly manh mé. Khi nguoi str dung yeu ody mot trang web, rét nhidu loal fle nac nhau nur hinh anh, am thanh, video, egl, htm! auge truy cap. Két qua Ia, server log chia rt nhigu myc du thira hoc khong thich hop cho nhiém vu khai pha. Chinh vi vay, chiing cin phai duge logi bé bling qua trinh tidn xi Ij. MOt trong nhong ky thual tidn xr Iy ma IE hay sir dung la latent semantic indexing (LSI), nh&m bién di céc vector tai ligu ban Gu sang khong gian 6 86 chidu nhé hon bang céch phan tich sy tong quan cua cdc thudt ngCr trong tp cae tai ligu. Mt sé ky thugt tien xO ty khde nda Ia “stop words’, "stemming", cho phép gidm kich thude cdc dc trung dau vao. c. Téng quét hod Trong giai doan nay, cAc ky thuét may hoc va nhén dang m&u thubag duge sir dung dé trich rit théng tin, Hau hét cdc hé théng may hoc due trién khai trén web thurgng hoe nhu céu cia ngudi sir dung hon la ban than web. Khé khn chinh ki hoc vé web Id van 48 gan nhdn. DoF ligu trén web qua thyc v6 cung phong phd, xong 9 lai khéng duge gan nh’n. Trong khi dé, rét nhidu ky thudt khai pha dir ligu lai yeu cu cde mu dau vao phdi éuge gin nan duong (positive) hase am (negative) tueng tng voi mot s6 Khai niém nao A. Vi dy, gid sir ching ta cb mot tap lon cdc {rang web durge gan nhan duong hode am cia Khai nigm homepage. Khi dé, vigc tao ra mot trinh phan loai 48 dy Cofn mot trang web co la homepage hay khong sé rét 8 dang. Nhung that khéng may, céc trang wab lei khong duge gan nhiin. Cac ky thut nhw |By mu khéng chac chén c6 tha gidm bét long di ligu khong duge gén nnhain, nhung lai Khong thé khir diroc bai todn nay. Mot cach tiép cén 48 giai quyét bai toan gan nhéin la dya trén oo sé web c6 nhidu tp cdc tai ligu lién két. Cac ky thu@t lap nhém khéng yéu cu du vao phai duge gan nhén, ding thai ching da va dang doc Ung dung hét suc thanh cing cho tap lén céc tai itu. Hon na, Web chinh la m@t manh é&t mau mé phi nhigu cho cde nghién cteu lap nhém ta lig. Khai pha cae lugt Két hgp (association rule mining) 1& mot phan cla giai Goan nay. V8 co bain, luat két hgp la nhing bibu thie cb Gang X = Y trong a6 X va Y Ia tap céc myc (item). X = Y ¢6 nghia lé bat cir khi nao mot giao tac T ¢6 chira X thi T fing cé6 thé chira Y. Xac sudt hay d0 tin cy cia luat duoc dinh nghta 1a phan team cc giao tée cb chita Y khi 68 chiva X tren ting 86 cdc giao tc €6 chira X. 4. Phan tich (analysis) Phan tich 1a bai ton hurémg dtr iu voi gid thiét da c6 fin dy dd dt Hgu. Tl dé, nhiing thong tin hu ich tim &n ¢6 thé duge trich rut va dee phan tich, Con nguéi dong mot vai tro cyc ky quan trong trong qua trinh Khai pha tri thuc bei web la mot moi trudng tuong tae. Diu nay dac biét quan trong cho su phé chuan va gia thich cc mau duge khai pha bdi khi céc mau Guge khai pha, nhiing ngubi phan tich cn phai st: dung nhong cong cy thich hep dé hiéu, tnyc quan hoa va gial thich céc mau, 3. Web content mining va Web structure mining 3.1 Web content mining Web content mining 14 qua trinh khai pha nhding théng tin c6 ich tur ndi dung / da liu / 12 ligu J dich vy web. Tai liu Web o6 thd cha mot sé logi dO liu sau: vin ban, hinh anh, am thank, video, siu di fibu va si lién két. Mot s6 la nia ou tic hy tai iy HTML, hoe la nhing dif ligu cé cdu tric hon nh de 1 trang bang eae 10 CSDL sinh ra tir cée trang HTML. Tuy nhién, hau hét céc dir ligu du la nhong do ligu van bén phi edu trae. Cac dec trung phi edu tric cua dir gu Web khién cho each tigp can Web content mining tra; nén v6 cing phic tap. Web content mining dug phan bigt tir 2 quan idm khée nhau: ¥ Tir quan éiém Information Retrieval ¥ Tir quan diém Database us Tir quan diém Information Retrieval Bang trén d@ téng két nhimg nghién cru cho tai ligu van ban phi céu tric. Hau hét cc nghién cu déu sir dung tap cac tir den d& bidu din cac tai ligu phi cAu trie. ach biéu dign nay 48 b6 qua trinh ty xuat hign cla cae tly ma chi quan tam dén tinh cht théng ké ella cae tir 46 ma thoi. Nqudi ta c6 thé quan tim mot tir 66 xudt hién trong tai igu hay kh6ng (boolean) hod t&n sudt xudt hiGn eda tly trong tal liéu. Ngoal ra, ngudi ta con quan tam dn vigc losi bd hé théng du chm cAu, nhong ti It dure ding, stop word, sterming... Céc tai ligu nia cdu tric thuéng sir dung céu tric HTML bén trong tai igu ho&e céc si8u lién kat glira céc tai leu. 4 Tir quan diém Database Cac ky thuat cor sé di liu trén web lién quan 161 nhong bai toan quan ly va truy van thong tin tren web. Nhin chung, ¢6 ba loai nhigm vy lién quan t6i bal toan nay: m6 hinh hoa va truy vén web, tich hop va trich rut th6ng tin, tao va 8 chu Iai cdc web site. V8 co ban nhin ti gée d6 CSDL, ching ta thuéng cb ging dua ra edu tric ola Web site hoae bién déi mot web site thanh CSDL d8 quan Wy va truy van théng tin tran web duge tét hon. Table & An Tse oe Wh nor ing fr weber docu a er ae Sa TT bi hahaa Pagani: SR eT aT Pe niet ‘a. ere ToS TAT ee TERRE eae =r a poke aren ea ra SS as amar {aa eee ee SS Ee ae ae om a ccrmitasion vpom —_| EE aT ioe aT “ibe eh ety a wv car Sar Rep eee Ma aT [aR a Was Por aT pa a Ear | El secure da Car a Ts | ST aT | Fag as naib =e YO TS TREE | PSS nobel = esa ie TETATATT — PO ToT RET —| Peay Tew — va ipniee Bnet aaa a Tea Mae ‘Tw bang tren ching ta c6 thé thAy rang, cach bidu cién tai iu tir géo 69 CSDL khac han so voi IR. DB view thutng st dung Object Exchange Model (OEM), Thea d6, céc dif li@u nia cau tnic thudng duge biéu dién bang mdt dé thi dupe gan nhan. Dé liéu trong OEM duge xem la 46 thi, cdc déi tuong gém c6 cae dinh va nhén trén méi canh. 3.2 Web Structure mining Hau hét cdc cing cu truy bi thang tin Web hign nay chi st dung nhiing théng tin nguyén mau ma bé qua cde théng tin lién Két. Myc tiéu ca Web structure mining 1a sinh ra nhiing tom tit cdu tric vé Web sile va Web page. Trong khi Web content mining t€p trung vao cu tric bén trong cia tai ligu, thi Web structure mining lai 6 gang khai phd cdu tric lién két cia cdc siéu lien két cé trong tai ligu. Dya trén kién tric oa hypertink, Web structure mining s8 phan loai céc trang web, sinh ra nhing théng tin nhur é turong quan, méi quan hé giita cde web site khac nhau, Web structure mining cing 6 thé khal pha cAu tric cia tai liu Web. Logi khat pha cu true nay duge st’ dung nham phat hign luge dé céu tnic etia céc trang Web. Nhiig thang tin cdu tric due chiét xuat ra tir Web structure mining bao gém Y Théng tin tan xuat cia nhiing lin két cue b6 trong Web tuples ca Web table Thing tin tan xudt oa Web tuples trong Web table cé chita cdc lién két ben. trong hoae cae lien két t6i cing mo ta igu Y Thong tin vé tan xuat olla cc Web tuple 09 chia cac lién két toan cus va cae lian kat trai rong dén cac Web sites khac. Thang tin vB tn xudt ciia cc Web tuple giéng nhau xuAt hién trong mét Web table hay trong nhigu Web table. 4, Web Text mining ‘Web text mining 66 4 lagi chinh: ¥ Text Categorization ¥ Text Clustering Y Association analysis Y Trend prediction 4.4 Text Categorization Phan loai van ban duge céc nha nghién clu dinh nghia nhu la viée gan cdc chu 8 c6 true cho cac van ban diva trén ndi dung ciia ching. Phan logi van ban [a Ong viée duoc st dung 68 hé tra cho qué trinh duyét van ban, tim kiém vn ban, \tiét loc théng tin, loc van ban hod tr déng d&n duréing cho van ban dén cc chi 6B duc xd¢ dinh trude. Hign nay c6 rat nhiéu thuat gidi phan loai (categorization algorithm), mot trong sé 46 1a cae thust gia ¥ keNearest Neighbor algorithm ¥ Decision Tree 42 Text Clustering Trong bai ton phan loai van ban, yéu cau dat ra la cn gan nhan mot van ban chu duge phan loai vo mét hoe mot s6 nhém duwe dinh true théng qua néi dung iia van ban va ca trén tap thude tinh ca céo nhém. Gan trong bai ton lap. nhém van ban, ching ta chiza biét thude tinh cila cée nhém, hay ndi éiing hon Ia cdc. nhom chua dugc xéc dinh. Yéu cau dat ra la thanh lap mot sé nhém van ban nhat inh ti tap hop cac van ban cho truse sao cho phii hop nhat. CChiing ta cé thé st dung Text clustering cho td hyp céc két qué tra v8 tu search engine. Sau 44, ngui si dung chi éon thuan kiém tra cde nhém lién quan t6i c&u truy van oda ho. Didu nay gitp gdm thidu ééng ké thoi gian va cong sue so véi vide idm tra tran toain bé tap tai liu (Cac thu@t taan clustering c6 thé durgc chia thanh 2 nhom ck ¥ Hierachical Clustering + Hierachical Bayesian Clustering (HBC) algorithm + G—HAC algorithm ¥ Partitional Clustering : K-Means 4.3 Association analysis Chidt xudt ce méi quan he gitra tai tau cum tir (phrase) va cae ttr (words) 66 trong 44 Trend Prediction Dy doan gia tr cla mot ad ligu cho true tai mot thoi diém xac dinh nao G6 trong {wong ia Gn cha tang, Web text mining cing tuong ty nhu mining trén mol tap cae file text, tt nhién cé nhing sy més rng nhat dinh nao G6. Tuy vay, nhéng théng tin phy durge mang béi cdc tag trong tai liu Web nhur , ... cn duge khal thac dé lam ting ch&t luong cia Web text mining CHUONG II KHAI PHA DU’ LIEU Khai phd dor ligu la mot khai niém ra doi vao nhiing ndm cud cla thp ky 89. NO bao ham mét loat céc kf thuat nhém phat hién ra cdc théng tin cé gia tr tiém an trong cae tp dO ligu lon (640 kno dor ligu). Vé ban cht, Khai pha do’ lieu lien quan dén vvige phan tich cde da’ ligu va sir dung cdc ky thuat é8 tim ra cdc mau hinh ¢o tinh chinh quy trong tap dor lieu. Nam 1989, Fayyad, Patestky-Shapiro va Smyth da ding khai niém Phat hign tr thire trong cor sé di liéu (Knowledge Discovery in Database - KDD) dé chi toan bo qué trinh phat hign cdc tri thive c6 ich tir cae tép diy liu lén. Trong 46, khai pha ot Tigu la mot burée de biét trong toan b} qué trinh, su dung cde gidi thudt dc Diet 6 chiét xu&t ra cde mu (pattern) tir a0: lieu. 1.2. Cac burée ca qua trinh khai pha dé ligu Géc gidi thuat Khai pha df iGu thudng éuge miéu 14 nhu nhiing chiang trinh hoat é0ng truc tip trén tp dG ligu. Véi cdc phyong php may hoc va théng ke truée éay, thuong thi bude dau tién Ia cdc gidi thuét nap toan bp tp ot ligu vao ‘trong b@ nhé. Khi chuyén sang cdc ung dung céng nghige lién cuan dén viec kha pha cc kho dir igu Kin, mé hinh nay khong thé dap Gng duoc. Khiéng chi béi khong thd nap hét di ligu vao trong bo nhé ma con vi khé o6 thé chiét sudt di hu ra cde t8p don gian dé phan tich duroc. (Qué trinh x by Khai pha dy ligu bat du bang cach xae dinh chinh xac van a8 cn gidi quyét. Sau d6 88 xac dinh cae di ligu llén quan ding dé xay dung gidi phap. Buréc tip theo 1a thu th8p cdc dé ligu c6 lién quan va xi IY ching thanh dang sao cho cdc gidi thuat Khai oha d0'ligu c6 thé hiéu duge. V8 ly thuyét thi c6 vé rét don gin nhung khi thye hign thi day thyc sy /& mot qua trinh rét KhO khan, gp phai nhidu vuéng mae nhw : cc do ligu phai duc sao ra nhidu ban (néu duge chiét sud vao cac lép), quan I tap cac t@p dif liu, phai lap di lap Iai nhibu an toan bd qua u thay d6ip, 8 1a qué cing kénh véi mbt gid thust khai phé di ligu néu pha uy nhap vaa {oan b@ ndi dung cua CSDL va lam nhiing viée nhurtrén, VA lal, giéu nay cting khong cen thiét, Co rt nhiéu gid thuat khai pha dor igu thye hign trén nhding théng ke tom tat kna don gian ca CSDL, khi ma toan 66 thong tin trong CSDL |é qua du thiva d6i vvOi mye dich cila vige khal pha of ligu Buéc liép theo la chon thud todn khal pha dir gu thich hop va thu hign viée Khai pha 48 tim duce eée mu (pattem) c6 ¥ nghia duxsi dang bidu din lremg ng v6i cde ¥ nghia 46 (hung duoc biéu din duéi dang luat xép losi, cay quyét dinh, luat san xuat, biéu thre hdi quy....). ‘Bac diém etia cac mau a phai mor (it nhét 1a ai voi ne thng 46). BO moi co thé duge do twong tng voi a9 thay 46i trong dir ligu (bing céch so sénh cdc gid tri hign {ai voi cae gia tr true do hoge cdc gid tri mong muén), hodc bang tr the (méi lien he gia céc phyong pha tim m6 va phurong phap cO nhu thé nao). Thubng thi co mdi cla mu duge danh gid bing céc ham logic hose ham do a} méi, 4 bat nge 6. tinh (nu m6 hinh do ella mu, Ngoai ra, mu phai eé kha nang sir dung tibm tang. Cac mu nay sau khi duoc xir ly va didn gidi phi dn dén ning hanh dng 06 ich ndo 46 dure éanh gia bdi mot ham gi ich. Vi dy nhu trong dif ligu cac khoan vay, ham Igi ich Ganh gia kha nding tng loi nhuan tty cée khodn vay. M&u khai théc duoc phai cé gia tr di v6i cde dir ligu moi vei 4 chinh xac nao 66. Thing ke sm at Thathap Xe dinh een isn mam w den veins — inal pba Wn quan yao a ‘Diligawe tap _ hh Voi cae gia thuat va cde nhigm vu ca khai pha dG igu rt khae nhau, dang cla sae mau chiét xual dug cting rat da dang. Theo cach don gian nhl, sy phan tich cho ra két qua chiél xual la mpl bao céo ve mOl s6 logi (a5 thé bao gdm eac phép do mang tinh théng ké vé 4 phis hop cia mé hinh, céc dif ligu la..). Trong thuc té du ra phic tap han nihidu. MBu chiét sudt 6uoe o6 thé la mét m6 ta xu hutng, 6 thé du6i dang van ban, mot dé thi m6 t cdc méi quan he trong md hin, cting c6 thé Ia mot hanh Gong vi dy nhur yeu cdu cla ngudi dung lam gi vei mhting gl Khai thac duge trong CSOL. iy thugt khal pha ot liu thye chat khong cO gi mor. NO Ia sy’ ké thira, kt hop va md rong ota ode ky thual co ban dé duge nghién cir tir rude nhu may hoe, nhén dang, théng ké (hdi quy, x4p loai, phan nhém), c4c mé hinh dd thi, meng Bayes tue nhan tao, thu thép tri thé hé chuyén gia... Tuy nhién, véi sir kt hop t8i tinh ova kal pha dO ligu, kj thuat nay c6 wu thé hon han cdc ehuong phao true 4, dem lal nhigu trén vong trong vige Ging dung phat tin aghién ctu Khoa hee ¢ong nhur lam {ang mir loi nhuan trong céc hoat dong kinh deanh. 2. Nhigm vy chinh cita khai pha di ligu Ré rang muc dich cia khai pha dir ligu lé cc tri thize chiét xudt duroe s@ duo ste dung cho li ich can tranh trén throng turing va céc loi ich trong nghién ecu khoa hos. Do 46, c6 thé col myc dich cia khai oha dif ligu sé la mo t@ (description) va dye Goan (prediction). Cac m&u ma Khai pha dé ligu phat hign duige nham vao cac myc. u dich nay. Dy éoan lién quan dén vige sir dung céc bién hae cdc tnréng trong CSDL 8 chiét xuAt ra cae mau I cde dy’ Goan nhing gid ti chua biét hoge nhong gia tr trong tuong lai cla cdc bién Géng quan tam, Mb A tap trung vao vige tim kiém cdc mu mé ta dO ligu ma con ngudi c6 thé hidu duge. 'Bé dat duc hai myc dich nay, nhigm vu chinh cla khai phd dor ligu bao gdm nhur sau > Phan l6p (Classification) : Phan isp la vigc hoc mot ham anh xa (nay phan joel) mot mau di? ligu vo mét trong s6 cdc lbp da xéc dinh > Hoi quy (Regression) : Hdi quy la vigc hoc mot ham nh xa tir mot mBu dor ligu thanh mot bién dy doan co gia tr thre. » Phan nhém (Clustering): La vigc md t€ chung dé tim ra cdc tap xae dinh, cdc nhom hay cdc leai 68 m6 t a igu. Cae nhém cé thé tach riéng nhau hoge phan cp hoge 96i lén nhau. Co nghia 1a mot di ligu c6 thd vira thuge nhém nay, vira thugic nhdm kia, > Tom tat (summarization) : Lién quan dén cac phuong phap tim kiém mot m6 ta tém tét cho mot tp con di lieu MBO hinh hod phy thude (Dependency Modeling): Bao gdm vige tim kiém mgt m6 hinh mé té su phy thude Gang Ké gida cae bidn, Cac m6 hinh phy thuge tn tal dudi hai muse : mie cu tric cla m6 hinh xéc dinh (thueng & dang 48 hoa) cdc bién nao Ia phy thude cyc bd vai nhau, mic éjnh lung cia mot mo hinh xac inh 6 manh cia sy phy thudec theo mét thuee do nao do. > Phat hign sy thay déi va lac hung (Change and Deviation Detection) : Tap rung \vdo khai théc nhimg thay dBi dng ké nhat trong dir liu lr cdc gid ti chudn hodc dugc do truse a6. Nhing nhigm vy kde nhau nay yéu cdu s6 lunging va cée dang théng tin r&t kha hau nén ching thubng anh hubng dén vige thiét ké va chon giai thuat khal pha do ligu khéc nhau. 3. Cae phurong phap khai pha di ligu Qué trinh khal pha 66 ligu lé qua trinh phat hign cdc mBu trong 6, oti thus Khai pha dir ligu tim kiém cae mu dang quan tam theo dang xAc dinh nhur céc luét, cy phan I6p, quy hdi, phan nhém, .. > Cac thanh phan cita giai thuat khai pha d Gidi thuat khal pha dir igu bao gdm c6 ba thanh phn chinh nh sau : bibu din md hinh, dénh gid m6 hinh, tim kiém ma hinh. 2. Bidu din m6 hinh : M6 hinh duge biéu dién bing mot ngbn nat L a& midu td cdc mu 06 thé khai théc duc. N&u sy mé tA qué bj han ché thi sé khdng thé hoc dugc hode sé khdng thé cé cac mu tao ra Guge mot m6 hinh chinh xa cho de ligu. Vi vay, vige quan trong Ie ngudi phan tich dé ligu cn phai hidu day di cde id thiét midu t2. Mot didu cng kha quan trong Ia nguéi thiét ké gil thuat cn phai din ta duge céc gid thidt miau t8 nao duge tao ra bd gidl thust nao. Kha nang miu té m6 hinh céng kin thi céng lam t&ing mic 49 nguy hiém va kam gidm di kha ning cy don cac di ligu chusa biét. Hon nifa, vige tim kiém sé cang tré rnén phue tap hon va viéc gia hich mé hinh efing kh6 khan hon, M6 hinh ban dau due xae dinh bang cach két hep bién Gu ra (phy thube) voi oc bién doc tap ma bién Gu ra phy thuge vao. Sau 46 pha tim nhoong tham 's6 ma bai toan can tap trung giai quyét. Viéc tim kiém ma hinh s8 dua ra Gugc mt mé hinh phi hop véi cac tham 86 dune xac dinh dya trén dif ligu (trong mot 86 rudng hop, mé hinh duge x8y dung 6c lp vai di igu trong khi adi vei mot 6 trudng hop khde thi m6 hinh va cdc tham s6 lai thay di &é phi hop véi dit ligu). Trong mét s6 tru’ng hop, tap do lléu duoc chia thanh tap dor ligu hoc va ‘tap dor ligu thi. Tap dor ligu hoc duge sir dung dé lam cho cac tham s6 cla mo. hinh phu hop voi dtr ligu, Mé hinh sau d6 sé duge Banh gia bang cach dua céc dor ligu thiy vao mé hinh va thay di lai cdc tham sé cho phu hop néu can. Mo hinh Iva chon ©6 thé la phuong phép thing ké nhu' SASS,,... mat sé gii thual may hoe, mang noron, suy dién hurong tinh hudng, cdc ky thuét phan lp, a Banh gid mé hinh: Banh gia xem mot mau c6 dép ung duge cae tiéu chudn cua qua trinh phat hién tri thie hay khéng. Viée dah gid 46 chinh xdc dy doan dya {én GAnh gid chéo (cross validation). Banh gi chét long miu t8 én quan én 6 chinh xac éy doan, 46 méi, khd nang su dung, kha nang hiéu m6 hinh. Ca hai chudn théng ké va chuain logic du cd th8 duge si dung a8 danh gia m6 hinh. \Vige danh gid mé hinh duge thyc hign qua kiém tra dor ligu (trong mot sé trong hop kiém tra vai tét e& céc dt liu, tong mét sé trudng hop Khe kiém tra véi dit ligu tho, a Phuong phap tim kiém: Phuong php tim kiém bao gdm 2 thanh phan : tim kiém ‘tham $6 va tim kiém mé hinh, Trong tim kiém tham s6, giai thuat c&n tim kiém cac tham s6 48 t61 uu hod céc tiéu chudn dah gié md hinh ver cae de> ligu quan sat 19 duge va vai mét migu t2 mé hinh d& dinh. Vige tim kiém khéng cn thiét 66i véi ccc bal todn kha don gin: cdc Gann gid tham $6 tdi uu ob thé dat dugc bang cae cach Gon gidn hon. Tim kiém mo hinh xy ra giéng nh mgt vong lap qua phuong phap tim kiém tham sé: miéu tA mé hinh bj thay déi tao n&n mot ho cae m9 hin. Voi mi mot migu td mé hink, phuong phap tim Kiém tham s6 éugc ép dung d8 danh gid chat luong ci mé hinh. Céc phuong phap tim kiém mé hinh thuréng sir dung céc ky thuat tim kiém heuristic vi kich thud aila khong gian cdc m6 hinh c6 thd thung ngan c&n cdc tim kiém ting thd, hon naa cée gidi phép don gidn khong o8 dat dugc. 4, Mét 86 bai toan chinh d6i v6i nghién cou vé khai pha die ligu > Bai toa phan logi (classification): tim mot anh xa tir OL mau dtr liGu v0 mot trang cac 6p ca san. > Bai ton hdi quy (regression): tim mot Anh xa hbi quy tir mot mu dir ligu vo mot bidn du doan ¢6 gid tr thuc. > Bai todn ap nhOm (clustering) ; la vie m6 t4 chung 46 tim ra tap xé¢ gin hivu han cae nhém hay cao chil 68 la mo ta dae trung cua dor lieu. > Bai toan lép tom tt (summarization): I vige tim kiém mot mé ta chung tom tat cho mat tap con dieu CHU'ONG III VAN BAN VA CAC BAI TOAN XU LY VAN BAN 41. Khai nigm Trong cac dang dir ligu thu@ng xuyén Guge str dung thi van ban la mt trong nhtng dang duoc ding phé bién nhét. Cac van ban cé thé Id cdc bai bao, cdc tai li¢u kinh doanh, céc théng tin kinh té, cc bai nghién ctu khoa hoe. Nhin chung, cac los do ligu van ban thudmg duge chia thanh 2 tog: a Dang phi c4u tréc (unstructured): | dang van ban ching ta sir dung hang ngay duge thé hign duéi dang ngén ngiy ty nhién cia con ngudi va ching kehdng cd mét cfu tric dinh dang cy thé nao 2 Dang ban cfu tric (semi-structured): é8y It cde loai vin ban khang doe uu ‘tft du6i dang cde bén ghi chat ch8 ma 6uge 18 chive qua céc éanh déu vén ban dé thé hién n6i dung chinh cla van ban. 2. Phwong phap biéu dién van ban bang mé hinh khéng gian vector (Vector ‘Space Model) Céch bigu din van ban thang dung nhét la thong qua vector. Bay 1A mot cach biéu din tong déi don gién va higu qua. True day of nhidu bai bao néi ring phuong phap nay gay tin kém chi phi luu tri va cOng sire xi lj, nhung khi cdc phuong php xi ly vector the duoc 4p dung thi cdc nhuige diém trén gidm di rét hid Theo m6 hinh nay, mBi van ban Surge bidu didn thanh mét vector. MBi thanh phan ca vector 14 mot thudt ngtr riéng biét trong tap véin ban gde va do gén mdt gid tr la ham feda thing thuat ngor trong vain ban. rhuatngn2 mei ae’y vin bin 2 ° "an han 3 o © van band ‘Thule ngie h 4: Bidu din cde vector vin ban trong khéng gian chi ed 2 thuét ngt> 2.1 M6 hinh Boolean Mat mé hinh biéu dién vector voi ham f cho ra gia tr réi rac vei duy nhat hai gia rj ding va sai (Ire va false, ho&c @ va 1) goi la mé hinh Boolean, Ham flyong ung voi thuat ng& t s8 cho ra gia tri ding néu va chi néu thuat ngort, xuat hign trong van, ban 66, M6 hinh Boolean duoc dinh nghia nhu sau: GIB st 06 mot cor sé edr Keu gm m vein ban, D = {a ds... dnp. MBI van ban dupe biéu ain duo dang mét vector gdm n thust nga T = (ts, ta... Got W = {yp a ma tra trong 86, trong dé wi fa gid tri trong 86 cila thudt ngiét trong van ban dy Md hinh Boolean ld md hinh don gidn nhét duge xée éinh nhu saul} Indus, c6mal rong d, A neu ngune lai 2.2 M6 hinh tan sudt Trong mé hinh t&n sudt, ma tran W = {i} duge xée éinh dye trén t&n 36 xual hign cia thudt nga f trong van ban ¢ hoe tan s6 xudt hign cia thuat nod ¢ trong toan b6 co sé da ligu. a. Phuong phép diva trén tan sé thuét agi (TF— Term Frequency) Cie gid tr wy dupe tinh dya trén tn 86 (hay 6 lan) xudt thuat ngi trong van ban. Goi fy las tin xudt hién cia thuat nai (trong van ban dj Khi d6 my ung tinh bdi mBt trong ba céng thive: 1 wah 2. w= 1 + logit) 3 meh Trong phuong phap nay, trong 86 wy ty lé thuan vdi s6 lan xuat hién cla thuat gO ti trong van bén d, Khi sé ldn xudt hign thuat nat 4 trong van ban djcang Kan thi digu 6 06 nghia la van ban d; c&ng phu thudc vao thuat nav t, hay néi cach khéc thual ngir 4 mang nhiéu théng tin trong van ban d. Vi dy, khi van ban xuat hign nhigu thugt ngl may tinh, diéu do co nghfa la van ban dang xét chil yéu lién quan én finh vye tin hoe. Nhung suy luan trén khOng phai lic no cing dling. Mot vi du didn hin la tte “va xual hign nhidu trong hau hét ede vain ban, nhung trén thre té tir nay lai khang mang, nhidu ¥ nghfa nhu thn sudt xudt hién cia nd. MOt phuong phép knde ra Goi Khe phyc éugc nhuge éiém ela phuong phap TF, 66 Id phuong phap IDF. 6. Phuong phap diya trén nghich dao tan sé van ban (IDF — inverse Document Frequency) Trong phueng php nay, gia tri wy ugc tinh theo céng thie sau: [oat =ogtn)—foth neo hua ng, sua ign woe ti ea lon ona 6 66 mia.sé lvong van ban va h, la s6 van ban ma thuat agit xuathién. Trang 86 Ww, trong cang thie nay éuoe tinh dya tén é& quan trong aia thust nate fytrong van ban d, Néu f xudt hign trong cang it van ban, didu 45 co nghia la néu nd xual hign trong thi trong 36 cia né 66 vei van ban d céng Kon hay né Ia éiém quan 2B 23 trong dé phan biét van ban dj véi cae van ban khéc va ham lugng théng tin trong né cng l6n. ©. Phuong phép TF » IDF Phuong phap nay Ia téng hyp ota hai shyong phap TF va IDF, gid tri elia ma tran trong 86 duge tinh nhur sau: m) 7 acu h, 21 | _[sbats te 0 nda ngycla Phuong phap nay kt hgp duge wu diém cia ca hal phuong phap tn. Trong £6 w;duge tinh bang t8n s6 xudt hign cla thuat ngo 4 trong van ban 0, va d@ hiém oda thudt ngi t trong toan bé co’ sé Gu Phuong phap xirly vector thira Thea m6 hinh vector chun, vie xir ij ede php toan trén vector s& phy thudc va0 dé I6n ctia ma tran Wr» & dé 21 86 lueng thuat ng hay 56 chiéu cia veator va ma s6 lupng van ban €6 trong eo sé dir igu. Trén thy 18, s6 lang thuat ngir va 6 vén ban cé thé lén dén vai chuc nghin. Khi 46 s6 lugng phan tu trong ma tran Wom 88 ln d&n con sé tram trigu va vide luu tro ma tran Wrm sé tén qua nhiéu tai guyén bg nho' dong thoi cac phép toan trén cac vector sé rat phuc tap. O8 khdc phyc van 8 nay c6 thé sir dung ky thu xir ly trén vector thua thay vi vig luu tro va vir iy trén c&c vector chun Cac cibu kien a8 co thé ap dung phuong phép vector the: 1. Cac vector thyc sy thua: 56 phan tir cé trong $6 khac 0 nhé hon rat nhibu so ‘oi 56 thuat ngG' trong co sé d0'ieu. 2. Phép xt ¥ vector lé don gin nhdt: 86 vector ciing bi tae déng trong mot phép xi Iy.00 ban la nh nhdt. Thubng £6 vector bj tac dong nay duge quy dinh t6i Gala 3 hoac 4 Trén thye 16, s6 thuat ng wut hién trong mat van bén thubng dudi 1000. O61 voi cdc van ban dai va da chi 48 thi s6 thuat ngo xuat hign cé thé nhiéu hon. Trong Ki 46, 86 lwgng thuat ngo’ co trong tt élén c6 thé lén aén con s6 100,000 tir. Bay chin la dibu kién 48 ap dung phuong phap vector thu. Vige thée man dibu kign thir hai cén phy thuge vao thuat toan durge 4p dung cho qua trinh xc van ban. Mat vi dy bidu dién vector thua tir céc vector chuan, » Déivéi vector chuan: my May tinh | Internee | Thicga | Quins | Co | Ling eitw a(CNTT) 2 3 0 9 0 0 ‘d(Ning nghigpy | 0 0 4 0 1 1 d:(Cong nghigpy | 0 0 0 6 0 2 de = (2, 3,0, 0,0, 0) = (00.4014) d= (0, 0, 0, 6,0, 2) > Déivéi vector the: dy = (1, 2), (2, 3) Os = ((3,4), (5,1), (6,1) dz = ((4,6), (6.2) Kidu phan ttr cia vector thura 06 sy thay di so vai vector chudn. Méi phan tt gdm hai gia tri la ma bidu din thudt ngi va gid tri trong sé tung Wing voi thuat nu 6, Phan tir (6, 2) trang van ban d2 chi ra rang thuat ngu’ c6 ma 6 ("ong cuw’) co trong sé 2. 3. Cac bai toan xiv ly van ban khéng 6 edu trie 3.1. Bai toan phan logi van ban (Text Clustering) a4 igu Bai toén: Cho mét tap hop céc vin ban da dure nguxs cing phan loai bing tay vd0 mé6t 86 chi dé c6 sn. Khi dua vao mét van ban mdi, yéu cau hé théng chi ra tén chu dé phi hop nhat vai vin ban dé. Theo céch truyén théng, vige phan logi van ban cé thé thye hign mét céch thi cong, tire 14 doe ting van ban mot va gan n6 vao nhém phi hop. Cach thue nay s& t6n nhiéu thd’ gian va chi phi néu nh 86 lugng van ban lon. Do vay, can pha xy dung cdc céng cu phén logi vain ban mot cach ty dong. ‘BE xay dyng ong cy phén leai van ban ty dong ngudi ta thueng ding céc thuat toan gc may (Machine Learning). Tuy nhién, con cd c&c thuat todn aac biet hon dung cho phan logi trong cdc linn vye dc tha. M@t vi dy din hinh [a bai toan phan loai cng van gidy 10. H@ théng 88 tim ra dc tha ca van ban mot c4ch tuong 66 méy méc, cu thé [a khi tim thay & 8 wen ben wai co ky higu “ND” thi he théng sé phan van ban dé vao nhém *Nghi dinh’, twong tw nhur vay véi cae kj higu “CV’, °QD" thi he théng 88 phan van ban nay vao nhém “Cong van", va “Quyét dinh’. Thuat toga nay tuong déi higu qua song lai chi phis hop cho céc nhém dip ligu tuong ai dae thi. Khi phai lam vigc voi céc van bain it dgc thi han thi cn phai xy dyng cac thuat {oan phan loei dua trén ngi cung ca van ban va so sfnh &6 phi hyp ca ching voi cde van ban 68 duge phan logi bei con ngudi. Day la tu tung chinh cia thuét ton hoc may. Trong mé hinh nay, cae van ban da duzgc phan logi sn va hé thing cta chting ta phai tim céch 48 téch ra dc thi: cla cée van ban thuge mdi nhém riéng biel. Tap van ban mau ding d& hun luydn goi ld tap hudn luyén (train set) hay tap mu (pattern set), qua trinh phan loai van ban bang tay gol !& qué tinh hudn fuyén (training), con qua tinh may ty tim @c thi eGa cde nhém gi Ib qua trinh hoc (earning). Sau kni may 43 hoc xong, ngudi ding sé dua cae van ban méi vao va nhigm vy cua may la tim ra xem van ban &6 phu hop nhét voi nhom ndo ma con ngudi da hudn juyen né. Mot trong nhiing ng dyna quan trong nha ota phan foal van ban la ing dyng tim kiém van ban. Tir tap dO* ligu 68 dugc phan foal, cdc van ban sé durge GAnh chi 36 66i v6i tung 6p twong tng. Naud dung cé6 thé xée dah chi 8 ma minh rmuén tim thong qua ede truy van. Mot ung dung khae ca phan loal van ban Id trong finn vye hiéu van ban. Phan loai van ban c6 thé duge ding a8 I9c ra mot s6 van ban hoge mot phan van ban ma khong lam mat di tinh phong phé cia ngén ngi ty nhian, Trong phan loai vn ban, sy phy thude ola mot van ban vao mot nhom c6 thé Ia Cac gid tr roi rac dling va sai (true hoa false, d& hibu ring van ban d6 thuge hay Khong thude mot nhém ndo 66) hoe cae gia tr ién tye (mot s6 thye, chi ra a9 phi 26 hop ea van ban véi mét nhém nao dd). Khi edu tré [oi dura ra bat bude pha ld dng hod saf thi gia tr len tye 66 thé 6uge roi rac hoa bang cach det nguong. Nouong {ait ra tay thud vao thud toan va yéu chu nguei ding, {Qué trinh phan loai van ban bao gém cac bude: 1. Banh chi sé: & bube nay, dir ligu du vao sé l& mot van ban va két qua Sau ra la mot chubi cdc chi sé thuat ng bigu dién noi dung cla van ban édu vva0. COng giéng nhu' trong bai toan tim kiém van ban. Bude nay rat quan trong vi né quyét dinh t6c 46 va 40 chinh xc oda céc buréc ké tiép. Qua trinh gan nhan thyéng durge you cu phai lam viéc voi téc 0 thai gian thc. Trong bude nay thing duge chia thanh cdc bute nhd hon: foal b6 ede ky te thira va iy tr nhidu, chuén héa cée ky tw, tach tir va dnh chi sé. Trong clic bute nay thi buée téch tir va danh chi sé la quan trong nhét, Tuy nhién o&n phai chu y dén tt cdc cac buée vi chang bu quyét éinh téc 66 cla cd qua trinh. 2. Xée dinh d phi hop: cting giéng nhw tim kiém van ban, phan logi van ban cn phai cé bude xt ly 48 chi ra rng van ban dang duoc phan tich thue v8 mét nhém nao 46 dya trén néi dung ctia van ban hay chubi cdc thudt ngd la biéu dién eda van ban 46. BO phan phan nay gol 1a bo phan loai (categorizer hay classifer), N6 cing giéng nhu b6 dinh dang truy van trong tim: kiém van ban. Tuy nhién, trong tim kiém van ban mang tinh nhét thoi, cén ‘rong phan loi van ban thi mang tinh én dinh hon. Két qua cla qua trinh may chira 6 phu thude gitra vin ban éang phan tich véi mdi nhém o6 sn, Day cd ‘thé coi la burée mang tinh quyét din trong ca qua trinh phan loai 3. So sanh: trong hau hét cac b6 phan logi, mdi van ban dau duge yeu cau ain gid li diing sai vao mot lop nao d6, Sip khde nhau Ken nhdt vai hé thong tim kim van ban la & day qua trinh so sénh chi duge thye hién mot an voi so lrong so sénh hihu han ty thuge vao tap hudn luyén. Vige chon quyét dink pphu hop phy thude chi yéu vao quan hé gitta cae nhém ding 6 hudn luyén. 4. Phén di: qua trinh phan hoi déng hai vai tré trong hé thong phén logi van ban, Thi nhat, khi phan loai vn ban cn phai cé mot s6 luong 1én ede van ban 68 duige phan loai bang tay trurée 66, cdc van ban nay dupe sir dung lam mau huan luygn 6 hd tre xay dung b9 phan loal. Thir hai, vigc phan logi van ban nay khéng dé dang thay dBi nhur qua trinh phan héi trong tim kiém van ban. Thay vi vay, ngudi dung ¢6 thé théng tin cho ngudi bdo tri hg théng va vige xéa, thém, hay stra d6i cae nhém van bén néo 46 minh yeu chu 3.1.2. Cae phwong phap phan logi van ban Voi sy’ bung né théng tin va sy phaltrién Khéng ngtmg cila Web, bal toan phan loai van ban 3 va dang dong mét vai tr cye ky quan trong. Don gién Ia vi C&C tai nguyén véin ban tryc tuyén ngay cang nhidu. Viée tim kiém thong tin s6 gp rt nhidu kh6 khan néu nhu khong 6 chi myc t6t va sv t6m tt noi dung ec tai Higu. Tong hong nam gin day, ngudi ta thudng sir dung céc phuong phap phan loai bang phuong phap théng ké va cac ky thugt may hoc dé gidi quyét bai ton nay. Hau hét cde nghién cou phan loai van ban hign nay déu chi gia quyét cho cac bai toan mh phén, trong 46, mot tai ligu c6 lién quan hay khéng lién quan toi mot chi d& nao 6, Tuy vay, nhidu lai igu hign nay, déc diet Ia cdc tai igu trén Internet lai Cue cu thanh tir nhidu chi é8 khac nhau, d&t ra bai togn phan loai da l6p (multi-class).. Tuy nhién, cée bai loan nay cho dén nay van cin rt it dug cha y quan tam. 6 rt nhidu oii thuat khde nhau G8 giai quyét bai loan phan loai van bain, don coir ahr Thual todin Roochio Naive Bayes Decision Trees K- Nearest Neighbor Support Vector Machines Naar 7.1 Thuat toan cay quyét dinh (Decision Tree) Cay quyét dinh 1a mot trong céc phureng phap due sir dung rng rai nhat trong hoc quy nap tir tap di igu Ion. Bay la phurong phap hoe xdp xi céic ham muc tidu ob gid tr 161 rac. MOL wu diém clia phuong phap cay quyét dinh la co thé chuyén cB dang sang dang co sé tri thc Ia cac lat Néu - Thi (If- Then) Phuong phap nay Guge Mitchell dura ra vao nim 1996. Trén cay gdm cc nit trong éuge gén nhin bb cde Khai nigm, cdc ahénh cay chive nit duge gan nin bang cdc trong s6 cla khal nigm twong Ung 66i voi tai leu mau va cc Id trBn cy 28 duge gan nh&n béi céc nhan nhém. MOt hé théng nhu vay sé phan loai mét tai ligu boi phép thir @ quy cAe trong s6 ma cae khdi niém éuge gén nhan cho cae nut {rong véi vector d, cho Gén khi v6i t6i mot nut a, Khi 66 nhan cia nd nay duge gn cho tai igu o), Ba s6 ode phuomg phap phan loai nhur vay si? dung bidu dién & dang nj phan va cde ey ctng duge bidu din dudi dang nhi phan. Vi dy v8 Decision Tree (eutooK] sunny overcast. «rainy ! | ! amesty] es] [ny =~ —~ high normal false true | 1 (Pa Bo + Entropy Entropy la dai lugng do 6 dang nhat thong tin hay tinh thuan nhdt ota cde mau. ‘Bay la dai lvgng hét sic cuan trong trong Iy thuyét théng tin. Gia str du ra tap Sco chia cée mau vi dy duong (+) va cae mau vi du am (-), nhur vay S éuge chia thanh hai lbp phan bigt, Khi db Entropy cia tap S dues dinh nghia nh sau: Entropy(S)=-p_ logy p, — 7 logy p 29 trong é6 p+ I& phan bé ca céc vi dy duong trong S va p- la phan bé cia ede vi dy am trong S, chting ta éinh nghia 010920 = 0. Trong trugng hyp t6ng quat ti dat lugng Entrapy durgc tinh nhu sau: Entropy(S)=—S plog,p, trong 46 p Ia phan bé oa thude tinh ther / trong S. Entropy la dai lweng trong ly thuyét thong tin, tinh theo bit, nén ham logarithm duge tinh & co’ s6 2, do 4 Entropy 66 thé én hon 1 trong truéng hop 1 > 2. + Information Gain Entropy la dai lvgng dae trung cho d@ déng nhat théng tin, ngudi ta cén dua ra 9 do xée dinh anh huéng cia m6t thuge tinh trong mu dé trong vige phan ly dal lugng a6 goi l8 Information Gain, Information Gain ctia mét thuge tinh A trong tap S, kj higu laCain(s, 4) duge xAc dink nh sau: Gains, entropy s)- T Slascopy(s: trong 66 Velues(A) la t§p cae gia ti 06 thé eta thude tinh A, cdn S, 18 tap con e& tnd cla tap 5 cac phn td 6 thudetinh A =v, tu la S$, =(s=51Ats)= v1. Bidu thire Gu Entropy (S) la éai lvgng entropy nguyén thdy cla tap S, biéu thire sau la gid tr ky vong ola entropy sau khi S duge chia thann cas tap con theo cae gia rj ota thude tinh A, Do dd Gain(S,A) thre chat la 60 gidm ki vong cla entropy biét cdc gia tr thude tinh A. Gia tr Gain(S.A) la s6 bit cAn Ivu khi ma hda cac gid tr myc tiéu ota mot thanh phan cia S, khi biét cdc gia tri cla thuge tinh A ‘+ Thu§t toan ID3 Cac thuat todn hoc cay quydt dinh ngay eang durge pha tn va ei idm, nhung hau hél cac thual todn 66 du dua vao cach tiép cn tiv tren xudng va chién luge tim kiém tham lam trong khong gian tim kiém ova cay quyét dinh 30 ID3 va cai tién ca né, C4.5, 18 cac thudt ton hoc céy quyét dinh phé bin nhat. Thust todn hoe ey quyét inh D3 lan du tién duge Quinlan gi6i thigu nam 1975 {rong tap chi machine learning, Vol.1, No.1 Nhin chung, thuat eée bude trong thuat toan ID3 ¢6 thé duge phat biéu mt each ng&n gon nhu seu: + Xe dinh thudc tinh cé dé do Information Gain cao nhat trong tap méu = Sir dung thudc tinh nay lam géc cla cay, t20 mot nhanh cay tong Ung véri ‘mBi gid tr 66 thé olla thud tinh, * Ung voi mi nhanh, ta lal lap lal qué trinh tren tuong Ung vei tap con cua tap mu (training set) €u0¢ phan cho nhanh nay. Thuat todn 1D3 dui dang pseudecode: 13 (Reamples,Targat_atershate,attrstutes) ‘Greate’ nav nods Root for the tree, Stall senbers Of Examples ars in the sane class © Soot > single-note tree with label = Cy else if attributes 1s eqpty Boot = single node tree with label = most comon ‘Value of Target attribute in Examples = nenber of Attributes that maximizes Gain (xamples,) ; Ais decision attribute for oot; fot each possible value V of A ‘201 a new branch below Roct, testing for A = v; Brawples v := subset of Examples vith A = ¥; if Skamptes 15 erpey ‘below the-nev bratch ad a leaf with label = most ‘common value of Target_attribute in Rxanpies “Botoy the bew branch add subtree urn oly Deambies_ Targer_attrinute,teributes ~ (8); > View So sénh niit gc cho t8p cit liu thar tiét 31 [eatoor] Temperature Ts sunny overcast rainy ‘mild cool hot 1 1 1 yes] [ves] ies yes ee yes] |yes no yes] |yes a no} Jyes| [no Es no 0 Gain(S, temperature) 0.03 a(S, humidity) = 0.15 a(S, windy) = 0.05 high tow ‘aise true Is il fi = ves] [er ye] [PE yes] |yes yes] [yes yes] |yes yes] [yes no | }yes yes] Jno no | }yos yes] |no no | |yes yes] Loo. no | Ino 0 no Thu6c tinh outlook c6 gid ti Information Gain nit gde eta eBy quyét dinh, 10 nha, vi vay, né ug chon lam — aN Pa 1 old Sal 9 he 1 1 L 1 — hy oa il 7 pb Jno | [no = Jno | [ye Gains temperature) ~ 0.57 same at 2 wh sy _~ Gain(S, windy) = 0.02 ‘aia Woe Lo re] fe ne | [no Thugs tinh humidity 68 4 do Information Gain cao nhdt, nén dug chon lam nat adc cho cdy quyét éinh con tuiong ung vol outlook = sunny. Co tiép tye nhu vay, cudi cling ta thu durge cay quyét éinh ing vei tap do ig thai tiét nh sau: outlook sunny ‘overcast rainy L | | [Ruma] yes windy] a high normal false true | | | | To] [| Pe] fe 72 ‘Thuat todn K-Nearest Neighbor k-NNN diya trén phurong phép hoc may duge biét één nhu mét thuat toan higu qua {rong nhiéu finh vye dc biét la trong bai ¥an phan loai van ban. Tu tung chinh ota thuat ton nay [a tinh toan d6 phi hyp cia van ban dang xét voi tig nhém chi Ga dua trén k van ban mu c6 d6 tvong ty gan nhat. Thuat toan nay cbn duge si dung trong bai todn tim kim van ban va bai ton 16m tAt vén ban. 3 Van 48 dau tien c&n quan tam la khéi nigm g4n 6 day theo nghia nao, v8 mie 66 gan & @8y duge tinh theo cbng thire nao. Van é8 thy hai can quan tam I& sau ki tim duge k van ban gan nhat ri, lam thé nao dé tim ra nhém van ban eh hop nhat voi van ban a6. Khai nigm gan & day duge hiéu le 49 iuong ty gitra cée van ban. C4 rdt nhibu cach tinh 49 tuong ty giv hai van ban trong dé phuong phap danh gid do tong ty dura trén cng tire casine trong 84 thug duoc si dung réng rai nhét. O 6ay, van ban duge bidu din dugi dang vector, T= ft fo. .ta) 18 tap hop cde thuat ngtt (hoe ce khdi nig), W= fer, Wa,..¥m) vector trong 86 & 46 wila trong s6 cia thuat nat t. Xet hal vin ban X = fy Xe.o%a) VB Y= Wr Yor eh © 6 x, ys la tn 86 thugt nad f xeudt hign tuong ung trong van ban X, ¥. Khi d6 66 tuong ty gitta hal van ban x va ¥ duge tinh theo cbng thure sau: Toten) en on Trong vector X va Y, cac thanh phan x; va yr thudng due chudn h6a theo tin sudt xuat hign (TF — xem cac cng thutc (2), (3), (4)) cla thudt ngu 4 trong van ban x va Y. Vector W duigc xac dinh bang tay hoac dugc tinh theo mét thuat todn tham lam nao d6, Mét sé dé xudt dua re cach tinh vector W theo nghich dao tan suat van ban IDF, khi 66 cae vain ban éugc biéu dién dui dang vector tn suat TF » IDF. sim(X.Y)~ cosine X.Y, Van dé thi hai, nhu 48 dat ra, 14 phép tinh 4} phi hop cia chi Gé khi d& tim dug k van ban gn nhdt, C6 rét nhidu cach danh gia dra tran 66 wong ly gitra cae van ban, trong dé 66 S.céch duge cht ¥ nhiBu nat: Gan nhan van ban gan nat Theo phucng phap nay, viin ban dang xél s& dug gén nhiin cho chi d’ eta vén ban 06 66 tung tur cao mht, Gidi phap nay twang abi don gidn va higu qua, song nd khOng due danh gid cao vi sé dan dén két qua sai khi tap mau cd nhiéu. Mot nhuge diém nda céa phurong phdp nay ld két qua dua ra s® khong mang tinh t6ng hop. Gan nhan theo s6 déng Bé Gé hidu, xét mot vi dy nhw sau, van ban d o6 5 van ban gan vai n6 I& di dp, dava dgcé nhan chi dé va 66 tong ty nhu sau 1 de, M4 a [7 os a ds turoni Pewena | oo | oss | 08a | 088 06 tw | chad’ | chad’ | chai d& | chid& chi dd Cha dé 1 2 2 2) 3 Ve mat trgc quan, nén gén nhan chi dé 2 cho van ban d vi e6 3 vén ban thude lap nay. Néu chon theo van ban gan nhal thi co thé gap sai lam ha nang cd nhiéu hoc {ap mu chia sai sét. Nhu vay, cach danh gid nay && kh phuc duoc I8i trong tap mu, tuy nhién vin chua éuge danh gid cao trong mét sé tinh hubng ma ey thé la &vi du dudi day. a | & | & | de ds 6 twong oo | os | 03 | 03 02 tw Chi a’ | chia’ | chi a8 | choad cha a8 cha ab 1 1 2 2 2 Trong trudng hyp nay nén chon chil é& 1 48 gan nhan cho vn ban d. Gan nhan theo do phi hep chi dé ‘BG phi hop gitra mot van ban d va chii d& c duge tinh thea céng this sau’ sinnld.c) = Ys) Phuong phép nay cdi lan luot tinh d6 phi hop ela vn ban d véi tng chi 62 tr van ban da léy ra, sau dé gain nhan chd 68 phi hep nhat cho . 3.2 Bai toan lap nhém van ban Trong bal toan phan loai vin ban, yeu cdu dat ra la cn gan nhan mot vain ban chura duge phan logi vao mot hoge mot sé nhém duoc dinh tude théng qua ndi dung cla van ban va dura trén thud tinh ela céc nhém, Trong bai toan lap nhom van ban, ching ta chua biét thude tinh cia ec nhém, hay ndi dling hon lé cc nhom chura dug xac dinh. Yéu vu Gét ra la thanh lap mét sé nhdm van ban nhét dinh tir tap hop cae van ban cho truve sa0 cho phil hop nha. 3.2.1. Gidi thigu ton: Cho mét tép hgp céc vain ban va sé tyr nhién n, Yéu céu hé théng phén chia vain ban trén thanh n nhém sao cho phi hop nhl ap Bai toan lp nhom van ban thudng It duc Ap dying hon cac bai todn phan loai van ban va tom tat van ban. Tuy nhién day IA m@t bai toan hay va cing [8 mét bai toan chia dung niéu thach thie. Céc hé théng Kap nhém vén ban cé thé duoc si dung trong cée thu vign, cac tba bo. Trén thu 16, thurémg thi cde nhan vign van thu biét minh can t6 chive kho van ban cla don vi minh thanh bao nhiéu nhém, va cdc hm c6 mét s6 8c tinh duge lam mu boi mét s6 van ban, sau dd Ap dung bai oan phan laai van ban ho sé cé duoc hé théng céc nhém van ban nh y. Song khong phal ldc née nguoi van thurIuu tr cUng biét 6uigc cd t8 chute kho ta qu cta minh thanh bao nhiéu nhom va sy tach biét gira cdc nhom rat khé éinh hinh. Trong trudng hop dé can cd mét hé théng ty ong lap nhom van ban éé tre gitip con nguei trong vige tach biGt dc tinh gidra cao nhém vain bin, hod gi eich tach bil gia cae nhém saa cho phil hop. 3.2.2. Céc phwong phép lap nhém van ban C6 rét nhidu thuét gidi cuore nghién cou Gé gid quyét bal todn lap nhém van ban. Cac thugt togn nay 06 thé éuge phan thanh 2 nhém chinh: > Hierachical Clustering > Thuat toan phan cdp Bayesian » Thudt toan ghép nhém theo 66 tuong ty > Paritional Clustering : K-means a. Thust toan phan cdp Bayesian Thuat toan HBC cuge danh gia 1@ thuat ton truy hai t6t nhdt khi cing dB gidi quyét bai toan 1gp nhém. Ging giéng nh hu hét céc thuat todn truy hdl Khe, HBC xy dung mot cay phan cp tur vei lén bang cach gop hai phan nhom vao mét trong mi bute xir ly. © buréc khdi tao 6 tt ca m phn nhém véi méi phan nhém chi gdm mot phan ti. Sau N-1 bude xirly thu duge mot nhém don chira toan bé tap der ligu, 36 Trong mBi xi ly hop nhém, thudt todn HBC chen cap nhém c&n hop sao cho t6i a hoa ham tiém nang P(CID), & 46 D la tap che van ban cn lap nhom (D = fa, utah) V8 C la tp cc nhém (C= (61, G2. 3). Ve hom ¢, € C a mot tap dar ligu dai dign cho mot nhém, Vado luc Khai tao, ching ta dat cr = (a). Ham tim nang P(CD) xc dinh kha nang mét tp hyp oO liu D duge phan 6p thanh mot tap céc nhém C. ‘Dé thay mot cach cy thé thao tac hop nhom, xem xét bude hyp nhom thir k + 7 (Q£k=N~ 1), 0 bude k +1, t8p dor ligu D duge phan ra thanh cdc tap hgp nhém x, Voi mBI mot dO Higu de D thuge v8 mot nhém eC, Khi ad ham tm nang duge chira nhur sau: ne, OT] Tei) Pid \o-Pe) TT ey Tre mH [rele ree.) nop ELScte) © eay, PC(C,) ing voi ham kha nang ma n dG ligu duoc lap thanh tap hop cae nhém G,, Ham nay duge dinh nghia nr sau IP" Pec, SC{¢) dug dinh nghia la kha nang khi tat o4 di gu dug hap trong mot nhom va diroc dinh nghia nhs seu: s(ey=[] Pid le) Sau khi hop hai nhém-c,.c, eC, , lap hep cde nhém Cy durge cap hat lai Nhu: sau HC, fee.) Me, UC, ) Ham tim nding khi 66 éuge tinh Iai nhu sau: a eve PCy [Dp= PCa SEWED pe, |p) PCIC,) SCIe SCE.) ‘Cuéi cling chung ta sé xem xét ham kha nang P(djc). P(di¢) duoc dinh nghia ! kha nang dif liéu d thuge nhém ¢. Theo dinh luat Bayes ta c6 Pid|e)~ Py” 0 6 tacé: ‘+ P(T=Hlo): 1a thn sudt khai nigm ¢ trong van ban ‘= P(T=f{c): la tn sudt Khai ni¢m t trong nhom ¢ P(T=0) [a tn sudt kha nigm ¢ rong toan bé oo sé dir iu ‘Thuat toan HBC bigu dién qua ngén ngir lap trinh duvge thé hign & Hinh 2. Wi dashed hit to: Cu= fe C3.) ei fag wil sis Tin SC(e) v6 | Tinh SCle, Ug) voi i Fork~1t Nei do ‘et scle,Ve,) Sp “6 Sle )SCI,) C=C x sleet) tle ved Tinh SCle, Ved wie, eC, x42 Function SCO) Return TT... |e) Hinh 2: Mé ta thuat toan HBC b, Thuat toan ghép nhém theo d@ twong ty ing ging nhw thudt togn HBC, thual togn ghép nhém theo 46 twang ty cing st dung phuong phip hBi quy, nhung diém Khée biet duy nha la ech chon nhom 38 8 hop. Bau tién, mBi vin bén sé duge gan cho mbt nhém riéng, nh vay sé c6 hom, Sau mi bure hap hai nhdm 6 do tong ty l6n mht véi nhau sé nhom $8 gidm 61 1. Qué trinh hgp céc nihém tién hanh cho één khi s6 nhom cén Iai Gung bang k Mau chdt etia thugt toan nay Id cach tinh a phd hgp nhat gitka hai nhom. 6 twong ty gitra hai nhém vin ban ach don gian nhat dé tinh do tuong ty giffe hai nhém van ban la dya trén 6 wong ly cita ode vector vain ban, Xat 6, va gla hai nhom van ban, khi dé d@ wong lu gid hai nhom van ban nay duge tinh theo cng thirc sau! simie,.c,)— 1 Yesimid, ds) (ele rs 6 day ham sim(d,.d2) c6 thé duge tinh theo céng thire cosine hoa khoang cach Euclid gitra hai vector van ban Khi do thuat tan phan cap theo do tuong ty duge biéu dién nhu sau; Blt vio, D = fb dost Khoi v0: Co fer Op-ed GM} vil “Thu thép cfc théng tin tr tr8n Internet: OF Buse nay, ching ta xy dung mét ‘webcrawler 6 tao CSDL. > Ap dung cdc thuat toan lap nhém a tim higu cho bé do ligu vira thu dye. 2.4. Web Crawler a. Gi6i thigu Web crawler nhin mot céch téng thé i chuong trinh thue hién cong vibe kiném ‘pha va tai cdc trang web turtrén mang Internet v8. Thong thuong thi crawler bat Gu vei mot t€p cae dla chi trang web khdi aiém (initial set of URLs) Sz NO sé Gua tap So nay va0 hang doi (queue), not ma cae URLs Guyc sap xép va crawl sau d6. Crawler sé lan lugt ldy cac URLs ty hang doi theo mot thi ty: nao d6 tuy thuge vao cach cai dat cy thé rBi tai cdc trang web twong ng voi cdc URLs vé, cat gil chiing (néu can thiét), tach cae URLs khac tren tang web viza tai v8 rbi xép vao hang doi. Qué trinh nay cir thé tiép di8n cho t6i khi crawder quyét dinh dimg lai (06 thé do hét hang doi hae vi mot ly do nao d6), ‘Bay la mot thanh phan v6 cing quan trong, dor itu due crawer tai vé s8 éuoe lung dé tao nén kho or ligu, Ganh chi myc va phyc vy cho cac nhu cau tiép sau. 2 ‘Web ‘Search Database of Spider ‘Web pro enaire web siles robot sites Frontend Back end (spider) b. Thirty crawl cc URLS Craver c&t cac URLs ma né bat gp trong qua trinh crawl vao hang agi rbi lan lugt ly ching ra theo mét quy tic nao 46 48 tién hanh tai cdc trang web tong ung ve. Cé bn chién luge crawl + crawl theo chiéu rng (breath-first crawling) : Craw theo chidu rong thyc tim kiém hét céc trang web gan, qua it lién két trung gian tir trang web xudt phat truge ri moi dén eéc trang web xa hon.Tim kiém theo chidu rong la phuong phap éugc 4p dung cho hau hét cdc crawler hién nay, + crawl theo chisu sau (depth-first crawling) : Theo phuong phap nay, crawler $8 lin lugt theo ign két Gu tien cla céc trang. Co tiép tye nhw vay cho dén khi khong con lian két nao nda thi quay lui ign két thi hal cia trang true ao. * craw! ngdu nhién (random crawling) : Vigo Iva chon cdc URL mang tinh chét ngdu nhién. Crawler sé Iva chen ngdu nhién céc URLs éé vigng thém. Tat c€ cao URLs duge 661 xi binh dang, khéng c6 mgt yéu 18 ndo tae éng en thu ty viéng tham URLs. + crawl cé thir ty (priority crawling) : Vai phuong phap nay, crawler sir dung mot h@ thu ty d8 tim cae URL éuge viéng tham. Thong thutng, crawler ‘thuong chen URL nao cé gia tr sép xép cao nha trvde én, c. M8056 vén dé cén chit y vé web crawler Trong qua trinh crawi cde trang Web, crawler can phai chi y dén nhiing van 68 sau: B = Tranh du thiva ; Mi trang web thudng chira cae link t6i cdc trang web khéc. Chinh vi vay, nu khong cn than crawler chée chdn sé bj roi vao vang lp va han. Tuy nhign, ta o6 thé dB dang tranh duge van a nay bang cach luu tw mt danh séch c4c URL 4 tim va kiém tra danh sch true khi si: dung mot lien két mei + Thue hién sw phan biét - Khéng phi lién két nao cling dn ching ta t6i mot trang HTML. Vi vay, crawler co thé s bat gap nhéing lidn két toi céc file dd hog hay céc file chvomg trinh, Crawler c&n phai nhan dang duoc ca lién két nay. = Gidi han pham vi : Do web site c6 thé lién két t6i cdc web site khac, vi vay, crawler c6 thé nhay tdi nhiing web server khac. Tuy nhién, véi viee gigi han ppham vi, chang ta 06 thé lp trinh cho crawer chi tim kiém cae lién két trong ccung mot web server hoe cling mét mién. + Gi6i han do sau : Mirc sau 161 da ma crawler o6 thé tim dan. |. Thuat toan sir dyng cho web crawler + Luu 48 thugt toan ‘CRAWLIURL) ‘Acquire URL ‘Sean URL for links ‘Advance to net ink ma No Any links remain Yes Call CRAWL link URL) recursively % Thuat toan Nain thang tin dau vao cla, nguéi sir dung: URL khéi dau, Bua URL nay vvao danh sach (hién dang rng) céc URL duoc tim. ‘While danh sch URL durge tim khdng rng co. han v8 URL éu tién trong danh sch. Chuyén URL 16; danh sch cc URL da duge tim. idm tra céc URL dé édm bao ring giao thie ca né [a http (néu khdng, thoat khai vong [8p va quay lai vong wile). igi tra xem ¢8 file robots.txt 6 site nay ma c@ chia cau enh “Disallow" hay khong? Néu c6, thoait Khoi vong lip, quay tro v8 vong lao vile. Mi URL dé nhan v8 tai gu ti trang Web. Néu day khong phai la file html, thoat khéi vong lap, quay tri vB wong tap while while van ban himl van on chiva céc link da { Hop Ia hoa URL ata lian kai, va dam bao rang robots duoc chép han. Néu day 1a mot him fie, Néu URL khéng 6 mat trong ca dan sach URL duge tim va danh séch URL 43 éuge tim, dua né vao danh sach URL éuge tim, Con nu né [a loai file ma ngudi sit dung yeu cdu thi dua nd vao anh sch ca¢ file tim duos d } 2.2 Ap dung céc thudt toan Igp nhém cho bé dit ligu vira thu duge 2.2.1 Cac buéc thye hin dé biéu dién vector van ban Tach tir ‘Dau véo ola bude nay la cac file vin bn (ou). Dau ra cia buts nay sé cho ta mot danh sach tir Ging voi moi file, Cong vige thyc hign nhur sau: Nnap vao cae file van ban (dudng on), tach fle vvan ban thann danh sach cae ti éon (1a mot tir tléng Anh gbm mot chudi cac ky tye tir a>z hoae A>Z) cdc tir tach bist bai cac ky ty dac biét ngoai cAc ky te chik cai (dau trng, phdy, chém,...) Ta thu duge céc danh sach tir keyword1, keyword2,... img véi cde file text thi: inhdt, ther hai Au ra cila bube nay sé & dau vao ctia bude ther hai, logi bd stopword. b. Loai bd stopwords ‘Bau vao la danh sach tl. Két thiic bude nay, céc tl ndo la stopword thi sé bj loai ra khéi danh sach, Loai 66 cde stopwords trong céc keyword, keyword2,... Cac tir stopword uae duge ligt ké trong mét file c6 sn, gm khodng 541 tir, ©. Stemming ‘Voi mi tir trong tiéng Anh, né ¢6 thé duoc xudt phat tir mét tl” khéc, do vay ma cac tu d6 mang ¥ nghTa la kha giéng nhau. Chang han c6 cac tl “learn”, “Iearnex “learning” thi ta sé luge b6 cac phan hau t6 dé chi ldy 1 géc la "learn". Cach loai bo nay tuain theo cac luat duge xay dyng san nhu luat cia Porter, d. Sépxép keyword 8p xép cdc phan ti trong cc danh sch keyword! , keyword2,... theo thi te cia bang chi e@i, sau dé iuu cc danh sch nay sang ede danh sach m6 teng Ging FT1, FT2,...mé méi phn tir tong danh séch méi gbm trudng word (ndi dung cdl tir) va TF, sé ibn xuat hign o¥a tl’ nay trong céc keyword, keyword2,..(tUc Id trong cae flle van ban ban éu). Cac phan tl trong cc dann séch FT1, FT2,...c6 noi dung cba lrudng word f& khdc nhau (céc danh sch nay Iuu trl cae tir khae ahau, thye hién bang céch loai bé cdc tir giéng nhau trong timg keyword va leu s6 lan xuAt hign cba 16). e. Xay dyng bag-of.words Tao mét danh séch FT cée phin tir I cae xau 66 néi dung cla cae word trong tat cfc dann sdch FT1, FT2,..b&ng céch ghép cée danh sch nay v6i nhau. Sau 66 sp xép cdc phan tt trong danh séch nay theo thes ty bang chi edi Tao mot mang cc ban ghi F, méi ban ghi gdm hai thanh phan la word (ndi dung cae word trong FT) va df (s6 ln tir d6 xual hign trong FT), bang cach lai b6 ce phn tir c6 word giéng nhau trong FT va lvu sé lan xuat hign cla tl 46, Mang F sé luu tré tat ca cc tir khéc nhau ddi mét trong tat ca cac file van ban dura vao cing voi s6 fle van ban chita tr do. Sau khi thye hign xong bude nay, img véi méi van ban, ta c6 céc danh sach Wr FTI tong ding, Gdng thoi, v6i tat ca ede fle vo nay, ta G6 mot danh sich tir chung ar. +. Bidu din timg file vin ban thanh eée vector (TFIDF) MBi danh sach trong céc dann séch FT1, FT2,... $6 6g biéu din bang céc vecto la cde méng tung Ung d1, d2,...ma cac phan tt cla mang sé tung Ung voi cac phan ti trang mng F. struct node { char “word; Pdi dung ti" int count 136 lan wut hign struct node “next; y Bigu din van ban béi mét vector trong 84, la mét mang sé thyc, s6 phan tir trong mang bang sé phén tu: trong danh sach Keyword, céc phan tir trong mang nay tung Ung voi cée phan ti trong mang F (phan WW th nhat cla mang éuge tinh theo trong 86 tung Gg cia phan tt thir nhl trong F. Chang han voi dt For i:=1 to [F] f |F| la sé phan ti cia mang F * If Fil] = FT( [word] then 1 (i)=TF(m, d1).IDF(wi) FFU] # FT [word] then d1()= 0 trong 66: TF(wi, ¢) la 6 lan tir wi xudt hign trong tai ligu d va: D ri) = one prm=Sf! wed So wee DF (wi) la 86 tai igu ma trong 46 wi xudt hign it nhat mot l&n + Thuat toan Stemming LLoai bé phan hau 16 otia mot tir mét cach ty déng cé nghia ta mot phép xir iy dae bigt 06 higu qua trong linh vue xt IY théng tin(IR). Trong méi trxeng IR dién hinh, bao gdm mét tap céc van ban (document) duoc mé ta bang cac tir tiéu 68 cia van ban va o@ thé la céc tir trong vain ban tri trong. Néu ta bé qua sur chinh xéc va vi tri ma cdc th ay direc 16 chiie thi 66 thé néi ring mél van ban duoc biéu dién bet mot vector cdc tir hay la Terms. Nhidu Term thugng mang y nghia twong tyr, vi dy: CONNECT CONNECTED CONNECTING CONNECTION CONNECTIONS Thuong thi higu qua cia céc hg théng IR tang Ién néu cdc nhom term Guge chuyén vé thanh mgt stem. Gach lam dua ra thudng Ia loai bd di cdc phan hau (6 nnhu ED, ING, ION, IONS... nhur vi dy trén thi ta sé thu vé duoc CONNECT. ‘B gid thigu thuat toan, ta di vao mot vai dinh nghia sau, Phy am la céc chi? A, E, |, 0, U va Y duoc ki higu boi c(consonant) Nguyén am la céc chi cén lai, Guge ki higu bai v (vowel) Ta ky higu C= c6,. voi chudicc... 06 d6 dai lon hon 0. V=Wy.. voi chubi w...c6 do dai Ian hon 0. M@t tir bat ki hay mot phan ota tir 88 c6 mét trong bén dang: evev..c evev..v veve...c VCVC...V hay & dang ting quat la: [C\(VC) {m} [V) ~ Ky higu [C] hay [V] nghia la C, hay V 06 thé 66 hoe khong = (VEm} nghTa la: cum VC dugc igp lai m lan (6 Go). Luat dé thay déi mét hgu té duoc cho dudi dang: (dieu kién) $1 > S2 Noha la néu met tr, thod man didu kién d@ cho, két thic béi hau t8 $1 thi St 38 duge chuyén thanh $2. Cac: sv | tirehiza mot phy am S| Wir kB thc BI MBE nguyén zim COI 70 | lirkét thc bai eve, vai nguyen &m thi hai khae W, X. ¥ ‘Didu kien co th a mat biBu the v6 ede phép logic: and, er, not Thudt toan Buc 1A (ak) St1>s2 |Vidw [7 SSES > ‘SS | caresses > caress 1ES > 1 | ponies > port ties + ti SS + SS __| caress + caress so cats — cat Bute 1B (aK) ‘St> 82 m0 | EED > EE | feed > feed agreed > agree —D > plasiered > plaster bled -+ bled 49 ING > motoring > motor sing + sing Néu nhu luat thir hai hoge ba ca tp luat Step duoc thye hign, céc twat sau day 88 duge thyc hign tiép. (aK) S1> 82 Vidu AT = ATE — | conflat(ed) + confiate BL + BLE — | troubi(ed)— trouble Z > WE | sizied) > size ‘d and not > Chir | hopping) > hop (Lor*Sor“Z) don | tann(ad) + tan felting) —> fall hissting) > hiss fizz(ed) + fizz and “0 E failling) — fail filing) > fle Luat chuyén vé thanh mot chi éom 66 thé dln t6i vige xoa mot cap cho dei. -E sé dugc dua tré Iai trong cae tu -AT, -BL va -1Z, co 60 cde nau té -ATE, -BLE va -IZE 66 thé diroc nhan ra trong Gac bude sau. Chir E nay 66 thé dug xod di trong bude thar. Bute 1C (ak) ‘St $2 Vi dy YT | happy > appt sky > ski Bude 1 ding cho cdc tir & dang s6 nhigu va dong tir chia & thi qué khib phan tir. Cac bude sau sé don gidn hon bude 1 Bude 2 50 (aK m0 m0 m0 m0 m0 m0 m0 m0 m0 m0 m0 70 70 70 =O 70 70 70 m0 Budge 3 (aK) m0 m0 m0 m0 m0 m0 St ATIONAL TIONAL ENCI ANCL WER ABLI ALU ENTLI Eu ‘USL IZATION ‘ATION ATOR ALIS IVENESS, FULNESS: ‘OUSNESS LITT with St ICATE ATIVE ALZE Tei TcAL FUL ‘SZ bed + 1 + 1 t}d}+ 4 4odpdl sp aye S2 body dts ATE TION ENCE ANCE Ze ABLE AL ENT ‘ous (Ze ATE ATE AL We FUL ‘OUS AL WE vide relational > relate ‘conditional scondition retional-rational valenciwvalence hesitanci-shesitance digitizer—sdigtize conformabli-sconformable radicalli-sradical differenti different vileli-yvile analogousii xanalogous Vietnamizationvietnamize predication-spredicate operator-soperate Teudalism-+feudal decisiveness decisive hopefulness shopeful callousness>calous ormaiitiformal sensitivitisensitive Vidy triplicate—triptic formative »form Tormalive>formal elecincit electric electrical selectric hopefulshope 31 m0 NESS > (goodness >go0d m0 > > Bute 4 (aK) Si > 82 Vidu mt AS revival-reviv m1 ANCE > allowance-vallow mt ENCE > inferencesinfer m1 cs airinarairine mt es ‘gyroscopic gyroscop m4 ABLE adjustable adjust met BLE > defensible >defens met ANT > initant-sinrit mat EMENT > replacement-rreplac met MENT > adjustment adjust mt ENT > dependent > depend (ae tang iON > adoption adopt (Sor‘T) mt ou > homologou > homolog met isM > ‘communism —> commun mt ATE > activate S activ mt m > angulerili > angular mi ous > homologous — homolog mt WE > effective effect mt mz > bowdlerize->bowdler Cac hau t6 bay gié 43 duge loai be. Bute 5A 32 (aK) Sts 82 Vidy met E> probate »probat ralesrale m=t and not“o ES cease pceas Bude 5B (aK S1 > 82 View m>1 and *d and *L + cho | controll-scontrol roll roll Thuat fodn nay khOng loi ba phn du t6 khi ma tir 66 06 phan than qua nga. Vige anh oi phan than ld ngdn hay dai théng qua s6 do m 2.2.2 Ap dung cac thugt toan lap nhém Sau khi 48 bigu din éuge van ban thanh cdc vector, ching ta c6 thd Ap dung céc thuat (oan Ip nhém dé tim hidu cho c&c vain ban nay. Bé 18 ode thuat tozin: > Thuat ton phan cdp Bayesian Thuat ton Kemeans Tuy nhién, do khodng thé gian thy tap o6 han, em méi chi xAy dying Guge module Web Crawler va biéu in cdc van bin béi cc vector trang khéng gian cae thudt ngir ma chua cai dt Guge céc thus tozn lap nom. TAI LIEU THAM KHAO 33 . Data mining ~ Practical Machine Learning Tools and Techniques with Java Implementation - fan H. Witten & Eibe Frank — Morgan KaufMann Publishers — 2002 - Nv9Wo2 2. Solving Data Mining Problems through Pattern Recognition — Ruby L. Kennedy, Yuchun Lee, Benjamin Van Roy, Christopher D.Resd, Dr-Richard P. Lippmann ~ Prentice Hell - 1997 . Predictive Data Mining ~ a practical guide. Sholom M4, Weiss, Nitin indurkhya — Morgan KaufMann Publishers - 1998 . Programing Bots, Spiders, Intelligent Agents in Microsoft Visual C++ - David Palimain ~ Microsoft Press ~ 1997 Phan _mém text classification m& ngubn mé cho Unix : *Rainbow" — http:/hwww.cs.emu.edul~mecallum/bow . Khai pha dG liu, Kg thual va ing dung - Luan vain ti nghiép cao hoe chuyén nganh céng nghé thang tin - Nguy8n Thi Dibu Thur TS666 Tim higu va gidi quyét mot s van 48 trong xi van ban tiéng Viet - 86 én tt nghigp dai hoc, bé mén cong nghé phan mém - Nguyén Lé Vinh ~ 2003 “Tim hidu va xay dung m6 to tim kidm tai gu tiSng Viet - 4 an t6t nghigp dai hoc, bé man céng nghé phén mém - Bang Xuan Ha — 2003, . Cc bal béo tham khdo trén mang Internet (Site: http:/citeseer.nj.nec.com) + Web Mining Research : A Survey + Web Mining : Pattern Discovery from World Wide Web Transactions. + Data Preparation for Mining World Wide Web Browsing Patterns + Web Mining and Knowledge Discovery of Usage Patterns + Efficiently Mining Frequent Trees in the Forest + Low Complexity Fuzzy Relational Clustering Algorithms for Web Mining + Extracting Web User Profiles Using Relational Competitive Fuzzy Clustering + Robust Fuzzy Clustering Methods to Support Web Mining «A Fuzzy relative of the k¢Medoids Algorithm with Application to Web Document and Snippet Clustering # Mining Web Access logs using a Fuzzy Relational Clustering Algorithm based on a Robust Estimator sa 35

You might also like