You are on page 1of 21

VIN CNG NGH THNG TIN & TRUYN THNG B MN CNG NGH PHN MM

BO CO THC TP TT NGHIP
ti: in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng Ging vin hng dn: PGS.TS. Hunh Quyt Thng ThS. L Quc B mn Cng ngh phn mm Vin CNTT & TT i hc Bch Khoa H Ni Sinh vin thc hin: Nguyn Vn ng Anh 20060102

H Ni, 02/2011

Mc Lc
1.
a. b.

B my tm kim ................................................................................................................ 3
Quy trnh tm kim ..................................................................................................................... 3 Tiu ch cho mt b my tm kim .......................................................................................... 3

2.
a. b. c.

Solr l g ......................................................................................................................................... 3 Ti sao chn Solr ......................................................................................................................... 4 Qu trnh thc hin .................................................................................................................... 5 i. File ch mc .............................................................................................................................. 6 ii. Qu trnh nh ch mc ......................................................................................................... 8 iii. Qu trnh tm kim ............................................................................................................... 11 BKProfile l g........................................................................................................................... 13 Thit k cu trc bn ghi ch mc ........................................................................................ 14 Nng cao cht lng tm kim .............................................................................................. 17 i. nh trng s ........................................................................................................................ 17 ii. Gom nhm cc cm t hay xut hin ............................................................................... 17

Solr ........................................................................................................................................ 3

3.
a. b. c.

Solr trong d n BKProfile........................................................................................... 13

4.

Demo .................................................................................................................................. 19

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

1. B my tm kim
a. Quy trnh tm kim Quy trnh tm kim bao gm c bn bc: Truy vn tm kim c thc hin bi ngi s dng bng cch yu cu b my tm kim thc hin tm kim cc t kha no . B my tm kim s thc hin x l truy vn My tm kim tm cc t kha trong b ch mc c sn ca n My tm kim thc hin nh im, sp xp theo ph hp vi yu cu tm kim v tr kt qu v cho ngi dng. b. Tiu ch cho mt b my tm kim C rt nhiu tiu ch cn t ra cho mt b my tm kim. Kt qu phi chnh xc: Sp xp theo th t ph hp: Cng ph hp vi yu cu tm kim ca ngi s dng th cng c a ln u tin Tc nhanh D dng ty chnh: i vi ngi pht trin, mt b my tm kim c coi l tt nu n c th d dng thm bt, cu hnh cc thuc tnh bn trong nhn ca b tm kim. Ngoi ra, my tm kim cn cho php ngi pht trin d dng theo di qu trnh tm kim, theo di qu trnh thc hin x l cu truy vn ca ngi dng t c cc bc ty chnh ph hp nng cao ph hp ca kt qu tr v Phn tn: Vi khi lng thng tin khng l v tng nhanh tng ngy th yu cu phn tn l mt yu cu cn thit i vi b my tm kim. Mt vi chc nng khc: o C chc nng nh du cho kt qu tr v o Tm kim theo cm: Ngi dng c th lc dn cc tiu ch theo cm t ln n nh dn a ra kt qu ph hp o T ng ngha: My tm kim cho php tm cc t c cng ngha vi cc t kha ngi dng nhp vo o T gc: My tm kim cho php tm cc t l t gc ca cc t trong t kha ca ngi dng. o Kim tra chnh t: My tm kim cho php kim tra chnh t ca ngi dng v t , gi cho ngi dng tm kim theo cc t kha ng chnh t o Stopwords: Trong ngn ng, c nhiu t khng mang nhiu ngha (v d cc t cm thn trong ting Vit hoc trong ting Anh c cc t nh a, the, not, but)

2. Solr
a. Solr l g Solr l mt my ch tm kim vn bn c tc thc thi rt nhanh. Solr s dng nhn tm kim Lucene, l mt th vin tm kim gm c cc chc nng sau:

Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology 3

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

nh ch mc ngc Phn tch vn bn Thut ton nh im tt Solr c pht trin hon ton bng Java v c thc thi trong mt container nh Tomcat hoc Jetty. Solr c cc giao tip API da trn nn tng XML hoc Json khin cho vic thc hin tng tc vi nhiu ngn ng khc tr nn d dng. Solr cho php ngi dng thc hin vic cu hnh bn ngoi h thng thng qua vic chnh sa trong file cu hnh (xml). Cc thnh phn khc m Solr cung cp: T ng ngha nh du kt qu tr v Phn tn Kt hp trc tip vi cc c s d liu (MySql, MSSQL) ly d liu Trong Solr tn ti hai qu trnh: Qu trnh nh ch mc: xy dng b d liu cho my ch tm kim Qu trnh truy vn: thc hin tm kim trong b d liu ca my ch tm kim b. Ti sao chn Solr Hin nay c nhiu cc loi my ch tm kim khc nhau, tuy nhin sau qu trnh tm hiu v nghin cu, da trn cc tiu ch nh gi v thi gian thc hin qu trnhnh ch mc, thi gian thc hin cu truy vn, s lng cu truy vn c thc hin trong mt n v thi gian, s lng khng gian a c ng m b d liu nh ch mc chim dng, Solr xng ng l ng c vin xut sc cho mt b my tm kim

4 Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

c. Qu trnh thc hin Qu trnh thc hin ca Solr cng ging qu trnh thc hin trong mt b my tm kim. Xy dng b d liu ch mc thng qua qu trnh nh ch mc Thc hin yu cu truy vn tm kim ca ngi dng trong b d liu ch mc v tr kt qu tm kim tng ng cho ngi dng.

Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology 5

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

6 Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology

i. File ch mc Mi mt bn ghi trong Solr c gi l mt doc, trong doc ny c nhiu trng, mi trng c mt kiu khc nhau (int, string) File ch mc bao gm cc thng tin sau: Tn trng: Tp tn trng cha ton b tt c cc trng c trong tt c cc bn ghi trong b ch mc ca Solr. Gi tr ca trng: y l cc gi tr tr v khi ngi dng thc hin truy vn i vi b d liu. T in terms: Cha ton b term trong b d liu ch mc, i km vi n l s lng cc bn ghi (doc) c cha term v con tr ch ti tn sut xut hin ca term v d liu v tr. o D liu tn sut xut hin: Vi mi term trong t in, cha thng tin s lng bn ghi c cha term v tn sut xut hin ca term trong mi bn ghi o D liu v tr: i vi mi term trong t in, cha thng tin v tr ca term xut hin trong mi bn ghi H s chun ha: H s ny c tc dng trn mi trng ca bn ghi, nhm xc nh h s nhn trong kt qu tr v ca mi trng trong bn ghi. H s ny cho php xc nh c trng no c trng s cao hn trng no. Term vector: i vi mi trng, thng tin cc term ca mt trng c cha trong mt vector.

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

Cc bn ghi b xa: Cha thng tin nh du cc bn ghi trong ch mc b xa. Sau ny khi thc hin ti u ha, s thc hin xa hn trong file ch mc. nh ch mc ngc: B d liu ch mc cha cc thng tin thng k v term vic tm kim trn b d liu v term ny c thc hin mt cc hiu qu. Ch mc ngc c ngh l i vi mi a term, file ch mc s lu cc bn ghi c cha term thay v ngc li. Bn ghi (document): V Solr s dng c s d liu bn ghi (document database), v vy, khng h tn ti quan h gia cc bn ghi vi nhau. Ton b cc thng tin tng ng i vi mt thc th c lu ton b trong mt bn ghi, v bn ghi ny c lu li trong ch mc. Bn ghi bao gm nhiu trng. Trng (field): Trong mi mt bn ghi, c cc trng tng ng cha d liu. V d thng tin v con ngi gm c h, tn, tui, qu qun Mi mt trng thuc mt kiu d liu xc nh, c th l nguyn thy (int, string) hoc do ngi pht trin t nh ngha. Ngoi ra, cn c mt s trng c bit nh: CopyField: L trng m s ly thng tin t mt trng (source) v ghi vo mt trng khc (destination) DynamicField: L trng cho php ngi dng khng phi ghi r tn tr ng m ch cn s dng cc k t i din (nh *). UniqueKey: Ch nh r tr ng no i din cho mt bn ghi DefaultSearchField: Trng no l trng mc nh thc hin vic tm kim trn b d liu DefautlOperator: Ton hng mc nh thc hin ni ghp cc t kha trong yu cu tm kim ca ngi dng. File nh kiu bn ghi (schema): Cc bn ghi phi c mt cu trc xc nh trc, bao gm cc trng no, kiu d liu nh th no Thng tin cc trng, kiu d liu ca cc trng cng nh nhiu thng tin cu thnh nn bn ghi c lu trong mt file cu hnh (schema). File ny c dng xml, c a vo h thng lc khi ng my ch.

Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology 7

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

ii. Qu trnh nh ch mc

8 Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

Analizer: L cc thnh phn xt cc trng trong cc bn ghi hoc cu truy vn ca ngi dng to thnh mt lot cc token (cc t kha) Tokenizer: L cc thnh phn c s dng to cc term t cu truy vn ca ngi dng hoc d liu mt trng no da vo cc tiu ch nh sn trong file cu hnh. V d: WhiteSpaceTokenizer s tch cu truy vn ca ngi dng thnh cc t kha da vo khong trng. StandardTokenizer s tch cu truy vn ca ngi dng da vo khong trng v cc du chm cu. LowerCaseTokenizer tch cu truy vn ca ngi dng thnh cc t kha da vo cc ch ci khng phi l k t v sau chuyn ht t kha v dng ch vit thng. Filter: L cc thnh phn c s dng phn tch cc t kha v hoc gi chng, chuyn chng thnh cc phn khc, hoc b chng, hoc to thm cc t kha khc. Cc tiu ch trn c ch nh bi ngi pht trin trong file cu hnh. Mt vi v d cho filter nh SynonymFilter s thm cc t kha ng ngha vi cc t kha trong cu truy vn ca ngi dng hoc trn trng c p dng Filter; StopwordFilter s loi b ht tt c cc t kha khng mang gi tr (c nh ngha trong mt file vn bn). Qu trnh nh ch mc c thc hin nh sau: Sau khi ly c cc thng tin cn nh ch mc v a vo cc trng trong bn ghi, cc thnh phn Tokenizer v Filter ngi pht trin ch nh trong file cu hnh s thc hin tch thng tin trong cc trng thnh cc t kha c th. Cc Analyzer ny s b sung, xa bt hoc gi nguyn cc t kha ph thuc vo cch cu hnh ca ngi pht trin. Sau , cc t kha c sinh ra s c lu li trong b ch mc chuyn bit tm kim. Cc gi tr nguyn bn ng vi mi trng s c lu li hoc khng lu li (ty cch cu hnh tr ng trong file cu hnh bn ghi (schema) ca ngi pht trin); nu c lu li, nu my tm kim tm c thng tin (hit), s tr li gi tr ny, cn khng s khng tr li gi tr.

Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology 9

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

10 Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

iii. Qu trnh tm kim

Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology 11

QueryParser: L cc b x l truy vn ca ngi dng.Vi mi b x l s c cch x l truy vn ring nh m ki m u, trn t trng no, c nh trng s ln trng no lc tm kim hay khng Khi ngi dng thc hin mt cu lnh truy vn, cu lng truy vn s c x l bi cc request handler (b x l yu cu tm kim, update, hoc xa). Cc request handler ny s xc nh cc logic

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

cn thc hin i vi cc yu cu . Cu lnh truy vn s c x l bi QueryParser, b phn ny s chu trch nhim sao cho cu truy vn tm kim ca ngi dng c tch khi cc tham s trong cu lnh truy vn. C nhiu loi QueryParser trong Solr, v d Standard Query Parser cho php ngi dng xc nh r cu truy vn tm kim; Dismax Query Parser cho php ngi dng thc hin cc cu lnh tm kim n gin nhng trn nhiu trng khc nhau. Sau , truy vn tm kim c cc Analyzer ch nh bi ngi pht trin tch thnh cc t kha khc nhau.

12 Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology

Sau , b my tm kim s tm kim cc t kha cng vi trng s trn cc t kha (nu c) trong b d liu ch mc ca h thng. Gi tr tr v s c x l tip da vo cc tham s truyn vo trong cu lnh truy vn ca ngi dng nh bt u t bn ghi no, kt thc n bn ghi no, tr v theo nh dng g Ngoi ra, cc thnh phn c ch nh trong yu cu truy vn s c x l

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

thm trong phn ny nh thnh phn HightLighting s nh du cc t kha trong cu lnh truy vn ca ngi dng ph hp vi t no trong bn ghi trong b d liu ch mc ca b my tm kim. Cc kiu tm kim c h tr trong Solr: Normal Query: Ch c t kha trong cu lnh truy vn ca ngi dng Wildcard Query: C thm cc k t wildcard trong cu lnh truy vn. ? thay cho mt k t; * thay cho khng hoc nhiu k t Range Searches: Cho php tm kim trong mt khong nht nh. V d tm kim theo ngy t ngy 01-01-2002 n ngy 01-01-2003 c cu lnh tm kim nh sau: date_search: [20020101 TO 20030101] Boosting: L trng s i vi mt s t kha no trong cu truy vn tm kim ca ngi dng. V d: name: dong^2 anh th dong c trng s l 2 v kt qu tr v s a ln u cc tn c dong trc cc tn c anh. QueryField: Xc nh xem trng no s c ch nh thc hin vic truy vn ngay trn n. V d: Tm kim tn: name:dong anh s tm kim trn trng tn gi tr l dong anh Boolean Operator: L ton hng s c s dng kt ni cc gi tr ca t kha ngi dng.V d: name: dong AND anh th s tm trong trng tn c gi tr l dong v anh. Datetime Query: Thc hin tm kim trn cc trng c kiu datetime. V d: event_date: [* TO NOW-2YEAR] s tm kim cc bn ghi c gi tr event_date t nm bt k n cch y 2 nm. Cc tham s Start: Bt u t bn ghi no Rows: S lng bn ghi cn tr v DebugQuery: Tr v thm cc gi tr tnh ton im cng nh cc thnh phn khc trong my tm kim ngi pht trin c th phn tch nh gi kt qu tr v Response Writer: Kiu nh dng tr v cho ngi dng, c th thuc dng XML hoc JSON.

3. Solr trong d n BKProfile


a. BKProfile l g BKProfile l mt b my tm kim, chuyn thc hin tm kim tr v thng tin ca con ngi. B my tm kim ny phc v mc ch tm kim con ngi cho 3 i tng chnh: Cc sinh vin hoc cu sinh vin mun tm mt sinh vin no nhm phc v mt mc ch no ; Cc nh lm gio dc c th thc hin tm kim sinh vin sau khi ra trng nh gi thc trng gio dc; Cc nh tuyn dng tm kim cc ng vin c tim nng.

Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology 13

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

V BKProfile l mt h thng tm kim nn n cng c cc yu cu c bn ca mt h thng tm kim nh ph hp ca kt qu cao Tc thc thi nhanh chng Thng tin tr v l duy nht i vi mt ngi C thm cc tnh nng phong ph khc tng thm tnh hp dn cho h thng b. Thit k cu trc bn ghi ch mc

14 Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology

Bn ghi bao gm c cc trng *_indexed: lu cc thng tin ca cc trng c kt thc l _indexed. Text: l mt copy field, mi gi tr ca cc trng khc s c chuyn thng vo trong trng ny. Trng ny c thc hin vic tm kim mc nh. Profile_id: L i din cho mt bn ghi. Thit k kiu trng BKText l mt kiu d liu custom c to ring cho d n BKProfile. o Qu trnh nh ch mc WhiteSpaceTokenizerFactory: Tch gi tr ca cc trng thnh cc t kha da vo khong trng gia cc t kha ASCIIFoldingFilterFactory: Chuyn tt c t kha v dng ASCII c th tm kim theo dng ting vit khng du WordDelimiterFilterFactory: Tch t ghp trong c c ch ci v s thnh ch ci ring v s ring LowerCaseFilterFactory: Chuyn tt c t kha v dng ch ci thng SynonymFilterFactory:S dng t ng ngh a cho php qu trnh tm kim c kt qu tt hn StopFilterFactory: Loi b tt c cc t khng mang nhiu ngh Cc t kha ny c nh ngha a. trong mt file ring (stopword.txt). Hu ht cc t ny u l t cm thn. ShingleFilterFactory: Ghp ni cc t kha thnh cc cm t (ti a l 5) tng chnh xc trong lc tm kim

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

SnowballPorterFilterFactory: loi b cc t bt ngun t t khc RemoveDuplicatesTokenFilterFactory: Loi b tt c cc t kha trng nhau sau khi c to bi tt c cc filter trn.

Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology 15

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

16 Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology

o Qu trnh tm kim WhiteSpaceFilterFactory:Tch truy vn ca ngi dng thnh cc t kha da vo khong trng ASCIIFoldingFilterFactory: Chuyn cc t kha thnh dng ASCII c th tm kim ting vit khng du WordDelimiterFilterFactory:Tch t ghp trong c c ch ci v s thnh ch ci ring v s ring LowerCaseFilterFactory: Chuyn tt c t kha v dng ch ci thng SynonymFilterFactory: S dng t ng ngha cho php qu trnh tm kim c kt qu tt hn StopFilterFactory: Loi b tt c cc t khng mang nhiu ngh Cc t kha ny c nh ngha a. trong mt file ring (stopword.txt). Hu ht cc t ny u l t cm thn.

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

ShingleFilterFactory: Ghp ni cc t kha thnh cc cm t (ti a l 5) tng chnh xc trong lc tm kim SnowballPorterFilterFactory: loi b cc t bt ngun t t khc RemoveDuplicatesTokenFilterFactory: Loi b tt c cc t kha trng nhau sau khi c to bi tt c cc filter trn.

c. Nng cao cht lng tm kim i. nh trng s i vi mi trng khc nhau c trng s khc nhau. V d nh khi thc hin tm kim th trng tn thng mang gi tr cao hn so vi cc trng khc. iu ny c ng ng v i cc trng nh tn trng, tn lp...V vy, khi thc hin thit k file ch mc, phi thc hin nh trng s cho cc trng kt qu tr v c ph hp cao hn so vi mc nh ii. Gom nhm cc cm t hay xut hin Khi cng tm kim i vi mt cm t no , Solr khng th pht hin u l cm t m n thc hin tm kim trn ton b cc bn ghi xem c xut hin t kha hay khng. y l iu chng ta khng h mong mun. V d khi tm kim t Cng ngh phn mm th cc kt qu c cha cng ngh hoc phn mm s ng trc cc kt qu c cng ngh an. V vy, mt t in nh ngha u l cm t hay xut hin tng thm im cho chng trong kt qu tr v tng thm ph hp trong kt qu tm kim

Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology 17

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

18 Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

4. Demo Tm kim vi t kha Hunh Quyt Thng

Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology 19

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

Tm kim vi t kha: Hunh Quyt Thng i hc Bch Khoa H Ni cng ngh thng tin cng ngh phn mm

20 Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology

in ton m my, MapReduce v ng dng xy dng h tm kim theo yu cu ngi dng

Ti liu tham kho

[1]. Apache Lucene, Solr http://lucene.apache.org/ http://lucene.apache.org/solr/ [2]. Packtpub Solr 1.4 Enterprise Search Server 2009 [3]. Lucidworks for Solr

Nguyn Vn ng Anh - CNPM K51 - Hanoi University of Technology 21

You might also like