You are on page 1of 63

Khoa Khoa Hc & K Thut My Tnh Trng i Hc Bch Khoa Tp.

H Ch Minh

Chng 1: Tng quan v khai ph d liu


Cao Hc Ngnh Khoa Hc My Tnh Gio trnh in t Bin son bi: TS. V Th Ngc Chu (chauvtn@cse.hcmut.edu.vn)
Hc k 1 2011-2012
1

Ti liu tham kho


[1] Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, Second Edition, Morgan Kaufmann Publishers, 2006. [2] David Hand, Heikki Mannila, Padhraic Smyth, Principles of Data Mining, MIT Press, 2001. [3] David L. Olson, Dursun Delen, Advanced Data Mining Techniques, Springer-Verlag, 2008. [4] Graham J. Williams, Simeon J. Simoff, Data Mining: Theory, Methodology, Techniques, and Applications, Springer-Verlag, 2006. [5] Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin Kumar, Next Generation of Data Mining, Taylor & Francis Group, LLC, 2009. [6] Daniel T. Larose, Data mining methods and models, John Wiley & Sons, Inc, 2006. [7] Ian H.Witten, Eibe Frank, Data mining : practical machine learning tools and techniques, Second Edition, Elsevier Inc, 2005. [8] Florent Messeglia, Pascal Poncelet & Maguelonne Teisseire, Successes and new directions in data mining, IGI Global, 2008. [9] Oded Maimon, Lior Rokach, Data Mining and Knowledge Discovery Handbook, Second Edition, Springer Science + Business Media, LLC 2005, 2010.

Ni dung

Chng 1: Tng quan v khai ph d liu Chng 2: Cc vn tin x l d liu Chng 3: Hi qui d liu Chng 4: Phn loi d liu Chng 5: Gom cm d liu Chng 6: Lut kt hp Chng 7: Khai ph d liu v cng ngh c s d liu Chng 8: ng dng khai ph d liu Chng 9: Cc ti nghin cu trong khai ph d liu Chng 10: n tp

Chng 1: Tng quan v khai ph d liu


1.0. 1.1. 1.2. 1.3.

Tnh hung Qu trnh khm ph tri thc Cc khi nim ngha v vai tr ca khai ph d ng dng ca khai ph d liu Tm tt
4

liu

1.4. 1.5.

1.0. Tnh hung 1

Ngi ang s dng th ID = 1234 tht s l ch nhn ca th hay l mt tn trm?

1.0. Tnh hung 2


Tid Refund 1 2 3 4 5 6 7 8 9 10
10

Marital Status Single Married Single Married

Taxable Evade Income 125K 100K 70K 120K No No No No Yes No No Yes No Yes
6

Yes No No Yes No No Yes No No No

Divorced 95K Married 60K

ng A (Tid = 100) c kh nng trn thu???

Divorced 220K Single Married Single 85K 75K 90K

1.0. Tnh hung 3


Ngy mai c phiu STB s tng???

1.0. Tnh hung 4


Kha 2004 2004 2004 2004 2004 2005 2006 2007 2008 MSV 1 2 3 8 14 90 24 82 47 MnHc1 9.0 6.5 4.0 5.5 5.0 7.0 9.5 5.5 2.0 MnHc2 8.5 8.0 2.5 3.5 5.5 6.0 7.5 4.5 3.0 TtNghip C C Khng Khng C C (80%) C (90%) Khng (45%) Khng (97%) Lm sao xc nh c kh nng tt nghip ca mt sinh vin hin ti? 8

1.0. Tnh hung

We are data rich, but information poor. Necessity is the mother of invention. - Plato
9

1.1. Qu trnh khm ph tri thc


Pattern Evaluation/ Presentation Data Mining
Task-relevant Data Data Warehouse Patterns

Selection/Transformation

Data Cleaning

Data Integration
10

Data Sources

1.1. Qu trnh khm ph tri thc


Knowledge discovery in databases is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

Frawley, W. J et al. (1991). Knowledge discovery in databases: an overview.

Knowledge discovery from databases is the process of using the database along with any required selection, preprocessing, sub-sampling, and transformations of it; to apply data mining methods (algorithms) to enumerate patterns from it; and to evaluate the products of data mining to identify the subset of the enumerated patterns deemed knowledge.

Fayyad, U.M et al. (1996). Advances in Knowledge Discovery and Data Mining. MIT Press.

11

1.1. Qu trnh khm ph tri thc


Qu trnh khm ph tri thc l mt chui lp gm cc bc:


Data cleaning (lm sch d liu) Data integration (tch hp d liu) Data selection (chn la d liu) Data transformation (bin i d liu) Data mining (khai ph d liu) Pattern evaluation (nh gi mu) Knowledge presentation (biu din tri thc)
12

1.1. Qu trnh khm ph tri thc


Qu trnh khm ph tri thc l mt chui lp gm cc bc c thc thi vi:


Data sources (cc ngun d liu) Data warehouse (kho d liu) Task-relevant data (d liu c th s c khai ph) Patterns (mu kt qu t khai ph d liu) Knowledge (tri thc t c)
13

1.1. Qu trnh khm ph tri thc


Increasing potential to support business decisions

Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Analysis, Querying and Reporting

End User

Business Analyst Data Analyst

Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP

DBA

14

1.2. Cc khi nim


1.2.1. Khai ph d liu (data mining) 1.2.2. Cc tc v khai ph d liu (data mining tasks/functions) 1.2.3. Cc quy trnh khai ph d liu (data mining processes) 1.2.4. Cc h thng khai ph d liu (data mining systems)
15

1.2.1. Khai ph d liu


Khai ph d liu

mt qu trnh trch xut tri thc t lng ln d liu


extracting or mining knowledge from large amounts of data knowledge mining from data

mt qu trnh khng d trch xut thng tin n, hu ch, cha c bit trc t d liu

the nontrivial extraction of implicit, previously unknown, and potentially useful information from data

Cc thut ng thng c dng tng ng: knowledge discovery/mining in data/databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence
16

1.2.1. Khai ph d liu


Lng ln d liu sn c khai ph


Bt k loi d liu c lu tr hay tm thi, c cu trc hay bn cu trc hay phi cu trc D liu c lu tr

Cc tp tin truyn thng (flat files) Cc c s d liu quan h (relational databases) hay quan h i tng (object relational databases) Cc c s d liu giao tc (transactional databases) hay kho d liu (data warehouses) Cc c s d liu hng ng dng: c s d liu khng gian (spatial databases), c s d liu thi gian (temporal databases), c s d liu khng thi gian (spatio-temporal databases), c s d liu chui thi gian (time series databases), c s d liu vn bn (text databases), c s d liu a phng tin (multimedia databases), Cc kho thng tin: the World Wide Web,
17

D liu tm thi: cc dng d liu (data streams)

1.2.1. Khai ph d liu


Tri thc t c t qu trnh khai ph


M t lp/khi nim (c trng ha v phn bit ha) Mu thng xuyn, cc mi quan h kt hp/tng quan M hnh phn loi v d on M hnh gom cm Cc phn t bin Xu hng hay mc thng xuyn ca cc i tng c hnh vi thay i theo thi gian
18

1.2.1. Khai ph d liu


Tri thc t c t qu trnh khai ph


Tri thc t c c th c tnh m t hay d on ty thuc vo qu trnh khai ph c th.


M t (Descriptive): c kh nng c trng ha cc thuc tnh chung ca d liu c khai ph (Tnh hung 1) D on (Predictive): c kh nng suy lun t d liu hin c d on (Tnh hung 2, 3, v 4)

Tri thc t c c th c cu trc, bn cu trc, hoc phi cu trc. Tri thc t c c th c/khng c ngi dng quan tm cc o nh gi tri thc t c. Tri thc t c c th c dng trong vic h tr ra quyt nh, iu khin quy trnh, qun l thng tin, x l truy vn

19

1.2.1. Khai ph d liu

(trends, regularities, )

(characterization and discrimination)

20

1.2.1. Khai ph d liu


Statistics Machine Learning

Data Mining Database Technology Other Disciplines Visualization

Khai ph d liu l mt lnh vc lin ngnh, ni hi t ca nhiu hc thuyt v cng ngh.


Data mining as a confluence of multiple disciplines

21

1.2.1. Khai ph d liu


Khai ph d liu v cng ngh c s d liu


Kh nng ng gp ca cng ngh c s d liu


Cng ngh c s d liu cho vic qun l d liu c khai ph.


D liu rt ln, c th vt qu kh nng ca b nh chnh (main memory). D liu c thu thp theo thi gian.

Cc h c s d liu c kh nng x l hiu qu lng ln d liu vi cc c ch phn trang (paging) v hon chuyn (swapping) d liu vo/ra b nh chnh. Cc h c s d liu hin i c kh nng x l nhiu loi d liu phc tp (spatial, temporal, spatiotemporal, multimedia, text, Web, ). Cc chc nng khc (x l ng thi, bo mt, hiu nng, ti u ha, ) ca cc h c s d liu c pht trin tt.

22

1.2.1. Khai ph d liu


Khai ph d liu v cng ngh c s d liu


Thc trng ng gp ca cng ngh c s d liu


Cc h qun tr c s d liu (DBMS) h tr khai ph d liu.


Oracle Data Mining (Oracle 9i, 10g, 11g) Cc cng c khai ph d liu ca Microsoft (MS SQL Server 2000, 2005, 2008) Intelligent Miner (IBM)

Cc h c s d liu qui np (inductive database) h tr khm ph tri thc. Chun SQL/MM 6:Data Mining ca ISO/IEC 132496:2006 h tr khai ph d liu.
c t giao din SQL cho cc ng dng v dch v khai ph d liu t cc c s d liu quan h
23

1.2.1. Khai ph d liu


Khai ph d liu v l thuyt thng k


Statistics Descriptive Statistics
M t d liu

Inductive Statistics
D bo v suy lun

Hai tp d liu mu c cng phn b?

24

1.2.1. Khai ph d liu


Khai ph d liu v hc my
Machine Learning Unsupervised Supervised

Natural groupings

Reinforcement

25

1.2.1. Khai ph d liu


Khai ph d liu v trc quan ha


D liu: 3D cubes, distribution charts, curves, surfaces, link graphs, image frames and movies, parallel coordinates Kt qu (tri thc): pie charts, scatter plots, box plots, association rules, parallel coordinates, dendograms, temporal evolution

Pie chart

Parallel coordinates

Temporal evolution

26

1.2.1. Khai ph d liu


Khai ph d liu v trc quan ha


Feature Selection

Mean Feature Image


27

1.2.1. Khai ph d liu


Khai ph d liu v trc quan ha


Gn nhn cc lp
Isodata (K-means) Clustering

Mean Feature Image

Label Image
28

1.2.2. Cc tc v khai ph d liu


Khai ph m t lp/khi nim (c trng ha v phn bit ha d liu) Khai ph lut kt hp/tng quan Phn loi d liu D on Gom cm d liu Phn tch xu hng Phn tch lch v phn t bin Phn tch tng t

29

1.2.2. Cc tc v khai ph d liu

Clu

ste rin g

Data
Tid Refund Marital Status 1 2 3 4 5 6 7 8 9 10 11 Yes No No Yes No No Yes No No No No Yes No No No Single Married Single Married Taxable Income Cheat 125K 100K 70K 120K No No No No Yes No No Yes No Yes No No Yes No Yes

Divorced 95K Married 60K

s if s la

ic

n io t a

Divorced 220K Single Married Single Married 85K 75K 90K 60K

n tio a i c so s A le s u R

12 13 14 15
10

Divorced 220K Single Married Single 85K 75K 90K

An De oma t ec l y tio n

ers o th

Milk
30

1.2.2. Cc tc v khai ph d liu


Nm thnh t c bn c t mt tc v khai ph d liu


D liu c th s c khai ph (task-relevant data) Loi tri thc s t c (kind of knowledge) Tri thc nn (background knowledge) Cc o (interestingness measures) Cc k thut biu din tri thc/trc quan ha mu (pattern visualization and knowledge presentation)
31

1.2.2. Cc tc v khai ph d liu


D liu c th s c khai ph (taskrelevant data)


Phn d liu t cc d liu ngun c quan tm Tng ng vi cc thuc tnh hay chiu d liu c quan tm Bao gm: tn kho d liu/c s d liu, cc bng d liu hay cc khi d liu, cc iu kin chn d liu, cc thuc tnh hay chiu d liu c tm, cc tiu ch gom nhm d liu
32

1.2.2. Cc tc v khai ph d liu


Loi tri thc s t c (kind of knowledge)


Bao gm: c trng ha d liu, phn bit ha d liu, m hnh phn tch kt hp hay tng quan, m hnh phn lp, m hnh d on, m hnh gom cm, m hnh phn tch phn t bin, m hnh phn tch tin ha Tng ng vi tc v khai ph d liu c th s c thc thi

33

1.2.2. Cc tc v khai ph d liu


Tri thc nn (background knowledge)


Tng ng vi lnh vc c th s c khai ph Hng dn qu trnh khm ph tri thc


H tr khai ph d liu nhiu mc tru tng khc nhau

nh gi cc mu c tm thy Bao gm: cc phn cp nim, nim tin ca ngi s dng v cc mi quan h ca d liu

34

1.2.2. Cc tc v khai ph d liu


Cc o (interestingness measures)

Thng i km vi cc ngng gi tr (threshold) Dn ng cho qu trnh khai ph hoc nh gi cc mu c tm thy Tng ng vi loi tri thc s t c v do , tng ng vi tc v khai ph d liu c th s c thc thi Kim tra: tnh n gin (simplicity), tnh chc chn (certainty), tnh hu dng (utility), tnh mi (novelty)
35

1.2.2. Cc tc v khai ph d liu


Cc k thut biu din tri thc/trc quan ha mu (pattern visualization and knowledge presentation)

Xc nh dng cc mu/tri thc c tm thy th hin n ngi s dng Bao gm: lut (rules), bng (tables), bo co (reports), biu (charts), th (graphs), cy (trees), v khi (cubes)

36

1.2.2. Cc tc v khai ph d liu


Khai ph d liu

Phn loi d liu


Gii thut phn loi vi cy quyt nh Gii thut phn loi vi mng Bayes Gii thut gom cm k-means Gii thut gom cm phn cp nhm Gii thut Apriori
37

Gom cm d liu

Khai ph lut kt hp

1.2.2. Cc tc v khai ph d liu

Task-relevant Data

Gi ii Gi Gi i Thu tt Thu Thut

Interesting Patterns (Knowledge)

Tc V Khai Ph D Liu

Khai Ph D Liu
38

1.2.2. Cc tc v khai ph d liu


Bn

thnh phn c bn ca mt gii thut khai ph d liu


Cu trc mu hay cu trc m hnh (model or pattern structure) Hm t s (score function) Phng php tm kim v ti u ha (optimization and search method) Chin lc qun l d liu (data management strategy)
39

1.2.2. Cc tc v khai ph d liu


Cu trc mu hay cu trc m hnh (model or pattern structure)


M hnh l m t ca tp d liu, mang tnh ton cc mc cao. Mu l c im (c trng) ca d liu, mang tnh cc b, ch cho mt vi bn ghi/i tng hay vi bin. Cu trc biu din cc dng chc nng chung vi cc thng s cha c xc nh tr. Cu trc m hnh l mt tm tt ton cc v d liu.

V d: Y = aX + b l mt cu trc m hnh v Y = 3X + 2 l mt m hnh c th c nh ngha da trn cu trc ny.

Cu trc mu l nhng cu trc lin quan mt phn tng i nh ca d liu hay ca khng gian d liu.

V d: p(Y>y1|X>x1) = p1 l mt cu trc mu v p(Y>5|X>10) = 0.5 l mt mu c xc nh da trn cu trc ny.


40

1.2.2. Cc tc v khai ph d liu


Hm t s (score function)

Hm t s l hm xc nh mt cu trc m hnh/mu p ng tp d liu cho tt mc no . Hm t s cho bit liu mt m hnh c tt hn cc m hnh khc hay khng. Hm t s khng nn ph thuc nhiu vo tp d liu, khng nn chim nhiu thi gian tnh ton. Mt vi hm t s thng dng: likelihood, sum of squared errors, misclassification rate,
41

1.2.2. Cc tc v khai ph d liu


Phng php tm kim v ti u ha (optimization and search method)


Mc tiu ca phng php tm kim v ti u ha l xc nh cu trc v gi tr cc thng s p ng tt nht hm t s t d liu sn c. Tm kim cc mu v m hnh

Khng gian trng thi: tp ri rc cc trng thi


Bi ton tm kim: bt u ti mt node (trng thi) c th, di chuyn qua khng gian trng thi tm thy node tng ng vi trng thi p ng tt nht hm t s.

Phng php tm kim: chin lc tham lam, c dng heuristics, chin lc nhnh-cn
42

Ti u ha thng s

1.2.2. Cc tc v khai ph d liu


Chin lc qun l d liu (data management strategy)


D liu c khai ph

t, ton b c x l ng thi trong b nh chnh Nhiu, trn a, mt phn c x l ng thi trong b nh chnh

Chin lc qun l d liu h tr cch d liu c lu tr, nh ch mc, v truy xut


Gii thut khai ph d liu hiu qu (efficiency) v c tnh co gin (scalability) vi d liu c khai ph. Cng ngh c s d liu
43

1.2.3. Cc quy trnh khai ph d liu


Quy trnh khai ph d liu l mt chui lp (iterative) (v tng tc(interactive)) gm cc bc (giai on) bt u vi d liu th (raw data) v kt thc vi tri thc (knowledge of interest) p ng c s quan tm ca ngi s dng.

Cross Industry Standard Process for Data Mining (CRISP-DM at www.crisp-dm.org) SEMMA (Sample, Explore, Modify, Model, Assess) at the SAS Institute
44

1.2.3. Cc quy trnh khai ph d liu


S cn thit ca mt quy trnh khai ph d liu


Cch thc tin hnh (hoch nh v qun l) d n khai ph d liu c h thng m bo n lc dnh cho mt d n khai ph d liu c ti u ha Vic nh gi v cp nht cc m hnh trong d n c din ra lin tc.

45

1.2.3. Quy trnh CRISP-DM


Chun quy trnh cng nghip


c khi xng t 09/1996 v c h tr bi hn 200 thnh vin Chun m H tr cng nghip/ng dng v cng c khai ph d liu hin c Tp trung vo cc vn nghip v cng nh phn tch k thut To ra mt khung thc hng dn qui trnh khai ph d liu C nn tng kinh nghim t cc lnh vc ng dng
46

1.2.3. Quy trnh CRISP-DM

47

1.2.3. Quy trnh CRISP-DM


Quy trnh CRISP-DM l mt quy trnh lp, c kh nng quay lui (backtracking) gm 6 giai on:

Tm hiu nghip v (Business understanding) Tm hiu d liu (Data understanding) Chun b d liu (Data preparation) M hnh ho (Modeling) nh gi (Evaluation) Trin khai (Deployment)
48

1.2.4. Cc h thng khai ph d liu


H thng khai ph d liu c pht trin da trn khi nim rng ca khai ph d liu.

Khai ph d liu l mt qu trnh khm ph tri thc c quan tm t lng ln d liu trong cc c s d liu, kho d liu, hay cc kho thng tin khc. Database, data warehouse, World Wide Web, v information repositories Database hay data warehouse server Knowledge base Data mining engine Pattern evaluation module User interface
49

Cc thnh phn chnh c th c


1.2.4. Kin trc ca mt h thng khai ph d liu

50

1.2.4. Cc h thng khai ph d liu


Database, data warehouse, World Wide Web, v information repositories


Thnh phn ny l cc ngun d liu/thng tin s c khai ph. Trong nhng tnh hung c th, thnh phn ny l ngun nhp (input) ca cc k thut tch hp v lm sch d liu.

Database hay data warehouse server


Thnh phn chu trch nhim chun b d liu thch hp cho cc yu cu khai ph d liu.
51

1.2.4. Cc h thng khai ph d liu


Knowledge base

Thnh phn cha tri thc min, c dng hng dn qu trnh tm kim, nh gi cc mu kt qu c tm thy. Tri thc min c th l cc phn cp khi nim, nim tin ca ngi s dng, cc rng buc hay cc ngng gi tr, siu d liu,

Data mining engine


Thnh phn cha cc khi chc nng thc hin cc tc v khai ph d liu.
52

1.2.4. Cc h thng khai ph d liu


Pattern evaluation module


Thnh phn ny lm vic vi cc o (v cc ngng gi tr) h tr tm kim v nh gi cc mu sao cho cc mu c tm thy l nhng mu c quan tm bi ngi s dng. Thnh phn ny c th c tch hp vo thnh phn Data mining engine.

53

1.2.4. Cc h thng khai ph d liu


User interface

Thnh phn h tr s tng tc gia ngi s dng v h thng khai ph d liu.


Ngi s dng c th ch nh cu truy vn hay tc v khai ph d liu. Ngi s dng c th c cung cp thng tin h tr vic tm kim, thc hin khai ph d liu su hn thng qua cc kt qu khai ph trung gian. Ngi s dng cng c th xem cc lc c s d liu/kho d liu, cc cu trc d liu; nh gi cc mu khai ph c; trc quan ha cc mu ny cc dng khc nhau.
54

1.2.4. Cc h thng khai ph d liu


Cc c im c dng kho st mt h thng khai ph d liu


Kiu d liu Cc vn h thng Ngun d liu Cc tc v v phng php lun khai ph d liu Vn gn kt vi cc h thng kho d liu/c s d liu Kh nng co gin d liu Cc cng c trc quan ha Ngn ng truy vn khai ph d liu v giao din ha cho ngi dng
55

1.2.4. Cc h thng khai ph d liu


Mt s h thng khai ph d liu:


Intelligent Miner (IBM) Microsoft data mining tools (Microsoft SQL Server 2000/2005/2008) Oracle Data Mining (Oracle 9i/10g/11g) Enterprise Miner (SAS Institute) Weka (the University of Waikato, New Zealand, www.cs.waikato.ac.nz/ml/weka)
56

1.2.4. Cc h thng khai ph d liu


Phn bit cc h thng khai ph d liu vi


Cc h thng phn tch d liu thng k (statistical data analysis systems) Cc h thng hc my (machine learning systems) Cc h thng truy hi thng tin (information retrieval systems) Cc h c s d liu din dch (deductive database systems) Cc h c s d liu (database systems)
57

1.3. ngha v vai tr ca khai ph d liu


S tin ha ca cng ngh h c s d liu
Database Management Systems (1970s-early 1980s) Data Collection and Database Creation (1960s and earlier)

Advanced Database Systems (mid-1980s-present)

Web-based Database Systems (1990s-present)

Advanced Data Analysis: Data Warehousing and Data Mining (late 1980s-present) New Generation of Integrated Data and Information Systems (present-future)

58

1.3. ngha v vai tr ca khai ph d liu


Cng ngh hin i trong lnh vc qun l thng tin


Hin din khp ni (ubiquitous) v c tnh n (invisible) trong nhiu kha cnh ca i sng hng ngy

Lm vic, mua sm, tm kim thng tin, ngh ngi,

c p dng trong nhiu ng dng thuc nhiu lnh vc khc nhau H tr cc nh khoa hc, gio dc hc, kinh t hc, doanh nghip, khch hng,
59

1.4. ng dng ca khai ph d liu


Trong kinh doanh (business) Trong ti chnh (finance) v tip th bn hng (sales marketing) Trong thng mi (commerce) v ngn hng (bank) Trong bo him (insurance) Trong khoa hc (science) v y sinh hc (biomedicine) Trong iu khin (control) v vin thng (telecommunication)
60

1.5. Tm tt
Khai

ph d liu l qu trnh khm ph ra cc mu c quan tm t lng ln d liu.


Mu kt qu khai ph c l nhng mu th hin tri thc nu chng d hiu, hp l vi mt mc chc chn, hu dng, v mi i vi ngi dng. Lng ln d liu t cc c s d liu truyn thng/hin i, kho d liu, hay t cc ngun thng tin khc (spatial, time series, text, multimedia, web, ). Cc tc v khai ph d liu bao gm khai ph m t lp/khi nim (c trng ha v phn bit ha d liu), khai ph lut kt hp/tng quan, phn lp, d on, gom cm, phn tch xu hng, phn tch lch v phn t bin, phn tch tng t,

Nm thnh t c bn c t mt tc v khai ph d liu: d liu c th s c khai ph, loi tri thc s t c, tri thc nn, cc o, v cc k thut biu din/trc quan ha tri thc. Bn thnh phn c bn ca mt gii thut khai ph d liu: cu trc mu hay m hnh, hm t s, phng php tm kim v ti u ha, chin lc qun l d liu. 61

1.5. Tm tt

Khai ph d liu c xem nh l mt phn ca qu trnh khm ph tri thc. Qu trnh khm ph tri thc l mt chui lp gm cc bc: lm sch d liu, tch hp d liu, chn la d liu, bin i d liu, khai ph d liu, nh gi mu, v biu din tri thc. Nhiu lnh vc khc nhau c lin quan vi khai ph d liu: cng ngh c s d liu, l thuyt thng k, hc my, khoa hc thng tin, trc quan ha, Cc vn lin quan: phng php lun khai ph d liu, vn tng tc ngi dng, kh nng co gin d liu v hiu sut, vn x l lng ln cc kiu d liu khc nhau, vn khai thc cc ng dng khai ph d liu cng nh s nh hng x hi ca chng. 62

Hi & p

63

You might also like