You are on page 1of 14

KHAI THC D LI U & NG D NG (DATA MINING)

GV : ThS. NGUY N HONG T ANH

BI 1 T NG QUAN

N I DUNG
1. T i sao c n khai thc d li u ?
2. Khai thc d li u (KTDL) l g ? 3. Qui trnh KDD 4. Cc nhi m v chnh c a KTDL 5. Cc k thu t KTDL 6. Cc v n

Kha c nh thng m i
S C N THI T C A KTDL
Kh i l ng l n d li u c thu th p v lu tr
o
o

Web data, e-commerce


Ha n mua hng t i siu th

/ trung tm mua s m Giao d ch ngn hng / th tin d ng

c a KTDL
o
3

My tnh m nh hn , r hn p l c c nh tranh r t m nh
Cung c p cc d ch v a d ng, ch t l ng t t ( CRM Customer Relationship Management)
4

C N THI T C A KTDL
D li u c thu th p v lu tr v i t c cao(GB/h)
o o o o

Kha c nh Khoa h c
Thi t b remote sensor trn v tinh Knh thin vn quan st b u tr i Microarray t o d li u bi u di n gien Th nghi m khoa h c t o hng TB

C N THI T C A KTDL

DL ch a r t nhi u thng tin gi tr , c l i cho qui trnh ra quy t nh


Khng th phn tch DL = tay Con ng i c n hng tu n l khm ph ra thng tin c ch Ph n l n d li u cha bao gi c phn tch c H su gi a kh nng sinh ra DL v kh nng s d ng DL Usama Fayyad
106-1012 bytes: Khong bao gi co the nhn thay mot cach ay u tap d lieu hoac a vao bo nh cua may tnh

Cc k thu t truy n th ng khng kh nng lm vi c v i d li u th KTDL c th gip cc nh khoa h c


o o

Phn lo i v phn o n d li u Xy d ng gi thuy t


5

C N THI T C A KTDL
4,000,000 3,500,000 3,000,000 2,500,000 2,000,000 1,500,000 1,000,000 500,000 0 1995 1996 1997 1998 1999
7

RA

I C A KTDL

H su d li u
S DL thu th p (TB) t nm 1995

S DL c phn tch

D NG KTDL KHI NO?


D li u qu nhi u D li u l n (chi u v kch th c)
D li u nh ( kch th c) D li u gene (s chi u)

LNH V C
Thong tin thng mai

NG D NG KTDL
Thong tin san xuat

-Phan tch th trng va mua ban -Phan tch au t -Chap thuan cho vay - ieu khien va len ke hoach - Quan tr mang -Phat hien gian lan - Phan tch cac ket qua thc nghiem Thong tin ca nhan Thong tin khoa hoc - Thien van hoc - C s d lieu sinh hoc - Khoa hoc a chat: bo do tm ong at

C t tri th c v d li u
9

10

N I DUNG
1. T i sao c n khai thc d li u ?

TH NO L KTDL
Khai thc d li u l qu trnh khng t m th ng c a vi c xc nh cc m u ti m n c tnh h p l , m i l , c ch v c th hi u c t i a trong CSDL U.Fayyad, (1996)
a x ly Qua trnh khong tam thng Hp le Mi la Co ch
11

2. Khai thc d li u l g ?
3. 4. 5. 6.

Qui trnh KDD Cc nhi m v chnh c a KTDL Cc k thu t KTDL Cc v n c a KTDL

Chng minh tnh ung Cua mau / Mo hnh Khong biet trc Co the s dung c Bi con ngi va may
12

Co the hieu c

KHAI THC DL
Th no l m u ?
L m i quan h trong d li u v d nh :

KHAI THC DL ....


What is not Data Mining? Tm s i n tho i trong danh b i n tho i Tm thng tin v Amazon trn serach engine What is Data Mining? Cc tn ph bi n t i khu v c xc nh c a M (OBrien, ORurke, OReilly vng Boston ) Gom nhm cc ti li u gi ng nhau thu c t search engine d a trn n i dung (VD: r ng nhi t i Amazon , Amazon.com)
14

Nh ng ng i mua qu n ty th ng hay mua thm o s mi Nh ng ng i c m c tn d ng t t th th ng t b tai n n n ng, 37+, thu nh p : 50K-75K, -> chi kho ng 25$-50$ cho t mua hng qua catalog
13

N I DUNG
1. T i sao c n khai thc d li u ? 2. Khai thc d li u l g ?

QUI TRNH KHM PH TRI TH C


KTDL : M t b c quan tr ng trong qui trnh KDD (knowledge discovery in DB)

5
Pattern Evaluation

3
Data Mining

3. Qui trnh KDD


4. Cc nhi m v chnh c a KTDL 5. Cc k thu t KTDL 6. Cc v n
Data Cleaning

Task-relevant Data Data Warehouse Selection

1 Data Integration
15

c a KTDL

Databases

16

D lieu c to chc theo chc nang


Tao ra/chon loc CSDL ch

QUI TRNH KDD


1

KI N TRC H TH NG DM TI U BI U
Graphical user interface

Data warehousing

Chon lla ky thuat ien hnh va d lieu mau Thay the nhng gia tr thieu Kh nhieu D lieu

Pattern evaluation
2

Data mining engine


Chuan hoa gia tr Bien oi gia tr Tao cac thuoc Tnh dan xuat Tm thuoc tnh quan trong &Mien gia tr

3
La chon nhiem vu DM La chon phng phap DM Trch xuat Tri thc Kiem tra tri thc

4
Tnh che Tri thc

Database or data warehouse server


Data cleaning & data integration

Knowledge-base
Filtering

Bien oi qua bieu ien khac

Phat sinh ra cau hoi va bao cao Cac phng phap cai tien 5 kieu ket hp va lap day

Databases
17

Data Warehouse
18

N I DUNG
1. T i sao c n khai thc d li u ? 2. Khai thc d li u l g ? 3. Qui trnh KDD

CC NHI M V

CHNH C A DM

4. Cc nhi m v chnh c a DM
5. Cc k thu t DM 6. Cc v n

c a DM
19 20

CC NHI M V

CHNH C A DM
Tm ra mt tp xc nh Cc nhm hay cc cm m t d liu

V D

PHN L P

Pht hin ra m t ca mt vi lp c xc nh v phn loi d liu vo mt trong cc lp .

Phn lp
?

Gom cm
Pht hin ra mt m hnh m m t ph thuc quan trng nht gia cc bin

nh x t mt mu d liu thnh mt bin d on trc c gi tr thc .

Hi qui
Pht hin ra nhng thay i quan trng nht trong d liu

M hnh ha ph thuc
Pht hin ra mt m t tm tt cho mt tp con d liu

Pht hin s thay i/lc hng

Tm tt

Cng ty Verizon Wireless : Cng ty cung c p thi t b , d ch v khng dy l n nh t M S l ng khch hng : 30.3 tri u 90% dn s M V n : T l khch hng b m t cao : 2%/thng ( 600,000 khch hng r i b /thng) Chi ph thay th : hng trm tri u $/nm Chi ph trung bnh cho m i khch hng m i : 320$
22

21

V D

PHN L P
ng

V D

PHN L P

Gi i php thng th ng : Cho m i, khuy n mi t t c khch hng tr c khi h t h p Ch ph qu t n km, lng ph

Gi i php c a KTDL :
Xy d ng m hnh d on

Dng m hnh d kh nng r i b

on

xc

nh cc khch hng c

Sau : Khuy n mi, cho m i ( VD: m t i n tho i m i) cho nh ng khch hng c nhi u kh nng r i b nh t Pht tri n k h ach m i nh m p ng nhu c u c a khch hng K t qu : gi m t l m t khch hng d i 1.5 %/ thng
23

24

Bi t p theo nhm
Th i gian th o lu n : 15 Th o lu n tnh hu ng KTDL trong nhm v s g i 01 ng i i di n cho nhm trnh by Th i gian trnh by : t i a 5 Trnh by tnh hu ng H ng gi i quy t v l i ch Tnh hu ng 1 : Th tr ng bn l Nhm : 3C, 4, G7, Miner2A, MyLove, Hoa D ng DL no c thu th p Ki u tri th c no ta c n bi t v khch hng C c n bi t khch hng mua cc m t hng g C c n phn lo i khch hng
25

Bi t p theo nhm
Th i gian : 15 Th o lu n tnh hu ng KTDL trong nhm v s g i 01 ng i i di n cho nhm trnh by Th i gian trnh by : t i a 5 Trnh by tnh hu ng H ng gi i quy t v l i ch Tnh hu ng 2 : Qu ng co s n ph m Nhm : K07, WOI, GIT, DataMiner, Tu n Anh, Tran G i t qu ng co s n ph m n t t c cc khch hng Hay ch g i cho 1 nhm c ch n l c D ki n kh nng ph n h i c a khch hng so v i chi ph g i qu ng co
26

PHN L P:
d ch th tn d ng

NG D NG 1

PHN L P:

NG D NG 2
nng mua s n ph m i n

Pht hi n gian l n : M c ch : D on cc tr ng h p gian l n trong giao H ng gi i quy t :


Dng cc giao d ch th tn d ng v thng tin c a ch th nh thu c tnh Khch hng mua ci g, lc no, s l n dng th
Gn nhn giao d ch c l gian l n hay h p l, ng - t o thnh thu c tnh l p Xy d ng m hnh cho l p cc giao d ch Dng m hnh khm ph gian l n trn cc giao d ch th tn d ng
27

Qu ng co : M c ch : Gi m ch ph th tn b ng cch t p trung vo
nhm khch hng c nhi u kh tho i di ng m i

H ng gi i quy t :
S d ng d li u cho s n ph m tng t tr c y Dng quy t nh {mua, khng mua} lm thu c tnh l p Thu th p thng tin c nhn, cch s ng v quan h c a t t c cc khch hng Dng cc thng tin trn nh l d li u u vo xy d ng m hnh phn l p
28

PHN L P:
c bi t cc thin vn

NG D NG 3

PHN L P Thin h
Ngu n: http://aps.umn.edu

Nghin c u thin vn : M c ch : D bo lo i i t ng ( ngi sao hay thin h),


i t ng kh th y d a trn hnh nh c a knh

Early

Class: Cc giai o n hnh thnh Intermediate

3000 nh : 23040 X 23040 pixel/ nh

Thu c tnh: c trng nh c i m sng nh sng,... Late

H ng gi i quy t :
Phn o n nh Xc nh thu c tnh( c trng) nh : 40 c trng/ nh Xy d ng m hnh d a trn cc c trng K t qu : tm th y 16 chu n tinh - i t ng r t xa kh c th th y c
29

Kch th c d

li u:

72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB

30

GOM C M : Minh h
Intracluster distances Intracluster distances are minimized are minimized

GOM C M :
bi t

NG D NG 1

Gom c m d a trn kho ng cch Euclide trong khng gian 3-D


Intercluster distances Intercluster distances are maximized are maximized

Gom nhm khch hng : M c ch : Chia khch hng thnh cc nhm/c m ring
c th p d ng cc bi n php qu ng co khc nhau

H ng gi i quy t :
Thu th p thng tin c nhn, cch s ng c a t t c cc khch hng Xc nh cc c m/nhm khch hng gi ng nhau Ki m tra ch t l ng c a cc c m thng qua vi c quan st c trng mua hng c a khch hng trong cng m t c m so v i khch hng khc c m
32

31

GOM C M :
quan tr ng

NG D NG 2

Minh h a gom c m ti li u
3024 bi bo c a LA Times o tng t : bao nhiu t dng trong cc vn b n ny. th ng c

Gom c m ti li u : M c ch : Tm nhm ti li u gi ng nhau d a trn cc t H ng gi i quy t :


nh ph bi n c a t trong ti li u. Xy d ng o tng t d a trn ph bi n c a cc t gom c m. L i ch : Trong lnh v c truy v n thng tin ( IR), c th dng cc c m lin k t ti li u m i v i cc ti li u gom c m
33

Xc

34

Gom c m DL c phi u S&P 500


Quan st s bi n ng c a gi c phi u hng ngy D li u : C phi u {UP/DOWN} o tng t : cc s ki n th ng gi ng nhau trong cng m t ngy
Discovered Clusters Industry Group

KHAI THC LU T K T H P
Transaction-id 10 20 30 40
Customer buys both

Items bought A, B, C A, C A, D B, E, F
Customer buys diaper

Itemset X={x1, , xk} Tm m i quan h gi a cc thu c tnh th ng xu t hi n ng th i


A C C (50%, 66.7%) A (50%, 100%)
Then Buy beer

1 2 3 4

Applied-Matl-DOW N,Bay-Net work-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Co mm-DOW N,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N, Sun-DOW N Apple-Co mp-DOW N,Autodesk-DOWN,DEC-DOWN, ADV-M icro-Device-DOWN,Andrew-Corp-DOWN, Co mputer-Assoc-DOWN,Circuit-City-DOWN, Co mpaq-DOWN, EM C-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Ho me-Loan-DOW N, MBNA-Corp -DOWN,Morgan-Stanley-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlu mberger-UP

Technology1-DOWN

Technology2-DOWN

Buy diapers on Friday night


Customer buys beer
35

Financial-DOWN Oil-UP

36

Khai thc LKH :


Gi s tm c lu t :

NG D NG 1

Khai thc LKH :

NG D NG 2

Qu ng co v khuy n mi :
{Bia, ...} {Khoai ty chin}
nh nn lm

Qu n l qu y hng siu th : M c ch : Xc nh nh ng m t hng c nhi u


khch hng mua chung

Khoai ty chin l h qu : quy t g qu ng co cho n

H ng gi i quy t :
X l d li u bn hng tm m i lin h gi a cc m t hng Lu t c i n : N u khch hng mua t gi y v s a th c kh nng mua bia.
38

Bia l ti n : dng xem lo i s n ph m no b nh h ng n u khng bn bia n a Bia v khoai ty chin cng xu t hi n : lo i s n ph m no nn bn km v i bia khuy n khch mua khoai ty chin
37

Khai thc LKH :

NG D NG 3

H I QUI
D on gi tr c a b n d a trn gi tr c a cc bi n khc V d : D bo kh i l ng bn hng c a s n ph m m i d a trn chi ph qu ng co D an t c gi nh m t hm c a nhi t m, p su t khng kh, D on ch s th tr ng ch ng khon
40

Qu n l hng ha: M c ch : Cng ty b o tr thi t b tiu dng mu n


on tr c nguyn nhn s a ch a cc s n ph m tiu dng v trang b cc xe b o tr cc b ph n c n thi t gi m thi u s l n n nh khch hng

H ng gi i quy t :
X l d li u trn cc d ng c v b ph n yu c u trong cc l n s a tr c tm cc m u ng xu t hi n
39

10

Pht hi n s L c h ng/ B t bnh th ng


Xc nh s l ch h ng r r t so v i hnh vi thng th ng ng d ng : Pht hi n gian l n th tn d ng Pht hi n xm nh p m ng tri php
41

N I DUNG
1. T i sao c n khai thc d li u ? 2. Khai thc d li u l g ? 3. Qui trnh KDD 4. Cc nhi m v chnh c a KTDL

5. Cc k thu t KTDL
6. Cc v n

c a KTDL
42

KTDL K T H P PHNG PHP


Database Technology Statistics

M TS

K THU T KTDL

Machine Learning Pattern Recognition

Data Mining

Visualization

Algorithm

Other Disciplines
43

Cy quy t nh, Lu t qui n p Pht hi n lu t k t h p Gi i thu t di truy n M ng N ron , t p m H i qui tuy n tnh, phi tuy n tnh T p th (Rough Sets) Th ng k M ng Bayes ..

44

11

N I DUNG
1. T i sao c n khai thc d li u (DM) ? 2. DM l g ? 3. Qui trnh KDD 4. Cc nhi m v chnh c a KTDL 5. Cc k thu t KTDL

NH NG V N
Tnh c ch Tnh hi u qu ng d ng L thuy t

C A KTDL

6. Cc v n

c a KTDL
45 46

NH NG V N
Tnh c ch
o tnh c ch ? Tr c quan v tng tc

C A KTDL
Cc tp d liu cc ln V c s chiu ln (Tnh hiu qa, tnh co dn)

NH NG V N
ng d ng

C A KTDL

Tnh hi u qu
Pht tri n thu t ton DM nhanh Thi hnh c phng php : khai thc song song, phn tn, tng c ng Tch h p vo h th ng s n ph m : DBMS, DW

DL b nhi u, thi u DL ph c t p, khng ng nh t B o ton tnh ring t

L thuy t
X l cc kiu d liu khc nhau vi mc qun tr khc nhau
47

Bi u di n tri th c Ngn ng v i s DM T i u ha cu truy v n DM

Cc ngun d liu khc nhau (Cc CSDL Phn tn v thun nht, d liu khng ng b, c nhiu v b mt mt,v.v.)

48

12

T I SAO C N NGHIN C U KTDL


Th o lu n v t a ra cu tr l i

TM T T
Khm ph m u c ch, cha bi t t kh i l ng l n DL Qui trnh KDD
Thu th p v ti n x l DL -> KTDL -> nh gi m u -> Bi u di n tri th c

Khai thc trn nhi u lo i DL, thng tin Cc lo i m u c n khai thc


49

Lu t k t h p, m u tu n t , phn l p, gom nhm, m u hi m, m u c bi t, sai l ch


50

TI LI U THAM KH O
G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An overview. U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 135. AAAI/MIT Press, 1996 http://vi.wikipedia.org/wiki/Khai_ph%C3%A 1_d%E1%BB%AF_li%E1%BB%87u : bch khoa ton th m wikipedia M t s slide dng trong bi c l y t cc slide c a cc cu n sch v KTDL.
51

pht tri n c a KTDL

1989 IJCAI Workshop on Knowledge Discovery in Databases Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) 1991-1994 Workshops on Knowledge Discovery in Databases Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD95-98) Journal of Data Mining and Knowledge Discovery (1997) ACM SIGKDD conferences t 1998 v SIGKDD Explorations Nhi u h i ngh khc v KTDL PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), ACM Transactions on KDD t 2007
52

13

BI T P
1. Th no l khai thc d li u ? 2. Cc ki u d li u, thng tin no c kh nng

Q&A

c s d ng trong qui trnh KDD? 3. Cho v d v vi c p d ng KTDL em n thnh cng trong kinh doanh (ngoi cc v d c trong bi gi ng). Lo i nhi m v no c a KTDL c s d ng ? H c th thay b ng phng php truy v n DL hay phn tch th ng k n gi n khng ?
53

54

14

You might also like