Professional Documents
Culture Documents
Bai 1 - 4s
Bai 1 - 4s
BI 1 T NG QUAN
N I DUNG
1. T i sao c n khai thc d li u ?
2. Khai thc d li u (KTDL) l g ? 3. Qui trnh KDD 4. Cc nhi m v chnh c a KTDL 5. Cc k thu t KTDL 6. Cc v n
Kha c nh thng m i
S C N THI T C A KTDL
Kh i l ng l n d li u c thu th p v lu tr
o
o
c a KTDL
o
3
My tnh m nh hn , r hn p l c c nh tranh r t m nh
Cung c p cc d ch v a d ng, ch t l ng t t ( CRM Customer Relationship Management)
4
C N THI T C A KTDL
D li u c thu th p v lu tr v i t c cao(GB/h)
o o o o
Kha c nh Khoa h c
Thi t b remote sensor trn v tinh Knh thin vn quan st b u tr i Microarray t o d li u bi u di n gien Th nghi m khoa h c t o hng TB
C N THI T C A KTDL
C N THI T C A KTDL
4,000,000 3,500,000 3,000,000 2,500,000 2,000,000 1,500,000 1,000,000 500,000 0 1995 1996 1997 1998 1999
7
RA
I C A KTDL
H su d li u
S DL thu th p (TB) t nm 1995
S DL c phn tch
LNH V C
Thong tin thng mai
NG D NG KTDL
Thong tin san xuat
-Phan tch th trng va mua ban -Phan tch au t -Chap thuan cho vay - ieu khien va len ke hoach - Quan tr mang -Phat hien gian lan - Phan tch cac ket qua thc nghiem Thong tin ca nhan Thong tin khoa hoc - Thien van hoc - C s d lieu sinh hoc - Khoa hoc a chat: bo do tm ong at
C t tri th c v d li u
9
10
N I DUNG
1. T i sao c n khai thc d li u ?
TH NO L KTDL
Khai thc d li u l qu trnh khng t m th ng c a vi c xc nh cc m u ti m n c tnh h p l , m i l , c ch v c th hi u c t i a trong CSDL U.Fayyad, (1996)
a x ly Qua trnh khong tam thng Hp le Mi la Co ch
11
2. Khai thc d li u l g ?
3. 4. 5. 6.
Chng minh tnh ung Cua mau / Mo hnh Khong biet trc Co the s dung c Bi con ngi va may
12
Co the hieu c
KHAI THC DL
Th no l m u ?
L m i quan h trong d li u v d nh :
Nh ng ng i mua qu n ty th ng hay mua thm o s mi Nh ng ng i c m c tn d ng t t th th ng t b tai n n n ng, 37+, thu nh p : 50K-75K, -> chi kho ng 25$-50$ cho t mua hng qua catalog
13
N I DUNG
1. T i sao c n khai thc d li u ? 2. Khai thc d li u l g ?
5
Pattern Evaluation
3
Data Mining
1 Data Integration
15
c a KTDL
Databases
16
KI N TRC H TH NG DM TI U BI U
Graphical user interface
Data warehousing
Chon lla ky thuat ien hnh va d lieu mau Thay the nhng gia tr thieu Kh nhieu D lieu
Pattern evaluation
2
3
La chon nhiem vu DM La chon phng phap DM Trch xuat Tri thc Kiem tra tri thc
4
Tnh che Tri thc
Knowledge-base
Filtering
Phat sinh ra cau hoi va bao cao Cac phng phap cai tien 5 kieu ket hp va lap day
Databases
17
Data Warehouse
18
N I DUNG
1. T i sao c n khai thc d li u ? 2. Khai thc d li u l g ? 3. Qui trnh KDD
CC NHI M V
CHNH C A DM
4. Cc nhi m v chnh c a DM
5. Cc k thu t DM 6. Cc v n
c a DM
19 20
CC NHI M V
CHNH C A DM
Tm ra mt tp xc nh Cc nhm hay cc cm m t d liu
V D
PHN L P
Phn lp
?
Gom cm
Pht hin ra mt m hnh m m t ph thuc quan trng nht gia cc bin
Hi qui
Pht hin ra nhng thay i quan trng nht trong d liu
M hnh ha ph thuc
Pht hin ra mt m t tm tt cho mt tp con d liu
Tm tt
Cng ty Verizon Wireless : Cng ty cung c p thi t b , d ch v khng dy l n nh t M S l ng khch hng : 30.3 tri u 90% dn s M V n : T l khch hng b m t cao : 2%/thng ( 600,000 khch hng r i b /thng) Chi ph thay th : hng trm tri u $/nm Chi ph trung bnh cho m i khch hng m i : 320$
22
21
V D
PHN L P
ng
V D
PHN L P
Gi i php c a KTDL :
Xy d ng m hnh d on
on
xc
nh cc khch hng c
Sau : Khuy n mi, cho m i ( VD: m t i n tho i m i) cho nh ng khch hng c nhi u kh nng r i b nh t Pht tri n k h ach m i nh m p ng nhu c u c a khch hng K t qu : gi m t l m t khch hng d i 1.5 %/ thng
23
24
Bi t p theo nhm
Th i gian th o lu n : 15 Th o lu n tnh hu ng KTDL trong nhm v s g i 01 ng i i di n cho nhm trnh by Th i gian trnh by : t i a 5 Trnh by tnh hu ng H ng gi i quy t v l i ch Tnh hu ng 1 : Th tr ng bn l Nhm : 3C, 4, G7, Miner2A, MyLove, Hoa D ng DL no c thu th p Ki u tri th c no ta c n bi t v khch hng C c n bi t khch hng mua cc m t hng g C c n phn lo i khch hng
25
Bi t p theo nhm
Th i gian : 15 Th o lu n tnh hu ng KTDL trong nhm v s g i 01 ng i i di n cho nhm trnh by Th i gian trnh by : t i a 5 Trnh by tnh hu ng H ng gi i quy t v l i ch Tnh hu ng 2 : Qu ng co s n ph m Nhm : K07, WOI, GIT, DataMiner, Tu n Anh, Tran G i t qu ng co s n ph m n t t c cc khch hng Hay ch g i cho 1 nhm c ch n l c D ki n kh nng ph n h i c a khch hng so v i chi ph g i qu ng co
26
PHN L P:
d ch th tn d ng
NG D NG 1
PHN L P:
NG D NG 2
nng mua s n ph m i n
Qu ng co : M c ch : Gi m ch ph th tn b ng cch t p trung vo
nhm khch hng c nhi u kh tho i di ng m i
H ng gi i quy t :
S d ng d li u cho s n ph m tng t tr c y Dng quy t nh {mua, khng mua} lm thu c tnh l p Thu th p thng tin c nhn, cch s ng v quan h c a t t c cc khch hng Dng cc thng tin trn nh l d li u u vo xy d ng m hnh phn l p
28
PHN L P:
c bi t cc thin vn
NG D NG 3
PHN L P Thin h
Ngu n: http://aps.umn.edu
Early
H ng gi i quy t :
Phn o n nh Xc nh thu c tnh( c trng) nh : 40 c trng/ nh Xy d ng m hnh d a trn cc c trng K t qu : tm th y 16 chu n tinh - i t ng r t xa kh c th th y c
29
Kch th c d
li u:
30
GOM C M : Minh h
Intracluster distances Intracluster distances are minimized are minimized
GOM C M :
bi t
NG D NG 1
Gom nhm khch hng : M c ch : Chia khch hng thnh cc nhm/c m ring
c th p d ng cc bi n php qu ng co khc nhau
H ng gi i quy t :
Thu th p thng tin c nhn, cch s ng c a t t c cc khch hng Xc nh cc c m/nhm khch hng gi ng nhau Ki m tra ch t l ng c a cc c m thng qua vi c quan st c trng mua hng c a khch hng trong cng m t c m so v i khch hng khc c m
32
31
GOM C M :
quan tr ng
NG D NG 2
Minh h a gom c m ti li u
3024 bi bo c a LA Times o tng t : bao nhiu t dng trong cc vn b n ny. th ng c
Xc
34
KHAI THC LU T K T H P
Transaction-id 10 20 30 40
Customer buys both
Items bought A, B, C A, C A, D B, E, F
Customer buys diaper
1 2 3 4
Applied-Matl-DOW N,Bay-Net work-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Co mm-DOW N,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N, Sun-DOW N Apple-Co mp-DOW N,Autodesk-DOWN,DEC-DOWN, ADV-M icro-Device-DOWN,Andrew-Corp-DOWN, Co mputer-Assoc-DOWN,Circuit-City-DOWN, Co mpaq-DOWN, EM C-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Ho me-Loan-DOW N, MBNA-Corp -DOWN,Morgan-Stanley-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlu mberger-UP
Technology1-DOWN
Technology2-DOWN
Financial-DOWN Oil-UP
36
NG D NG 1
NG D NG 2
Qu ng co v khuy n mi :
{Bia, ...} {Khoai ty chin}
nh nn lm
H ng gi i quy t :
X l d li u bn hng tm m i lin h gi a cc m t hng Lu t c i n : N u khch hng mua t gi y v s a th c kh nng mua bia.
38
Bia l ti n : dng xem lo i s n ph m no b nh h ng n u khng bn bia n a Bia v khoai ty chin cng xu t hi n : lo i s n ph m no nn bn km v i bia khuy n khch mua khoai ty chin
37
NG D NG 3
H I QUI
D on gi tr c a b n d a trn gi tr c a cc bi n khc V d : D bo kh i l ng bn hng c a s n ph m m i d a trn chi ph qu ng co D an t c gi nh m t hm c a nhi t m, p su t khng kh, D on ch s th tr ng ch ng khon
40
H ng gi i quy t :
X l d li u trn cc d ng c v b ph n yu c u trong cc l n s a tr c tm cc m u ng xu t hi n
39
10
N I DUNG
1. T i sao c n khai thc d li u ? 2. Khai thc d li u l g ? 3. Qui trnh KDD 4. Cc nhi m v chnh c a KTDL
5. Cc k thu t KTDL
6. Cc v n
c a KTDL
42
M TS
K THU T KTDL
Data Mining
Visualization
Algorithm
Other Disciplines
43
Cy quy t nh, Lu t qui n p Pht hi n lu t k t h p Gi i thu t di truy n M ng N ron , t p m H i qui tuy n tnh, phi tuy n tnh T p th (Rough Sets) Th ng k M ng Bayes ..
44
11
N I DUNG
1. T i sao c n khai thc d li u (DM) ? 2. DM l g ? 3. Qui trnh KDD 4. Cc nhi m v chnh c a KTDL 5. Cc k thu t KTDL
NH NG V N
Tnh c ch Tnh hi u qu ng d ng L thuy t
C A KTDL
6. Cc v n
c a KTDL
45 46
NH NG V N
Tnh c ch
o tnh c ch ? Tr c quan v tng tc
C A KTDL
Cc tp d liu cc ln V c s chiu ln (Tnh hiu qa, tnh co dn)
NH NG V N
ng d ng
C A KTDL
Tnh hi u qu
Pht tri n thu t ton DM nhanh Thi hnh c phng php : khai thc song song, phn tn, tng c ng Tch h p vo h th ng s n ph m : DBMS, DW
L thuy t
X l cc kiu d liu khc nhau vi mc qun tr khc nhau
47
Cc ngun d liu khc nhau (Cc CSDL Phn tn v thun nht, d liu khng ng b, c nhiu v b mt mt,v.v.)
48
12
TM T T
Khm ph m u c ch, cha bi t t kh i l ng l n DL Qui trnh KDD
Thu th p v ti n x l DL -> KTDL -> nh gi m u -> Bi u di n tri th c
TI LI U THAM KH O
G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An overview. U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 135. AAAI/MIT Press, 1996 http://vi.wikipedia.org/wiki/Khai_ph%C3%A 1_d%E1%BB%AF_li%E1%BB%87u : bch khoa ton th m wikipedia M t s slide dng trong bi c l y t cc slide c a cc cu n sch v KTDL.
51
1989 IJCAI Workshop on Knowledge Discovery in Databases Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) 1991-1994 Workshops on Knowledge Discovery in Databases Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD95-98) Journal of Data Mining and Knowledge Discovery (1997) ACM SIGKDD conferences t 1998 v SIGKDD Explorations Nhi u h i ngh khc v KTDL PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), ACM Transactions on KDD t 2007
52
13
BI T P
1. Th no l khai thc d li u ? 2. Cc ki u d li u, thng tin no c kh nng
Q&A
c s d ng trong qui trnh KDD? 3. Cho v d v vi c p d ng KTDL em n thnh cng trong kinh doanh (ngoi cc v d c trong bi gi ng). Lo i nhi m v no c a KTDL c s d ng ? H c th thay b ng phng php truy v n DL hay phn tch th ng k n gi n khng ?
53
54
14