Professional Documents
Culture Documents
Bai 1 - 2015
Bai 1 - 2015
& NG DNG
GV: TS. NGUYN HONG T ANH
BI 1
TNG QUAN
NI DUNG
1. Ti sao cn khai thc d liu ?
2. Khai thc d liu (KTDL) l g ?
3. Qui trnh Khm ph tri thc (KDD)
4. Cc nhim v chnh ca KTDL
5. Cc k thut KTDL
c thu thp v lu tr
/ trung tm mua sm
Giao dch ngn hng /
th tin dng
S CN THIT CA KTDL
o
o
o
o
o
S RA I CA KTDL
KTDL ra i trong bi
cnh : GIU DL
S CN THIT CA KTDL
DL cha rt nhiu thng tin gi
106-1012 bytes:
Khong bao gi co
the nhn thay mot
cach ay u tap
d lieu hoac a
vao bo nh cua
may tnh
Evolution of Sciences:
New Data Science Era
Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
The Internet and computing Grid that makes all these archives universally accessible
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
8
Comm. ACM, 45(11): 50-54, Nov. 2002
S CN THIT CA KTDL
4,000,000
3,500,000
H su d liu
3,000,000
2,500,000
2,000,000
1,500,000
1,000,000
S DL c
phn tch
500,000
0
1995
1996
1997
1998
1999
9
11
Customer Relationship
Management (CRM)
xy dng mi quan h vi khch hng, cc cng
ty cn phi bit :
1.
2.
3.
4.
NI DUNG
1. Ti sao cn khai thc d liu ?
TH NO L KTDL
Khai thc d liu l qu trnh khng tm thng ca vic xc
nh cc mu tim n c tnh hp l, mi l, c ch v c
th hiu c ti a trong CSDL U.Fayyad, (1996)
a x ly
Qua trnh khong tam thng
Hp le
Mi la
Co ch
Co the hieu c
Co the s dung c
Bi con ngi va may
18
KHAI THC DL
Th no l mu tim n ?
19
Tm s in thoi
trong danh b in
thoi.
Cc tn ph bin ti khu
vc xc nh ca M
(OBrien, ORurke,
OReilly vng Boston).
Tm thng tin v
Amazon
trn
serach engine.
10
NI DUNG
1. Ti sao cn khai thc d liu ?
2. Khai thc d liu l g ?
(KDD)
4. Cc nhim v chnh ca KTDL
5. Cc k thut KTDL
6. Cc thch thc ca KTDL
21
KTDL : Mt bc
quan trng trong qui
trnh KDD (knowledge
discovery in DB)
Pattern Evaluation
3
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
1 Data Integration
Databases
22
11
Data warehousing
1
Kh nhieu
D lieu
Chuan hoa
gia tr
Bien oi
gia tr
La chon
phng phap DM
Trch xuat
Tri thc
2
Tm thuoc tnh quan
trong &Mien gia tr
3
La chon
nhiem vu DM
Bien oi qua
bieu ien khac
Kiem tra
tri thc
Tnh che
Tri thc
23
Data cleaning
Data mining
24
12
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Filtering
Data
Warehouse
Databases
25
Decision
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
26
13
Data PreProcessing
Data integration
Normalization
Feature selection
Dimension reduction
Data
Mining
PostProcessing
Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization
27
NI DUNG
1. Ti sao cn khai thc d liu ?
2. Khai thc d liu l g ?
14
29
D on (Predictive) :
Phn lp
Hi qui
Pht hin s thay i /lc hng
M t ( Descriptive) :
Gom cm
Tm tt
M hnh ha ph thuc
30
15
Tm ra mt tp xc nh
Cc nhm hay cc cm
m t d liu
Gom cm
Phn lp
?
nh x t mt mu d liu
thnh mt bin d on
trc c gi tr thc .
Pht hin ra mt m
hnh m m t ph
thuc quan trng nht
gia cc bin
M hnh ha
ph thuc
Hi qui
Pht hin ra nhng thay i
quan trng nht
trong d liu
Pht hin ra mt m t
tm tt cho mt
tp con d liu
Tm tt
31
Periodicity analysis
Motifs and biological sequence analysis
Similarity-based analysis
32
16
Graph mining
Finding frequent subgraphs (e.g., chemical compounds), trees
(XML), substructures (web fragments)
Information network analysis
Social networks: actors (objects, nodes) and relationships (edges)
e.g., author networks in CS, terrorist networks
Multiple heterogeneous networks
A person could be multiple information networks: friends, family,
classmates,
Links carry a lot of semantic information: Link mining
Web mining
Web is a big information network: from PageRank to Google
Analysis of Web information networks
Web community discovery, opinion mining, usage mining,
33
Evaluation of Knowledge
Coverage
Accuracy
Timeliness
34
17
V D PHN LP
V D PHN LP
Xy dng m hnh d on
Sau :
Khuyn mi, cho mi (VD: mt in thoi mi) cho
nhng khch hng c nhiu kh nng ri b nht.
Pht trin k hach mi nhm p ng nhu cu ca khch
hng.
Kt qu: gim t l mt khch hng di 1.5 %/ thng.
18
V D PHN LP
Model/Pattern
Training Data:
Customer characteristics &
cell phone usage behavior
Model
Consumer i
Probability
customer
would
terminate
contract
37
38
19
Qung co:
Mc ch: Gim ch ph th tn bng cch tp trung vo
nhm khch hng c nhiu kh nng mua sn phm in
thoi di ng mi.
GOM Nhm DL
Gom cm/ Gom nhm da trn khong cch Euclide
trong khng gian 3-D
Intracluster distances
are minimized
Intercluster distances
are maximized
40
20
41
42
21
1
2
3
4
Industry Group
Technology1-DOWN
Technology2-DOWN
Financial-DOWN
Oil-UP
43
Items bought
10
A, B, C
20
A, C
30
A, D
40
B, E, F
Customer
buys both
Customer
buys diaper
Buy diapers
on
Friday night
Then
Buy beer
Customer
buys beer
44
22
23
HI QUI
D on gi tr ca bn da trn gi tr ca
cc bin khc
V d:
D bo khi lng bn hng ca sn phm
mi da trn chi ph qung co.
D on tc gi nh mt hm ca nhit ,
m, p sut khng kh,
D on ch s th trng chng khon.
47
Xc nh s lch hng r
rt so vi hnh vi thng
thng
ng dng:
Pht hin gian ln
th tn dng
Pht hin xm
nhp mng tri php
48
24
NI DUNG
1. Ti sao cn khai thc d liu ?
2. Khai thc d liu l g ?
3. Qui trnh Khm ph tri thc (KDD)
4. Cc nhim v chnh ca KTDL
5. Cc k thut KTDL
6. Cc thch thc ca KTDL
50
25
CC K THUT KTDL
KTDL ly tng t cc lnh vc nh
my hc, thng k, nhn dng, h thng
DL
Cc k thut truyn thng c th khng
ph hp do:
Kch thc ln ca DL
S chiu DL ln
Bn cht DL khng ng nht
51
Applications
Algorithm
Pattern
Recognition
Data Mining
Database
Technology
Statistics
Visualization
High-Performance
Computing
52
26
NI DUNG
1. Ti sao cn khai thc d liu (DM) ?
2. DM l g ?
27
56
28
TM TT
Khm ph mu c ch, cha bit t khi
lng ln DL
Qui trnh khm ph tri thc (KDD)
Thu thp v tin x l DL -> KTDL -> nh
gi mu -> Biu din tri thc
29
KDD Conferences
ACM SIGKDD Int. Conf. on
Knowledge Discovery in
Databases and Data Mining (KDD)
SIAM Data Mining Conf. (SDM)
(IEEE) Int. Conf. on Data Mining
(ICDM)
European Conf. on Machine
Learning and Principles and
practices of Knowledge Discovery
and Data Mining (ECML-PKDD)
Pacific-Asia Conf. on Knowledge
Discovery and Data Mining
(PAKDD)
Int. Conf. on Web Search and
Data Mining (WSDM)
PR conferences: CVPR,
Journals
KDD Explorations
Tm ti liu u?
DBLP, CiteSeer, Google
Statistics
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI,
etc.
Web and IR
Visualization
60
30
Bi tp theo nhm s 1
Tho lun tnh hung KTDL trong nhm v 01 ngi i din cho
nhm trnh by.
Thi gian trnh by: ti a 3 .
Trnh by tnh hung
Hng gii quyt v li ch
Gi :
Dng DL no c thu thp. S dng nhim v no ca KTDL ?
Cc thng tin no ta cn bit v khch hng
C cn bit khch hng mua cc mt hng g?
C cn phn loi khch hng ?,
61
Bi tp theo nhm s 1
Thi gian: 15
Tho lun tnh hung KTDL trong nhm v 01 ngi i din cho
nhm trnh by
Thi gian trnh by: ti a 3
Trnh by tnh hung
Hng gii quyt v li ch
Tnh hung 2: Qung co sn phm (v d chn la hnh thc,
i tng qung co gim chi ph, tng li nhun)
Gi :
DL cn thu thp l g. S dng nhim v no ca KTDL ?
C cn thit gi t qung co sn phm n tt c cc khch
hng Hay ch gi cho 1 nhm c chn lc.
C th d kin kh nng phn hi ca khch hng so vi chi ph
gi qung co ?
62
31
CC CNG VIC CN LM
Np kt qu bi tp nhm 1 ca nhm
theo link ca trang Moodle trong
vong 01 tun t ngy hm nay.
Ni dung trnh by:
Mc tiu, yu cu c th (sn phm c th)
D liu cn thit: VD: DL sn phm, Khch hng, i
th, cc kt qu trc lin quan n yu cu ca bi
ton.
S dng chc nng no ca KTDL: m t, d on, hay
kt hp,
Kt qu t c d kin.
63
32
BI TP
Th no l khai thc d liu ? Cho v d minh
ha.
2. Cc kiu d liu, thng tin no c kh nng c
s dng trong qui trnh KDD?
3. Cho v d thc t v vic p dng KTDL em n
thnh cng trong kinh doanh (ngoi cc v d c
trong bi ging).
1.
66
33