Chap1 Introduction
Chap1 Introduction
2
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web, computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
Computers have become cheaper and more powerful
Expectations
Gathered data will have value either for the purpose collected or for a purpose not envisioned.
Data rich but information poor!
What does those data mean?
How to analyze data?
Traditional techniques infeasible for raw data 3
Origins of Data Mining
We are drowning in data but
starving for knowledge!
“Necessity is the mother of
invention”—Data mining—
Automated analysis of
massive data sets
4
Origins of Data Mining
Draws ideas from machine learning/AI, pattern recognition, statistics, and
database systems
A key component of the emerging field of data science and data-driven discovery
5
Data Mining and Related Field
6
Why Data Mining?
Great opportunities to improve productivity in all walks of life
Great Opportunities to Solve Society’s Major Problems
7
Content
Why data mining?
What is data mining?
Knowledge Discovery (KDD) Process
A Multi-Dimensional View of Data Mining
Data Mining Tasks
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
8
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially
useful) patterns or knowledge from huge amount of data
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging, information harvesting, business intelligence,
etc.
Is everything “data mining”?
When is data mining used?
9
What Is Data Mining?
1. Hãy hiển thị số tiền Ông Smith trong ngày 5 tháng Giêng?
2. Có bao nhiêu nhà đầu tư nước ngoài mua cổ phiếu X
trong tháng trước ?
3. Hiển thị mọi cổ phiếu trong CSDL với mệnh giá tăng ?
4. Các cổ phiếu tăng giá có đặc trưng gì?
5. Hy vọng gì về cổ phiếu X trong tuần tiếp theo ?
6. Trong tháng tiếp theo, sẽ có bao nhiêu đoàn viên công
đoàn không trả được nợ của họ?
7. Những người mua sản phẩm Y có đặc trưng gì ?
Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship management (CRM), market basket analysis, cross
selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting, quality control, competitive
analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
Bioinformatics and bio-data analysis
11
Content
Why data mining?
What is data mining?
Knowledge Discovery (KDD) Process
A Multi-Dimensional View of Data Mining
Data Mining Tasks
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
12
Knowledge Discovery (KDD) Process
This is a view from typical Pattern Evaluation/ 5
Data
Cleaning
1 Data Integration
Data Sources 13
Knowledge Discovery (KDD) Process (cont.)
Learning the application domain
relevant prior knowledge and goals of application
Identifying a target data set: data selection
Data processing
Data cleaning (remove noise and inconsistent data)
Data integration (multiple data sources maybe combined)
Data selection (data relevant to the analysis task are retrieved from database)
Data transformation (data transformed or consolidated into forms appropriate for mining)
(Done with data preprocessing)
Data mining (an essential process where intelligent methods are applied to extract
data patterns)
Pattern evaluation (indentify the truly interesting patterns)
Knowledge presentation (mined knowledge is presented to the user with
visualization or representation techniques)
Use of discovered knowledge
14
KDD Process: A View from ML and Statistics
17
Data Mining Tasks
Prediction Tasks
Use some variables to predict unknown or future values of other variables
Description Tasks
Find human-interpretable patterns that describe the data.
18
Data Mining Tasks
19
Tình huống 1
Marital Taxable
Tid Refund Evade
Status Income
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
Ông A (Tid = 100)
6 No Married 60K No có khả năng trốn
7 Yes Divorced 220K No
8 No Single 85K Yes thuế???
9 No Married 75K No
10 No Single 90K Yes
10
Tình huống 3
Ngày mai cổ
phiếu STB sẽ
tăng???
Tình huống 4
Khóa MãSV MônHọc1 MônHọc2 … TốtNghiệp
2004 1 9.0 8.5 … Có
2004 2 6.5 8.0 … Có
2004 3 4.0 2.5 … Không
2004 8 5.5 3.5 … Không
2004 14 5.0 5.5 … Có
… … … … … …
2005 90 7.0 6.0 … Có (80%)
2006 24 9.5 7.5 … Có (90%)
2007 82 5.5 4.5 … Không (45%)
2008 47 2.0 3.0 … Không (97%)
… … … … … …
Data
Tid Refund Marital Taxable
Status Income Cheat
Milk
24
Main data mining tasks: Classification
Classification - is the task of generalizing known structure to apply to new
data. For example, an e-mail program might attempt to classify an e-mail as
"legitimate" or as "spam".
In machine learning and statistics, classification is the problem of identifying
to which of a set of categories a new observation belongs, on the basis of a
training set of data containing observations whose category membership is
known.
In the terminology of machine learning, classification is considered an
instance of supervised learning, i.e. learning where a training set of correctly
identified observations is available. The corresponding unsupervised
procedure is known as clustering, and involves grouping data into categories
based on some measure of inherent similarity or distance.
25
Main data mining tasks: Classification
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the
values of other attributes.
Goal: previously unseen records should be assigned a
class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is
divided into training and test sets, with training set used to build the model and test set
used to validate it.
26
Classification Example
Model for predicting credit worthiness
Class E m p lo y e d
# years at
Level of Credit
Tid Employed present No Yes
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No E d u c a tio n
3 No Undergrad 1 No
4 Yes High School 10 Yes { H ig h s c h o o l,
G ra d u a te
… … … … … U n d e rg ra d }
10
N um ber of N um ber of
y e a rs y e a rs
Yes No Y es No
27
Classification Example
# years at
Level of Credit
Tid Employed present
Education Worthy
address
1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
… … … … …
1 Yes Graduate 5 Yes 10
Training
Learn
Model
Set Classifier
28
Examples of Classification Task
Classifying credit card transactions as legitimate or
fraudulent
Classifying land covers (water bodies, urban areas,
forests, etc.) using satellite data
Categorizing news stories as finance, weather,
entertainment, sports, etc
Identifying intruders in the cyberspace
Predicting tumor cells as benign or malignant
Classifying secondary structures of protein as
alpha-helix, beta-sheet, or random coil
29
Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a classifier model.
30
Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
Use credit card transactions and the information on its account-holder
as attributes.
When does a customer buy, what does he buy, how often he pays on time, etc
Label past transactions as fraud or fair transactions. This forms the
class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card transactions
on an account.
31
Classification: Application 3
Customer Attrition/Churn:
Goal: To predict whether a customer is likely to be lost
to a competitor.
Approach:
Use detailed record of transactions with each of the past and present
customers, to find attributes.
How often the customer calls, where he calls, what time-of-the day he calls most,
his financial status, marital status, etc.
Label the customers as loyal or disloyal.
Find a model for loyalty.
32
Main data mining tasks: Classification
Several classification algorithms include:
Linear classifiers: Fisher's linear discriminant, Logistic regression, Naive
Bayes classifier, Perceptron
Support vector machines: Least squares support vector machines
Quadratic classifiers
Kernel estimation: k-nearest neighbor
Boosting (meta-algorithm)
Decision trees: Random forests
Neural networks
Learning vector quantization
33
Main data mining tasks - Deviation/Anomaly/Change Detection
Detect significant deviations from normal
behavior
Applications:
Credit Card Fraud Detection
Network Intrusion Detection
Identify anomalous behavior from sensor
networks for monitoring and surveillance.
Detecting changes in the global forest cover.
34
Main data mining tasks - Association rule
Association rule learning (Dependency modelling) - Searches for
relationships between variables.
Association rule learning is a method for discovering interesting
relations between variables in large databases. It is intended to
identify strong rules discovered in databases using some
measures of interestingness.
For example, a supermarket might gather data on customer
purchasing habits. Using association rule learning, the
supermarket can determine which products are frequently bought
together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
35
Main data mining tasks - Association rule
In order to select interesting rules from the set of all possible rules,
constraints on various measures of significance and interest are used. The
best-known constraints are minimum thresholds on support and confidence.
Association rules are usually required to satisfy a user-specified minimum
support and a user-specified minimum confidence at the same time.
Association rule generation is usually split up into two separate steps:
1. A minimum support threshold is applied to find all frequent item-sets in a
database.
2. A minimum confidence constraint is applied to these frequent item-sets in
order to form rules.
36
Association Rule Discovery: Definition
Given a set of records each of which contain some number of
items from a given collection
Produce dependency rules which will predict occurrence of an item
based on occurrences of other items.
TID Items
1 Bread, Coke, Milk Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
37
Main data mining tasks - Association rule
Many algorithms for generating association rules were presented over time.
Apriori algorithm
Eclat algorithm (Equivalence Class Transformation)
FP-growth algorithm (FP: Frequent Pattern), AprioriDP
…
Other types of association mining
Multi-Relation Association Rules
Context Based Association Rules
…
38
Association Analysis: Applications
Market-basket analysis
Rules are used for sales promotion, shelf management, and inventory management
Medical Informatics
Rules are used to find combination of patient symptoms and test results associated
with certain diseases
39
Association Rule Discovery: Application
Supermarket shelf management.
Goal: To identify items that are bought together by sufficiently many
customers.
Approach: Process the point-of-sale data collected with barcode scanners
to find dependencies among items.
A classic rule --
If a customer buys diaper and milk, then he is very likely to buy beer:
40
Main data mining tasks - Clustering
Clustering - is the task of discovering groups and structures in the data that
are in some way or another "similar", without using known structures in the
data.
Cluster analysis or clustering is the task of grouping a set of objects in such a
way that objects in the same group (called a cluster) are more similar (in
some sense or another) to each other than to those in other groups (clusters).
Clustering is a main task of exploratory data mining, and a common
technique for statistical data analysis, used in many fields, including
machine learning, pattern recognition, image analysis, information retrieval,
and bioinformatics.
41
Main data mining tasks - Clustering
Finding groups of objects such that the objects in a group will be similar (or
related) to one another and different from (or unrelated to) the objects in other
groups
Euclidean Distance Based Clustering in 3-D space.
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
42
Main data mining tasks – Clustering (cont.)
Clustering algorithms can be categorized based on their cluster model
Connectivity based clustering (hierarchical clustering)
Centroid-based clustering
Distribution-based clustering
Density-based clustering
In recent years considerable effort has been put into improving the
performance of existing algorithms.
The researches of clustering algorithms?
43
Main data mining tasks – Clustering
Given a set of data points, each having a set of
attributes, and a similarity measure among them, find
clusters such that
Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to one another.
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
44
Applications of Cluster Analysis
Understanding
Custom profiling for targeted marketing
Group related documents for browsing
Group genes and proteins that have similar functionality
Group stocks with similar price fluctuations
Summarization
Reduce the size of large data sets
Courtesy: Michael Eisen
Clusters for Raw SST and Raw NPP
90
Land Cluster 2
0
and Southern Hemispheres.
Ice or No NPP
-30
Sea Cluster 2
-60
Sea Cluster 1
-90
-180 -150 -1 20 -90 -60 -30 0 30 60 90 1 20 150 180
Clus ter
longitude
45
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on their geographical and lifestyle
related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of customers in same
cluster vs. those from different clusters.
46
Clustering: Application 2
Document Clustering:
Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the frequencies
of different terms. Use it to cluster.
Gain: Information Retrieval can utilize the clusters to relate a
new document or search term to clustered documents.
47
Main data mining tasks - Regression
Predict a value of a given continuous valued variable based on the
values of other variables, assuming a linear or nonlinear model of
dependency.
Extensively studied in statistics, neural network fields.
Examples:
Predicting sales amounts of new product based on advetising
expenditure.
Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
Time series prediction of stock market indices.
48
Content
Why data mining?
What is data mining?
Knowledge Discovery (KDD) Process
A Multi-Dimensional View of Data Mining
Data Mining Tasks
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
49
Major Issues in Data Mining
Mining methodology and User interaction
Mining different kinds of knowledge
DM should cover a wide spectrum of data analysis and knowledge discovery tasks
Enable to use the database in different ways
Require the development of numerous data mining techniques
Interactive mining of knowledge at multiple levels of abstraction
Difficult to know exactly what will be discovered
Allow users to focus the search, refine data mining requests
Incorporation of background knowledge
Guide the discovery process
Allow discovered patterns to be expressed in concise terms and different levels of abstraction
Data mining query languages and ad hoc data mining
High-level query languages need to be developed
Should be integrated with a DB/DW query language
50
Major Issues in Data Mining
Presentation and visualization of results
Knowledge should be easily understood and directly usable
High level languages, visual representations or other expressive forms
Require the DM system to adopt the above techniques
51
Major Issues in Data Mining
Performance Issues
Efficiency and scalability
Huge amount of data
Running time must be predictable and acceptable
Parallel, distributed and incremental mining algorithms
Divide the data into partitions and processed in parallel
Incorporate database updates without having to mine the entire data again from
scratch
52
A Brief History of Data Mining Society
1989 IJCAI Workshop on Knowledge Discovery in Databases
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth,
and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining
(KDD’95-98)
Journal of Data Mining and Knowledge Discovery (1997)
ACM SIGKDD conferences since 1998 and SIGKDD Explorations
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), WSDM
(2008), etc.
ACM Transactions on KDD (2007)
53
Conferences and Journals on Data Mining
KDD Conferences
ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and
Data Mining (KDD)
SIAM Data Mining Conf. (SDM)
(IEEE) Int. Conf. on Data Mining (ICDM)
Conf. on Principles and practices of Knowledge Discovery and Data
Mining (PKDD)
Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)
Other related conferences
ACM SIGMOD, VLDB, (IEEE) ICDE
WWW, SIGIR, ICML, CVPR, NIPS
Journals
Data Mining and Knowledge Discovery (DAMI or DMKD)
IEEE Trans. On Knowledge and Data Eng. (TKDE)
KDD Explorations
ACM Trans. on KDD
54
Summary
Data mining: Discovering interesting patterns and knowledge from
massive amount of data
A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
Mining can be performed in a variety of data
Data mining functionalities: characterization, discrimination, association,
classification, clustering, trend and outlier analysis, etc.
Data mining technologies and applications
Major issues in data mining
55
References
1. Tan, Steinbach, Karpatne, Kumar, Introduction to Data
Mining, 2nd Edition, 2018,
2. Jiawei Han, Micheline Kamber. “Data Mining: Concepts
and Techniques”, Third Edition, Morgan Kaufmann
Publishers, 2012
3. Fayyad, et.al. Advances in Knowledge Discovery and
Data Mining, 1996
56
Bài tập
1. Thế nào là khai thác dữ liệu? Cho ví dụ minh họa
2. Các kiểu dữ liệu, thông tin nào có khả năng được sử dụng trong qui trình
KDD?
3. Cho ví dụ thực tế về việc áp dụng KTDL đem đến thành công trong kinh
doanh (ngoài các ví dụ đã có trong bài giảng)
Gợi ý: Bài toán tăng doanh thu của thị trường bán lẻ. Bài toán xây dựng
kế hoạch quảng cáo và khuyến mãi
Loại DL nào được thu thập? Loại tác vụ nào của KTDL được sử dụng?
Có thể thay bằng phương pháp truy vấn DL hay phân tích thống kê đơn
giản không?
Lưu ý: Cần tìm vì dụ ứng dụng thực tế và kèm địa chỉ tài liệu hay website có
giới thiệu về ứng dụng này.
58