0% found this document useful (0 votes)
13 views58 pages

Chap1 Introduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views58 pages

Chap1 Introduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Mining and Application

Lecture 01. Introduction to Data Mining


Content
 Why data mining?
 What is data mining?
 Knowledge Discovery (KDD) Process
 A Multi-Dimensional View of Data Mining
 Data Mining Tasks
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary

2
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability
 Automated data collection tools, database systems, Web, computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras, YouTube
 Computers have become cheaper and more powerful
 Expectations
 Gathered data will have value either for the purpose collected or for a purpose not envisioned.
 Data rich but information poor!
 What does those data mean?
 How to analyze data?
 Traditional techniques infeasible for raw data 3
Origins of Data Mining
We are drowning in data but
starving for knowledge!
“Necessity is the mother of
invention”—Data mining—
Automated analysis of
massive data sets

We are data rich, but information poor.


“Necessity is the mother of invention”. - Plato

4
Origins of Data Mining
 Draws ideas from machine learning/AI, pattern recognition, statistics, and
database systems

 Traditional techniques may be unsuitable due to data that is


 Large-scale
 High dimensional
 Heterogeneous
 Complex
 Distributed

 A key component of the emerging field of data science and data-driven discovery

5
Data Mining and Related Field

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing

6
Why Data Mining?
 Great opportunities to improve productivity in all walks of life
 Great Opportunities to Solve Society’s Major Problems

7
Content
 Why data mining?
 What is data mining?
 Knowledge Discovery (KDD) Process
 A Multi-Dimensional View of Data Mining
 Data Mining Tasks
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary

8
What Is Data Mining?
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously unknown and potentially
useful) patterns or knowledge from huge amount of data
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging, information harvesting, business intelligence,
etc.
 Is everything “data mining”?
 When is data mining used?

9
What Is Data Mining?
 1. Hãy hiển thị số tiền Ông Smith trong ngày 5 tháng Giêng?
 2. Có bao nhiêu nhà đầu tư nước ngoài mua cổ phiếu X
trong tháng trước ?
 3. Hiển thị mọi cổ phiếu trong CSDL với mệnh giá tăng ?
 4. Các cổ phiếu tăng giá có đặc trưng gì?
 5. Hy vọng gì về cổ phiếu X trong tuần tiếp theo ?
 6. Trong tháng tiếp theo, sẽ có bao nhiêu đoàn viên công
đoàn không trả được nợ của họ?
 7. Những người mua sản phẩm Y có đặc trưng gì ?
Potential Applications
 Data analysis and decision support
 Market analysis and management
 Target marketing, customer relationship management (CRM), market basket analysis, cross
selling, market segmentation
 Risk analysis and management
 Forecasting, customer retention, improved underwriting, quality control, competitive
analysis
 Fraud detection and detection of unusual patterns (outliers)
 Other Applications
 Text mining (news group, email, documents) and Web mining
 Stream data mining
 Bioinformatics and bio-data analysis

11
Content
 Why data mining?
 What is data mining?
 Knowledge Discovery (KDD) Process
 A Multi-Dimensional View of Data Mining
 Data Mining Tasks
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary

12
Knowledge Discovery (KDD) Process
 This is a view from typical Pattern Evaluation/ 5

database systems and data Presentation


4
warehousing communities 3
 Data mining plays an Data Mining Patterns
essential role in the
knowledge discovery Task-relevant Data
process 2

Data Warehouse Selection/Transformation

Data
Cleaning
1 Data Integration

Data Sources 13
Knowledge Discovery (KDD) Process (cont.)
 Learning the application domain
 relevant prior knowledge and goals of application
 Identifying a target data set: data selection
 Data processing
 Data cleaning (remove noise and inconsistent data)
 Data integration (multiple data sources maybe combined)
 Data selection (data relevant to the analysis task are retrieved from database)
 Data transformation (data transformed or consolidated into forms appropriate for mining)
(Done with data preprocessing)
 Data mining (an essential process where intelligent methods are applied to extract
data patterns)
 Pattern evaluation (indentify the truly interesting patterns)
 Knowledge presentation (mined knowledge is presented to the user with
visualization or representation techniques)
 Use of discovered knowledge
14
KDD Process: A View from ML and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Normalization Association & correlation Pattern selection
Feature selection Classification
Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………

 This is a view from typical machine learning and statistics communities


15
Multi-Dimensional View of Data Mining
 Data to be mined
 Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse,
transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs
& social and information networks
 Knowledge to be mined (or: Data mining functions)
 Characterization, discrimination, association, classification, clustering, trend/deviation, outlier
analysis, etc.
 Descriptive vs. predictive data mining
 Multiple/integrated functions and mining at multiple levels
 Techniques utilized
 Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition,
visualization, high-performance, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text
mining, Web mining, etc.
16
Content
 Why data mining?
 What is data mining?
 Knowledge Discovery (KDD) Process
 A Multi-Dimensional View of Data Mining
 Data Mining Tasks
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary

17
Data Mining Tasks
 Prediction Tasks
 Use some variables to predict unknown or future values of other variables
 Description Tasks
 Find human-interpretable patterns that describe the data.

Common data mining tasks


 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Regression [Predictive]
 Deviation Detection [Predictive]

18
Data Mining Tasks

19
Tình huống 1

Người đang sử dụng


thẻ ID = 1234 thật sự là
chủ nhân của thẻ hay là
một tên trộm?
Tình huống 2

Marital Taxable
Tid Refund Evade
Status Income
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
Ông A (Tid = 100)
6 No Married 60K No có khả năng trốn
7 Yes Divorced 220K No
8 No Single 85K Yes thuế???
9 No Married 75K No
10 No Single 90K Yes
10
Tình huống 3
Ngày mai cổ
phiếu STB sẽ
tăng???
Tình huống 4
Khóa MãSV MônHọc1 MônHọc2 … TốtNghiệp
2004 1 9.0 8.5 … Có
2004 2 6.5 8.0 … Có
2004 3 4.0 2.5 … Không
2004 8 5.5 3.5 … Không
2004 14 5.0 5.5 … Có
… … … … … …
2005 90 7.0 6.0 … Có (80%)
2006 24 9.5 7.5 … Có (90%)
2007 82 5.5 4.5 … Không (45%)
2008 47 2.0 3.0 … Không (97%)
… … … … … …

Làm sao xác định được khả năng tốt


nghiệp của một sinh viên hiện tại? 23
Data Mining Tasks

Data
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes
10

Milk
24
Main data mining tasks: Classification
 Classification - is the task of generalizing known structure to apply to new
data. For example, an e-mail program might attempt to classify an e-mail as
"legitimate" or as "spam".
 In machine learning and statistics, classification is the problem of identifying
to which of a set of categories a new observation belongs, on the basis of a
training set of data containing observations whose category membership is
known.
 In the terminology of machine learning, classification is considered an
instance of supervised learning, i.e. learning where a training set of correctly
identified observations is available. The corresponding unsupervised
procedure is known as clustering, and involves grouping data into categories
based on some measure of inherent similarity or distance.
25
Main data mining tasks: Classification
 Given a collection of records (training set )
 Each record contains a set of attributes, one of the attributes is the class.
 Find a model for class attribute as a function of the
values of other attributes.
 Goal: previously unseen records should be assigned a
class as accurately as possible.
 A test set is used to determine the accuracy of the model. Usually, the given data set is
divided into training and test sets, with training set used to build the model and test set
used to validate it.

26
Classification Example
Model for predicting credit worthiness

Class E m p lo y e d
# years at
Level of Credit
Tid Employed present No Yes
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No E d u c a tio n
3 No Undergrad 1 No
4 Yes High School 10 Yes { H ig h s c h o o l,
G ra d u a te
… … … … … U n d e rg ra d }
10

N um ber of N um ber of
y e a rs y e a rs

> 3 yr < 3 yr > 7 yrs < 7 y rs

Yes No Y es No

27
Classification Example

# years at
Level of Credit
Tid Employed present
Education Worthy
address
1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
… … … … …
1 Yes Graduate 5 Yes 10

2 Yes High School 2 No


3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … …
10
Test
Set

Training
Learn
Model
Set Classifier

28
Examples of Classification Task
 Classifying credit card transactions as legitimate or
fraudulent
 Classifying land covers (water bodies, urban areas,
forests, etc.) using satellite data
 Categorizing news stories as finance, weather,
entertainment, sports, etc
 Identifying intruders in the cyberspace
 Predicting tumor cells as benign or malignant
 Classifying secondary structures of protein as
alpha-helix, beta-sheet, or random coil

29
Classification: Application 1
 Direct Marketing
 Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
 Approach:
 Use the data for a similar product introduced before.
 We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
 Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
 Type of business, where they stay, how much they earn, etc.
 Use this information as input attributes to learn a classifier model.
30
Classification: Application 2
 Fraud Detection
 Goal: Predict fraudulent cases in credit card transactions.
 Approach:
 Use credit card transactions and the information on its account-holder
as attributes.
 When does a customer buy, what does he buy, how often he pays on time, etc
 Label past transactions as fraud or fair transactions. This forms the
class attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit card transactions
on an account.

31
Classification: Application 3
 Customer Attrition/Churn:
 Goal: To predict whether a customer is likely to be lost
to a competitor.
 Approach:
 Use detailed record of transactions with each of the past and present
customers, to find attributes.
 How often the customer calls, where he calls, what time-of-the day he calls most,
his financial status, marital status, etc.
 Label the customers as loyal or disloyal.
 Find a model for loyalty.
32
Main data mining tasks: Classification
Several classification algorithms include:
 Linear classifiers: Fisher's linear discriminant, Logistic regression, Naive
Bayes classifier, Perceptron
 Support vector machines: Least squares support vector machines
 Quadratic classifiers
 Kernel estimation: k-nearest neighbor
 Boosting (meta-algorithm)
 Decision trees: Random forests
 Neural networks
 Learning vector quantization

33
Main data mining tasks - Deviation/Anomaly/Change Detection
 Detect significant deviations from normal
behavior
 Applications:
 Credit Card Fraud Detection
 Network Intrusion Detection
 Identify anomalous behavior from sensor
networks for monitoring and surveillance.
 Detecting changes in the global forest cover.

34
Main data mining tasks - Association rule
 Association rule learning (Dependency modelling) - Searches for
relationships between variables.
 Association rule learning is a method for discovering interesting
relations between variables in large databases. It is intended to
identify strong rules discovered in databases using some
measures of interestingness.
 For example, a supermarket might gather data on customer
purchasing habits. Using association rule learning, the
supermarket can determine which products are frequently bought
together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
35
Main data mining tasks - Association rule
 In order to select interesting rules from the set of all possible rules,
constraints on various measures of significance and interest are used. The
best-known constraints are minimum thresholds on support and confidence.
 Association rules are usually required to satisfy a user-specified minimum
support and a user-specified minimum confidence at the same time.
Association rule generation is usually split up into two separate steps:
1. A minimum support threshold is applied to find all frequent item-sets in a
database.
2. A minimum confidence constraint is applied to these frequent item-sets in
order to form rules.

36
Association Rule Discovery: Definition
 Given a set of records each of which contain some number of
items from a given collection
 Produce dependency rules which will predict occurrence of an item
based on occurrences of other items.
TID Items
1 Bread, Coke, Milk Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

37
Main data mining tasks - Association rule
 Many algorithms for generating association rules were presented over time.
 Apriori algorithm
 Eclat algorithm (Equivalence Class Transformation)
 FP-growth algorithm (FP: Frequent Pattern), AprioriDP
 …
 Other types of association mining
 Multi-Relation Association Rules
 Context Based Association Rules
 …

38
Association Analysis: Applications
 Market-basket analysis
 Rules are used for sales promotion, shelf management, and inventory management

 Telecommunication alarm diagnosis


 Rules are used to find combination of alarms that occur together frequently in the
same time period

 Medical Informatics
 Rules are used to find combination of patient symptoms and test results associated
with certain diseases

39
Association Rule Discovery: Application
 Supermarket shelf management.
 Goal: To identify items that are bought together by sufficiently many
customers.
 Approach: Process the point-of-sale data collected with barcode scanners
to find dependencies among items.
 A classic rule --
 If a customer buys diaper and milk, then he is very likely to buy beer:

40
Main data mining tasks - Clustering
 Clustering - is the task of discovering groups and structures in the data that
are in some way or another "similar", without using known structures in the
data.
 Cluster analysis or clustering is the task of grouping a set of objects in such a
way that objects in the same group (called a cluster) are more similar (in
some sense or another) to each other than to those in other groups (clusters).
 Clustering is a main task of exploratory data mining, and a common
technique for statistical data analysis, used in many fields, including
machine learning, pattern recognition, image analysis, information retrieval,
and bioinformatics.

41
Main data mining tasks - Clustering
 Finding groups of objects such that the objects in a group will be similar (or
related) to one another and different from (or unrelated to) the objects in other
groups
Euclidean Distance Based Clustering in 3-D space.
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

42
Main data mining tasks – Clustering (cont.)
 Clustering algorithms can be categorized based on their cluster model
 Connectivity based clustering (hierarchical clustering)
 Centroid-based clustering
 Distribution-based clustering
 Density-based clustering
 In recent years considerable effort has been put into improving the
performance of existing algorithms.
 The researches of clustering algorithms?

43
Main data mining tasks – Clustering
 Given a set of data points, each having a set of
attributes, and a similarity measure among them, find
clusters such that
 Data points in one cluster are more similar to one another.
 Data points in separate clusters are less similar to one another.
 Similarity Measures:
 Euclidean Distance if attributes are continuous.
 Other Problem-specific Measures.

44
Applications of Cluster Analysis
Understanding
 Custom profiling for targeted marketing
 Group related documents for browsing
 Group genes and proteins that have similar functionality
 Group stocks with similar price fluctuations
Summarization
 Reduce the size of large data sets
Courtesy: Michael Eisen
Clusters for Raw SST and Raw NPP
90

Use of K-means to partition Sea


Surface Temperature (SST) and
60

Land Cluster 2

30 Net Primary Production (NPP) into


Land Cluster 1 clusters that reflect the Northern
latitude

0
and Southern Hemispheres.
Ice or No NPP

-30

Sea Cluster 2

-60

Sea Cluster 1

-90
-180 -150 -1 20 -90 -60 -30 0 30 60 90 1 20 150 180
Clus ter
longitude

45
Clustering: Application 1
 Market Segmentation:
 Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
 Approach:
 Collect different attributes of customers based on their geographical and lifestyle
related information.
 Find clusters of similar customers.
 Measure the clustering quality by observing buying patterns of customers in same
cluster vs. those from different clusters.

46
Clustering: Application 2
 Document Clustering:
 Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
 Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the frequencies
of different terms. Use it to cluster.
 Gain: Information Retrieval can utilize the clusters to relate a
new document or search term to clustered documents.

47
Main data mining tasks - Regression
Predict a value of a given continuous valued variable based on the
values of other variables, assuming a linear or nonlinear model of
dependency.
Extensively studied in statistics, neural network fields.
Examples:
Predicting sales amounts of new product based on advetising
expenditure.
Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
Time series prediction of stock market indices.
48
Content
 Why data mining?
 What is data mining?
 Knowledge Discovery (KDD) Process
 A Multi-Dimensional View of Data Mining
 Data Mining Tasks
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary

49
Major Issues in Data Mining
 Mining methodology and User interaction
 Mining different kinds of knowledge
 DM should cover a wide spectrum of data analysis and knowledge discovery tasks
 Enable to use the database in different ways
 Require the development of numerous data mining techniques
 Interactive mining of knowledge at multiple levels of abstraction
 Difficult to know exactly what will be discovered
 Allow users to focus the search, refine data mining requests
 Incorporation of background knowledge
 Guide the discovery process
 Allow discovered patterns to be expressed in concise terms and different levels of abstraction
 Data mining query languages and ad hoc data mining
 High-level query languages need to be developed
 Should be integrated with a DB/DW query language
50
Major Issues in Data Mining
 Presentation and visualization of results
 Knowledge should be easily understood and directly usable
 High level languages, visual representations or other expressive forms
 Require the DM system to adopt the above techniques

 Handling noisy or incomplete data


 Require data cleaning methods and data analysis methods that can handle noise

 Pattern evaluation – the interestingness problem


 How to develop techniques to access the interestingness of discovered patterns,
especially with subjective measures bases on user beliefs or expectations

51
Major Issues in Data Mining
 Performance Issues
 Efficiency and scalability
 Huge amount of data
 Running time must be predictable and acceptable
 Parallel, distributed and incremental mining algorithms
 Divide the data into partitions and processed in parallel
 Incorporate database updates without having to mine the entire data again from
scratch

 Diversity of Database Types


 Other database that contain complex data objects, multimedia data,
spatial data, etc.
 Expect to have different DM systems for different kinds of data
 Heterogeneous databases and global information systems
 Web mining becomes a very challenging and fast-evolving field in data mining

52
A Brief History of Data Mining Society
 1989 IJCAI Workshop on Knowledge Discovery in Databases
 Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
 1991-1994 Workshops on Knowledge Discovery in Databases
 Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth,
and R. Uthurusamy, 1996)
 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining
(KDD’95-98)
 Journal of Data Mining and Knowledge Discovery (1997)
 ACM SIGKDD conferences since 1998 and SIGKDD Explorations
 More conferences on data mining
 PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), WSDM
(2008), etc.
 ACM Transactions on KDD (2007)
53
Conferences and Journals on Data Mining
 KDD Conferences
 ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and
Data Mining (KDD)
 SIAM Data Mining Conf. (SDM)
 (IEEE) Int. Conf. on Data Mining (ICDM)
 Conf. on Principles and practices of Knowledge Discovery and Data
Mining (PKDD)
 Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)
 Other related conferences
 ACM SIGMOD, VLDB, (IEEE) ICDE
 WWW, SIGIR, ICML, CVPR, NIPS
 Journals
 Data Mining and Knowledge Discovery (DAMI or DMKD)
 IEEE Trans. On Knowledge and Data Eng. (TKDE)
 KDD Explorations
 ACM Trans. on KDD
54
Summary
 Data mining: Discovering interesting patterns and knowledge from
massive amount of data
 A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
 Mining can be performed in a variety of data
 Data mining functionalities: characterization, discrimination, association,
classification, clustering, trend and outlier analysis, etc.
 Data mining technologies and applications
 Major issues in data mining
55
References
1. Tan, Steinbach, Karpatne, Kumar, Introduction to Data
Mining, 2nd Edition, 2018,
2. Jiawei Han, Micheline Kamber. “Data Mining: Concepts
and Techniques”, Third Edition, Morgan Kaufmann
Publishers, 2012
3. Fayyad, et.al. Advances in Knowledge Discovery and
Data Mining, 1996

56
Bài tập
1. Thế nào là khai thác dữ liệu? Cho ví dụ minh họa
2. Các kiểu dữ liệu, thông tin nào có khả năng được sử dụng trong qui trình
KDD?
3. Cho ví dụ thực tế về việc áp dụng KTDL đem đến thành công trong kinh
doanh (ngoài các ví dụ đã có trong bài giảng)
 Gợi ý: Bài toán tăng doanh thu của thị trường bán lẻ. Bài toán xây dựng
kế hoạch quảng cáo và khuyến mãi
 Loại DL nào được thu thập? Loại tác vụ nào của KTDL được sử dụng?
Có thể thay bằng phương pháp truy vấn DL hay phân tích thống kê đơn
giản không?
Lưu ý: Cần tìm vì dụ ứng dụng thực tế và kèm địa chỉ tài liệu hay website có
giới thiệu về ứng dụng này.
58

You might also like