Professional Documents
Culture Documents
1
Data explosion problem
Automated data collection tools and mature
database technology lead to tremendous amounts
of data stored in databases, data warehouses and
other information repositories
We are drowning in data, but starving for knowledge!
Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing
Extraction of interesting knowledge (rules,
regularities, patterns, constraints) from data in
large databases
2
Data mining /knowledge discovery in databases
(KDD)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large
databases
Alternative names and their “inside stories”:
Data mining: a misnomer?
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
What is not data mining?
(Deductive) query processing.
Expert systems or small ML/statistical programs
3
Database analysis and decision support
◦ Market analysis and management
target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation
◦ Risk analysis and management
Forecasting, customer retention, quality control
etc.
◦ Fraud detection and management
Other Applications
◦ Text mining (news group, email, documents) and
Web analysis.
◦ Intelligent query answering
4
Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount coupons, customer
complaint calls, plus (public) lifestyle studies
Target marketing
Find clusters of “model” customers who share the same characteristics:
interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information
Customer profiling
data mining can tell you what types of customers buy what
products (clustering or classification)
Identifying customer requirements
identifying the best products for different customers
use prediction to find what factors will attract new customers
5
Finance planning and asset evaluation
◦ cash flow analysis and prediction
◦ contingent claim analysis to evaluate assets
◦ cross-sectional and time series analysis (financial-
ratio, trend analysis, etc.)
Resource planning:
◦ summarize and compare the resources and
spending
Competition:
◦ monitor competitors and market directions
◦ group customers into classes and a class-based
pricing procedure
◦ set pricing strategy in a highly competitive market
6
Applications
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
use historical data to build models of fraudulent
behavior and use data mining to help identify similar
instances
Examples
auto insurance: detect a group of people who stage
accidents to collect on insurance
money laundering: detect suspicious money transactions
medical insurance: detect professional patients and ring
of doctors and ring of references
Detecting inappropriate medical treatment
Detecting telephone fraud
7
Sports
Astronomy
Internet Web Surf-Aid
◦ IBM Surf-Aid applies data mining algorithms to Web access
logs for market-related pages to discover customer
preference and behavior pages, analyzing effectiveness of
Web marketing, improving Web site organization, etc.
8
Pattern Evaluation
Data mining: the core of knowledge
discovery process.
Data Mining
Task-relevant Data
Data Selection
Warehouse and transformation
Data Cleaning
and Integration
Flat files
9
Databases
Learning the application domain: relevant prior knowledge and
goals of application
Data cleaning: removes noise and inconsistent data
Data integration: multiple data sources are combined
Data selection: data relevant to analysis task are retrieved
Data transformation: data are transformed or consolidated into
forms appropriate for mining
Data mining: an essential process where intelligent methods
are applied in order to extract data patterns of interest
Pattern evaluation: identify the truly interesting patterns
representing knowledge
Knowledge presentation: visualization and knowledge
representation techniques are used to present the mined
knowledge to the user
10
Increasing potential
to support
business decisions End User
Making
Decisions
Data Exploration
Statistical Analysis, Querying and Reporting
Pattern evaluation
Data
Databases Warehouse
12
Database, data warehouse or other information repository: These
are set of databases, data warehouses, spreadsheets etc.
Data cleaning and data integration techniques are applied to
this data.
Database or data warehouse server: responsible for fetching
the relevant data based on the user’s data mining request.
Knowledge base: This is the domain knowledge that is used to
guide the search, or evaluate the interestingness of resulting
patterns.
Data mining engine: Consists of a set of functional modules
for tasks such as characterization, association,
classification, cluster analysis, and evolution and deviation
analysis.
13
Pattern evaluation module: This component typically employs
interestingness measures and interact with the data mining
modules so as to focus the search towards interesting
patterns.
Graphical user interface: This module communicates between
users and the data mining system. It allows to specify a data
mining query or task. Allows the user to browse database
and data warehouse schemas or data structures, evaluate
mined patterns and visualize the patterns in different forms.
14
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
15
Data mining tasks can be classified into two categories:
descriptive and predictive.
Descriptive: characterize the general properties of the data
in the database
Predictive: perform inference on the current data in order to
make predictions
Concept/Class description: Characterization and
discrimination
Data can be associated with classes or concepts
Class/concept descriptions can be derived via i) data
characterization ii) data discrimination, or iii) both
Data characterization: summarization of the general
characteristics of a target class of data (Example 1.4)
Data discrimination: comparison of the general features of
target class data objects with the general features of objects
from one or a set of contrasting classes (Example 1.5)
16
Association analysis (correlation and causality)
It is the discovery of association rules showing attribute-value
conditions that occur frequently together in a given set of data.
Used in transaction data analysis.
Uses rules of the form X Y is interpreted as database tuples
that satisfy the conditions in X are also likely to satisfy the
conditions in Y.
Multi-dimensional association rule: Association between
multiple attributes or predicates
Single-dimensional association rule: Association rule involves
single attribute or predicate
Example 1.6
age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”)
[support = 2%, confidence = 60%]
contains(T, “computer”) contains(T, “software”)
[support = 1%, confidence = 50%]
17
Classification and Prediction (Tan, Kumar and Steinbach-330)
It is the process of finding a set of models (functions) that
describe and distinguish data classes or concepts to
predict the class of objects whose class label is unknown.
The derived model is based on the analysis of a set of
training data (i.e., data objects whose class label is
known)
E.g., classify countries based on climate, or classify cars
based on gas mileage
Presentation: classification (IF-THEN) rules, mathematical
formulae, decision-tree, neural network
A decision tree is a flow-chart-like tree structure, where
each node denotes a test on an attribute value, each
branch represents an outcome of the test, and tree leaves
represent classes or class distributions.
18
Neural network: A neural network when used for classification, is
typically a collection of neuron-like processing units with
weighted connections between the units.
Prediction: Predict some unknown or missing numerical values
See Example 1.7
Cluster analysis
Clustering analyzes data objects without consulting a known
class label. It groups data to form new classes, e.g., cluster
houses to find distribution patterns
The objects are clustered or grouped based on the principle of
maximizing the intra-class similarity and minimizing the
interclass similarity.
19
Each cluster can be viewed as a class of objects from which
new rules can be derived (See Example 1.8)
Outlier analysis
Outlier: a data object that does not comply with the general
behavior or model of the data
It can be considered as noise or exception but is quite useful
in fraud detection
The rare events can be more interesting than the more
regularly occurring ones. The analysis of outlier data is
referred to as outlier mining.
Deviation-based methods identify outliers by examining
differences in the main characteristics of objects in a group.
See example 1.9.
20
A data mining system/query may generate thousands of
patterns, not all of them are interesting. A pattern is
interesting if
i) it is easily understood by humans
ii) valid on new or test data with some degree of certainty
iii) potentially useful
iv) novel, or validates some hypothesis that a user seeks to
confirm.
Interestingness measures
Objective interestingness measures: based on the structure
of discovered patterns and the statistics underlying them.
E.g., measure for association rules: support, confidence.
Subjective interestingness measures: based on user’s belief
in the data. These measures find patterns interesting if they
are unexpected or offer strategic information on which the
user can act (actionable).
21
Find all of the interesting patterns
Can a data mining system generate all of the interesting
patterns?: I t refers to the completeness of a data mining
algorithm.
User provided constraints and interestingness measures
should be used to focus the search, which is often sufficient to
ensure completeness.
Search for only interesting patterns: Optimization
Can a data mining system generate only interesting patterns?:
To generate only interesting patterns would be much more
efficient for users and data mining systems- a challenging
issue.
Approaches
First generate all of the patterns and then filter out the
uninteresting ones.
Generate only the interesting patterns—mining query
optimization
22
Database
Statistics
Technology
Machine
Learning
Data Mining Visualization
Information Other
Science Disciplines
23
Kinds of knowledge to be mined
◦ Based on data mining functionalities: Characterization, discrimination,
association, classification, clustering, outlier and evolution analysis,
etc.
◦ An advanced data mining system should facilitate the discovery of
knowledge at multiple levels of abstraction.
Kinds of techniques utilized
◦ Based on underlying data analysis techniques: Database-oriented,
data warehouse (OLAP), machine learning, statistics, visualization,
neural network, etc.
◦ Based on the degree of user interaction: autonomous, interactive,
query-driven system.
Based on the applications adapted
◦ There could be data mining systems tailored specifically for finance,
telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, e-mail etc.
24
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances
in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann, 2000.
T. Imielinski and H. Mannila. A database perspective on knowledge
discovery. Communications of ACM, 39:58-64, 1996.
G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to
knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in
Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996.
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases.
AAAI/MIT Press, 1991.
25