You are on page 1of 25

Reference Books:

1. Data Mining: Concepts and Techniques


By Jiawei Han and Micheline Kamber
Simon Fras1er University, Canada
2. Book Slides –
Data Mining: Concepts and Techniques
By Jiawei Han and Micheline Kamber
Simon Fraser University, Canada
3. Introduction to Data Mining
By P N Tan, M Steinbach and V Kumar

1
 Data explosion problem
 Automated data collection tools and mature
database technology lead to tremendous amounts
of data stored in databases, data warehouses and
other information repositories
 We are drowning in data, but starving for knowledge!
 Solution: Data warehousing and data mining
 Data warehousing and on-line analytical processing
 Extraction of interesting knowledge (rules,
regularities, patterns, constraints) from data in
large databases

2
 Data mining /knowledge discovery in databases
(KDD)
 Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large
databases
 Alternative names and their “inside stories”:
 Data mining: a misnomer?
 Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
 What is not data mining?
 (Deductive) query processing.
 Expert systems or small ML/statistical programs

3
 Database analysis and decision support
◦ Market analysis and management
 target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation
◦ Risk analysis and management
 Forecasting, customer retention, quality control
etc.
◦ Fraud detection and management
 Other Applications
◦ Text mining (news group, email, documents) and
Web analysis.
◦ Intelligent query answering

4
 Where are the data sources for analysis?
 Credit card transactions, loyalty cards, discount coupons, customer
complaint calls, plus (public) lifestyle studies
 Target marketing
 Find clusters of “model” customers who share the same characteristics:
interest, income level, spending habits, etc.
 Determine customer purchasing patterns over time
 Conversion of single to a joint bank account: marriage, etc.
 Cross-market analysis
 Associations/co-relations between product sales
 Prediction based on the association information
 Customer profiling
 data mining can tell you what types of customers buy what
products (clustering or classification)
 Identifying customer requirements
 identifying the best products for different customers
 use prediction to find what factors will attract new customers

5
 Finance planning and asset evaluation
◦ cash flow analysis and prediction
◦ contingent claim analysis to evaluate assets
◦ cross-sectional and time series analysis (financial-
ratio, trend analysis, etc.)
 Resource planning:
◦ summarize and compare the resources and
spending
 Competition:
◦ monitor competitors and market directions
◦ group customers into classes and a class-based
pricing procedure
◦ set pricing strategy in a highly competitive market

6
 Applications
 widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
 Approach
 use historical data to build models of fraudulent
behavior and use data mining to help identify similar
instances
 Examples
 auto insurance: detect a group of people who stage
accidents to collect on insurance
 money laundering: detect suspicious money transactions
 medical insurance: detect professional patients and ring
of doctors and ring of references
 Detecting inappropriate medical treatment
 Detecting telephone fraud
7
 Sports
 Astronomy
 Internet Web Surf-Aid
◦ IBM Surf-Aid applies data mining algorithms to Web access
logs for market-related pages to discover customer
preference and behavior pages, analyzing effectiveness of
Web marketing, improving Web site organization, etc.

Available Data Mining/Data Warehouising tools


◦ WEKA
◦ DBMiner
◦ NeuralWare
◦ RapidMiner

8
Pattern Evaluation
Data mining: the core of knowledge
discovery process.
Data Mining

Task-relevant Data

Data Selection
Warehouse and transformation

Data Cleaning
and Integration
Flat files

9
Databases
 Learning the application domain: relevant prior knowledge and
goals of application
 Data cleaning: removes noise and inconsistent data
 Data integration: multiple data sources are combined
 Data selection: data relevant to analysis task are retrieved
 Data transformation: data are transformed or consolidated into
forms appropriate for mining
 Data mining: an essential process where intelligent methods
are applied in order to extract data patterns of interest
 Pattern evaluation: identify the truly interesting patterns
representing knowledge
 Knowledge presentation: visualization and knowledge
representation techniques are used to present the mined
knowledge to the user

10
Increasing potential
to support
business decisions End User
Making
Decisions

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts


OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
11
Graphical user interface

Pattern evaluation

Data mining engine

Database or data Knowledge-base


warehouse server
Data cleaning & data integration Filtering

Data
Databases Warehouse

12
 Database, data warehouse or other information repository: These
are set of databases, data warehouses, spreadsheets etc.
Data cleaning and data integration techniques are applied to
this data.
 Database or data warehouse server: responsible for fetching
the relevant data based on the user’s data mining request.
 Knowledge base: This is the domain knowledge that is used to
guide the search, or evaluate the interestingness of resulting
patterns.
 Data mining engine: Consists of a set of functional modules
for tasks such as characterization, association,
classification, cluster analysis, and evolution and deviation
analysis.

13
 Pattern evaluation module: This component typically employs
interestingness measures and interact with the data mining
modules so as to focus the search towards interesting
patterns.
 Graphical user interface: This module communicates between
users and the data mining system. It allows to specify a data
mining query or task. Allows the user to browse database
and data warehouse schemas or data structures, evaluate
mined patterns and visualize the patterns in different forms.

14
 Relational databases
 Data warehouses
 Transactional databases
 Advanced DB and information repositories
 Object-oriented and object-relational databases
 Spatial databases
 Time-series data and temporal data
 Text databases and multimedia databases
 Heterogeneous and legacy databases
 WWW

15
 Data mining tasks can be classified into two categories:
descriptive and predictive.
 Descriptive: characterize the general properties of the data
in the database
 Predictive: perform inference on the current data in order to
make predictions
 Concept/Class description: Characterization and
discrimination
 Data can be associated with classes or concepts
 Class/concept descriptions can be derived via i) data
characterization ii) data discrimination, or iii) both
 Data characterization: summarization of the general
characteristics of a target class of data (Example 1.4)
 Data discrimination: comparison of the general features of
target class data objects with the general features of objects
from one or a set of contrasting classes (Example 1.5)
16
Association analysis (correlation and causality)
 It is the discovery of association rules showing attribute-value
conditions that occur frequently together in a given set of data.
Used in transaction data analysis.
 Uses rules of the form X  Y is interpreted as database tuples
that satisfy the conditions in X are also likely to satisfy the
conditions in Y.
 Multi-dimensional association rule: Association between
multiple attributes or predicates
 Single-dimensional association rule: Association rule involves
single attribute or predicate
 Example 1.6
 age(X, “20..29”) ^ income(X, “20..29K”)  buys(X, “PC”)
[support = 2%, confidence = 60%]
 contains(T, “computer”)  contains(T, “software”)
[support = 1%, confidence = 50%]
17
Classification and Prediction (Tan, Kumar and Steinbach-330)
 It is the process of finding a set of models (functions) that
describe and distinguish data classes or concepts to
predict the class of objects whose class label is unknown.
The derived model is based on the analysis of a set of
training data (i.e., data objects whose class label is
known)
 E.g., classify countries based on climate, or classify cars
based on gas mileage
 Presentation: classification (IF-THEN) rules, mathematical
formulae, decision-tree, neural network
 A decision tree is a flow-chart-like tree structure, where
each node denotes a test on an attribute value, each
branch represents an outcome of the test, and tree leaves
represent classes or class distributions.
18
Neural network: A neural network when used for classification, is
typically a collection of neuron-like processing units with
weighted connections between the units.
Prediction: Predict some unknown or missing numerical values
See Example 1.7

Cluster analysis
 Clustering analyzes data objects without consulting a known
class label. It groups data to form new classes, e.g., cluster
houses to find distribution patterns
 The objects are clustered or grouped based on the principle of
maximizing the intra-class similarity and minimizing the
interclass similarity.

19
 Each cluster can be viewed as a class of objects from which
new rules can be derived (See Example 1.8)

Outlier analysis
 Outlier: a data object that does not comply with the general
behavior or model of the data
 It can be considered as noise or exception but is quite useful
in fraud detection
 The rare events can be more interesting than the more
regularly occurring ones. The analysis of outlier data is
referred to as outlier mining.
 Deviation-based methods identify outliers by examining
differences in the main characteristics of objects in a group.
See example 1.9.

20
 A data mining system/query may generate thousands of
patterns, not all of them are interesting. A pattern is
interesting if
i) it is easily understood by humans
ii) valid on new or test data with some degree of certainty
iii) potentially useful
iv) novel, or validates some hypothesis that a user seeks to
confirm.
Interestingness measures
 Objective interestingness measures: based on the structure
of discovered patterns and the statistics underlying them.
E.g., measure for association rules: support, confidence.
 Subjective interestingness measures: based on user’s belief
in the data. These measures find patterns interesting if they
are unexpected or offer strategic information on which the
user can act (actionable).

21
 Find all of the interesting patterns
 Can a data mining system generate all of the interesting
patterns?: I t refers to the completeness of a data mining
algorithm.
 User provided constraints and interestingness measures
should be used to focus the search, which is often sufficient to
ensure completeness.
 Search for only interesting patterns: Optimization
 Can a data mining system generate only interesting patterns?:
To generate only interesting patterns would be much more
efficient for users and data mining systems- a challenging
issue.
 Approaches
 First generate all of the patterns and then filter out the
uninteresting ones.
 Generate only the interesting patterns—mining query
optimization

22
Database
Statistics
Technology

Machine
Learning
Data Mining Visualization

Information Other
Science Disciplines

23
 Kinds of knowledge to be mined
◦ Based on data mining functionalities: Characterization, discrimination,
association, classification, clustering, outlier and evolution analysis,
etc.
◦ An advanced data mining system should facilitate the discovery of
knowledge at multiple levels of abstraction.
 Kinds of techniques utilized
◦ Based on underlying data analysis techniques: Database-oriented,
data warehouse (OLAP), machine learning, statistics, visualization,
neural network, etc.
◦ Based on the degree of user interaction: autonomous, interactive,
query-driven system.
 Based on the applications adapted
◦ There could be data mining systems tailored specifically for finance,
telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, e-mail etc.

24
 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances
in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
 J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann, 2000.
 T. Imielinski and H. Mannila. A database perspective on knowledge
discovery. Communications of ACM, 39:58-64, 1996.
 G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to
knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in
Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996.
 G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases.
AAAI/MIT Press, 1991.

25

You might also like