Professional Documents
Culture Documents
Unit I
Unit II
2
Syllabus
Unit III
Mining Frequent Patterns, Associations, and Correlations: Basic Concepts,
Mining Methods: The Apriori Algorithm: Finding Frequent Itemsets Using
Candidate Generation, Generating Association Rules from Frequent
Itemsets,
Improving the Efficiency of Apriori, Mining Frequent Itemsets without Candidate
Generation. Mining various kinds of Association Rules: Mining Multilevel Association
Rules, Mining Multidimensional Association Rules from Relational Databases and Data
Warehouses.
Unit IV
Classification and Prediction: Classification, Prediction, Regarding
Classification and Prediction, Decision Tree Induction, Attribute Selection
Measures. Bayesian Classification: Naïve Bayesian Classification
Rule-Based Classification: Using IF-THEN Rules for Classification, Rule
Extraction from a Decision Tree, Rule Induction Using a Sequential
Covering Algorithm.
Unit V
Cluster Analysis: Cluster Analysis, Types of Data in Cluster Analysis, A
Categorization of Major Clustering Methods, Partitioning Methods.
Hierarchical Methods: Agglomerative and Divisive Hierarchical Clustering,
Density-Based Methods, Outlier Analysis
3
Data Mining Activities / Kinds of Patterns
5
Class/Concept Descriptions
Also known as Market Basket Analysis for its wide use in retail sales,
Association analysis aims to discover associations between items occurring
together frequently. Association analysis is based on rules having 2 parts:
antecedent (if)
consequent(then)
An antecedent is an item in a collection, which, when found, also indicates a
certain chance of finding a consequent in the collection. In other words, they
are associated. From the data association inferences like,
If a customer buys potato chips, he / she is about 70% likely to buy soft drinks along with it.
So, over a big chunk of data, you see this association with a certain degree of confidence.
It analyses the set of items that generally occur together in a transactional
dataset. Two parameters are used for determining the association rules:
It provides which identifies the common item set in the database.
Confidence is the conditional probability that an item occurs when another
item occurs in a transaction.
Multi-dimensional vs. single-dimensional association
age(X, “20..29”) ^ income(X, “20..29K”) => buys(X, “PC”) [support = 2%, confidence = 60%]
contains(T, “computer”) => contains(x, “software”) [support=1%, confidence=75%]
products using a training data set where the class for products is
known.
Numerical prediction
E.g., Weather, Stock Price, Traffic
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
16
Views of DM
Techniques utilized
Data-intensive, data warehouse (OLAP), machine learning, statistics,
21
Data Mining Function: (1) Generalization
Information integration and data warehouse construction
Data cleaning, transformation, integration, and
multidimensional data model
Data cube technology
Scalable methods for computing (i.e., materializing)
multidimensional aggregates
OLAP (online analytical processing)
Multidimensional concept description: Characterization
and discrimination
Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region
22
Data Mining Function: (2) Association
and Correlation Analysis
Frequent patterns (or frequent itemsets)
What items are frequently purchased together in your
Online Purchase Platforms?
Association, correlation vs. causality
A typical association rule
Diaper Beer [0.5%, 75%] (support, confidence)
Are strongly associated items also strongly correlated?
How to mine such patterns and rules efficiently in large
datasets?
How to use such patterns for classification, clustering,
and other applications?
23
Data Mining Function: (3) Classification
24
Data Mining Function: (4) Cluster Analysis
25
Data Mining Function: (5) Outlier Analysis
Outlier analysis
Outlier: A data object that does not comply with the general
behavior of the data
Noise or exception? ― One person’s garbage could be another
person’s treasure
Methods: by product of clustering or regression analysis, …
Useful in fraud detection, rare events analysis
26
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
Sequence, trend and evolution analysis
Trend, time-series, and deviation analysis: e.g., regression
cards
Periodicity analysis
Similarity-based analysis
27
Structure and Network Analysis
Graph mining
Finding frequent subgraphs (e.g., chemical compounds), trees
family, classmates, …
Links carry a lot of semantic information: Link mining
Web mining
Web is a big information network: from PageRank to Google
28
Evaluation of Knowledge
Are all mined knowledge interesting?
One can mine tremendous amount of “patterns” and knowledge
Some may fit only certain dimension space (time, location, …)
Some may not be representative, may be transient, …
Evaluation of mined knowledge → directly mine only
interesting knowledge?
Descriptive vs. predictive
Coverage
Typicality vs. novelty
Accuracy
Timeliness
…
29
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
30
Data Mining: Confluence of Multiple Disciplines
31
Why Confluence of Multiple Disciplines?
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-bytes of
data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
32
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
33
Applications of Data Mining
Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
Collaborative analysis & recommender systems
Basket data analysis to targeted marketing
Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
34
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
35
Major Issues in Data Mining (1)
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
36
Major Issues in Data Mining (2)
37
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
38
A Brief History of Data Mining Society
41