KDD - Knowledge Discovery in Databases

Lecture Knowledge Discovery in Databases
Chapter 1. Introduction
Prof. Dr. Klaus Meyer-Wegener
Summer Semester 2018
Lehrstuhl für Informatik 6 (Datenmanagement)

Friedrich-Alexander-Universität Erlangen-Nürnberg
Chapter 1. Introduction 1-2
 Why data mining?

 What is data mining?
 A multi-dimensional view of data mining
 What kinds of data can be mined?
 What kinds of patterns can be mined?
 What technologies are used?
 What kinds of applications are targeted?
 Major issues in data mining
 A brief history of data mining
 Summary

KDD, 1 Introduction
Why Data Mining? 1-3
 The Explosive Growth of Data:

from Terabytes to Petabytes and more
 Data collection and data availability
 Automated data-collection tools, database systems, Web, computerized
society, digitization
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras, YouTube
 New buzzword: "Big Data"
 We are drowning in data, but starving for knowledge!
 "Necessity is the mother of invention"
 Data mining—Automated analysis of massive data sets

KDD, 1 Introduction
Evolution of Sciences 1-4
 Before 1600, Empirical Science

 1600–1950s, Theoretical Science
 Each discipline has grown a theoretical component.
 Theoretical models often motivate experiments and generalize our
understanding.
 1950s–1990s, Computational Science
 Over the last 50 years, most disciplines have grown a third, computational
branch
 E.g. empirical, theoretical, and computational ecology,
 or physics, or linguistics, …
 Computational Science traditionally meant simulation.
 It grew out of our inability to find closed-form solutions for complex
mathematical models.

KDD, 1 Introduction
Evolution of Sciences (2) 1-5
 1990-now, Data Science

 The flood of data from new scientific instruments and simulations
 The ability to economically store and manage petabytes of data online
 The Internet and Computing Grid that makes all these archives universally
accessible
 Scientific-information management, acquisition, organization, query, and
visualization tasks scale almost linearly with data volumes.
 Data mining is a major new challenge!
 Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for
Online Science, Comm. ACM 45(11): 50-54, Nov. 2002

KDD, 1 Introduction
Evolution of Database Technology 1-6
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS products
 Advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web databases
 2000s:
 Stream-data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems
KDD, 1 Introduction
Chapter 1. Introduction 1-7

 Summary

KDD, 1 Introduction
What Is Data Mining? 1-8
 Data mining (knowledge discovery from data):

 Extraction of
interesting (non-trivial, implicit, previously unknown
and potentially useful)
patterns or knowledge
from huge amounts of data
 Data mining: a misnomer?
 Alternative names:
 Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting,
business intelligence, etc.
 Watch out: Is everything "data mining"?
 Simple search and query processing is not.
 (Deductive) expert systems are neither.

KDD, 1 Introduction
Knowledge Discovery (KDD) Process 1-9
 This is a view from typical Visualization,

database-systems and data- Pattern Evaluation
warehousing communities.
 Data mining plays an essential
role in the knowledge-discovery Data Mining
process.
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration

Databases
KDD, 1 Introduction
Example: A Web-Mining Framework 1 - 10
 Web mining usually involves

 Data cleaning
 Data integration from multiple sources
 Warehousing the data
 Data-cube construction
 Data selection for data mining
 Data mining
 Presentation of the mining results
 Patterns and knowledge to be used
or stored in a knowledge base

KDD, 1 Introduction
Data Mining in Business Intelligence 1 - 11
Increasing potential
to support
business decisions End User
Decision
Making
Data Presentation Business

Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses

DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems

KDD, 1 Introduction
Example: Mining vs. Data Exploration 1 - 12
 Business-intelligence view
 Warehouse, data cube, reporting
 But not much mining
 Business objects vs. data-mining tools
 Supply-chain example: tools
 Data presentation
 Exploration

KDD, 1 Introduction
KDD Process: A Typical View from Machine 1 - 13
Learning and Statistics
Input Data Data Pre- Data Post-

Processing Mining Processing
Data integration Pattern discovery Pattern evaluation

Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Dimension reduction Clustering Pattern visualization
Outlier analysis
…
 This is a view from typical machine-learning and statistics communities.

KDD, 1 Introduction
Example: Medical Data Mining 1 - 14
 Health-care & medical data mining

 often adopted such a view
in statistics and machine learning.
 Preprocessing of the data
 includes feature extraction and dimension reduction.
 Classification or/and clustering processes
 Post-processing for presentation

KDD, 1 Introduction
CRISP-DM 1 - 15
 "CRoss-Industry Standard Process for Data Mining"

Business Understanding Data Understanding Data Preparation
• Determine Business • Collect Initial Data • Data Set
Objectives • Describe Data • Select Data
• Assess Situation • Explore Data • Clean Data
• Determine Data-Mining • Verify Data Quality • Construct Data
Goals • Integrate Data
• Produce Project Plan • Format Data
Modeling Evaluation Deployment

• Select Modeling • Evaluate Results • Plan Deployment
Technique • Review Process • Plan Monitoring and
• Generate Test Design • Determine Next Steps Maintenance
• Build Model • Produce Final Report
• Assess Model • Review Project

KDD, 1 Introduction
Chapter 1. Introduction 1 - 16

 Summary

KDD, 1 Introduction
A Multi-Dimensional View of Data Mining 1 - 17
 Data to be mined
 Database data (extended-relational, object-oriented, heterogeneous, legacy),
data warehouse, transactional data, stream, spatiotemporal, time-series,
sequence, text and Web, multi-media, graphs & social and information
networks
 Knowledge to be mined (or: Data-mining functions)
 Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
 Descriptive vs. predictive data mining
 Multiple/integrated functions and mining at multiple levels
 Techniques utilized
 Database, data warehouse (OLAP), machine learning, statistics, pattern
recognition, visualization, high-performance computing, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data mining, stock-
market analysis, text mining, Web mining, etc.

KDD, 1 Introduction

 Summary

KDD, 1 Introduction
Data Mining: On What Kinds of Data? 1 - 19
 Database-oriented data sets and applications

 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks, and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 NoSQL databases
 Spatial data and spatiotemporal data
 Multimedia databases
 Text databases
 The World-Wide Web

KDD, 1 Introduction

 Summary

KDD, 1 Introduction
Data-Mining Function: 1 - 21
(1) Generalization
 Information integration and data-warehouse construction

 Data cleaning, transformation, integration, and multidimensional data model
 Data-cube technology
 Scalable methods for computing (i.e., materializing) multidimensional
aggregates
 OLAP (online analytical processing)
 Multidimensional concept description:
Characterization and discrimination
 Generalize, summarize, and contrast data characteristics
 E.g., dry vs. wet regions from numerical humidity values

KDD, 1 Introduction
Data-Mining Function: 1 - 22
(2) Association and Correlation Analysis
 Frequent patterns (or frequent itemsets)

 What items are frequently purchased together in your supermarket?
 Association, correlation vs. causality
 A typical association rule
 Diapers → Beer [0.5%, 75%] (support, confidence)
 Are strongly associated items also strongly correlated?
 How to mine such patterns and rules efficiently in large datasets?
 How to use such patterns for classification, clustering, and other
applications?

KDD, 1 Introduction
Data Mining Function: 1 - 23
(3) Classification
 Classification and (class-) label prediction

 Construct models (functions) based on training examples
 Hence: "supervised"
 Describe and distinguish classes or concepts for future prediction
 E.g., classify countries based on climate,
or classify cars based on gas mileage
 Predict unknown class labels
 Typical methods:
 Decision trees, naïve Bayesian classification, support-vector machines,
(deep) neural networks, rule-based classification, pattern-based
classification, logistic regression, …
 Typical applications:
 Credit-card-fraud detection, direct marketing, classifying stars, diseases,
Web pages, …

KDD, 1 Introduction
(4) Cluster Analysis
 Unsupervised learning
 I.e., class (cluster) label is unknown
 Group data to form new categories (i.e., clusters)
 E.g., cluster houses to find distribution patterns
 Principle:
Maximize intra-class similarity &
minimize interclass similarity
 Many methods and applications
 Definition of similarity or proximity as input!

KDD, 1 Introduction
(5) Outlier Analysis
 Outlier:
 A data object that does not comply with the general behavior of the data
 Noise or exception?
 One person's garbage could be another person's treasure.
 Methods:
 By-product of clustering or regression analysis, …
 Useful in fraud detection, rare-events analysis

KDD, 1 Introduction
Time and Ordering: 1 - 26
Sequential Pattern, Trend, and Evolution Analysis
 Sequence, trend, and evolution analysis

 Trend, time-series, and deviation analysis
 E.g., regression and value prediction (forecasting)
 Sequential-pattern mining
 E.g., customers first buy digital camera, then buy large SD memory cards
 Periodicity analysis
 Motifs and biological-sequence analysis
 Approximate and consecutive motifs
 Similarity-based analysis
 Mining data streams
 Ordered, time-varying, potentially infinite (unbounded)

KDD, 1 Introduction
Structure and Network Analysis 1 - 27
 Graph mining
 Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (Web fragments)
 Information-network analysis
 Social networks: actors (objects, nodes) and relationships (edges)
 E.g., author networks in CS, terrorist networks
 Multiple heterogeneous networks
 A person could be in multiple information networks: friends, family,
classmates, …
 Links carry a lot of semantical information: Link mining
 Web mining
 Web is a big information network: from PageRank to Google
 Analysis of Web information networks
 Web community discovery, opinion mining, usage mining, …

KDD, 1 Introduction
Evaluation of Knowledge 1 - 28
 Is all mined knowledge interesting?

 One can mine tremendous amounts of "patterns" and knowledge.
 Some may fit only certain dimension space (time, location, … ).
 Some may not be representative, may be transient, …
 Evaluation of mined knowledge → directly mine only interesting
knowledge?
 Descriptive vs. predictive
 Coverage
 Typicality vs. novelty
 Accuracy
 Timeliness
 …

KDD, 1 Introduction

 Summary

KDD, 1 Introduction
Data Mining: Confluence of Multiple Disciplines 1 - 30
Pattern
Machine Recognition
Statistics
Learning
Applications Visualization
Data Mining
High-Performance
Algorithms
Database Computing
Technology

KDD, 1 Introduction
Why Confluence of Multiple Disciplines? 1 - 31
 Tremendous amount of data

 Algorithms must be highly scalable
to handle also terabytes of data.
 High dimensionality of data
 DNA microarrays may have tens of thousands of dimensions.
 Collections of microscopic DNA spots attached to a solid surface
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks, and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
KDD, 1 Introduction

 Summary

KDD, 1 Introduction
Applications of Data Mining 1 - 33
 Web-page analysis:
 From Web-page classification, clustering to PageRank & HITS algorithms
 HITS = Hyperlink-Induced Topic Search
 Collaborative analysis & recommender systems
 Basket-data analysis for targeted marketing
 Biological and medical data analysis:
 Classification, cluster analysis (microarray data analysis), biological
sequence analysis, biological network analysis
 Data mining and software engineering
 E.g., IEEE Computer, Aug. 2009 issue
 From major dedicated data-mining systems/tools
 E.g., SAS, MS SQL-Server Analysis Manager, Oracle Data-Mining Tools
 To invisible data mining

KDD, 1 Introduction

 Summary

KDD, 1 Introduction
Major Issues in Data Mining 1 - 35
 Mining methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multi-dimensional space
 Data mining: An interdisciplinary effort
 Boosting the power of discovery in a networked environment
 Handling noise, uncertainty, and incompleteness of data
 Pattern evaluation and pattern- or constraint-guided mining
 User interaction
 Interactive mining
 Incorporation of background knowledge
 Presentation and visualization of data mining results

KDD, 1 Introduction
Major Issues in Data Mining (2) 1 - 36
 Efficiency and Scalability

 Efficiency and scalability of data-mining algorithms
 Parallel, distributed, stream, and incremental mining methods
 Diversity of data types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data mining and society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining

KDD, 1 Introduction

 Summary

KDD, 1 Introduction
A Brief History of Data-Mining Society 1 - 38
 1989 IJCAI Workshop on Knowledge Discovery in Databases

 Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
 1991-1994 Workshops on Knowledge Discovery in Databases
 Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
 1995-1998 International Conferences on Knowledge Discovery in
Databases and Data Mining (KDD’95-98)
 Journal of Data Mining and Knowledge Discovery (1997)
 ACM SIGKDD conferences since 1998 and SIGKDD Explorations
 More conferences on data mining
 PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), etc.
 Journal ACM Transactions on KDD starting in 2007
KDD, 1 Introduction
Conferences and Journals on Data Mining 1 - 39
 KDD Conferences  Other related conferences

 ACM SIGKDD Int. Conf. on  DB conferences: ACM
Knowledge Discovery in
Databases and Data Mining SIGMOD, VLDB, ICDE, EDBT,
(KDD) ICDT, …
 SIAM Data Mining Conf. (SDM)  Web and IR conferences:
 (IEEE) Int. Conf. on Data Mining WWW, SIGIR, WSDM
(ICDM)  ML conferences: ICML, NIPS
 European Conf. on Machine
 PR conferences: CVPR
Learning and Principles and
Practices of Knowledge Discovery  Journals
and Data Mining (ECML-PKDD)  Data Mining and Knowledge
 Pacific-Asia Conf. on Knowledge Discovery (DAMI or DMKD)
Discovery and Data Mining
(PAKDD)  IEEE Trans. On Knowledge and
 Int. Conf. on Web Search and Data Eng. (TKDE)
Data Mining (WSDM)  KDD Explorations
 ACM Trans. on KDD
KDD, 1 Introduction
Where to Find References? DBLP, CiteSeer, Google 1 - 40
 Data mining and KDD (SIGKDD: CD-ROM)

 Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
 Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM
TKDD
 KDnuggets: www.kdnuggets.com
 Database systems (SIGMOD: ACM SIGMOD Anthology—CD-ROM)
 Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT,
DASFAA
 Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys.,
etc.
 AI & Machine Learning
 Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory),
CVPR, NIPS, etc.
 Journals: Machine Learning, Artificial Intelligence, Knowledge and
Information Systems, IEEE-PAMI, etc.
KDD, 1 Introduction
Where to Find References? (2) 1 - 41
 Web and IR
 Conferences: SIGIR, WWW, CIKM, etc.
 Journals: WWW: Internet and Web Information Systems
 Statistics
 Conferences: Joint Stat. Meeting, etc.
 Journals: Annals of Statistics, etc.
 Visualization
 Conferences: CHI, ACM-SIGGraph, etc.
 Journals: IEEE Trans. Visualization and Computer Graphics, etc.

KDD, 1 Introduction

 Summary

KDD, 1 Introduction
Summary 1 - 43
 Data mining:
 Discovering interesting patterns and knowledge
from massive amounts of data
 A natural evolution of database technology
 In great demand, with wide applications
 KDD process includes:
 Data cleaning, data integration, data selection, transformation, data mining,
pattern evaluation, and knowledge presentation
 Mining can be performed in a variety of data.
 Data-mining functionalities:
 Characterization, discrimination, association, classification, clustering, outlier
and trend analysis, etc.
 Data-mining technologies and applications
KDD, 1 Introduction
Recommended Reference Books 1 - 44
 S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertext and Semi-

Structured Data. Morgan Kaufmann, 2002
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John
Wiley & Sons, 2003
 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. 2ed., Wiley-
Interscience, 2000
 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy.
Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996
 U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data
Mining and Knowledge Discovery, Morgan Kaufmann, 2001
 J. Han, M. Kamber, and J. Pei. Data Mining : Concepts and Techniques.
Morgan Kaufmann, 3rd ed., 2012
 D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining. MIT Press,
2001

KDD, 1 Introduction
Recommended Reference Books (2) 1 - 45
 T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical

Learning: Data Mining, Inference, and Prediction. 2nd ed., Springer-Verlag,
2009
 B. Liu, Web Data Mining. Springer 2006.
 T. M. Mitchell, Machine Learning. McGraw Hill, 1997
 G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases.
AAAI/MIT Press, 1991
 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining. Wiley,
2005
 S. M. Weiss and N. Indurkhya, Predictive Data Mining. Morgan Kaufmann,
1998
 I. H. Witten, E. Frank and M. A. Hall, Data Mining: Practical Machine
Learning Tools and Techniques. Morgan Kaufmann, 3rd ed. 2011

KDD, 1 Introduction
Other Literature 1 - 46
 Colin Shearer: The CRISP-DM Model: The New Blueprint for Data Mining.
Journal of Data Warehousing, vol. 5, no. 4, Fall 2000, pp. 13-22
 Tao Xie, Suresh Thummalapenta, David Lo, and Chao Liu: Data Mining for
Software Engineering. IEEE Computer, August 2009, pp. 55-62
 Rob J. Hyndman and George Athanasopoulos: Forecasting: Principles and
Practice. 2nd ed. Monash University, Australia, April 2018.
https://otexts.org/fpp2/

KDD, 1 Introduction
Chapter 2: Getting to Know Your Data


Chapter 2: Getting to Know Your Data 2-2
 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data
 Data Visualization
 Measuring Data Similarity and Dissimilarity
 Summary

KDD, 2 Data
Types of Data Sets 2-3
 Record
 Relational records
 Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
crosstabs
team
ball
lost
play
win
 Document data: text documents;
term-frequency vector
 Transaction data
 Graph and network Document 1 3 0 5 0 2 6 0 2 0 2
 World-wide Web Document 2 0 7 0 2 1 0 0 3 0 0

 Social or information networks
 Molecular Structures Document 3 0 1 0 0 1 2 2 0 3 0
 Ordered
 Video data: sequence of images
 Temporal data: time-series TID Items
 Sequential Data: transaction sequences
 Genetic sequence data 1 Bread, Coke, Milk
 Spatial, image and multimedia:
2 Beer, Bread
 Spatial data: maps
 Image data 3 Beer, Coke, Diapers, Milk
 Video data
4 Beer, Bread, Diapers, Milk
5 Coke, Diapers, Milk
KDD, 2 Data
Important Characteristics of Structured Data 2-4
 Dimensionality
 Curse of dimensionality (sparse high-dimensional data spaces)
 Sparsity
 Only presence counts
 Resolution
 Patterns depend on the scale
 Distribution
 Centrality and dispersion

KDD, 2 Data
Data Objects 2-5
 Data sets are made up of data objects.

 A data object represents an entity.
 Examples:
 Sales database: customers, store items, sales
 Medical database: patients, treatments
 University database: students, professors, courses
 Also called:
 Samples, examples, instances, data points, objects, tuples, …
 Data objects are described by attributes.

 Database rows → data objects
 Columns → attributes

KDD, 2 Data
Attributes 2-6
 Attribute
 Also called: field, dimension, feature, variable, …
 A data field, representing a characteristic or feature of a data object
 E.g., customer_ID, name, address
 Types:
 Nominal
 Binary
 Ordinal
 Numerical: quantitative
 Interval-scaled
 Ratio-scaled

KDD, 2 Data
Attribute Types 2-7
 Nominal
 Categories, states, or "names of things"
 E.g., hair_color = {auburn, black, blond, brown, grey, red, white}
 Other examples: marital status, occupation, ID number, ZIP code
 Binary
 Nominal attribute with only two states (0 and 1)
 Symmetric binary: both outcomes equally important
 E.g., gender
 Asymmetric binary: outcomes not equally important
 E.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV positive)
 Ordinal
 Values have a meaningful order (ranking),
but magnitude between successive values is not known.
 E.g., size = {small, medium, large}, grades, army rankings

KDD, 2 Data
Numerical Attribute Types 2-8
 Numerical: Quantity (integer- or real-valued)

 Interval-scaled
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚ or F˚, calendar dates
 No true zero-point
 Ratio-scaled
 Inherent zero-point
 We can speak of values as being an order of magnitude larger than the unit
of measurement (10 K˚ is twice as high as 5 K˚).
 E.g., temperature in Kelvin, length, counts, monetary quantities

KDD, 2 Data
Discrete vs. Continuous Attributes 2-9
 Discrete Attribute
 Has only a finite or countably infinite set of values
 E.g., ZIP code, profession, or the set of words in a collection of
documents
 Sometimes represented as integer variables
 Note: Binary attributes are a special case of discrete attributes.
 Continuous Attribute
 Has real numbers as attribute values
 E.g., temperature, height, or weight
 Practically, real values can only be measured and represented using a finite
number of digits.
 Continuous attributes are typically represented as floating-point variables.

KDD, 2 Data
Chapter 2: Getting to Know Your Data 2 - 10

 Summary

KDD, 2 Data
Basic Statistical Descriptions of Data 2 - 11
 Motivation
 To better understand the data: central tendency, variation, and spread
 Data-dispersion characteristics
 Median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals.
 Data dispersion: analyzed with multiple granularities of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube

KDD, 2 Data
Measuring the Central Tendency 2 - 12
 Mean (algebraic measure):

 Note: n is sample (data-set) size and N is population size.
 Weighted arithmetic mean:

1 n
x   xi x
n i 1 N
 Trimmed mean: chopping extreme values
n
w x i i
x i 1
n
w
i 1
i

KDD, 2 Data
Measuring the Central Tendency (2) 2 - 13
 Median:
 Middle value if odd number of values,
or average of the middle two values otherwise
 Estimated by interpolation (for grouped data):
N / 2  ( freq) l
median  L1  ( ) width median
freqmedian
with L1 lower boundary of median interval,
( freq)l sum of frequencies of all intervals
lower than median
 Mode:
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula: mean  mode  3  (mean  median )
KDD, 2 Data
Symmetrical vs. Skewed Data 2 - 14
 Median, mean, and mode of symmetrical

symmetrical, positively and
negatively skewed data
positively skewed negatively skewed

KDD, 2 Data
Measuring the Dispersion of Data 2 - 15
 Quartiles, outliers, and boxplots

 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five-number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked;
add whiskers and plot outliers individually
 Outlier: usually assigned to values higher/lower than 1.5 x IQR
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 1 n 1 n 2
s 
2

n  1 i 1
( xi  x ) 
2
[ xi  ( xi ) 2 ]
n  1 i 1 n i 1
   ( xi   )   xi   2
2
N i 1
2
N i 1
 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

KDD, 2 Data
Boxplot Analysis 2 - 16
 Five-number summary of a
distribution
 Minimum, Q1, median, Q3,
maximum
 Boxplot
 Data is represented with a box.
 The ends of the box are at the
first and third quartiles, i.e., the
height of the box is IQR.
 The median is marked by a line
within the box.
 Whiskers: two lines outside the
box extended to minimum and
maximum
 Outliers: points beyond a
specified outlier threshold,
plotted individually
KDD, 2 Data
Visualization of Data Dispersion: 3-D Boxplots 2 - 17

KDD, 2 Data
Properties of Normal Distribution Curve 2 - 18
 The normal (distribution) curve

 From μ – σ to μ + σ: contains about 68% of the measurements
 μ: mean, σ: standard deviation
 From μ – 2σ to μ + 2σ: contains about 95% of it
 From μ – 3σ to μ + 3σ: contains about 99.7% of it

KDD, 2 Data
Graphic Displays of Basic Statistical Descriptions 2 - 19
 Boxplot:
 Graphic display of five-number summary
 Histogram:
 X-axis are values, y-axis repres. frequencies
 Quantile plot:
 Each value xi is paired with fi indicating that approximately 100 fi % of data
are  xi
 Quantile-quantile (q-q) plot:
 Graphs the quantiles of one univariant distribution against the corresponding
quantiles of another
 Scatter plot:
 Each pair of values is a pair of coordinates and plotted as points in the plane.

KDD, 2 Data
Histogram Analysis 2 - 20
 Histogram: Graph display of

tabulated frequencies, shown as
bars
40
 It shows what proportion of cases
35
fall into each of several
30
categories.
25
 Differs from a bar chart in that it
20
is the area of the bar that denotes
15
the value, not the height as in bar
10
charts, a crucial distinction when
5
the categories are not of uniform
0
width. 10000 30000 50000 70000 90000
 The categories are usually

specified as non-overlapping
intervals of some variable. The
categories (bars) must be
adjacent.

KDD, 2 Data
Histograms Often Tell More than Boxplots 2 - 21
 The two histograms shown to

the left may have the same
boxplot representation.
 The same values for: min, Q1,
median, Q3, max
 But they have rather different
data distributions.

KDD, 2 Data
Quantile Plot 2 - 22
 Displays all of the data

 Allowing the user to assess both the overall behavior and unusual
occurrences
 Plots quantile information
 For a data xi sorted in increasing order, fi indicates that approximately
100 fi % of the data are below or equal to the value xi .

KDD, 2 Data
Quantile-Quantile (Q-Q) Plot 2 - 23
 Graphs the quantiles of one univariate distribution

against the corresponding quantiles of another
 View: Is there is a shift in going from one distribution to another?
 Example shows unit price of items sold at Branch 1 vs. Branch 2 for each
quantile. Unit prices of items sold at Branch 1 tend to be lower than those at
Branch 2.

KDD, 2 Data
Scatter plot 2 - 24
 Provides a first look at bivariate data

to see clusters of points, outliers, etc.
 Each pair of values is treated as a pair of coordinates
and plotted as points in the plane.

KDD, 2 Data
Positively and Negatively Correlated Data 2 - 25
 The left half fragment is positively

correlated.
 The right half is negative
correlated.

KDD, 2 Data
Uncorrelated Data 2 - 26

KDD, 2 Data
Data Profiling 2 - 27
(Naumann 2013)
 More from the database perspective
 Derive metadata such as
 data types and value patterns,
 completeness and uniqueness of columns,
 keys and foreign keys, and
 occasionally functional dependencies and association rules;
 discovery of inclusion dependencies and conditional functional
dependencies.
 Statistics:
 number of null values, and distinct values in a column,
 data types,
 most frequent patterns of values.

KDD, 2 Data

 Summary

KDD, 2 Data
Data Visualization 2 - 29
 Why data visualization?

 Gain insight into an information space
by mapping data onto graphical primitives
 Provide qualitative overview of large data sets
 Search for patterns, trends, structure, irregularities, relationships among data
 Help find interesting regions and suitable parameters for further
quantitative analysis
 Provide a visual proof of computer representations derived
 Categorization of visualization methods:
 Pixel-oriented
 Geometric projection
 Icon-based
 Hierarchical
 Visualizing complex data and relations

KDD, 2 Data
Pixel-Oriented Visualization Techniques 2 - 30
 For a data set of m dimensions, create m windows on the screen,

one for each dimension.
 The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows.
 The colors of the pixels reflect the corresponding values.
(a) Summer
IncomeSemester 2018 (b) Credit Limit (c) Transaction Volume (d) Age
KDD, 2 Data
Laying Out Pixels in Circle Segments 2 - 31
 To save space and show the connections among multiple dimensions, space
filling is often done in a circle segment.
(a) Representing a data record in

circle segment (b) Laying out pixels in circle segment
KDD, 2 Data
Geometric Projection Visualization Techniques 2 - 32
 Visualization of geometric transformations and projections of the

data
 Methods:
 Scatter plot and scatter-plot matrices
 Landscapes
 Projection pursuit technique:
Help users find meaningful projections of multidimensional data
 Prosection views
 Hyperslice
 Parallel coordinates

KDD, 2 Data
Scatter-plot Matrices 2 - 33
 Matrix of scatter plots

(x-y diagrams)
of k-dim. data
[total of (k2/2–k)
scatter plots]
 each dimension
with every other
dimension
Used by permission of M. Ward,
Worcester Polytechnic Institute

KDD, 2 Data
Landscapes 2 - 34
 Visualization of the data as perspective landscape news articles

visualized as
 The data needs to be a landscape
transformed into a
(possibly artificial)
2D spatial
representation
which preserves
the characteristics
of the data.
Used by permission of B.
Wright, Visible Decisions
Inc.

KDD, 2 Data
Parallel Coordinates of a Data Set 2 - 35

KDD, 2 Data
Icon-Based Visualization Techniques 2 - 36
 Visualization of the data values as features of icons

 Typical visualization methods
 Chernoff Faces
 Stick Figures
 General techniques
 Shape coding:
 Use shape to represent certain information encoding
 Color icons:
 Use color icons to encode more information
 Tile bars:
 Use small icons to represent the relevant feature vectors in document
retrieval

KDD, 2 Data
Chernoff Faces 2 - 37
 A way to display variables on a two-dimensional surface

 E.g., let x be eyebrow slant, y be eye size, z be nose length, etc.
 The figure shows faces produced using 10 characteristics (head eccentricity,
eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size,
mouth shape, mouth size, and mouth opening). Each assigned one of 10
possible values, generated using Mathematica (S. Dickson)
References:
 Gonick, L. and Smith, W.
The Cartoon Guide to Statistics.
New York: Harper Perennial, 1993, p. 212
 Weisstein, Eric W. "Chernoff Face."
From MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.html
KDD, 2 Data
Stick Figure 2 - 38
A census data
figure showing age,
income, gender,
education, etc.
A 5-piece stick
figure (1 body and 4
limbs w. different
angle/length)

KDD, 2 Data
Two attributes mapped to axes, remaining attributes mapped to angle or length of limbs. Look at texture pattern
Hierarchical Visualization Techniques 2 - 39
 Visualization of the data

using a hierarchical partitioning into subspaces
 Methods
 Worlds-within-Worlds
 Tree-Map
 Cone Trees
 InfoCube

KDD, 2 Data
Worlds-within-Worlds 2 - 40
 Assign the function and two most important parameters

to innermost world
 Fix all other parameters at constant values –
draw other (1 or 2 or 3) dimensional worlds choosing these as the
axes
 Software that uses
this paradigm:
 N–vision:
Dynamic interaction
through data glove and
stereo displays,
including rotation,
scaling (inner) and
translation (inner/outer)
 Auto Visual:
Static interaction by
means of queries
KDD, 2 Data
Tree-Map 2 - 41
 Screen-filling method
 Uses a hierarchical partitioning of the screen
into regions depending on the attribute values
 X- and y-dimension of the screen partitioned alternately
according to the attribute values (classes)
MSR Netscan Image

KDD, 2 Data
Ack.: http://www.cs.umd.edu/hcil/treemap-history/all102001.jpg
Tree-Map of a File System (Schneiderman) 2 - 42

KDD, 2 Data
InfoCube 2 - 43
 3-D visualization technique

 Hierarchical information is displayed
as nested semi-transparent cubes
 The outermost cubes correspond to the top level data,
while the subnodes or the lower level data
are represented as smaller cubes inside the outermost cubes, and so on

KDD, 2 Data
Three-D Cone Trees 2 - 44
 3D cone-tree visualization
technique
 works well for up to a thousand
nodes or so
 First build a 2D circle tree that
arranges its nodes in concentric
circles centered on the root node
 Cannot avoid overlaps when
projected to 2D
 G. Robertson, J. Mackinlay, S.
Card. “Cone Trees: Animated 3D
Visualizations of Hierarchical
Information”, ACM SIGCHI'91
 Graph from Nadeau Software
Consulting website: Visualize a
social network data set that
models the way an infection
spreads from one person to the
next

KDD, 2 Data Ack.: http://nadeausoftware.com/articles/visualization
Visualizing Complex Data and Relations 2 - 45
 Visualizing non-numerical data: text and social networks

 Tag cloud: visualizing user-generated tags
 The importance of tag
is represented by
font size/color
 Besides text data,
there are also methods
to visualize relationships,
such as visualizing
social networks.

KDD, 2 Data Newsmap: Google News Stories in 2005

 Summary

KDD, 2 Data
Similarity and Dissimilarity 2 - 47
 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike.
 Often falls in the range [0,1]
 Dissimilarity
 E.g., distance
 Numerical measure of how different two data objects are
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies
 Proximity
 refers to similarity or dissimilarity

KDD, 2 Data
Data Matrix and Dissimilarity Matrix 2 - 48
 Data matrix
 n data points with p dimensions  x11 ... x1f ... x1p 
 Two-mode matrix  
 ... ... ... ... ... 
x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 
 Dissimilarity matrix
 n data points, but registers only  0 
the distance  d(2,1) 0 
 A triangular matrix  
 d(3,1) d ( 3,2) 0 
 One-mode matrix  
 : : : 
d ( n,1) d ( n,2) ... ... 0
KDD, 2 Data
Proximity Measure for Nominal Attributes 2 - 49
 Can take two or more states

 E.g., red, yellow, blue, green
 Generalization of a binary attribute
 Values can be the same (distance 0) or different (1)
 More options for sets of nominal attributes (variables)
 Method 1: Simple matching
 m: # of matches, p: total # of variables
d (i, j)  p pm
 Method 2: Use a large number of binary attributes

 Creating a new binary attribute for each of the M nominal states

KDD, 2 Data
Proximity Measure for Binary Attributes 2 - 50
Object j
 A contingency table
for binary data Object i
 Counting matches
 Distance measure for
symmetrical binary variables:
 Distance measure for
asymmetrical binary variables:
 Jaccard coefficient
(similarity measure for
asymmetrical binary variables):
 Note: Jaccard coefficient is the
same as "coherence":

KDD, 2 Data
Dissimilarity between Binary Variables 2 - 51
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 Gender is a symmetrical attribute.

 The remaining attributes are asymmetrical binary.
 Let the values Y and P be 1, and the value N be 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
KDD, 2 Data
Standardizing Numerical Data 2 - 52

z  x
 Z-score:
 x: raw score to be standardized, μ: mean of the population,
σ: standard deviation
 The distance between the raw score and the population mean in units of the
standard deviation.
 Negative when the raw score is below the mean, "+" when above
 An alternative way: Calculate the mean absolute deviation
s  1 (| x  m |  | x  m | ... | x  m |)
f n 1f f 2f f nf f
 where m  1 (x  x  ...  x ).
f n 1f 2f nf
x m
if f
 Standardized measure (z-score): z 
if s
f
 Using mean absolute deviation is more robust than using standard

deviation.
KDD, 2 Data
Example: 2 - 53
Data Matrix and Dissimilarity Matrix
Data Matrix
x2 x4 point attribute1 attribute2
x1 1 2
x2 3 5
4
x3 2 0
x4 4 5
Dissimilarity Matrix
2 (with Euclidean Distance)
x1
x1 x2 x3 x4
x1 0
x2 3,61 0
x3 x3 5,1 5,1 0
0 2 4 x4 4,24 1 5,39 0

KDD, 2 Data
Distance on Numerical Data: Minkowski Distance 2 - 54
 Minkowski distance: A popular distance measure
 where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order
 The distance so defined is also called Lh norm.
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric.

KDD, 2 Data
Special Cases of Minkowski Distance 2 - 55
 h = 1: Manhattan (city block, L1 norm) distance

 E.g., the Hamming distance: the number of bits that are different between
two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
 h = 2: Euclidean (L2 norm) distance
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp
 h  : "supremum" (Lmax norm, L norm) distance
 This is the maximum difference between any component (attribute) of the
vectors.

KDD, 2 Data
Example: Minkowski Distance 2 - 56
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L1 x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
x2 x4 Euclidean (L2)
L2 x1 x2 x3 x4
4 x1 0
x2 3,61 0
x3 2,24 5,1 0
x4 4,24 1 5,39 0
2 x1
Supremum
L x1 x2 x3 x4
x3 x1 0
0 2 4 x2 3 0
x3 2 5 0
KDD, 2 Data
x4 3 1 5 0
Ordinal Variables 2 - 57
 An ordinal variable can be discrete or continuous.

 Order is important, e.g., rank
 Can be treated like interval-scaled
 Replace xif by their rank
rif {1,..., M f }
 Map the range of each variable onto [0, 1]
by replacing i-th object in the f-th variable by
rif 1
zif 
M f 1
 Compute the dissimilarity using methods for interval-scaled variables

KDD, 2 Data
Attributes of Mixed Type 2 - 58
 A database may contain all attribute types:

 Nominal, symmetric binary, asymmetric binary, numeric, ordinal
 One can use a weighted formula to combine their effects
 pf  1 ij( f )dij( f )
d (i, j) 
 pf  1 ij( f )
 f is binary or nominal:
 dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is numeric:
 Use the normalized distance
 f is ordinal:
 Compute ranks rif and
zif 
r 1
if
 Treat zif as interval-scaled
M 1 f

KDD, 2 Data
Cosine Similarity 2 - 59
 A document can be represented by thousands of attributes, each

recording the frequency of a particular word (such as keywords) or
phrase in the document.
 Other vector objects: gene features in microarrays, …

 Applications: information retrieval, biologic taxonomy, gene-feature mapping,
...
 Cosine measure:
 If d1 and d2 are two vectors (e.g., term-frequency vectors), then
sim(d1, d2) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product and ||d|| the length of vector d
 Ignores zero-matches
KDD, 2 Data
Example: Cosine Similarity 2 - 60
sim(d1, d2) = (d1  d2) / ||d1|| ||d2||
 Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1|| = (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5 = (42)0.5
= 6.481
||d2|| = (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5 =(17)0.5
= 4.12
sim(d1, d2) = 0.94

KDD, 2 Data

 Summary

KDD, 2 Data
Summary 2 - 62
 Data attribute types:

 Nominal, binary, ordinal, interval-scaled, ratio-scaled
 Many types of data sets
 E.g., numerical, text, graph, Web, image
 Gain insight into the data by:
 Basic statistical data description:
 Central tendency, dispersion, graphical displays
 Data visualization:
 Map data onto graphical primitives
 Measure data similarity
 Above steps are the beginning of data preprocessing.
 Many methods have been developed,
but still an active area of research.

KDD, 2 Data
References 2 - 63
W. Cleveland, Visualizing Data, Hobart Press, 1993

T. Dasu and T. Johnson, Exploratory Data Mining and Data Cleaning, John Wiley,
2003
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster
Analysis, John Wiley & Sons, 1990.
H. V. Jagadish et al., Special Issue on Data Reduction Techniques, Bulletin of the
Tech. Committee on Data Eng., 20(4), Dec. 1997
D. A. Keim, Information visualization and visual data mining, IEEE Trans. on
Visualization and Computer Graphics, 8(1), 2002
F. Naumann, Data Profiling Revisited, ACM SIGMOD Record, 32(4), 2013, 40-49
D. Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999
S. Santini and R. Jain, Similarity measures, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
E. R. Tufte, The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
C. Yu et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009

KDD, 2 Data
Chapter 3: Data Preprocessing


Chapter 3: Data Preprocessing 3-2
 Data Preprocessing: An Overview

 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary

KDD, 3 Preprocessing
Data Quality: Why Preprocess the Data? 3-3
 Measures for data quality: A multidimensional view

 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling refs, …
 Timeliness: timely update?
 Believability: how trustable that the data are correct?
 Interpretability: how easily can the data be understood?
 And many more!

Major Tasks in Data Preprocessing 3-4
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept-hierarchy generation

Chapter 3: Data Preprocessing 3-5

 Data Quality
 Data Cleaning
 Data Reduction
 Summary

Data Cleaning 3-6
 Data in the real world is dirty:

Lots of potentially incorrect data
 E.g., instrument faulty, human or computer error, transmission error
 Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
 E.g., occupation = " " (missing data)
 Noisy: containing noise, errors, or outliers
 Stochastic deviation, imprecision
 E.g., measurements
 Inconsistent: containing discrepancies in codes or names
 E.g., age = "42", birthday = "03/07/2010"
 Was rating "1, 2, 3", now rating "A, B, C"
 Discrepancy between duplicate records (e.g. address old and new)
 Intentional (only default value, e.g., disguised missing data)
 Jan. 1 as everyone's birthday?
Incomplete (Missing) Data 3-7
 Data is not always available.

 E.g., many tuples have no recorded value for several attributes.
 Such as customer income in sales data
 Missing data may be due to:
 Equipment malfunction
 Inconsistency with other recorded data and thus deleted
 Data not entered due to misunderstanding
 Certain data may not be considered important at the time of entry
 Not registered history or changes of the data
 Missing data may need to be inferred.

How to Handle Missing Data? 3-8
 Ignore the tuple

 Usually done when class label is missing (when doing classification)
 Not effective when the percentage of missing values per attribute varies
considerably
 Fill in the missing value manually
 Tedious + infeasible?
 Fill in automatically with:
 A global constant
 E.g., "unknown", a new class?!
 The attribute mean
 The attribute mean for all samples belonging to the same class (smarter)
 The most probable value
 Inference-based such as Bayesian formula or decision tree

Noisy Data 3-9
 Noise:
 Random error or variance in a measured variable
 Stored value a little bit off the real value, up or down
 Leads to (slightly) incorrect attribute values
 May be due to:
 Faulty data-collection instruments
 Data-entry problems
 Data-transmission problems
 Technology limitation
 Inconsistency in naming conventions

How to Handle Noisy Data? 3 - 10
 Binning
 First sort data and partition into (equal-frequency) bins
 Then smooth by bin mean, by bin median, or by bin boundaries
 Regression
 Smooth by fitting the data to regression functions
 Clustering
 Detect and remove outliers
 Combined computer and human inspection
 Detect suspicious values and check by human
 E.g., deal with possible outliers

Data Cleaning as a Process 3 - 11
 Data-discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule, and null rule
 Use commercial tools
 Data scrubbing: use simple domain knowledge (e.g., postal code, spell-
check) to detect errors and make corrections
 Data auditing: by analyzing data to discover rules and relationships to
detect violators (e.g., correlation and clustering to find outliers)
 Data migration and integration
 Data-migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., the Potter's Wheel tool)
Chapter 3: Data Preprocessing 3 - 12

 Data Quality
 Data Cleaning
 Data Reduction
 Summary

Data Integration 3 - 13
 Data integration
 Combine data from multiple sources into a coherent store
 Schema integration
 E.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity-identification problem:
 Identify the same real-world entities from multiple data sources
 E.g., Bill Clinton = William Clinton
 Detecting and resolving data-value conflicts
 For the same real world entity, attribute values from different sources are
different.
 Possible reasons:
 Different representations (coding)
 Different scales, e.g., metric vs. British units

Handling Redundancy in Data Integration 3 - 14
 Redundant data often occur when integrating multiple databases.

 Object (entity) identification:
 The same attribute or object may have different names in different
databases.
 Derivable data:
 One attribute may be a "derived" attribute in another table.
 E.g., annual revenue
 Redundant attributes
 Can be detected by correlation analysis and covariance analysis
 Careful integration of the data from multiple sources

 Helps to reduce/avoid redundancies and inconsistencies
and improve mining speed and quality

Correlation Analysis for Nominal Data 3 - 15
 Two attributes:
 A has c distinct values: a1, a2, …, ac
 B has r distinct values: b1, b2, …, br
 Contingency table:
 Columns: the c values of A
 Rows: the r values of B
 Cells: counts of records with A = ai and B = bj
 Expected count in cell (i, j):
count ( A  ai )  count ( B  b j )
eij 
n
 where n total number of records,

count(A = ai) the number of records having ai as value for A
(count(B = bj) similarly)

Correlation Analysis for Nominal Data (2) 3 - 16
 2 (chi-square) test
(Observed  Expected ) 2
 
2
Expected
 Summing over all cells of the contingency table

 No correlation (i.e. independence of attributes) yields 2 value of zero.
 The larger the 2 value, the more likely the variables are related.
 The cells that contribute the most to the 2 value
are those whose actual count is very different from the expected count.
 Correlation does not imply causality!

 E.g., # of hospitals and # of car-thefts in a city are correlated.
 Both are causally linked to the third variable: population.

Chi-Square Calculation: An Example 3 - 17
Play chess Not play chess Sum (row)

Like science fiction 250 (90) 200 (360) 450
Not like science fiction 50 (210) 1000 (840) 1050
Sum (col.) 300 1200 1500
 Numbers in parenthesis are expected counts calculated based on the data

distribution in the two categories.
 2 (chi-square) calculation
(250  90) 2 (50  210 ) 2 (200  360 ) 2 (1000  840 ) 2
 2
    507 .93
90 210 360 840
 It shows that "like science fiction" and "play chess" are correlated in the
group.

Correlation Analysis for Numerical Data 3 - 18
 Correlation coefficient
 Also called Pearson's product-moment coefficient
i1 (ai  A)(bi  B) 

n n
(ai bi )  n AB
rA, B   i 1
(n  1) A B (n  1) A B
 where n is the number of tuples, A and B are the respective means of A

and B, σA and σB are the respective standard deviations of A and B, and
Σ(aibi) is the sum of the AB cross product.
 If rA,B > 0, A and B are positively correlated
(A's values increase with B's).
 The higher, the stronger the correlation.
 rA,B = 0: independent
 rA,B < 0: negatively correlated

Visually Evaluating Correlation 3 - 19
Scatter plots
showing the
similarity
from –1 to 1

Covariance of Numerical Data 3 - 20
 Covariance is similar to correlation.

σ𝑛𝑖=1(𝑎𝑖 − 𝐴)(𝑏
ҧ 𝑖 − 𝐵)
ത
ҧ ത
𝐶𝑜𝑣 𝐴, 𝐵 = 𝐸( 𝐴 − 𝐴 𝐵 − 𝐵 ) =
𝑛
 Correlation coefficient:
𝐶𝑜𝑣(𝐴, 𝐵)
𝑟𝐴,𝐵 =
𝜎𝐴 𝜎𝐵
 where n is the number of tuples, 𝐴ҧ and 𝐵ത are the respective means or expected values of A
and B, σA and σB are the respective standard deviations of A and B.
 Positive covariance:
 If Cov(A,B) > 0, then A and B tend to be either both larger or both smaller than their
expected values.
 Negative covariance:
 If Cov(A,B) < 0 then if A is larger than its expected value, B is likely to be smaller than its
expected value, and vice versa.
 Independence:
 Cov(A,B) = 0
 But the converse is not true:
Some pairs of random variables may have a covariance of 0 but are not independent. Only
under some additional assumptions (e.g., the data follow multivariate normal distributions)
does a covariance of 0 imply independence.

Covariance: An Example 3 - 21
σ𝑛𝑖=1(𝑎𝑖 − 𝐴)(𝑏
ҧ 𝑖 − 𝐵)
ത
𝐶𝑜𝑣 𝐴, 𝐵 = 𝐸( 𝐴 − 𝐴ҧ 𝐵 − 𝐵ത ) =
𝑛
 Can be simplified in computation as:
𝐶𝑜𝑣 𝐴, 𝐵 = 𝐸 𝐴 ∙ 𝐵 − 𝐴ҧ𝐵ത
 Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14)
 If the stocks are affected by the same industry trends, will their prices rise or
fall together?
E(A) = (2 + 3 + 5 + 4 + 6) / 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) / 5 = 48/5 = 9.6
Cov(A,B) = (2×5 + 3×8 + 5×10 + 4×11 + 6×14) / 5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A,B) > 0.


 Data Quality
 Data Cleaning
 Data Reduction
 Summary

Data-Reduction Strategies 3 - 23
 Obtain a reduced representation of the data set

that is much smaller in volume
but yet produces the same (or almost the same) analytical results
 Why data reduction?
 A database/data warehouse may store terabytes of data.
 Complex data analysis may take a very long time to run on the complete data set.
 Data-reduction strategies
 Dimensionality reduction, i.e., remove unimportant attributes
 Wavelet transforms
 Principal Component Analysis (PCA)
 Attribute-subset selection, attribute creation
 Regression and Log-Linear Models
 Histograms, clustering, sampling
 Data-cube aggregation

Data Reduction 1: Dimensionality Reduction 3 - 24
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse.
 Density and distance between points, which are critical to clustering and
outlier analysis, become less meaningful.
 The possible combinations of subspaces will grow exponentially.
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality-reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

Mapping Data to a New Space 3 - 25
 Fourier transform
 Wavelet transform
Two Sine Waves Two Sine Waves + Noise Frequency

What Is Wavelet Transform? 3 - 26
 Decomposes a signal
into different frequency
subbands
 Applicable to n-dimensional
signals
 Data transformed
to preserve relative distance
between objects
at different levels of resolution
 Allow natural clusters to
become more distinguishable
 Used for image compression

Wavelet Transformation 3 - 27
 Discrete wavelet transform (DWT):

 Transforms data vector X
into numerically different vector X'
of wavelet coefficients (same length)
 Compressed approximation: Haar2 Daubechie4
 Store only a small fraction
of the strongest of the wavelet coefficients
 Similar to Discrete Fourier Transform (DFT),
but better lossy compression, localized in space
 Method:
 Length, L, must be an integer power of 2 (padding with 0's, if necessary).
 Each transform has two functions: smoothing and difference.
 Applied to pairs of data, resulting in two sets of data of length L/2
 The two functions are applied recursively, until reaching the desired length.

Wavelet Decomposition 3 - 28
 Example:
 X = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to
X' = [2.75, -1.25, 0.5, 0, 0, -1, -1, 0]
 Compression:
 Many small detail coefficients can be replaced by 0's,
and only the significant coefficients are retained.

Haar Wavelet Coefficients 3 - 29
Coefficient "Supports"
Hierarchical
decomposition 2.75
2.75 +
structure
(a.k.a. "error tree")
+
-1.25
-1.25 + -
+
-
0.5 0
0.5 + -
+ - + - 0 + -
0 -1 -1 0
+
- + - + - + -
0 + -
2 2 0 2 3 5 4 4
-1 + -
-1 + -
Original frequency distribution 0 + -
Why Wavelet Transform? 3 - 30
 Use hat-shaped filters

 Emphasize region where points cluster
 Suppress weaker information in their boundaries
 Effective removal of outliers
 Insensitive to noise, insensitive to input order
 Multi-resolution
 Detect arbitrary shaped clusters at different scales
 Efficient
 Complexity O(N)

Principal-Component Analysis (PCA) 3 - 31
 Find a projection
that captures the largest amount of variation in data.
 The original data are projected onto a much smaller space, resulting in
dimensionality reduction.
 We find the eigenvectors of the covariance matrix,
and these eigenvectors define the new space.
x2

x1
Principal-Component Analysis – Steps 3 - 32
 Given N data vectors from n dimensions, find k ≤ n orthogonal

vectors (principal components) that can be best used to represent
data.
1. Normalize input data: Each attribute falls within the same range.
2. Compute k orthonormal (unit) vectors, i.e., the principal components.
 Each input data (vector) is a linear combination of the k principal-
component vectors.
3. The principal components are sorted in order of decreasing "significance" or
strength.
4. Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance
 Using the strongest principal components,
it is possible to reconstruct a good approximation of the original data.
 Works only for numerical data

Attribute-Subset Selection 3 - 33
 Another way to reduce dimensionality of data

 Redundant attributes
 duplicate much or all of the information contained in other attributes
 E.g., purchase price of a product and the amount of sales tax paid
 Irrelevant attributes
 contain no information that is useful
for the data-mining task at hand
 E.g., students' ID is often irrelevant to the task of predicting students'
GPA.

Heuristic Search in Attribute Selection 3 - 34
 There are 2d possible attribute combinations of d attributes.

 Typical heuristic attribute-selection methods:
 Best single attribute under the attribute-independence assumption:
choose by significance tests (e.g. t-test, see Chapter 6)
 Best step-wise feature selection:
 The best single attribute is picked first.
 Then next best attribute condition to the first, ...
 Step-wise attribute elimination:
 Repeatedly eliminate the worst attribute
 Best combined attribute selection and elimination
 Optimal branch and bound:
 Use attribute elimination and backtracking

Attribute Creation (Feature Generation) 3 - 35
 Create new attributes (features)

that can capture the important information in a data set
more effectively than the original ones.
 Three general methodologies
 Attribute extraction
 Domain-specific
 Mapping data to new space (see: data reduction)
 E.g., Fourier transformation, wavelet transformation, manifold
approaches (not covered)
 Attribute construction
 Combining features (see: discriminative frequent patterns in Chapter 5)
 Data discretization

Data Reduction 2: Numerosity Reduction 3 - 36
 Reduce data volume by

choosing alternative, smaller forms of data representation
 Parametric methods (e.g., regression)
 Assume the data fits some model (e.g. a function)
 Estimate model parameters
 Store only the parameters
 Discard the data (except possible outliers)
 Ex.: Log-linear models—obtain value at a point in m-D space as the
product of appropriate marginal subspaces
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling, …

Parametric Data Reduction: Regression and Log- 3 - 37
Linear Models
 Linear regression
 Data modeled to fit a straight line
 Often uses the least-square method to fit the line
 Multiple regression
 Allows a response variable Y to be modeled as a linear function
of a multidimensional feature vector (x1, …, xn)
 Log-linear model
 Approximates discrete multidimensional probability distributions

Regression Analysis 3 - 38
y
 A collective name
for techniques for the modeling
and analysis of numerical data Y1
consisting of values of a
dependent variable (also called
response variable or
measurement) and of one or Y1' y=x+1
more independent variables
(aka. explanatory variables or
predictors)
 The parameters are estimated x
so as to give a "best fit" of the X1
data.
 Most commonly the best fit is  Used for prediction (including
evaluated by using the least- forecasting of time-series data),
squares method, but other inference, hypothesis testing,
criteria have also been used. and modeling of causal
relationships
Regression Analysis and Log-Linear Models 3 - 39
 Linear regression: Y = w X + b
 Two regression coefficients, w and b, specify the line and are to be estimated
by using the data at hand.
 Using the least-squares criterion to the known values of Y1, Y2, …, X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2
 Many nonlinear functions can be transformed into the above.
 Log-linear models:
 Approximate discrete multidimensional probability distributions
 Estimate the probability of each point (tuple) in a multi-dimensional space
for a set of discretized attributes,
based on a smaller subset of dimensional combinations
 Useful also for dimensionality reduction and data smoothing

Histogram Analysis 3 - 40
 Divide data into buckets

and store average (sum)
for each bucket
 Partitioning rules:
 Equal-width:
equal bucket range
40
 Equal-frequency
35
(or equal-depth)
30
25
20
15
10
5
0
10000 30000 50000 70000 90000

Clustering 3 - 41
 Partition data set into clusters based on similarity,

and store cluster representation (e.g., centroid and diameter) only
 Can be very effective if data is clustered,
but not if data is "smeared"
 Can have hierarchical clustering
and be stored in multidimensional index-tree structures
 There are many choices of clustering definitions and clustering algorithms.
 Cluster analysis will be studied in depth in Chapter 7.

Sampling 3 - 42
 Obtain a small sample s

to represent the whole data set N
 Allow a mining algorithm to run in complexity
that is potentially sub-linear to the size of the data
 Key principle: Choose a representative subset of the data
 Simple random sampling may have very poor performance
in the presence of skew.
 Develop adaptive sampling methods, e.g., stratified sampling
 Note: Sampling may not reduce database I/Os
 One page at a time

Types of Sampling 3 - 43
 Simple random sampling

 There is an equal probability of selecting any particular item.
 Sampling without repetition
 Once an object is selected, it is removed from the population.
 Sampling with repetition
 A selected object is not removed from the population.
 Stratified sampling:
 Partition the data set,
and draw samples from each partition
 Proportionally,
i.e., approximately the same percentage of the data
 Used in conjunction with skewed data

Data-Cube Aggregation 3 - 44
 The lowest level of a data cube (base cuboid)

 The aggregated data for an individual entity of interest
 E.g., a customer in a phone-calling data warehouse
 Number of calls per hour, day, or week
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to solve the task
 Queries regarding aggregated information should be answered
using the data cube, if possible.

Data Reduction 3: Data Compression 3 - 45
 String compression
 There are extensive theories and well-tuned algorithms.
 Typically lossless, but only limited manipulation is possible without expansion
 Audio/video compression
 Typically lossy compression, with progressive refinement
 Sometimes small fragments of signal can be reconstructed
without reconstructing the whole.
 Time sequence is not audio
 Typically short and varies slowly with time
 Dimensionality and numerosity reduction may also be considered
as forms of data compression.


 Data Quality
 Data Cleaning
 Data Reduction
 Summary

Data Transformation 3 - 47
 A function
that maps the entire set of values of a given attribute
to a new set of replacement values
s.t. each old value can be identified with one of the new values
 Methods
 Smoothing: Remove noise from data
 Attribute/feature construction: New attributes constructed from the given ones
 Aggregation: Summarization, data-cube construction
 Normalization: Scaled to fall within a smaller, specified range
 Min-max normalization
 Z-score normalization
 Normalization by decimal scaling
 Discretization: Concept-hierarchy climbing

Normalization 3 - 48
 Min-max normalization:
 to [new_minA, new_maxA]
v  minA
v'  (new _ max A  new _ minA)  new _ minA
max A  minA
 Ex. Let income range from $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,600 is mapped to 73,600  12,000
(1.0  0)  0  0.716
98,000  12,000
 Z-score normalization:
 (μ: mean, σ: standard deviation):
v  A
v' 
A
73,600  54,000
 Ex. Let μ = 54,000, σ = 16,000. Then  1.225
16,000
 Normalization by decimal scaling:
v
v' 
10 j where j is the smallest integer such that Max(|ν'|) < 1
Discretization 3 - 49
 Three types of attributes:

 Nominal—values from an unordered set, e.g., color, profession
 Ordinal—values from an ordered set, e.g., military or academic rank
 Numerical—numbers, e.g., integer or real numbers
 Discretization:
Divide the value range of a continuous attribute into intervals
 Interval labels can then be used to replace actual data values.
 Reduce data size by discretization
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute.
 Prepare for further analysis, e.g., classification

Data-Discretization Methods 3 - 50
 Typical methods:
 All the methods can be applied recursively.
 Binning
 Unsupervised, top-down split
 Histogram analysis
 Unsupervised, top-down split
 Clustering analysis
 Unsupervised, top-down split or bottom-up merge
 Decision-tree analysis
 Supervised, top-down split
 Correlation (e.g., 2) analysis
 Unsupervised, bottom-up merge

Simple Discretization: Binning 3 - 51
 Equal-width (distance) partitioning

 Divides the range into N intervals of equal size: uniform grid
 If A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B – A) / N .
 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well.
 Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing approximately same
number of samples
 Good data scaling
 Managing categorical attributes can be tricky

Binning Methods for Data Smoothing 3 - 52
 Sorted data for price (in dollars):

 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
 Partition into equal-frequency (equal-depth) bins:
 Bin 1: 4, 8, 9, 15
 Bin 2: 21, 21, 24, 25
 Bin 3: 26, 28, 29, 34
 Smoothing by bin means:
 Bin 1: 9, 9, 9, 9
 Bin 2: 23, 23, 23, 23
 Bin 3: 29, 29, 29, 29
 Smoothing by bin boundaries:
 Bin 1: 4, 4, 4, 15
 Bin 2: 21, 21, 25, 25
 Bin 3: 26, 26, 26, 34
Discretization Without Using Class Labels 3 - 53
(Binning vs. Clustering)
Data Equal interval width (binning)

Equal interval width (binning)
Equal frequency (binning) K-means clustering leads to better results

Discretization by Classification & Correlation 3 - 54
Analysis
 Classification
 E.g., decision-tree analysis
 Supervised: Class labels given for training set
 E.g., cancerous vs. benign
 Using entropy to determine split point (discretization point)
 Top-down, recursive split
 Details will be covered in Chapter 6
 Correlation analysis
 E.g., Chi-merge: 2-based discretization
 Supervised: use class information
 Bottom-up merge: find the best neighboring intervals (those having similar
distributions of classes, i.e., low 2 values) to merge
 Merge performed recursively, until a predefined stopping condition

Concept-Hierarchy Generation 3 - 55
 Concept hierarchy
 Organizes concepts (i.e., attribute values) hierarchically
 Usually associated with each dimension in a data warehouse
 Facilitates drilling and rolling in data warehouses
to view data at multiple granularity
 Concept-hierarchy formation
 Recursively reduce the data
by collecting and replacing low-level concepts
(such as numerical values for age)
by higher-level concepts
(such as youth, adult, or senior)
 Can be explicitly specified by domain experts and/or data-warehouse
designers
 Can be automatically formed for both numerical and nominal data
 For numerical data, use discretization methods shown.

Concept-Hierarchy Generation 3 - 56
for Nominal Data
 Specification of a partial/total ordering of attributes

explicitly at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values
by explicit data grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 Only street < city, not others
 Automatic generation of hierarchies (or attribute levels)
by the analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}
 See below

Automatic Concept-Hierarchy Generation 3 - 57
 Some hierarchies can be automatically generated

based on the analysis of the number of distinct values per attribute.
 The attribute with the most distinct values
is placed at the lowest level of the hierarchy.
 Exceptions, e.g., weekday, month, quarter, year
country 15 distinct values
province_or_state 365 distinct values
city 3567 distinct values
street 674,339 distinct values


 Data Quality
 Data Cleaning
 Data Reduction
 Summary

Summary 3 - 59
 Data quality
 Accuracy, completeness, consistency, timeliness, believability, interpretability
 Data cleaning
 E.g. missing/noisy values, outliers
 Data integration from multiple sources:
 Entity identification problem
 Remove redundancies
 Detect inconsistencies
 Data reduction
 Data transformation and data discretization
 Normalization
 Concept-hierarchy generation

References 3 - 60
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse

environments. Comm. of ACM, 42:73-78, 1999
A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct.
1996
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John
Wiley, 2003
J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data.
Duxbury Press, 1997
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative
data cleaning: Language, model, and algorithms. VLDB'01
M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach.
KDD'07
H. V. Jagadish et al. Special Issue on Data Reduction Techniques. Bulletin of
the Technical Committee on Data Engineering, 20(4), Dec. 1997

References (2) 3 - 61
H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A

Data Mining Perspective. Kluwer Academic, 1998
J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
V. Raman and J. Hellerstein. Potter's Wheel: An Interactive Framework for Data
Cleaning and Transformation, VLDB'01
T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality
research. IEEE Trans. Knowledge and Data Engineering, 7:623-640, 1995

Chapter 4: Data Warehousing and

Online Analytical Processing

Chapter 4: Data Warehousing and 4-2
 Data Warehouse: Basic Concepts

 Data-Warehouse Modeling: Data Cube and OLAP
 Data-Warehouse Design and Usage
 Data-Warehouse Implementation
 Data Generalization by Attribute-Oriented Induction
 Summary

KDD, 4 OLAP
What is a Data Warehouse? 4-3
 Defined in many different ways, but not rigorously:

 A decision-support database
that is maintained separately
from the organization's operational database
 Support information processing
by providing a solid platform
of consolidated, historical data for analysis
 Famous: "A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management's decision-making process."—W. H. Inmon
 (Highlighting by me)
 Data warehousing:
 The process of constructing and using data warehouses

KDD, 4 OLAP
Data Warehouse—Subject-Oriented 4-4
 Organized around major subjects

 Such as customer, product, sales
 Focusing on the modeling and analysis of data
for decision makers
 Not on daily operations or transaction processing
 Provide a simple and concise view
around particular subject issues
 by excluding data that are not useful
in the decision-support process

KDD, 4 OLAP
Data Warehouse—Integrated 4-5
 Constructed by integrating multiple heterogeneous data sources

 Relational databases, flat files, online transaction records, …
 Data-cleaning and data-integration techniques are applied.
 Ensure consistency in naming conventions, encoding structures, attribute
measures, etc. among different data sources
 E.g., hotel price: currency, tax, breakfast covered, etc.
 When data is moved to the warehouse, it is converted.
 ETL—Extraction, Transformation, Loading, see below

KDD, 4 OLAP
Data Warehouse—Time Variant 4-6
 The time horizon for a data warehouse

is significantly longer
than that of operational systems.
 Operational database:
 Current-value data
 Data warehouse:
 Provide information from a historical perspective
 E.g., past 5-10 years
 Every key structure in the data warehouse
contains an element of time, explicitly or implicitly.
 But the key of operational data may or may not contain a "time element."

KDD, 4 OLAP
Data Warehouse—Nonvolatile 4-7
 A physically separate store of data

 Transformed from the operational environment
 By copying
 No operational update of data
 Hence, does not require transaction processing
 I.e., no logging, recovery, concurrency control, etc.
 Requires only three operations:
 Initial loading of data
 Refresh (update, often periodically, e.g., over night)
 Access of data

KDD, 4 OLAP
OLTP vs. OLAP 4-8
OLTP OLAP
users clerk, IT professional knowledge worker
function day-to-day operations decision support
DB design application-oriented subject-oriented
current, up-to-date; historical;
data detailed, flat relational; summarized, multidimensional;
isolated integrated, consolidated
usage repetitive ad-hoc
read/write;
access lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
#records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

KDD, 4 OLAP
Why a Separate Data Warehouse? 4-9
 High performance for both systems

 DBMS—tuned for OLTP:
 Access methods, indexing, concurrency control, recovery
 Warehouse—tuned for OLAP:
 Complex OLAP queries, multidimensional view, consolidation
 Different functions and different data:
 Missing data:
 Decision support (DS) requires historical data
which operational DBs do not typically maintain.
 Data consolidation:
 DS requires consolidation (aggregation, summarization) of data
from heterogeneous sources.
 Data quality:
 Different sources typically use inconsistent data representations, codes and
formats which have to be reconciled.
 Note: There are more and more systems which perform OLAP analysis
directly on relational databases.
KDD, 4 OLAP
Data Warehouse: A Multi-Tiered Architecture 4 - 10
Monitor
& OLAP Server
Other Metadata
sources Integrator
Operational Extract Analysis

DBs Transform Data Serve Query
Load Warehouse Reports
Refresh
Data mining
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools

KDD, 4 OLAP
Three Data-Warehouse Models 4 - 11
 Enterprise Warehouse
 Collects all of the information about subjects
spanning the entire organization
 Data Mart
 A subset of corporate-wide data
that is of value to a specific group of users.
 Its scope is confined to specific, selected groups,
such as marketing data mart.
 Independent vs. dependent (directly from warehouse) data mart
 Virtual Warehouse
 A set of views over operational databases
 Only some of the possible summary views may be materialized.

KDD, 4 OLAP
Extraction, Transformation, and Loading (ETL) 4 - 12
 Extraction
 Get data from multiple, heterogeneous, and external sources
 Cleaning
 Detect errors in the data and rectify them if possible
 Transformation
 Convert data from legacy or host format to warehouse format
 Loading
 Sort, summarize, consolidate, compute views, check integrity, and build
indices and partitions
 Refresh
 Propagate only the updates from the data sources to the warehouse

KDD, 4 OLAP
Metadata Repository 4 - 13
 Metadata: the data defining data-warehouse objects

 Description of the structure of the data warehouse:
 Schema, view, dimensions, hierarchies, derived-data definition, data-mart
locations and contents
 Operational metadata:
 Data lineage (history of migrated data and transformation path)
 Currency of data (active, archived, or purged)
 Monitoring information (warehouse-usage statistics, error reports, audit trails)
 Algorithms used for summarization
 Mapping from operational environment to data warehouse
 Data related to system performance:
 Warehouse schema, view and derived-data definitions
 Business data:
 Business terms and definitions, ownership of data, charging policies
KDD, 4 OLAP
Chapter 4: Data Warehousing and 4 - 14

 Summary

KDD, 4 OLAP
From Tables and Spreadsheets to Data Cubes 4 - 15
 Data warehouse
 Based on a multidimensional data model
which views data in the form of a data cube
 Data cube
 Allows data (here: sales) to be modeled and viewed in multiple dimensions
 Dimension tables
 Such as: item (item_name, brand, type),
or: time (day, week, month, quarter, year)
 Fact table
 Contains measures (such as dollars_sold)
and references (foreign keys) to each of the related dimension tables
 n-D base cube
 Called a base cuboid in data-warehousing literature
 Top most 0-D cuboid
 Holds the highest-level of summarization
 Called the apex cuboid
 Lattice of cuboids
 Forms a data cube

KDD, 4 OLAP
Cube: A Lattice of Cuboids 4 - 16
all
0-D (apex) cuboid
time item location supplier

1-D cuboids
time, location item, location location, supplier

2-D cuboids
time, item time, supplier item, supplier
time, item, location time, location, supplier

3-D cuboids
time, item, supplier item, location, supplier
4-D (base) cuboid

time, item, location, supplier
KDD, 4 OLAP
Conceptual Modeling of Data Warehouses 4 - 17
 Modeling data warehouses: dimensions & measures

 Star schema:
 A fact table in the middle
connected to a set of dimension tables
 Snowflake schema:
 A refinement of the star schema
where some dimensional hierarchy is normalized
into a set of smaller dimension tables,
forming a shape similar to a snowflake
 Fact constellations:
 Multiple fact tables sharing dimension tables,
viewed as a collection of stars,
therefore called galaxy schema or fact constellation

KDD, 4 OLAP
Example of Star Schema 4 - 18
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key
type
year item_key supplier_type
branch_key
branch location_key location
branch_key units_sold location_key
branch_name dollars_sold street
branch_type city
avg_sales
province_or_street
country
Measures
KDD, 4 OLAP
Example of Snowflake Schema 4 - 19
time
time_key item supplier
day item_key
day_of_the_week supplier_key
Sales Fact Table item_name
month supplier_type
brand
quarter time_key
supplier_key
year item_key
branch_key
branch location_key location
branch_key units_sold location_key
branch_type city_key
avg_sales city
city_key
city
Measures province_or_street
Summer Semester 2018 country
KDD, 4 OLAP
Example of Fact Constellation 4 - 20
time Shipping
time_key item Fact Table
day item_key
day_of_the_week time_key
Sales Fact Table item_name
month brand item_key
quarter time_key
type shipper_key
year item_key supplier_type from_location
branch_key
to_location
branch location_key location dollars_cost
branch_key units_sold location_key units_shipped
branch_type city
avg_sales
province_or_street shipper
country
shipper_key
shipper_name
Measures location_key
Summer Semester 2018 shipper_type
KDD, 4 OLAP
A Concept Hierarchy: Dimension (Location) 4 - 21
all all
Europe ... North_America

region
Germany ... Spain Canada ... Mexico

country
... Vancouver ...

city Frankfurt Toronto
...
office L. Chan M. Wind
KDD, 4 OLAP
Data-Cube Measures: Three Categories 4 - 22
 Distributive:
 If the result derived by applying the function to the n aggregate values
obtained for n partitions of the dataset
is the same as that derived by applying the function
on all the data without partitioning
 E.g., count(), sum(), min(), max()
 Algebraic:
 If it can be computed by an algebraic function with M arguments
(where M is a bounded integer),
each of which is obtained by applying a distributive aggregate function
 E.g., avg(), min_N(), standard_deviation()
 Holistic:
 If there is no constant bound on the storage size needed to describe a
subaggregate
 E.g., median(), mode(), rank()

KDD, 4 OLAP
Aggregation Type 4 - 23
 Non-trivial Property
 Next to name and value range
 Defines the set of aggregation operations
that can be executed on a measure (a fact)
 FLOW
 Any aggregation
 E.g., sales, turnover
 STOCK
 No temporal aggregation
 E.g., stock, inventory
 VPU (Value per Unit)
 No summarization
 E.g., price, tax, in general factors
 (Always applicable: minimum, maximum, and average.)
KDD, 4 OLAP
View of Warehouses and Hierarchies 4 - 24
 Specification
of hierarchies
 Schema hierarchy
 day <
{month < quarter; week} <
year
 Set-grouping
hierarchy
KDD, 4 OLAP
 {1..10} < inexpensive
Multidimensional Data 4 - 25
 Sales volume as a function of product, month, and region
• Dimensions: Product, Location, Time

• Hierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week

Product
Office Day
Month
KDD, 4 OLAP
A Sample Data Cube 4 - 26
Date Total annual sales

1Qtr 2Qtr of TVs in U.S.A.
3Qtr 4Qtr sum
TV
PC U.S.A
VCR
Country
sum
Canada
Mexico
sum

KDD, 4 OLAP
Cuboids Corresponding to the Cube 4 - 27
all
0-D (apex) cuboid
product date country
1-D cuboids
product, date product, country date, country

2-D cuboids
3-D (base) cuboid

product, date, country

KDD, 4 OLAP
Typical OLAP Operations 4 - 28
 Roll up (drill up): summarize data

 By climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll up
 From higher-level summary to lower-level summary or detailed data,
or introducing new dimensions
 Slice and dice: project and select
 Pivot (rotate):
 Reorient the cube, visualization, 3D to series of 2D planes
 Other operations:
 Drill across: involving (across) more than one fact table
 Drill through: through the bottom level of the cube to its back-end relational
tables (using SQL)

KDD, 4 OLAP
Typical OLAP Operations
A Star-Net Query Model 4 - 30
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location
Promotion Each circle is Organization
KDD, 4 OLAP
called a footprint.
Browsing a Data Cube 4 - 31
 Visualization
 OLAP capabilities
 Interactive manipulation
KDD, 4 OLAP

 Summary

KDD, 4 OLAP
Design of Data Warehouse: 4 - 33
A Business-Analysis Framework
 Four views regarding the design of a data warehouse

 Top-down view
 Allows selection of the relevant information
necessary for the data warehouse
 Data-source view
 Exposes the information
being captured, stored, and managed
by operational systems
 Data-warehouse view
 Consists of fact tables and dimension tables
 Business-query view
 Sees the perspectives of data in the warehouse
from the view of the end-user

KDD, 4 OLAP
Data-Warehouse Design Process 4 - 34
 Top-down, bottom-up approaches or a combination of both

 Top-down: starts with overall design and planning (mature)
 Bottom-up: starts with experiments and prototypes (rapid)
 From software-engineering point of view
 Waterfall: structured and systematic analysis at each step
before proceeding to the next
 Spiral: rapid generation of increasingly functional systems,
short turn-around time, quick turn-around
 Typical data-warehouse design process
 Choose a business process to model, e.g., orders, invoices, etc.
 Choose the grain (atomic level of data) of the business process
 Choose the dimensions that will apply to each fact-table record
 Choose the measure that will populate each fact-table record

KDD, 4 OLAP
Data-Warehouse Development: 4 - 35
A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Enterprise
Data Data
Data
Mart Mart
Warehouse
Model
refinement Model refinement
Define a high-level corporate data model

KDD, 4 OLAP
Data-Warehouse Usage 4 - 36
 Three kinds of data-warehouse applications

 Information processing
 Supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs
 Analytical processing
 Multidimensional analysis of data warehouse data
 Supports basic OLAP operations, slice-dice, drilling, pivoting
 Data mining
 Knowledge discovery from hidden patterns
 Supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results using
visualization tools

KDD, 4 OLAP
From Online Analytical Processing (OLAP) 4 - 37
to Online Analytical Mining (OLAM)
 Why online analytical mining?

 High quality of data in data warehouses
 DW contains integrated, consistent, cleaned data
 Available information-processing structure surrounding data warehouses
 ODBC, OLEDB, Web access, service facilities, reporting, and OLAP tools
 OLAP-based exploratory data analysis
 Mining with drilling, dicing, pivoting, etc.
 Online selection of data-mining functions
 Integration and swapping
of multiple mining functions, algorithms, and tasks

KDD, 4 OLAP

 Summary

KDD, 4 OLAP
Efficient Data-Cube Computation 4 - 39
 Data cube can be viewed as a lattice of cuboids

 The bottom-most cuboid is the base cuboid.
 The top-most cuboid (apex) contains only one cell.
 How many cuboids in an n-dimensional cube with Li levels associated with
dimension i?
n
T   ( Li  1)
i 1
 Materialization of data cube

 Materialize
each (cuboid) (full materialization),
none (no materialization), or
some (partial materialization)
 Selection of cuboids to materialize
 Based on size, sharing, access frequency, etc.

KDD, 4 OLAP
The "Compute Cube" Operator 4 - 40
 Cube definition and computation in DMQL

DEFINE CUBE sales [item, city, year]:
SUM (sales_in_dollars);
COMPUTE CUBE sales;
 Transform it into an SQL-like language
 with a new operator CUBE BY (introduced by Gray et al. '96)()
SELECT item, city, year, SUM (amount)
FROM sales
CUBE BY item, city, year; (city) (item) (year)
 Need to compute the following Group-Bys
(date, product, customer),
(date, product), (date, customer), (city, item) (city, year) (item, year)
(product, customer),
(date), (product), (customer)
() (city, item, year)
KDD, 4 OLAP
Indexing OLAP Data: Bitmap Index 4 - 41
 Index on a particular column

 Each value in the column has a bit vector: bit-op is fast
 Length of bit vector: # of records in base table
 i-th bit set, if i-th row of base table has value of bit vector
 Not suitable for high-cardinality domains
 A bit compression technique called Word-Aligned Hybrid (WAH) makes it
work for high-cardinality domain as well [Wu et al., TODS'06].
Base table Index on Region Index on Type
Cust Region Type RecIDAsia Europe America RecID Retail Dealer
C1 Asia Retail 1 1 0 0 1 1 0
C2 Europe Dealer 2 0 1 0 2 0 1
C3 Asia Dealer 3 1 0 0 3 0 1
C4 America Retail 4 0 0 1 4 1 0
C5 Europe Dealer 5 0 1 0 5 0 1
KDD, 4 OLAP
Indexing OLAP Data: Join Indices 4 - 42
 Join index: JI(R-id, S-id)

where R (R-id, …)  S (S-id, …)
 Traditional indices map the values
to a list of record ids.
 It materializes relational join in JI file
and speeds up relational join.
 In data warehouses,
join index relates the values
of the dimensions of a star schema
to rows in the fact table.
 E.g. fact table: Sales and
two dimensions location and item
 A join index on location
maintains for each distinct location
a list of R-ids of the tuples
recording the Sales in that location
 Join indices can span multiple dimensions.

KDD, 4 OLAP
Efficient Processing of OLAP Queries 4 - 43
 Determine which operations should be performed on the available

cuboids
 Transform drill, roll, etc. into corresponding SQL and/or OLAP operations
 E.g., dice = selection + projection
 Determine which materialized cuboid(s) should be selected for
OLAP operation
 Let the query to be processed be on {brand, province_or_state} with the
condition "year = 2004", and there are 4 materialized cuboids available:
 1) {year, item_name, city}
 2) {year, brand, country}
 3) {year, brand, province_or_state}
 4) {item_name, province_or_state} where year = 2004
 Which should be selected to process the query?
 Explore indexing structures and compressed vs. dense-array
structures in MOLAP
KDD, 4 OLAP
OLAP Server Architectures 4 - 44
 Relational OLAP (ROLAP)

 Use relational or extended-relational DBMS
to store and manage warehouse data
and OLAP middleware
 Include optimization of DBMS backend,
implementation of aggregation navigation logic,
and additional tools and services
 Greater scalability
 Multidimensional OLAP (MOLAP)
 Sparse array-based multidimensional storage engine
 Fast indexing to pre-computed summarized data
 Hybrid OLAP (HOLAP) (e.g., Microsoft SQL-Server)
 Flexibility, e.g., low level: relational, high-level: array
 Specialized SQL servers (e.g., Redbricks)
 Specialized support for SQL queries over star/snowflake schemas
KDD, 4 OLAP

 Summary

KDD, 4 OLAP
Data Generalization 4 - 46
 Summarize data
 By replacing relatively low-level values
 E.g., numerical values for the attribute age
 with higher-level concepts
 E.g., young, middle-aged, and senior
 By reducing the number of dimensions

 E.g., removing birth_date and telephone_number
when summarizing the behavior of a group of students
 Describe concepts in concise and succinct terms

at generalized (rather than low) levels of abstractions
 Facilitates users in examining the general behavior of the data
 Makes dimensions of a data cube easier to grasp

KDD, 4 OLAP
Attribute-Oriented Induction 4 - 47
 Proposed in 1989
 KDD'89 workshop
 Not confined to categorical data nor to particular measures
 How is it done?
 Collect the task-relevant data (initial relation)
using a relational database query
 Perform generalization
by attribute removal or attribute generalization
 Apply aggregation
by merging identical, generalized tuples
and accumulating their respective counts
 Interaction with users
for knowledge presentation

KDD, 4 OLAP
Attribute-Oriented Induction: An Example 4 - 48
 Example:
 Describe general characteristics of graduate students
in a University database
 Step 1:
 Fetch relevant set of data using an SQL statement, e.g.,
SELECT name, gender, major, birth_place, birth_date,
residence, phone#, gpa)
FROM student
WHERE student_status IN {"Msc", "MBA", "PhD"};
 Step 2:
 Perform attribute-oriented induction
 Step 3:
 Present results in generalized-relation, cross-tab, or rule forms

KDD, 4 OLAP
Class Characterization: An Example 4 - 49
Name Gender Major Birth_Place Birth_Date Residence Phone# GPA

Jim M CS Vancouver, 08-12-76 3511 Main 687-4598 3.67
Initial Woodman BC, Canada St.,
Relation Richmond
Scott M CS Montreal, 28-07-75 345 1st 253-9106 3.70
Lachance Que, Canada Ave.,
Richmond
Laura Lee F Physics Seattle, WA, 25-08-70 125 Austin 420-5232 3.83
USA Ave.,
Burnaby
Removed Retained Sci, Country Age range City Removed Excl,
Eng, VG,..
Bus
Prime Gender Major Birth_region Age_range Residence GPA Count

Generalized M Science Canada 20-25 Richmond Very good 16
Relation F Science Foreign 25-30 Burnaby Excellent 22
… … … … … … …

KDD, 4 OLAP
Class Characterization: An Example (2) 4 - 50
Birth_Region
Cross-table
Gender Canada Foreign Total
M 16 14 30
F 10 22 32
Total 26 36 62

KDD, 4 OLAP
Basic Principles of Attribute-Oriented Induction 4 - 51
 Data focusing:
 Task-relevant data, including dimensions
 The result is the initial relation.
 Attribute removal:
 Remove attribute A, if there is a large set of distinct values for A
but (1) there is no generalization operator on A,
or (2) A's higher-level concepts are expressed in terms of other attributes.
 Attribute generalization:
 If there is a large set of distinct values for A,
and there exists a set of generalization operators on A,
then select an operator and generalize A.
 Attribute-threshold control:
 Typical 2-8, specified/default
 Generalized-relation-threshold control:
 Control the final relation/rule size
KDD, 4 OLAP
Attribute-Oriented Induction: Basic Algorithm 4 - 52
 InitialRel:
 Query processing of task-relevant data, deriving the initial relation
 PreGen:
 Based on the analysis of the number of distinct values in each attribute,
determine generalization plan for each attribute:
removal? or how high to generalize?
 PrimeGen:
 Based on the PreGen plan,
perform generalization to the right level
to derive a "prime generalized relation",
accumulating the counts.
 Presentation:
 User interaction:
(1) adjust levels by drilling,
(2) pivoting,
(3) mapping into rules, cross tabs, visualization presentations.
KDD, 4 OLAP
Presentation of Generalized Results 4 - 53
 Generalized relation:
 Relations where some or all attributes are generalized,
with counts or other aggregation values accumulated
 Cross tabulation:
 Mapping results into cross-tabulation form (similar to contingency tables)
 Visualization techniques:
Pie charts, bar charts, curves, cubes, and other visual forms
 Quantitative characteristic rules:
 Mapping generalized result into characteristic rules
with quantitative information associated with it, e.g.,
grad ( x)  male( x) 
birth _ region( x) "Canada"[t :53%] birth _ region( x) " foreign"[t : 47%].

KDD, 4 OLAP
Mining-Class Comparisons 4 - 54
 Comparison: Comparing two or more classes

 Method:
 Partition the set of relevant data
into the target class and the contrasting class(es)
 Generalize both classes to the same high-level concepts (i.e. AOI)
 Including aggregation
 Compare tuples with the same high-level concepts
 Present for each tuple its description and two measures
 Support - distribution within single class (counts, percentage)
 Comparison - distribution between classes
 Highlight the tuples with strong discriminant features
 Relevance Analysis:
 Find attributes (features) which best distinguish different classes

KDD, 4 OLAP
Concept Description vs. Cube-Based OLAP 4 - 55
 Similarity:
 Data generalization
 Presentation of data summarization at multiple levels of abstraction
 Interactive drilling, pivoting, slicing and dicing
 Differences:
 OLAP has systematic preprocessing, query independent, and can drill down
to rather low level
 AOI has automated desired-level allocation,
and may perform dimension-relevance analysis/ranking
when there are many relevant dimensions
 AOI works on data which are not in relational forms

KDD, 4 OLAP

 Data Warehouse Modeling: Data Cube and OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Summary

KDD, 4 OLAP
Summary 4 - 57
 Data Warehousing: Multi-dimensional model of data

 A data cube consists of dimensions & measures.
 Star schema, snowflake schema, fact constellations
 OLAP operations: drilling, rolling, slicing, dicing, and pivoting
 Data-Warehouse Architecture, Design, and Usage
 Multi-tiered architecture
 Business-analysis design framework
 Information processing, analytical processing, data mining, OLAM (Online
Analytical Mining)
 Implementation: Efficient computation of data cubes
 Partial vs. full vs. no materialization
 Indexing OALP data: Bitmap index and join index
 OLAP query processing
 OLAP servers: ROLAP, MOLAP, HOLAP
 Data generalization: Attribute-oriented induction

KDD, 4 OLAP
References 4 - 58
S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R.

Ramakrishnan, and S. Sarawagi. On the computation of multidimensional
aggregates. VLDB'96
D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in
data warehouses. SIGMOD'97
R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases.
ICDE'97
S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP
technology. ACM SIGMOD Record, 26:65-74, 1997
E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer
World, 27, July 1993
J. Gray, et al. Data cube: A relational aggregation operator generalizing group-
by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54,
1997
A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations,
and Applications. MIT Press, 1999
KDD, 4 OLAP
J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD

Record, 27:97-107, 1998
V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes
efficiently. SIGMOD'96
J. Hellerstein, P. Haas, and H. Wang. Online aggregation. SIGMOD'97
C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design:
Relational and Dimensional Techniques. John Wiley, 2003
W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to
Dimensional Modeling. 2ed. John Wiley, 2002
P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. ACM
SIGMOD Record, 24:8–11, Sept. 1995
P. O'Neil and D. Quass. Improved query performance with variant indexes.
SIGMOD'97
Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998
KDD, 4 OLAP
S. Sarawagi and M. Stonebraker. Efficient organization of large

multidimensional arrays. ICDE'94
A. Shoshani. OLAP and statistical databases: Similarities and differences.
PODS'00
D. Srivastava, S. Dar, H. V. Jagadish, and A. V. Levy. Answering queries with
aggregation using views. VLDB'96
P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987
J. Widom. Research problems in data warehousing. CIKM'95
K. Wu, E. Otoo, and A. Shoshani. Optimal bitmap indices with efficient
compression. ACM Trans. on Database Systems, 31:1-38, 2006

KDD, 4 OLAP
Chapter 5: Mining Frequent Patterns,

Associations and Correlations

Chapter 5: Mining Frequent Patterns, Associations, 5-2
and Correlations
 Basic Concepts
 Scalable Frequent-Itemset-Mining Methods
 Apriori: A Candidate-Generation-and-Test Approach
 Improving the Efficiency of Apriori
 FPGrowth: A Frequent-Pattern-Growth Approach
 ECLAT: Frequent-Pattern Mining with Vertical Data Format
 Mining Closed Itemsets and Max-Itemsets
 Generating Association Rules from Frequent Itemsets
 Which Patterns Are Interesting?—Pattern-Evaluation Methods
 Summary

KDD, 5 Frequent Patterns
What Is Frequent-Pattern Analysis? 5-3
 Frequent pattern:
 A pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a dataset
 First proposed by (Agrawal, Imielinski & Swami, AIS'93)
in the context of frequent itemsets and association-rule mining
 Motivation: Finding inherent regularities in data
 What products are often purchased together?—Beer and diapers?!
 What are the subsequent purchases after buying a PC?
 "Who bought this has often also bought … "
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify Web documents?
 Applications:
 Basket-data analysis, cross-marketing, catalog design, sale-campaign
analysis, Web-log (click-stream) analysis, and DNA-sequence analysis

Why Is Frequent-Pattern Mining Important? 5-4
 A frequent pattern is an intrinsic and important property

of a dataset.
 Foundation for many essential data-mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-series, and stream data
 Classification: discriminative, frequent-pattern analysis
 Cluster analysis: frequent-pattern-based clustering
 Data warehousing: iceberg cube and cube gradient
 Semantic data compression: fascicles (Jagadish, Madar, and Ng, VLDB'99)
 Broad applications

An Example 5-5
• from: Martin Lindstrom: Brandwashed. Random House, 2011

• "It is by crunching these numbers that the data-mining industry has
uncovered some even more surprising factoids: Did you know, for example,
that at Walmart a shopper who buys a Barbie doll is 60 percent more likely to
purchase one of three types of candy bars? Or that toothpaste is most often
bought alongside canned tuna? Or that a customer who buys a lot of meat is
likely to spend more money in a health-food store than a non-meat-eater? Or
what about the data revealed to one Canadian grocery chain that customers
who bought coconuts also tended to buy prepaid calling cards? At first, no
one in store management could figure out what was going on. What could
coconuts possibly have to do with calling cards? Finally it occurred to them
that the store served a huge population of shoppers from the Caribbean
islands and Asia, both of whose cuisines use coconuts in their cooking. Now
it made perfect sense that these Caribbean and Asian shoppers were buying
prepaid calling cards to check in with their extended families back home."

Basic Concepts: Frequent Itemsets 5-6
Tid Items bought  Itemset:

10 Beer, Nuts, Diapers  A set of one or more items
20 Beer, Coffee, Diapers  k-itemset X = {x1, …, xk}
30 Beer, Diapers, Eggs  (Absolute) Support,
40 Nuts, Eggs, Milk or support count of X:
50 Nuts, Coffee, Diapers, Eggs, Milk  Frequency or occurrence of X
 (Relative) Support s:
Customer Customer
 The fraction of the transactions
buys both buys
that contain X
diapers
 I.e., the probability that a
transaction contains X
 An itemset X is frequent,
if X's support is no less than a
Customer min_sup threshold.
buys beer
Basic Concepts: Association Rules 5-7
Tid Items bought  Find all the rules X  Y with

minimum support and
10 Beer, Nuts, Diapers confidence
20 Beer, Coffee, Diapers  Support s: probability that a
transaction contains X  Y
30 Beer, Diapers, Eggs  Confidence c: conditional
40 Nuts, Eggs, Milk probability that a transaction
having X also contains Y
50 Nuts, Coffee, Diapers, Eggs, Milk  Example:
 Let min_sup = 50% and
Customer Customer min_conf = 50%
buys both buys  Frequent itemsets:
diapers Beer:3, Nuts:3, Diapers:4, Eggs:3,
{Beer, Diapers}:3
 Association rules:
(many more!)
 Beer  Diapers (60%, 100%)
 Diapers  Beer (60%, 75%)
Customer
buys beer
Basic Concepts: Association Rules (2) 5-8
 Implication of the form A  B

 Where A  Ø, B  Ø, and A  B = Ø
 Strong rule:
 Satifies both min_sup and min_conf
support(A  B) = P(A  B)
confidence(A  B) = P(B|A)
= support(A  B) / support(A)
= support_count(A  B) / support_count(A)
 I.e., confidence of rule can be easily derived

from the support counts of A and A  B
 Association-rule mining:
 Find all frequent itemsets
 Generate strong association rules from the frequent itemsets

Closed Itemsets and Max-Itemsets 5-9
 A long itemset contains a combinatorial number of sub-itemsets.

 E.g., {a1, …, a100} contains
100  100  100  100
      ...     2  1  1.27 10 30 sub-itemsets!
 1   2  100 
 Solution:
 Mine closed itemsets and max-itemsets instead
 An itemset X is closed, if X is frequent
and there exists no super-itemset Y  X with the same support as X.
 Proposed by (Pasquier et al., ICDT'99)
 An itemset X is a max-itemset, if X is frequent
and there exists no frequent super-itemset Y  X.
 Proposed by (Bayardo, SIGMOD'98)
 Closed itemset is a lossless compression of frequent itemsets.
 Reducing the number of itemsets (and rules)

Closed Itemsets and Max-Itemsets (2) 5 - 10
 Example:
 DB = {<a1, …, a100>, < a1, …, a50>}
 I.e., just two transactions
 min_sup = 1
 What are the closed itemsets?
 <a1, …, a100>: 1
 <a1, …, a50>: 2
 Number behind the colon: support_count
 What are the max-itemsets?
 <a1, …, a100>: 1
 What is the set of all frequent itemsets?
 !!

Chapter 5: Mining Frequent Patterns, Associations 5 - 11
and Correlations
 Basic Concepts
 Summary

The Downward-Closure Property 5 - 12
and Scalable Mining Methods
 The downward-closure property of frequent patterns:

 Any subset of a frequent itemset must also be frequent.
 If {Beer, Diapers, Nuts} is frequent, so is {Beer, Diapers}.
 I.e., every transaction having {Beer, Diapers, Nuts}
also contains {Beer, Diapers}.
 Scalable mining methods: three major approaches
 Apriori
 (Agrawal & Srikant, VLDB'94)
 Frequent-pattern growth (FPgrowth)
 (Han, Pei & Yin, SIGMOD'00)
 Vertical-data-format approach (CHARM)
 (Zaki & Hsiao, SDM'02)

Apriori: A Candidate Generation & Test Approach 5 - 13
 Apriori pruning principle:

 If there is any itemset which is infrequent,
its supersets should not be generated/tested!
(Agrawal & Srikant, VLDB'94; Mannila et al., KDD'94)
 Method:
 Initially, scan DB once
to get frequent 1-itemsets.
 Generate length-(k+1) candidate itemsets
from length-k frequent itemsets.
 Test the candidates against DB,
discard those that are infrequent.
 Terminate when no further candidate or frequent itemset can be generated.

The Apriori Algorithm—An Example 5 - 14
min_sup = 2
Itemset sup
Itemset sup
Database TDB C1 {A} 2
Tid Items
L1 {A} 2
{B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup
C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 {A, B}
{A, C} 2 {A, C}
{B, C} 2
{A, E} 1 2nd scan
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
{B, C, E} 3rd scan L3 Itemset sup
{B, C, E} 2
The Apriori Algorithm (Pseudo-Code) 5 - 15
Ck: candidate itemsets of size k

Lk : frequent itemsets of size k
L1 = {frequent items};
for (k = 1; Lk != ; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t;
Lk+1 = candidates in Ck+1 with min_sup;
end;
return k Lk;

Implementation of Apriori 5 - 16
 How to generate candidates?

 Step 1: self-joining Lk
 Step 2: pruning
 Example of candidate generation:
 L3 = {abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4 = {abcd}

How to Count Supports of Candidates? 5 - 17
 Why is counting supports of candidates a problem?

 The total number of candidates can be huge.
 One transaction may contain many candidates.
 Method:
 Candidate itemsets are stored in a hash tree.
 Leaf node of hash tree contains a list of itemsets and counts.
 Interior node contains a hash table.
 Subset function: finds all the candidates contained in a transaction

Counting Supports of Candidates Using Hash Tree 5 - 18
Subset function
1,4,7 3,6,9 Transaction: 1 2 3 5 6
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
125 159
457
458

Candidate Generation: An SQL Implementation 5 - 19
 SQL implementation of candidate generation

Suppose the items in Lk–1 are listed in order
Step 1: self-joining Lk–1
insert into Ck
select p.item1, p.item2, …, p.itemk–1, q.itemk–1
from Lk–1 p, Lk–1 q
where p.item1 = q.item1, …, p.itemk–2 = q.itemk–2, p.itemk–1 < q.itemk–1;
Step 2: pruning
forall itemsets c in Ck do
forall (k–1)-subsets s of c do
if (s is not in Lk–1) then delete c from Ck;
 Use object-relational extensions like UDFs, BLOBs, and table
functions for efficient implementation
 (Sarawagi, Thomas & Agrawal, SIGMOD'98)

and Correlations
 Basic Concepts
 Summary

Further Improvement of the Apriori Method 5 - 21
 Major computational challenges

 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce passes of transaction-database scans
 Shrink number of candidates
 Facilitate support counting of candidates

Hashing: Reduce the Number of Candidates 5 - 22
 A k-itemset whose corresponding hashing-bucket count is below

the threshold cannot be frequent.
 Candidates: a, b, c, d, e count itemsets
 While scanning DB for frequent 1-itemsets, 35 {ab, ad, ae}
create hash entries for 2-itemsets: 88 {bd, be, de}
 {ab, ad, ae}
. .
 {bd, be, de} . .
 … . .
 Frequent 1-itemset: a, b, d, e
 ab is not a candidate 2-itemset, 102 {yz, qs, wt}
if the sum of count of {ab, ad, ae} Hash Table
is below support threshold.
(Park, Chen & Yu, SIGMOD'95)

Partition: Scan Database Only Twice 5 - 23
 Any itemset that is potentially frequent in DB must be frequent in at

least one of the partitions of DB.
 Scan 1: partition database and find local frequent patterns
 min_supi = min_sup [%] x | DBi |
 Scan 2: consolidate global frequent patterns
(Savasere, Omiecinski & Navathe, VLDB'95)
DB1 + DB2 + + DBk = DB

sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB

Sampling for Frequent Patterns 5 - 24
 Select a sample of original database,

mine frequent patterns within sample using Apriori
 Scan database once
to verify frequent itemsets found in sample,
only borders of closure of frequent patterns are checked
 Example: check abcd instead of ab, ac, …, etc.
 Scan database again
to find missed frequent patterns
(Toivonen, VLDB'96)

Dynamic Itemset Counting: 5 - 25
Reduce Number of Scans
 Adding candidate itemsets at different points during a scan

 DB partitioned into blocks marked by start points
 New candidate itemsets can be added at any start point during a scan.
 E.g., if A and B are already found to be frequent,
AB are also counted from that starting point on
 Uses the count-so-far as the lower bound of the actual count
 If count-so-far passes minimum support,
itemset is added to frequent-itemset collection.
 Can then be used to generate even longer candidates

Dynamic Itemset Counting: 5 - 26
Reduce Number of Scans (2)
 Once both A and D are

ABCD determined frequent, the
counting of AD begins.
ABC ABD ACD BCD  Once all length-2 subsets of
BCD are determined frequent,
the counting of BCD begins.
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori: 2-itemsets
…
{}
Itemset lattice 1-itemsets
2-itemsets
(Brin, Motwani, Ullman & DIC: 3-itemsets
Tsur, SIGMOD'97)
and Correlations
 Basic Concepts
 Summary

27
Pattern-Growth Approach: Mining Frequent 5 - 28
Patterns Without Candidate Generation
 Bottlenecks of the Apriori approach

 Breadth-first (i.e., level-wise) search
 Candidate generation and test
 Often generates a huge number of candidates
 The FPGrowth Approach
(Han, Pei & Yin, SIGMOD'00)
 Depth-first search
 Avoid explicit candidate generation
 Major philosophy: Grow long patterns from short ones
using local frequent items only
 abc is a frequent pattern.
 Get all transactions having abc, i.e., restrict DB on abc: DB|abc
 d is a local frequent item in DB|abc  abcd is a frequent pattern.
 local frequent item: counted locally, but above global min_sup

Construct FP-tree from a Transaction Database 5 - 29
TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_sup = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
1. Scan DB once, find frequent
1-itemsets (single-item Header Table
patterns) f:4 c:1
Item frequency head
2. Sort frequent items in f 4
frequency-descending order, c 4 c:3 b:1 b:1
creating the f-list a 3
b 3
3. Scan DB again, construct m 3
a:3 p:1
FP-tree p 3
m:2 b:1
F-list = f-c-a-b-m-p
Summer Semester 2018 p:2 m:1
Partition Itemsets and Databases 5 - 30
 Frequent itemsets can be partitioned into subsets

according to f-list.
 F-list = f-c-a-b-m-p
 Patterns containing p
 The least-frequent item (at the end of the f-list, suffix)
 Patterns having m but not p
 …
 Patterns having c but not a nor b, m, p
 Pattern f
 This processing order guarantees completeness and non-
redundancy.

Find Itemsets Having Item p 5 - 31
From p's Conditional Pattern Base
 Starting at the frequent-item header table in the FP-tree

 Traverse the FP-tree by following the link of frequent item p
 Accumulate all transformed prefix paths of item p to form p's
conditional pattern base
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a fc:3
m 3
a:3 p:1
b fca:1, f:2, c:1
p 3
m:2 b:1 m fca:3, fcab:1
p fcam:2, cb:1
Summer Semester 2018 p:2 m:1
p's Conditional Pattern Base 5 - 32
{}
Header Table
f:2 c:1
Item frequency head
f 4
c 4 c:2 b:1 b:1
a 3
b 3
m 3
a:2 p:1
p 3
m:2 b:1
p:2 m:1
Hence, p's conditional pattern base is
fcam: 2, cb: 1;
both below min_sup

m's Conditional Pattern Base 5 - 33
{}
Header Table
f:3 c:1
Item frequency head
f 4
c 4 c:3 b:1 b:1
a 3
b 3
m 3
a:3 p:1
p 3
m:2 b:1
p:2 m:1
Hence, m's conditional pattern base is
fca: 3, fcab: 1;
fca has min_sup

b's Conditional Pattern Base 5 - 34
{}
Header Table
f:2 c:1
Item frequency head
f 4
c 4 c:1 b:1 b:1
a 3
b 3
m 3
a:1 p:1
p 3
m:2 b:1
p:2 m:1
Hence, b's conditional pattern base is
fca: 1, f: 2, c: 1;
all below min_sup

a's Conditional Pattern Base 5 - 35
{}
Header Table
f:3 c:1
Item frequency head
f 4
c 4 c:3 b:1 b:1
a 3
b 3
m 3
a:3 p:1
p 3
m:2 b:1
p:2 m:1
Hence, a's conditional pattern base is
fc: 3;
has min_sup (could have used the m-conditional
pattern base here (recursion))
From Conditional Pattern Bases 5 - 36
to Conditional FP-Trees
 For each conditional pattern base:

 Accumulate the count for each item in the base
 Construct the conditional FP-tree for the frequent items of the pattern base
(min_sup = 3)
{}
Header Table
m's conditional pattern base:
f:4 c:1 fca:3, fcab:1
Item frequency head
f 4 All frequent
c 4 c:3 b:1 b:1 patterns related to
{} m:
a 3
b 3 m,
m 3
a:3 p:1  f:3  fm, cm, am,
p 3
m:2 b:1 c:3 fcm, fam, cam,
fcam
p:2 m:1 a:3
m's conditional FP-tree
Recursion: Mining Each Conditional FP-tree 5 - 37
{}
{} Cond. pattern base of "am": (fc:3) f:3
c:3
f:3
am's conditional FP-tree
c:3 {}
Cond. pattern base of "cm": (f:3)
a:3 f:3
m's conditional FP-tree
cm's conditional FP-tree
{}
Cond. pattern base of "cam": (f:3)
f:3
cam's conditional FP-tree
A Special Case: Single Prefix Path in FP-tree 5 - 38
 Suppose a (conditional) FP-tree T has a shared single prefix-path P

 Mining can be decomposed into two parts
 Reduction of the single prefix path into one node
 Concatenation of the mining results of the two parts
{}
a1:n1
{} r1
a2:n2
a3:n3 a1:n1
 r1 = + b1:m1 C1:k1
a2:n2
b1:m1 C1:k1
a3:n3 C2:k2 C3:k3
C :k
Summer2Semester
2 C :k3
2018 3
Benefits of the FP-tree Structure 5 - 39
 Completeness
 Preserve complete information for frequent-pattern mining
 Never break a long pattern of any transaction
 Compactness
 Reduce irrelevant info—infrequent items are gone
 Items in frequency-descending order:
 The more frequently occurring, the more likely to be shared
 Never larger than the original database
 Not counting node links and the count fields

The Frequent-Pattern-Growth Mining Method 5 - 40
 Idea: Frequent-pattern growth

 Recursively grow frequent patterns by pattern and database partition
 Method:
 For each frequent item, construct its conditional pattern base,
and then its conditional FP-tree
 Repeat the process on each newly created conditional FP-tree
 Until the resulting FP-tree is empty,
or it contains only one path
 Single path will generate all the combinations of its sub-paths,
each of which is a frequent pattern.

Scaling FP-Growth by Database Projection 5 - 41
 What if FP-tree does not fit in memory?

 DB projection
 First partition database into a set of projected DBs
 Then construct and mine FP-tree for each projected DB
 Parallel-projection vs. partition-projection techniques
 Parallel projection
 Project the DB in parallel for each frequent item
 Parallel projection is space costly.
 All the partitions can be processed in parallel.
 Partition projection
 Partition the DB based on the ordered frequent items
 Passing the unprocessed parts to the subsequent partitions

Partition-Based Projection 5 - 42
 Parallel projection needs a lot of disk space.

 Partition projection saves it.
Tran. DB
fcamp
fcabm
fb
cbp
fcamp
p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB

fcam fcab f fc f …
cb fca cb … …
fcam fca …
am-proj DB cm-proj DB
fc f …
fc f
Summer Semester 2018 fc f
Performance of FP-Growth in Large Datasets 5 - 43
100
140
90 D1 FP-grow th runtime D2 FP-growth
80
D1 Apriori runtime 120 D2 TreeProjection
70 100
Runtime (sec.)
Run time(sec.)
60
80
50 Data set T25I20D10K Data set T25I20D100K
40 60
30 40
20
20
10
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)
FP-Growth vs. Apriori FP-Growth vs. Tree-Projection

Advantages of the Pattern-Growth Approach 5 - 44
 Divide-and-conquer:
 Decompose both the mining task and DB according to the frequent patterns
obtained so far
 Leading to focused search of smaller databases
 Other factors
 No candidate generation, no candidate test
 Compressed database: FP-tree structure
 No repeated scan of entire database
 Basic ops: counting local frequent items and building sub FP-tree, no pattern
search and matching
 A good open-source implementation and refinement of FP-Growth:
 FPGrowth+
(Grahne & Zhu, FIMI'03)

Further Improvements of Mining Methods 5 - 45
 AFOPT
(Liu et al., KDD'03)
 A "push-right" method for mining condensed frequent-pattern (CFP) tree
 Carpenter
(Pan et al., KDD'03)
 Mine datasets with small rows but numerous columns
 Construct a row-enumeration tree for efficient mining
 FPgrowth+
(Grahne & Zhu, FIMI'03)
 Efficiently using prefix-trees in mining frequent itemsets
 TD-Close
(Liu et al., SDM'06)

Extension of Pattern-Growth Mining Methodology 5 - 46
 Mining closed frequent itemsets and max-patterns

 CLOSET (DMKD'00), FPclose, and FPMax (Grahne & Zhu, FIMI'03)
 Mining sequential patterns
 PrefixSpan (ICDE'01), CloSpan (SDM'03), BIDE (ICDE'04)
 Mining graph patterns
 gSpan (ICDM'02), CloseGraph (KDD'03)
 Constraint-based mining of frequent patterns
 Convertible constraints (ICDE'01), gPrune (PAKDD'03)
 Computing iceberg data cubes with complex measures
 H-tree, H-cubing, and Star-cubing (SIGMOD'01, VLDB'03)
 Pattern-growth-based Clustering
 MaPle (Pei et al., ICDM'03)
 Pattern-growth-based Classification
 Mining frequent and discriminative patterns (Cheng et al., ICDE'07)
and Correlations
 Basic Concepts
 Summary

ECLAT: Mining by Exploring Vertical Data Format 5 - 48
 Vertical format: t(AB) = {T11, T25, … }

 Tid-list: list of transaction ids containing an itemset
 Deriving frequent itemsets based on vertical intersections
 t(X) = t(Y): X and Y always happen together
 t(X)  t(Y): transaction having X always has Y
 Using diffset to accelerate mining
 Only keep track of differences of tids
 t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
 Diffset (XY, X) = {T2}
 ECLAT
(Zaki et al., KDD'97)
 Mining closed itemsets using vertical format: CHARM
(Zaki & Hsiao, SDM'02)

and Correlations
 Basic Concepts
 Summary

Mining Closed Itemsets: CLOSET 5 - 50
 Flist: list of all frequent items in support-ascending order

 Flist: d-a-f-e-c
 Divide search space min_sup = 2
 Itemsets having d TID Items
 Itemsets having d but not a, etc. 10 a, c, d, e, f
 Find closed itemsets recursively 20 a, b, e
 Every transaction having d also has cfa 30 c, e, f
 cfad is a closed itemset 40 a, c, d, f
50 c, e, f
(Pei, Han & Mao, DMKD'00)

CLOSET+: Mining Closed Itemsets by Pattern- 5 - 51
Growth
 Itemset merging:
 If Y appears in each occurrence of X, then Y is merged with X.
 Sub-itemset pruning:
 if Y ‫ כ‬X, and sup(X) = sup(Y), X and all of X’s descendants in the set-
enumeration tree can be pruned.
 Hybrid tree projection
 Bottom-up physical tree projection
 Top-down pseudo tree projection
 Item skipping:
 If a local frequent item has the same support in several header tables at
different levels, one can prune it from the header table at higher levels.
 Efficient subset checking

MaxMiner: Mining Max-Itemsets 5 - 52
 1st scan: find frequent items Tid Items

 A, B, C, D, E 10 A, B, C, D, E
 2nd scan: find support for 20 B, C, D, E,
 AB, AC, AD, AE, ABCDE 30 A, C, D, F
 BC, BD, BE, BCDE
 CD, CE, CDE, DE Potential
max-itemsets
 Since BCDE is a max-itemset,

no need to check BCD, BDE, CDE in later scan
(Bayardo, SIGMOD'98)

Chapter 5: Mining Frequent Patterns, Association 5 - 53
and Correlations
 Basic Concepts
 Summary

Generating Association Rules 5 - 54
from Frequent Itemsets
 Once frequent itemsets from transactions in database D found:

 Generate strong association rules from them
 Where "strong" = satisfying both minimum support and minimum
confidence
confidence(A  B) = P(B|A)
= support(A  B) / support(A)
= support_count(A  B) / support_count(A)
 For each frequent itemset l:
 Generate all nonempty subsets of l
 For every s in l:
 Output the rule "s  (l – s)", if
support_count(l) / support_count(s)  min_conf
 min_sup is satisfied, because only frequent itemsets used

Visualization of Association Rules: Plane Graph 5 - 55

Visualization of Association Rules: Rule Graph 5 - 56

Visualization of Association Rules 5 - 57
(SGI/MineSet 3.0)

Chapter 5: Mining Frequent Patterns, Association 5 - 58
and Correlations
 Basic Concepts
 Summary

Interestingness Measure: Correlation (Lift) 5 - 59
 (play) basketball  (eat) cereal (40%, 66.7%) misleading

 The overall % of students eating cereal is 75% > 66.7%.
 basketball  no cereal (20%, 33.3%) more accurate
 Although with lower support and confidence
 Reason: correlation
 Choice of one item decreases likelihood of choosing the other
 Measure of dependent/correlated events: lift
 value 1: independence; value < 1: negatively correlated
P ( A B )
lift  basketball no basketball sum (row)
P ( A) P ( B )
cereal 2000 1750 3750
2000 / 5000
lift( B, C )   0.89 no cereal 1000 250 1250
3000 / 5000 * 3750 / 5000
sum (col.) 3000 2000 5000
1000 / 5000
lift( B, C )   1.33
3000 / 5000 *1250 / 5000

Are Lift and 2 Good Measures of Correlation? 5 - 60
 Support and
confidence are
not good
to indicate
correlation.
 Over 20
interestingness
measures
have been
proposed.
(Tan, Kumar &
Sritastava,
KDD'02)
 Which are
good ones?
Null-Invariant Measures 5 - 61
 Null-transaction:
 A transaction that does not contain any of the itemsets being examined
 Can outweigh the number of individual itemsets
 A measure is null-invariant,
 if its value is free from the influence of null-transactions.
 Lift and 2 are not null-invariant.

Null-Invariant Measures (2) 5 - 62

Comparison of Interestingness Measures 5 - 63
 Null-(transaction) invariance is crucial for correlation analysis.

 5 null-invariant measures:
Milk No Milk Sum (row)

Coffee m, c ~m, c c
No Coffee m, ~c ~m, ~c ~c
Sum(col.) m ~m 
Null-transactions
Kulczynski measure
w.r.t. m and c (1927) Null-invariant

Subtle: They disagree
Analysis of DBLP Coauthor Relationships 5 - 64
 Recent DB conferences, removing balanced associations, low sup,

etc.
Advisor-advisee relation:
Kulc: high, coherence: low, cosine: middle
(Wu, Chen & Han, PKDD'07)
Which Null-Invariant Measure Is Better? 5 - 65
 Imbalance Ratio (IR):

 Measure the imbalance of two itemsets A and B in rule implications
 Kulczynski and IR together present a clear picture for all the three
datasets D4 through D6
 D4 is balanced & neutral
 D5 is imbalanced & neutral
 D6 is very imbalanced & neutral

and Correlations
 Basic Concepts
 Summary

Summary 5 - 67
 Basic concepts:
 Association rules
 Support-confidence framework
 Closed and max-itemsets
 Scalable frequent-itemset-mining methods
 Apriori
 Candidate generation & test
 Projection-based
 FPgrowth, CLOSET+, ...
 Vertical-format approach
 ECLAT, CHARM, ...
 Association rules generated from frequent itemsets
 Which patterns are interesting?
 Pattern-evaluation methods
References: Basic Concepts of Frequent-Pattern 5 - 68
Mining
(Association Rules)
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between
sets of items in large databases. SIGMOD'93
(Max-Itemset)
R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98
(Closed Itemsets)
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent
closed itemsets for association rules. ICDT'99
(Sequential Pattern)
R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95

References: Apriori and Its Improvements 5 - 69
R. Agrawal and R. Srikant. Fast algorithms for mining association rules.

VLDB'94
H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering
association rules. KDD'94
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining
association rules in large databases. VLDB'95
J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD'95
H. Toivonen. Sampling large databases for association rules. VLDB'96
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and
implication rules for market basket analysis. SIGMOD'97
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining
with relational database systems: alternatives and implications. SIGMOD'98

References: Depth-First, Projection-Based FP 5 - 70
Mining
R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for

generation of frequent itemsets. J. Parallel and Distributed Computing, 2002
G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent
Itemsets. FIMI'03
B. Goethals and M. Zaki. An introduction to workshop on frequent itemset
mining implementations. FIMI'03
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate
generation. SIGMOD'00
J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent itemsets by opportunistic
projection. KDD'02
J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining top-K frequent closed patterns
without minimum support. ICDM'02
J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the best strategies for
mining frequent closed itemsets. KDD'03

References: Vertical Format and Row Enumeration 5 - 71
Methods
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for

discovery of association rules. DAMI'97
M. J. Zaki and C. J. Hsiao. CHARM: An efficient algorithm for closed itemset
mining. SDM'02
C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A dual-pruning
algorithm for itemsets with constraints. KDD'02
F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki. CARPENTER: Finding
closed patterns in long biological datasets. KDD'03
H. Liu, J. Han, D. Xin, and Z. Shao. Mining interesting patterns from very high
dimensional data: a top-down row enumeration approach. SDM'06

References: Mining Correlations and Interesting 5 - 72
Rules
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: generalizing

association rules to correlations. SIGMOD'97
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo.
Finding interesting rules from large sets of discovered association rules.
CIKM'94
R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Measures of
Interest. Kluwer Academic, 2001
C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for
mining causal structures. VLDB'98
P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness
measure for association patterns. KDD'02
E. Omiecinski. Alternative interest measures for mining associations. TKDE'03
T. Wu, Y. Chen and J. Han. Association mining in large databases: a re-
examination of its measures. PKDD'07
T. Wu, Y. Chen, and J. Han. Re-examination of interestingness measures in
pattern mining: a unified framework. Data Mining and Knowledge Discovery,
21(3):371-397, 2010

Chapter 6: Classification

Chapter 6: Classification 6-2
 Classification: Basic Concepts

 Decision-Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy: Ensemble Methods
 Summary

KDD, 6 Classification
Supervised vs. Unsupervised Learning 6-3
 Supervised learning (classification)

 Supervision:
The training data (observations, measurements, etc.)
are accompanied by labels
indicating the class of the observations.
 New data is classified
based on a model
created from the training data.
 Unsupervised learning (clustering)
 The class labels of training data are unknown.
 Or rather, there are no training data.
 Given a set of measurements, observations, etc.,
the goal is to find classes or clusters in the data.
 See next chapter.

Prediction Problems: 6-4
Classification vs. Numerical Prediction
 Classification:
 Predicts categorical class labels (discrete, nominal)
 Constructs a model
based on the training set
and the values (class labels) in a classifying attribute
and uses it in classifying new data
 Numerical Prediction:
 Models continuous-valued functions
 I.e., predicts missing or unknown (future) values
 Typical applications of classification:
 Credit/loan approval: Will it be paid back?
 Medical diagnosis: Is a tumor cancerous or benign?
 Fraud detection: Is a transaction fraudulent or not?
 Web-page categorization: Which category is it?

Classification—A Two-Step Process 6-5
 Model construction: describing a set of predetermined classes

 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class-label attribute.
 The set of tuples used for model construction is the training set.
 The model is represented as classification rules, decision trees, or
mathematical formulae.
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model:
 The known label of test samples is compared with the result from the
model.
 Accuracy rate is the percentage of test-set samples that are correctly
classified by the model.
 Test set is independent of training set (otherwise overfitting).
 If the accuracy is acceptable, use the model to classify data tuples whose
class labels are not known.
Process (1): Model Construction 6-6
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
if RANK = 'Professor'
Anne Associate Prof 3 no
or YEARS > 6
then TENURED = 'yes'
Process (2): Using the Model in Prediction 6-7
Classifier
Testing Unseen Data

Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED

Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes

Chapter 6: Classification 6-8

 Summary

Decision-Tree Induction: An Example 6-9
 Training dataset: buys_computer age income student credit_rating buys_computer

<=30 high no fair no
 The dataset follows
<=30 high no excellent no
an example of Quinlan's ID3 31…40 high no fair yes
(playing tennis). >40 medium no fair yes
 Resulting tree: >40
>40
low
low
yes fair
yes excellent
yes
no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 31..40 >40 <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
student? yes credit rating? >40 medium no excellent no
no yes excellent fair
no yes no yes

Algorithm for Decision-Tree Induction 6 - 10
 Basic algorithm (a greedy algorithm)

 Tree is constructed in a top-down recursive divide-and-conquer manner.
 Attributes are categorical.
 If continuous-valued, they are discretized in advance.
 At start, all the training examples are at the root.
 Examples are partitioned recursively based on selected attributes.
 Test attributes are selected on the basis of a heuristic or statistical measure.
 E.g., information gain—see below
 Conditions for stopping partitioning:
 All samples for a given node belong to the same class.
 There are no remaining attributes for further partitioning.
 Majority voting is employed for classifying the leaf.
 There are no samples left (i.e. partition for particular value is empty)

Attribute-Selection Measure: 6 - 11
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain.

 Let pi be the probability that an arbitrary tuple in D belongs to class Ci,
estimated by |Ci | / |D|; 1 ≤ i ≤ m
 Expected information (entropy) needed to classify a tuple in D:
m
Info( D)   pi log 2 ( pi )
i 1
 Information needed (after using attribute A to split D into v partitions) to

classify D: v | Dj |
InfoA ( D)    Info( D j )
j 1 |D|
 Information gained by branching on A:
Gain(A)  Info(D)  InfoA(D)

Attribute Selection: Information Gain 6 - 12
Class P: buys_computer = "yes" 5 4

Infoage ( D )  I ( 2,3)  I ( 4,0)
Class N: buys_computer = "no" 14 14
9 9 5 5 5
Info( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940  I (3,2)  0.694
14 14 14 14
14
age pi ni I(pi,ni) 5
I (2,3) means "age <= 30" has 5
<= 30 2 3 0.971 14 out of 14 samples, with 2
31 .. 40 4 0 0 yes’es and 3 no’s. Hence
> 40 3 2 0.971
Gain(age)  Info( D)  Infoage ( D)  0.246
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high
31…40 high
no excellent
no fair
no
yes
 Similarly,
>40 medium no fair yes
>40 low yes fair yes Gain(income)  0.029
>40 low yes excellent no
31…40 low yes excellent yes Gain( student)  0.151
<=30 medium no fair no
<=30 low yes fair yes Gain(credit _ rating )  0.048
<=30 medium yes excellent yes
>40 medium no excellent no
Partitioning in the Example 6 - 13
age?
<=30 31..40 >40
income student credit_rating buys_computer

high no fair no income student credit_rating buys_computer
high no excellent no medium no fair yes
medium no fair no low yes fair yes
low yes fair yes low yes excellent no
medium yes excellent yes medium yes fair yes
medium no excellent no
income student credit_rating buys_computer

high no fair yes
low yes excellent yes
medium no excellent yes
high yes fair yes
Computing Information Gain 6 - 14
for Continuous-Valued Attributes
 Let attribute A be a continuous-valued attribute.

 Must determine the best split point for A
 Sort the values of A in increasing order.
 Typically, the midpoint between each pair of adjacent values is considered as
a possible split point.
 (ai + ai+1) / 2 is the midpoint between the values of ai and ai+1.
 The point with the minimum expected information requirement for A
is selected as the split-point for A.
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point,
and D2 is the set of tuples in D satisfying A > split-point.
 So to say: Discretization as you go along

 For this particular purpose

Gain Ratio for Attribute Selection (C4.5) 6 - 15
 Information-gain measure is biased towards attributes with a large

number of values.
 C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain).
v | Dj | | Dj |
SplitInfoA ( D)    log 2 ( )
j 1 |D| |D|
GainRatio(A) = Gain(A) / SplitInfoA(D)
 Example:
4 4 6 6 4 4
SplitInfoincome ( D)    log 2     log 2     log 2    1.557
14  14  14  14  14  14 
GainRatio(income) = 0.029 / 1.557 = 0.019
 The attribute with the maximum gain ratio is selected as the splitting attribute.

Gini Index 6 - 16
 Corrado Gini (1884 – 1965)

 Italian statistician and sociologist
 Also called Gini coefficient
 Measures statistical dispersion
 Zero expresses perfect equality where all values are the same.
 One expresses maximal inequality among values.
 Based on the Lorenz curve
 Plots the proportion of the total sum of values (y axis)
that is cumulatively assigned to the bottom x% of the population
 Line at 45 degrees thus represents perfect equality of value distribution
 Gini coefficient then is
 the ratio of the area that lies between the line of equality and the Lorenz
curve
over the total area under the line of equality

Gini Index (2) 6 - 17
Example: Distribution of incomes
Source: Wikipedia
Gini Index (CART, IBM IntelligentMiner) 6 - 18
 If a dataset D contains examples from n classes,

Gini index gini(D) is defined as
n
gini ( D)  1   p j
2
j 1
 where pj is the relative frequency of class j in D

 If a dataset D is split on A into two subsets D1 and D2, the Gini index
giniA(D) is defined as
| D1 | |D |
gini A ( D )  gini ( D1)  2 gini ( D 2)
|D| |D|
 Reduction in impurity:
gini ( A)  gini ( D)  gini A ( D)
 The attribute A that provides the smallest giniA(D) (or the largest
reduction in impurity) is chosen to split the node.
 Need to enumerate all the possible splitting points for each attribute.

Computation of Gini Index 6 - 19
 Example:
 D has 9 tuples in buys_computer = "yes" and 5 in "no"
2 2
9 5
gini ( D)  1        0.459
 14   14 
 Suppose the attribute income partitions D

into 10 in D1: {low, medium} and 4 in D2: {high}
 10   4
giniincome{low ,medium } ( D )    gini ( D1 )    gini ( D2 )
 14   14 
10   7   3   4   2 2  2 2 
2 2
 1        1       

14   10   10   14   4   4  
  
 0.443
 giniincome{high } ( D )

Computation of Gini Index (2) 6 - 20
 Example (cont.):
 giniincome  {low,high} is 0.458;
giniincome  {medium,high} is 0.450.
 Thus, split on the {low,medium} and {high} since it has the lowest gini index.
 All attributes are assumed continuous-valued.

 May need other tools,
e.g., clustering,
to get the possible split values.
 Can be modified for categorical attributes.

Comparing Attribute-Selection Measures 6 - 21
 The three measures, in general, return good results, but

 Information gain:
 Biased towards multi-valued attributes.
 Gain ratio:
 Tends to prefer unbalanced splits
in which one partition is much smaller than the others.
 Gini index:
 Biased to multi-valued attributes.
 Has difficulty when number of classes is large.
 Tends to favor tests that result in equal-sized partitions
and purity in both partitions.

Other Attribute-Selection Measures 6 - 22
 CHAID:
 A popular decision-tree algorithm, measure based on 2 test for independence
 C-SEP:
 Performs better than information gain and Gini index in certain cases
 G-statistic:
 Has a close approximation to 2 distribution
 MDL (Minimal Description Length) principle:
 I.e., the simplest solution is preferred.
 The best tree is the one that requires the fewest number of bits
to both (1) encode the tree and (2) encode the exceptions to the tree.
 Multivariate splits:
 Partitioning based on multiple variable combinations
 CART: finds multivariate splits based on a linear combination of attributes.
 Which attribute-selection measure is the best?
 Most give good results, none is significantly superior to others.
Overfitting and Tree Pruning 6 - 23
 Overfitting: An induced tree may overfit the training data.

 Too many branches, some may reflect anomalies due to noise or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning:
 Halt tree construction early
 Do not split a node, if this would result in the goodness measure
falling below a threshold.
 Difficult to choose an appropriate threshold
 Postpruning:
 Remove branches from a "fully grown" tree
 Get a sequence of progressively pruned trees.
 Use a set of data different from the training data
to decide which is the "best pruned tree."

Enhancements to Basic Decision-Tree Induction 6 - 24
 Allow for continuous-valued attributes

 Dynamically define new discrete-valued attribute
that partitions the values of continuous-valued attribute
into a discrete set of intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that are sparsely represented.
 This reduces fragmentation, repetition, and replication.

Classification in Large Databases 6 - 25
 Classification—a classical problem extensively studied by

statisticians and machine-learning researchers
 Scalability:
 Classifying datasets with millions of examples and hundreds of attributes with
reasonable speed
 Why is decision-tree induction popular?
 Relatively fast learning speed (compared to other classification methods)
 Convertible to simple and easy-to-understand classification rules
 Can use SQL queries for accessing databases
 Classification accuracy comparable with other methods
 RainForest (Gehrke, Ramakrishnan & Ganti, VLDB'98)
 Builds an AVC-list (attribute, value, class label)

Scalability Framework for RainForest 6 - 26
 Separates the scalability aspects

from the criteria that determine the quality of the tree
 Builds an AVC-list:
 AVC (Attribute, Value, Class_label)
 AVC-set (of an attribute X):
 Projection of training dataset onto the attribute X and class label
where counts of individual class label are aggregated.
 AVC-group (of a node n):
 Set of AVC-sets of all predictor attributes at the node n

RainForest: Training Set and Its AVC-Sets 6 - 27
Training Set
age income student credit_rating
buys_computer AVC-set on age
<=30 high no fair no buys_computer
<=30 high no excellent no age yes no
31…40 high no fair yes <=30 2 3
>40 medium no fair yes 31..40 4 0
>40 low yes fair yes >40 3 2
31…40 low yes excellent yes
<=30 low yes fair yes AVC-set on income
>40 medium yes fair yes buys_computer
<=30 medium yes excellent yes income yes no

31…40 medium no excellent yes high 2 2
31…40 high yes fair yes medium 4 2
>40 medium no excellent no low 3 1

RainForest: Training Set and Its AVC-Sets (2) 6 - 28
Training Set
age buys_computer AVC-set on student
income student credit_rating
<=30 high no fair no buys_computer
<=30 high no excellent no student yes no
31…40 high no fair yes yes 6 1
>40 medium no fair yes no 3 4
>40 low yes fair yes
31…40 low yes excellent yes AVC-set on credit_rating
buys_computer
<=30 low yes fair yes credit_
rating yes no
fair 6 2
excellent 3 3

BOAT (Bootstrapped Optimistic Algorithm for Tree 6 - 29
Construction)
 Use a statistical technique called bootstrapping

to create several smaller samples (subsets),
each fitting in memory
 See below
 Each subset is used to create a tree, resulting in several trees.
 These trees are examined and used to construct a new tree T'.
 It turns out that T' is very close to the tree that would be generated using the
whole data set together.
 Advantages:
 Requires only two scans of DB
 An incremental algorithm:
 Take insertions and deletions of training data
and update the decision tree.

Presentation of Classification Results 6 - 30

Visualization of a Decision Tree in SGI/MineSet 3.0 6 - 31

Interactive Visual Mining 6 - 32
by Perception-Based Classification (PBC)

Chapter 6: Classification 6 - 33

 Summary

Bayesian Classification: Why? 6 - 34
 A statistical classifier:
 Performs probabilistic prediction,
i.e., predicts class-membership probabilities
 Foundation: Bayes' Theorem
 Performance:
 A simple Bayesian classifier (naïve Bayesian classifier) has performance
comparable with decision tree and selected neural-network classifiers.
 Incremental:
 Each training example can incrementally increase/decrease the probability
that a hypothesis is correct—prior knowledge can be combined with
observed data.
 Standard:
 Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods
can be measured.
Bayes' Theorem: Basics 6 - 35
 Let X be a data sample ("evidence").

 Class label is unknown.
 Let H be the hypothesis that X belongs to class C.
 Classification is to determine P(H | X):
 Posteriori probability: the probability that the hypothesis holds given the
observed data sample X
 P(H):
 Prior probability: the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X):
 Probability that sample data is observed
 P(X | H):
 Likelihood: the probability of observing the sample X, given that the
hypothesis holds
 E.g., given that X buys computer, the prob. that X is 31..40, medium income

Bayes' Theorem (2) 6 - 36
 Given training data X, posteriori probability P(H | X) of a hypothesis

H follows from the Bayes' theorem:
P(X | H ) P( H )
P ( H | X)   P ( X | H )  P ( H ) / P ( X)
P ( X)
 Informally, this can be written as:

posteriori = likelihood x prior / evidence
 Predicts that X belongs to Ci
iff the probability P(Ci | X) is the highest among all the P(Ck | X)
for all the k classes
 Practical difficulty:
 Requires initial knowledge of many probabilities
 Significant computational cost

Towards Naïve Bayesian Classifier 6 - 37
 Let D be a training set of tuples and their associated class labels.

 Each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn).
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori probability.
 I.e., the maximal P(Ci | X)
 This can be derived from Bayes' theorem:
P ( X | C i ) P (C i )
P (C i | X ) 
P ( X)
 Since P(X) is constant for all classes, we must maximize only:
P ( X | Ci ) P (Ci )

Derivation of Naïve Bayes Classifier 6 - 38
 A simplifying assumption: attributes are conditionally independent.

 I.e., no dependence relation between attributes.
n
P( X | C i )   P( xk | C i )  P( x1 | C i )  P( x2 | C i )  ... P( xn | C i )
k 1
 This greatly reduces the computation cost:

Only count the class distribution.
 If Ak is categorical,
 P(xk | Ci) is the number of tuples in Ci having value xk for Ak divided by
|Ci,D| (the number of tuples of Ci in D).
 If Ak is continuous-valued,
 P(xk | Ci) is usually computed based on Gaussian distribution with a
mean μ and standard deviation σ: ( x )2
1 
g ( x,  ,  )  e 2
2
2 
 and P(xk | Ci) is:
P( xk | Ci)  g ( xk , Ci , Ci )
Naïve Bayesian Classifier: Training Dataset 6 - 39
 Classes:
 C1: buys_computer = yes age income student credit_ratingbuys_co
 C2: buys_computer = no <=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
 Data sample:
>40 medium no fair yes
 X = (age <=30,
>40 low yes fair yes
income = medium,
student = yes,
credit_rating = fair) 31…40 low yes excellent yes
<=30 low yes fair yes
Naïve Bayesian Classifier: An Example 6 - 40
 P(Ci):
 P(buys_computer = yes) = 9/14 = 0.643
 P(buys_computer = no) = 5/14= 0.357
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
 Compute P(X | Ci) for each class:
 P(age <= 30 | buys_computer = yes) = 2/9 = 0.222
 P(age <= 30 | buys_computer = no) = 3/5 = 0.6
 P(income = medium | buys_computer = yes) = 4/9 = 0.444
 P(income = medium | buys_computer = no) = 2/5 = 0.4
 P(student = yes | buys_computer = yes) = 6/9 = 0.667
 P(student = yes | buys_computer = no) = 1/5 = 0.2
 P(credit_rating = fair | buys_computer = yes) = 6/9 = 0.667
 P(credit_rating = fair | buys_computer = no) = 2/5 = 0.4
 P(X | Ci) :
 P(X | buys_computer = yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
 P(X | buys_computer = no) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
 P(X | Ci) x P(Ci):
 P(X | buys_computer = yes) x P(buys_computer = yes) = 0.028
 P(X | buys_computer = no) x P(buys_computer = no) = 0.007
 Therefore, X belongs to class C1 (buys_computer = yes)
Avoiding the Zero-Probability Problem 6 - 41
 Naïve Bayesian prediction requires

each conditional probability to be non-zero.
 Otherwise, the predicted probability will be zero.
n
P( X | C i )   P( xk | C i )
k 1
 Example:
 Suppose a dataset with 1000 tuples,
income = low (0), income = medium (990), and income = high (10)
 Use Laplacian correction (or Laplacian estimator)
 Add 1 to each case:
 P(income = low) = 1 / 1003
 P(income = medium) = 991 / 1003
 P(income = high) = 11 / 1003
 The "corrected" probability estimates are close to their "uncorrected"
counterparts.
Naïve Bayesian Classifier: Comments 6 - 42
 Advantages:
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages:
 Assumption: class conditional independence, therefore loss of accuracy
 Practically, dependencies exist among variables.
 E.g., hospital patients:
 Profile: age, family history, etc.
 Symptoms: fever, cough, etc.
 Disease: lung cancer, diabetes, etc.
 Cannot be modeled by Naïve Bayesian Classifier
 How to deal with these dependencies?
 Bayesian Belief Networks
 See textbook


 Summary

Using IF-THEN Rules for Classification 6 - 44
 Represent the knowledge in the form of IF-THEN rules
 E.g., IF age <= 30 AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent
 Assessment of a rule R: coverage and accuracy

 ncovers = # of tuples covered by R (antecedent is true)
 ncorrect = # of tuples correctly classified by R
 coverage(R) = ncovers / |D| with D training dataset
 accuracy(R) = ncorrect / ncovers

Using IF-THEN Rules for Classification (2) 6 - 45
 If more than one rule are triggered,

need conflict resolution.
 Size ordering:
 Assign the highest priority to the triggered rule
that has the "toughest" requirement
(i.e., the most attribute tests).
 Class-based ordering:
 Decreasing order of prevalence or misclassification cost per class
 Rule-based ordering (decision list):
 Rules are organized into one long priority list,
according to some measure of rule quality or by experts.

Rule Extraction from a Decision Tree 6 - 46
 Rules are easier to understand than large trees.

 Rule can be created for each path from the root to a leaf.
 Each attribute-value pair along the path forms a conjunction:
 The leaf holds the class prediction.
 Rules are mutually exclusive and exhaustive.
 Example:
 Rule extraction from our buys_computer age?
decision-tree
 IF age <=30 AND student = no <=30 31..40 >40
THEN buys_computer = no
 IF age <= 30 AND student = yes student? credit rating?
yes
THEN buys_computer = yes
 IF age 31..40 no yes excellent fair
no yes no yes
 IF age >40 AND credit_rating = excellent
THEN buys_computer = no
 IF age >40 AND credit_rating = fair

Rule Induction: Sequential Covering Method 6 - 47
 Sequential covering algorithm:

 Extracts rules directly from training data
 Typical sequential covering algorithms:
 FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially.
 Each rule for a given class Ci will cover many tuples of Ci,
but none (or few) of the tuples of other classes.
 Steps:
 Rules are learned one at a time.
 Each time a rule is learned, the tuples covered by the rule are removed.
 The process repeats on the remaining tuples unless termination condition,
e.g., when no more training examples left
or when the quality of a rule returned is below a user-specified threshold.
 Compare with decision-tree induction:
 That was learning a set of rules simultaneously.
Sequential Covering Algorithm 6 - 48
while (enough target tuples left):

generate a rule;
remove positive target tuples satisfying this rule;
Examples covered
Examples covered by rule 2
by rule 1 Examples covered
by rule 3
Positive
examples

Rule Generation 6 - 49
 To generate a rule:
while(true):
find the best predicate p (attribute = value);
if FOIL_gain(p) > threshold
then add p to current rule;
else break;
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
How to Learn One Rule? 6 - 50
 Start with the most general rule possible:

 condition = empty
 Add new attributes by adopting a greedy depth-first strategy.
 Pick the one that improves the rule quality most.
 Current rule R: IF condition THEN class = c
 New rule R': IF condition' THEN class = c,
where condition' = (condition AND new attribute test)
 Pos/neg are # of positive/negative tuples covered by R.
 Rule-quality measures:
 Must consider both coverage and accuracy
 FOIL_Gain (from FOIL – First-Order Inductive Learner):
pos ' pos
FOIL _ Gain  pos '(log 2  log 2 )
pos ' neg ' pos  neg
 Favors rules that have high accuracy and cover many positive tuples

Rule Pruning 6 - 51
 Danger of overfitting
 Removing a conjunct (attribute test)
 If pruned version of rule has greater quality,
assessed on an independent set of test tuples (called "pruning set")
 FOIL uses:
pos  neg
FOIL _ Prune ( R ) 
pos  neg
 If FOIL_Prune is higher for the pruned version of R,
prune R.


 Summary

Model Evaluation and Selection 6 - 53
 Evaluation metrics:
 How can we measure accuracy?
 Other metrics to consider?
 Use test set of class-labeled tuples
instead of training set
when assessing accuracy
 Methods for estimating a classifier's accuracy:
 Holdout method, random subsampling
 Cross-validation
 Bootstrap
 Comparing classifiers:
 Confidence intervals
 Cost-benefit analysis and ROC curves

Classifier-Evaluation Metrics: Confusion Matrix 6 - 54
 Confusion Matrix:
Actual class \ Predicted class: C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
 Example of Confusion Matrix:

Actual class \ Predicted class: buys_computer = yes buys_computer = no Total
buys_computer = yes 6954 46 7000
buys_computer = no 412 2588 3000
Total 7366 2634 10000
 Given m classes, an entry CMi,j in an m x m confusion matrix indicates # of

tuples in class i that were labeled by the classifier as class j
 May have extra rows/columns to provide totals
Classifier-Evaluation Metrics: 6 - 55
Accuracy, Error Rate, Sensitivity, and Specificity
A\P C ¬C  Class-Imbalance Problem:

C TP FN P  One class may be rare.
¬C FP TN N  E.g. fraud, or HIV-positive
P' N' All  Significant majority of the
negative class and minority of
the positive class
 Classifier accuracy,  Sensitivity: True-Positive
or recognition rate: recognition rate
 Percentage of test set tuples  Sensitivity = TP / P
that are correctly classified  Specificity: True-Negative
 Accuracy = (TP + TN) / All recognition rate
 Error rate:  Specificity = TN / N
 1 – accuracy, or
 Error rate = (FP + FN) / All

Classifier-Evaluation Metrics: 6 - 56
Precision, Recall, and F-Measures
 Precision:
 Exactness – the % of tuples that are actually positive in those that the
classifier labeled as positive
TP
precision 
TP  FP
 Recall:
 Completeness – the % of tuples that the classifier labeled as positive
in all positive tuples TP
recall 
 Perfect score is 1.0 TP  FN
 Inverse relationship between precision and recall
 F-measure (F1 or F-score): 2  precision  recall
 Gives equal weight F
precision  recall
to precision and recall
 Fβ: weighted measure of precision and recall
 Assigns β times as much weight (1   ) 2  precision  recall
to recall as to precision F 
 2  precision  recall
Classifier-Evaluation Metrics: Example 6 - 57
Actual Class \ Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity)
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
 Precision = 90 / 230 = 39.13%

 Recall = 90 / 300 = 30.00%
 F-measure = 2  0.3913  0.3 / (0.3913 + 0.3) = 33.96%

Evaluating Classifier Accuracy: 6 - 58
Holdout & Cross-Validation Methods
 Holdout method
 Given data is randomly partitioned into two independent sets:
 Training set (e.g., 2/3) for model construction
 Test set (e.g., 1/3) for accuracy estimation
 Random sampling: a variation of holdout
 Repeat holdout k times, accuracy = avg. of the accuracies obtained
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
 At i-th iteration, use Di as test set and the others as training set.
 Leave-one-out:
 k folds, where k = # of tuples; for small-sized data
 Stratified cross-validation:
 Folds are stratified so that class distribution in each fold is approx. the
same as that in the initial data.
Evaluating Classifier Accuracy: Bootstrap 6 - 59
 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement
 I.e., each time a tuple is selected, it is equally likely to be selected again
and re-added to the training set.
 Several bootstrap methods, and a common one is .632 bootstrap
 Data set with d tuples sampled d times, with replacement,
resulting in a training set of d samples.
 The data tuples that did not make it into the training set end up forming the
test set.
 About 63.2% of the original data end up in the bootstrap, and the
remaining 36.8% form the test set (since (1 – 1/d)d ≈ e–1 = 0.368).
 Repeat the sampling procedure k times; overall accuracy of the model:

Estimating Confidence Intervals: 6 - 60
Classifier Models M1 vs. M2
 Suppose we have 2 classifiers, M1 and M2,

which one is better?
 Use 10-fold cross-validation to obtain and
 Recall: error rate is 1 – accuracy(M).
 Mean error rates:
 Just estimates of error
on the true population of future data cases.
 What if the difference between the 2 error rates
is just attributed to chance?
 Use a test of statistical significance
 Obtain confidence limits for our error estimates

Null Hypothesis
 Perform 10-fold cross-validation

 10 times
 Assume samples follow a t-distribution
with k–1 degrees of freedom
 Here, k = 10
 Use t-test
 Student's t-test
 Null hypothesis:
 M1 & M2 are the same
 If we can reject the null hypothesis, then
 Conclude that difference between M1 & M2 is statistically significant
 Select model with lower error rate

Estimating Confidence Intervals: t-test 6 - 62
 If only one test set available: pairwise comparison

 For i-th round of 10-fold cross-validation,
the same cross partitioning is used to obtain err(M1)i and err(M2)i
 Average over 10 rounds to get and
 t-test computes t-statistic with k–1 degrees of freedom:
 where
 If two test sets available: use nonpaired t-test
 where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
Table for t-distribution
 Symmetrical
 Significance level
 E.g., sig = 0.05 or 5%
means M1 & M2 are
significantly different for
95% of population
 Confidence limit
 z = sig / 2

Statistical Significance
 Are M1 & M2 significantly different?

 Compute t. Select significance level (e.g. sig = 5%).
 Consult table for t-distribution:
 Find t value corresponding to k–1 degrees of freedom (here, 9).
 t-distribution is symmetrical:
 Typically upper % points of distribution shown
→ look up value for confidence limit z = sig/2 (here, 0.025)
 If t > z or t < –z, then t value lies in rejection region:
 Reject null hypothesis that mean error rates of M1 & M2 are same
 Conclude: statistically significant difference between M1 & M2
 Otherwise, conclude that any difference is chance.

Model Selection: ROC Curves 6 - 65
M1
 ROC (Receiver Operating
Characteristics) curves:
 For visual comparison of classification
models M2
 Originated from signal-detection theory
 Shows the trade-off between the true-
positive rate and the false-positive rate
 The area under the ROC curve is a
measure of the accuracy of the
model.
 Rank the test tuples in decreasing  Vertical axis represents the true
order: positive rate.
 The one that is most likely to belong to
the positive class appears at the top of  Horizontal axis repr. the false
the list positive rate.
 The closer to the diagonal line  The plot also shows a diagonal
(i.e., the closer the area is to 0.5), line.
the less accurate is the model.  A model with perfect accuracy will
have an area of 1.0.

Issues Affecting Model Selection 6 - 66
 Accuracy
 Classifier accuracy: predicting class label
 Speed
 Time to construct the model (training time)
 Time to use the model (classification/prediction time)
 Robustness
 Handling noise and missing values
 Scalability
 Efficiency in disk-resident databases
 Interpretability
 Understanding and insight provided by the model
 Other measures
 E.g., goodness of rules, such as decision-tree size or compactness of
classification rules

 Summary

Ensemble Methods: Increasing the Accuracy 6 - 68
 Ensemble methods
 Use a combination of models
to increase accuracy
 Combine a series
of k learned models,
M1, M2, …, Mk,
with the aim of creating
an improved model M*
 Popular ensemble methods
 Bagging:
 Averaging the prediction over a collection of classifiers
 Boosting:
 Weighted vote with a collection of classifiers
 Ensemble:
 Combining a set of heterogeneous classifiers

Bagging: Boostrap Aggregation 6 - 69
 Analogy:
 Diagnosis based on multiple doctors' majority vote
 Training
 Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
 A classifier model Mi is learned for each training set Di
 Classification: classify an unknown sample X
 Each classifier Mi returns its class prediction.
 The bagged classifier M* counts the votes
and assigns the class with the most votes to X
 Prediction
 Can be applied to the prediction of continuous values
by taking the average value of each prediction for a given test tuple
 Accuracy
 Often significantly better than a single classifier derived from D
 For noisy data: not considerably worse, more robust
 Proved improved accuracy in prediction

Boosting 6 - 70
 Analogy:
 Consult several doctors, based on a combination of weighted diagnoses—
weight assigned based on the previous diagnosis accuracy
 How boosting works:
 Weights are assigned to each training tuple.
 A series of k classifiers is iteratively learned.
 After a classifier Mi is learned, the weights are updated to allow the
subsequent classifier, Mi+1 to pay more attention to the training tuples that
were misclassified by Mi.
 The final M* combines the votes of each individual classifier, where the
weight of each classifier's vote is a function of its accuracy.
 Boosting algorithm can be extended for numeric prediction.
 Comparing with bagging:
 Boosting tends to have greater accuracy, but it also risks overfitting the
model to misclassified data.
AdaBoost 6 - 71
"Adaptive Boosting" (Freund and Schapire, 1997)

 Given a set of d class-labeled tuples: (X1, y1), …, (Xd, yd)
 Initially, all the weights of tuples are set the same: (1/d)
 Generate k classifiers in k rounds. At round i,
 Tuples from D are sampled (with replacement)
to form a training set Di of the same size.
 Each tuple's chance of being selected is based on its weight.
 A classification model Mi is derived from Di.
 Its error rate is calculated using Di as a test set.
 If a tuple is misclassified, its weight is increased, otherwise it is decreased.
 Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier
Mi error rate is the sum of the weights of the misclassified tuples:
d
error ( M i )   w j  err ( X j )
j
1  error (M i )
 Summer
The Semester
weight2018
of classifier Mi's vote is: log
error (M i )
Random Forest 6 - 72
(Breiman 2001)
 Random Forest:
 Each classifier in the ensemble is a decision-tree classifier
and is generated using a random selection of attributes at each node
to determine the split.
 During classification, each tree votes and the most popular class is returned.
 Two Methods to construct Random Forest:
 Forest-RI (random input selection):
 Randomly select, at each node, F attributes as candidates for the split at the
node. The CART methodology is used to grow the trees to maximum size.
 Forest-RC (random linear combinations):
 Creates new attributes (or features) that are a linear combination of the
existing attributes (reduces the correlation between individual classifiers).
 Comparable in accuracy to AdaBoost, but more robust to errors and
outliers
 Insensitive to the number of attributes selected for consideration at
each split, and faster than bagging or boosting
Classification of Class-Imbalanced Data Sets 6 - 73
 Class-imbalance problem:
 Rare positive example but numerous negative ones
 E.g., medical diagnosis, fraud, oil-spill, fault, etc.
 Traditional methods assume a balanced distribution of classes and equal
error costs: not suitable for class-imbalanced data
 Typical methods for imbalanced data in 2-class classification:
 Oversampling:
 Re-sampling of data from positive class
 Undersampling:
 Randomly eliminate tuples from negative class
 Threshold-moving:
 Moves the decision threshold, t, so that the rare-class tuples are easier to
classify, and hence, less chance of costly false-negative errors
 Ensemble techniques:
 Ensemble multiple classifiers introduced above
 Still difficult on multi-class tasks

 Summary

Summary 6 - 75
 Classification
 A form of data analysis
that extracts models describing important data classes
 Effective and scalable methods
 Developed for decision-tree induction, naive Bayesian classification, rule-
based classification, and many other classification methods
 Evaluation metrics:
 Accuracy, sensitivity, specificity, precision, recall, F-measure, and Fß
measure
 Stratified k-fold cross-validation
 Recommended for accuracy estimation
 Significance tests and ROC curves
 Useful for model selection

75
Summary (2) 6 - 76
 Ensemble methods
 Bagging and boosting can be used to increase overall accuracy by learning
and combining a series of individual models.
 Numerous comparisons of the different classification methods
 Matter remains a research topic
 No single method has been found to be superior over all others for all data
sets.
 Issues such as accuracy, training time, robustness, scalability, and
interpretability must be considered and can involve trade-offs, further
complicating the quest for an overall superior method.

76
Reference: Books on Classification 6 - 77
E. Alpaydin. Introduction to Machine Learning. 2nd ed., MIT Press, 2011

L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and
Regression Trees. Wadsworth International Group, 1984
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. 2nd ed., John
Wiley, 2001
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Springer-Verlag, 2001
H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A
Data Mining Perspective. Kluwer Academic, 1998
T. M. Mitchell. Machine Learning. McGraw Hill, 1997
S. Marsland. Machine Learning: An Algorithmic Perspective. Chapman and
Hall/CRC, 2009
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993
J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan
Kaufmann, 1990
Reference: Books on Classification (2) 6 - 78
P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison

Wesley, 2005
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn:
Classification and Prediction Methods from Statistics, Neural Nets, Machine
Learning, and Expert Systems. Morgan Kaufman, 1991
S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques. 2nd ed., Morgan Kaufmann, 2005

Reference: Decision Trees 6 - 79
M. Ankerst, C. Elsen, M. Ester, and H.-P. Kriegel. Visual classification: An

interactive approach to decision tree construction. KDD'99
C. Apte and S. Weiss. Data mining with decision trees and decision rules.
Future Generation Computer Systems 13, 1997
C. E. Brodley and P. E. Utgoff. Multivariate decision trees. Machine Learning
19:45–77, 1995
P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from
partitioned data for scaling machine learning. KDD'95
U. M. Fayyad. Branching on attribute values in decision tree generation.
AAAI'94
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data
mining. EDBT'96
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast
decision tree construction of large datasets. VLDB'98
J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT – Optimistic
Decision Tree Construction. SIGMOD'99
Reference: Decision Trees (2) 6 - 80
S. K. Murthy. Automatic Construction of Decision Trees from Data: A Multi-

Disciplinary Survey. Data Mining and Knowledge Discovery 2(4):345–389, 1998
J. R. Quinlan. Induction of decision trees. Machine Learning 1:81–106, 1986
J. R. Quinlan and R. L. Rivest. Inferring decision trees using the minimum
description length principle. Information and Computation 80:227–248, Mar.
1989
S. K. Murthy. Automatic construction of decision trees from data: A multi-
disciplinary survey. Data Mining and Knowledge Discovery 2:345–389, 1998
R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building
and pruning. VLDB'98
J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for
data mining. VLDB'96
Y.-S. Shih. Families of splitting criteria for classification trees. Statistics and
Computing 9:309–315, 1999

Reference: Neural Networks 6 - 81
C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University

Press, 1995
Y. Chauvin and D. Rumelhart. Backpropagation: Theory, Architectures, and
Applications. Lawrence Erlbaum, 1995
J. W. Shavlik, R. J. Mooney, and G. G. Towell. Symbolic and neural learning
algorithms: An experimental comparison. Machine Learning 6:111–144, 1991
S. Haykin. Neural Networks and Learning Machines. Prentice Hall, Saddle
River, NJ, 2008
J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural
Computation. Addison Wesley, 1991
R. Hecht-Nielsen. Neurocomputing. Addison Wesley, 1990
B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University
Press, 1996

Reference: Support Vector Machines 6 - 82
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition.

Data Mining and Knowledge Discovery 2(2):121–168, 1998
N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines
and Other Kernel-Based Learning Methods. Cambridge Univ. Press, 2000
H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. N. Vapnik. Support
vector regression machines. NIPS, 1997
J. C. Platt. Fast training of support vector machines using sequential minimal
optimization. In B. Schoelkopf, C. J. C. Burges, and A. Smola (eds.). Advances
in Kernel Methods|Support Vector Learning, pages 185–208. MIT Press, 1998
B. Schoelkopf, P. L. Bartlett, A. Smola, and R. Williamson. Shrinking the tube: A
new support vector regression algorithm. NIPS, 1999
H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with
hierarchical clusters. KDD'03

Reference: Pattern-Based Classification 6 - 83
H. Cheng, X. Yan, J. Han, and C.-W. Hsu. Discriminative Frequent Pattern

Analysis for Effective Classification. ICDE'07
H. Cheng, X. Yan, J. Han, and P. S. Yu. Direct Discriminative Pattern Mining for
Effective Classification. ICDE'08
G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule
groups for gene expression data. SIGMOD'05
G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and
differences. KDD'99
H. S. Kim, S. Kim, T. Weninger, J. Han, and T. Abdelzaher. NDPMine:
Efficiently mining discriminative numerical features for pattern-based
classification. ECMLPKDD'10
W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification Based on
Multiple Class-Association Rules. ICDM'01
B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining.
KDD'98
J. Wang and G. Karypis. HARMONY: Efficiently mining the best rules for
classification. SDM'05

References: Rule Induction 6 - 84
P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning 3:261–
283, 1989
W. Cohen. Fast effective rule induction. ICML'95
S. L. Crawford. Extensions to the CART algorithm. Int. J. Man-Machine Studies
31:197–217, Aug. 1989
J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML'93
P. Smyth and R. M. Goodman. An information theoretic approach to rule
induction. IEEE Trans. Knowledge and Data Engineering 4:301–316, 1992
X. Yin and J. Han. CPAR: Classification based on predictive association rules.
SDM'03

References: K-NN & Case-Based Reasoning 6 - 85
A. Aamodt and E. Plazas. Case-based reasoning: Foundational issues,

methodological variations, and system approaches. AI Comm. 7:39–52, 1994
T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Trans.
Information Theory 13:21–27, 1967
B. V. Dasarathy. Nearest Neighbor (NN) Norms: NN Pattern Classification
Techniques. IEEE Computer Society Press, 1991
J. L. Kolodner. Case-Based Reasoning. Morgan Kaufmann, 1993
A. Veloso, W. Meira, and M. Zaki. Lazy associative classification. ICDM'06

References: Bayesian Method & Statistical Models 6 - 86
A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall,

1990
D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks:
The combination of knowledge and statistical data. Machine Learning, 1995
G. Cooper and E. Herskovits. A Bayesian method for the induction of
probabilistic networks from data. Machine Learning 9:309–347, 1992
A. Darwiche. Bayesian networks. Comm. ACM 53:80–90, 2010
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from
incomplete data via the EM algorithm. J. Royal Statistical Society, Series B,
39:1–38, 1977
D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks:
The combination of knowledge and statistical data. Machine Learning 20:197–
243, 1995
F. V. Jensen. An Introduction to Bayesian Networks. Springer Verlag, 1996
D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and
Techniques. The MIT Press, 2009
86
References: Bayesian Method & Statistical Models 6 - 87
(2)
J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kauffman, 1988

S. Russell, J. Binder, D. Koller, and K. Kanazawa. Local learning in probabilistic
networks with hidden variables. IJCAI'95
V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998

References: Semi-Supervised & Multi-Class 6 - 88
Learning
O. Chapelle, B. Schoelkopf, and A. Zien. Semi-supervised Learning. MIT Press,

2006
T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-
correcting output codes. J. Articial Intelligence Research 2:263–286, 1995
W. Dai, Q. Yang, G. Xue, and Y. Yu. Boosting for transfer learning. ICML'07
S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Trans. on
Knowledge and Data Engineering 22:1345–1359, 2010
B. Settles. Active learning literature survey. CS Tech. Rep. 1648, Univ.
Wisconsin-Madison, 2010
X. Zhu. Semi-supervised learning literature survey. CS Tech. Rep. 1530, Univ.
Wisconsin-Madison, 2005

References: Genetic Algorithms & Rough/Fuzzy 6 - 89
Sets
D. Goldberg. Genetic Algorithms in Search, Optimization, and Machine

Learning. Addison-Wesley, 1989
S. A. Harp, T. Samad, and A. Guha. Designing application-specific neural
networks using the genetic algorithm. NIPS, 1990
Z. Michalewicz. Genetic Algorithms + Data Structures = Evolution Programs.
Springer Verlag, 1992
M. Mitchell. An Introduction to Genetic Algorithms. MIT Press, 1996
Z. Pawlak. Rough Sets, Theoretical Aspects of Reasoning about Data. Kluwer
Academic, 1991
S. Pal and A. Skowron (eds.). Fuzzy Sets, Rough Sets and Decision Making
Processes. New York, 1998
R. R. Yager and L. A. Zadeh. Fuzzy Sets, Neural Networks and Soft Computing.
Van Nostrand Reinhold, 1994

References: Model Evaluation, Ensemble Methods 6 - 90
L. Breiman. Bagging predictors. Machine Learning 24:123–140, 1996

L. Breiman. Random forests. Machine Learning 45:5–32, 2001
C. Elkan. The foundations of cost-sensitive learning. IJCAI'01
B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall,
1993
J. Friedman and E. P. Bogdan. Predictive learning via rule ensembles. Ann.
Applied Statistics 2:916–954, 2008
T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy,
complexity, and training time of thirty-three old and new classification
algorithms. Machine Learning, 2000
J. Magidson. The Chaid approach to segmentation modeling: Chi-squared
automatic interaction detection. In R. P. Bagozzi (ed.). Advanced Methods of
Marketing Research, Blackwell Business, 1994
J. R. Quinlan. Bagging, boosting, and C4.5. AAAI'96

90
References: Model Evaluation, Ensemble Methods 6 - 91
(2)
G. Seni and J. F. Elder. Ensemble Methods in Data Mining: Improving Accuracy

Through Combining Predictions. Morgan and Claypool, 2010
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. J. Computer and System Sciences,
1997

Classification methods
Decision Tree
Laszlo Kovacs, 2018
Classification
General model: {((x1,x2,x3,...,xm):y} ---> f(x1,x2,x3,...,xm) -> Y
Xi : attributes (visible)
Y, Y : variable (not visible)
Application areas:
Spam Filtering Web Search Postal Mail Routing
Movie
Fraud Detection OCR
Recommendations
Supervised Both attributes and output variables are known in
learning the knowledge base (experiences)
Unsupervised Only the attributes are known in the knowledge

learning base (experiences)
Regression The outcome (variable) is continuous

(numerical)
Classification The outcome is a category (discrete value)

Two phases in classification / regression
Data Set empty

fitting model
(Experiences) model
output
Data Set model prediction
variables
training set
target
feature
label
example
Data preparation is an important step
Features may be converted from numerical to categorical (intervals)
Decision Tree
Elementary decision Expensive?
condition
outputs (labels) Not to
To buy buy
For complex cases:
complex condition : hard to implement in computers

more decision steps : more practical
ID3 Decision Tree algorithm
A hierarchy of elementary decisions

Condition: an attribute (feature)
Output: the distinct values
Leaf node: category label
Construction of an optimal decision tree
Goals: time cost efficiency
high precision Expensive?
DT optimization:
Do you Not to
minimal depth
need it? buy
minimal classification error
Objective functions:
Calculation of entropy
X = (a,b,b,a,a,c,c,c)
pa = 3/8 pb = 2/8 pc = 3/8
log pa = -1.41 pb = -2 pc = -1.41
H = 0.53 + 0.5 + 0.53 = 1.56
X = (a,b,b,a,a,b,b,b)
pa = 3/8 pc = 5/8
log pa = -1.41 pb = -0.67
H = 0.53 + 0.43 = 0.95
ID3 algorithm
T : training set
TC: current training set
A: feature (attribute) set
AC : current attribute set
Initially:
AC = A, TC = T
Method:
Select the best attribute from AC using entropy measure and
add this node to the tree
Selection of the winner attribute:
for a in AC loop
V = determine the set of values of a in TC
split TC into disjoint sets according to V
calculate the entropy of each partition
Ea = the aggregated entropy of a
end loop
return the a with best entropy
Why entropy?
maximum homogeneity
Yes: 8
No : 4
Yes: 8
No : 4
Yes: 2 Yes: 6
No : 2 No : 2
Benefits of DT:
Easy to interpret and implement—"if … then … else" logic

Handle any data category—binary, ordinal, continuous
No preprocessing or scaling required
Drawbacks of DT:
Only categorical attributes are supported
Too expensive to modify
Low generalization level of the data → tree pruning
Regression tree
Similar to decision tree but
its output is value or range
Quality of the DT model
measurements usually contain some random / error

components Source: Intel Data Mining Slides
Underfitting vs. overfitting
Underfitting Just Right Overfitting

Split of training set
Different data sets for training and testing to avoid overfitting
Training
Testing
Cross-validation
All data items are used for both training and testing but not at the same time
Random forest
An ensemble learning method for classification,

regression
A multitude of low-correlated /
random decision trees
The essential idea is to average many noisy
but approximately unbiased models, and
hence reduce the variance.
Random attribute set / random training set
RF-algorithm
For every tree instance (1..M) do
draw a bootstrap sample from the training data

build a random forest tree for this data using the
following steps:
select a random subset of the free attributes
pick the best of the candidate attributes
split the node into child nodes
add this tree to the ensemble of trees
Example in R
library (rpart)
head(kyphosis)
table(kyphosis$Kyphosis)
fit <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis)
plot(fit, uniform=TRUE, main="Classification Tree for Kyphosis")
text(fit, use.n=TRUE, all=TRUE, cex=.8)
plotcp(fit)
printcp(fit)
Example in RapidMiner
Source CSV
Chapter 7: Cluster Analysis


Chapter 7: Cluster Analysis 7-2
 Cluster Analysis: Basic Concepts

 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary

KDD, 7 Cluster Analysis
What is Cluster Analysis? 7-3
 Cluster: A collection of data objects within a larger set that are

 similar (or related) to one another within the same group and
 dissimilar (or unrelated) to the objects outside the group.
 Cluster analysis (or clustering, data segmentation, … )
 Define similarities among data
based on the characteristics found in the data
 Group similar data objects into clusters
 Unsupervised learning:
 No predefined classes
 I.e., learning by observations (vs. learning by examples: supervised)
 Typical applications:
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms

Clustering for Data Understanding and 7-4
Applications
 Biology:
 Taxonomy of living things:
kingdom, phylum, class, order, family, genus, and species
 Information retrieval:
 Document clustering
 Land use:
 Identification of areas of similar land use in an earth-observation database
 Marketing:
 Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
 City planning:
 Identifying groups of houses according to their house type, value, and geographical location
 Earthquake studies:
 Observed earthquake epicenters should be clustered along continent faults.
 Climate:
 Understanding earth climate, find patterns of atmosphere and ocean
 Economic Science:
 Market research

Clustering as a Preprocessing Tool (Utility) 7-5
 Summarization
 Preprocessing for regression, PCA, classification, and association analysis
 Compression
 Image processing: vector quantization
 Finding k-nearest neighbors
 Localizing search to one cluster or a small number of clusters
 Outlier detection
 Outliers are often viewed as those "far away" from any cluster.

Quality: What Is Good Clustering? 7-6
 A good clustering method will produce high-quality clusters.

 High intra-class similarity:
 Cohesive within clusters
 Low inter-class similarity:
 Distinctive between clusters
 The quality of a clustering method depends on:

 the similarity measure used by the method,
 its implementation, and
 its ability to discover some or all of the hidden patterns.

Measure the Quality of Clustering 7-7
 Dissimilarity/similarity metric
 Similarity is expressed in terms of a distance function,
typically a metric: d(i, j)
 The definitions of distance functions are usually rather different for interval-
scaled, Boolean, categorical, ordinal, ratio, and vector variables.
 See Chapter 2!
 Weights should be associated with different variables
based on applications and data semantics.
 Quality of clustering:
 There is usually a separate "quality" function
that measures the "goodness" of a cluster.
 It is hard to define "similar enough" or "good enough."
 The answer is typically highly subjective.

Considerations for Cluster Analysis 7-8
 Partitioning criteria
 Single level vs. hierarchical partitioning
 Often, multi-level hierarchical partitioning is desirable.
 Separation of clusters
 Exclusive (e.g., one customer belongs to only one region) vs.
 Non-exclusive (e.g., one document may belong to more than one class)
 Similarity measure
 Distance-based (e.g., Euclidian, road network, vector) vs.
 Connectivity-based (e.g., density or contiguity)
 Clustering space
 Full space (often when low-dimensional) vs.
 Subspaces (often in high-dimensional clustering)

Requirements and Challenges 7-9
 Scalability
 Clustering all the data instead of only on samples
 Ability to deal with different types of attributes
 Numerical, binary, categorical, ordinal, linked, and mixture of these
 Constraint-based clustering
 User may give inputs on constraints
 Use domain knowledge to determine input parameters
 Interpretability and usability
 Others
 Discovery of clusters with arbitrary shape
 Ability to deal with noisy data
 Incremental clustering and insensitivity to input order
 High dimensionality

Major Clustering Approaches 7 - 10
 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion
 E.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARA, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)
using some criterion
 Typical methods: AGNES, DIANA, BIRCH, CHAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSCAN, OPTICS, DENCLUE
 Grid-based approach:
 Based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE

Major Clustering Approaches (2) 7 - 11
 Model-based approach:
 A model is hypothesized for each of the clusters
and tries to find the best fit of that model to each other.
 Typical methods: EM, SOM, COBWEB
 Frequent-pattern-based approach:
 Based on the analysis of frequent patterns
 Typical methods: p-Cluster
 User-guided or constraint-based approach:
 Clustering by considering user-specified or application-specific constraints
 Typical methods: COD (obstacles), constrained clustering
 Link-based clustering:
 Objects are often linked together in various ways.
 Massive links can be used to cluster objects: SimRank, LinkClus

Chapter 7: Cluster Analysis 7 - 12

 Summary

Partitioning Algorithms: Basic Concept 7 - 13
 Partitioning method:
 Partition a database D of n objects o into a set of k clusters Ci, 1 ≤ i ≤ k
such that the sum of squared distances to ci is minimized
(where ci is the centroid or medoid of cluster Ci)
E   ik1 oCi d (o, ci ) 2

 Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion.
 Globally optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen'67, Lloyd'57/'82):
 Each cluster is represented by the center of the cluster.
 k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw'87):
 Each cluster is represented by one of the objects in the cluster.

The K-Means Clustering Method 7 - 14
 Given k, the k-means algorithm is implemented in four steps:

1. Partition database into k nonempty subsets.
 E.g. the first n/k objects, then the next n/k objects, …
2. Compute the centroids of the clusters of the current partitioning.
 The centroid is the center, i.e., mean point, of the cluster.
 For each attribute (or dimension), calculate the average value.
3. Assign each object to the cluster with the nearest centroid.
 That is, for each object calculate distance to each of the k centroids
and pick the one with the smallest distance.
4. If any object has changed its cluster, go back to step 2.
Otherwise stop.
 Variant:
 Start with arbitrarily chosen k objects as initial centroids in step 1.
 Continue with step 3.
An Example of K-Means Clustering 7 - 15
k=2
Arbitrarily Calculate
partition the cluster
objects into centroids
k groups
The initial data set Loop if

needed Reassign objects
 Partition database into k
nonempty subsets;
 Repeat
 Compute centroid (i.e., mean
point) for each cluster; Update the
 Assign each object to the cluster
cluster of its nearest centroids
centroid;
 Until no change;
Comments on the K-Means Method 7 - 16
 Strength:
 Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations.
Normally, k, t << n.
 Comparing: PAM: O(k(n–k)2), CLARA: O(ks2 + k(n–k))
 Comment:
 Often terminates at a local optimum.
 Weakness:
 Applicable only to objects in a continuous n-dimensional space
 Using the k-modes method for categorical data
 In comparison, k-medoids can be applied to a wide range of data.
 Need to specify k, the number of clusters, in advance
 There are ways to automatically determine the best k
(see Hastie et al., 2009).
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
Variations of the K-Means Method 7 - 17
 Most of the variants of the k-means

differ in:
 Selection of the initial k subsets (or centroids)
 Dissimilarity calculations
 Strategies to calculate cluster centroids
 Handling categorical data: k-modes
 Replacing centroids with modes
 See Chapter 2: mode = value that occurs most frequently in the data
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method

What Is the Problem of the K-Means Method? 7 - 18
 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may substantially distort the
distribution of the data
 K-medoids:
 Instead of taking the mean value of the objects in a cluster as a reference
point, medoids can be used, which is the most centrally located object in a
cluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

The K-Medoids Clustering Method 7 - 19
 K-Medoids Clustering:
 Find representative objects (medoids) in clusters
 PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw, 1987)
 Starts from an initial set of k medoids and iteratively replaces one of the
medoids by one of the non-medoids, if it improves the total distance of
the resulting clustering
 PAM works effectively for small data sets, but does not scale well for
large data sets (due to the computational complexity)
 Efficiency improvement on PAM
 CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
 CLARANS (Ng & Han, 1994): Randomized re-sampling

PAM: A Typical K-Medoids Algorithm 7 - 20
k=2 Total Cost = 20

10 10 10
9 9 9
8 8 8
7 7 7
6
Arbitrarily 6
Assign 6
5
choose k 5
each 5
4 objects as 4 remaining 4
3 initial 3
object to 3
2
medoids 2
nearest 2
1 1 1
0 0
medoid 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Randomly select a
Total Cost = 26 nonmedoid object, orandom
10 10
Do loop 9
Compute
9
8 8
Swapping o total cost of
Until no change and orandom
7 7
6
swapping 6
5 5
if quality is 4 4
improved. 3 3
2 2
1 1
Summer Semester 2018 0

0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10
PAM (Partitioning Around Medoids) 8 - 21
 Use real objects to represent the clusters

 Algorithm:
1. Arbitrarily choose k objects as the initial mediods
2. Repeat
3. Assign each remaining object to the cluster with the nearest mediod oi
4. Randomly select a non-medoid object oh
5. Compute the total cost TCih of swapping oi with oh
6. If TCih < 0 then swap oi with oh to form the new set of k medoids
7. Until no change
 TCih = j Cjih
 with Cjih as the cost for object oj if oi is swapped with oh
 That is, distance to new medoid minus distance to old medoid

PAM Clustering (2) 8 - 22
 Case 1:
oj currently belongs to medoid oi. If oi is replaced with oh as a medoid, and oj is
closest to oh, then oj is reassigned to oh (same cluster, different distance).
Cjih = d(j, h) – d(j, i)
10
7
t
6
j
5
4
i h
3
0
0 1 2 3 4 5 6 7 8 9 10

 Case 2:
oj currently belongs to medoid ot, t  j. If oi is replaced with oh as a medoid, and
oj is still closest to ot, then the assignment does not change.
Cjih = d(j, i) – d(j, i) = 0
10
t j
8
4 h
3
i
2
0
0 1 2 3 4 5 6 7 8 9 10

 Case 3:
oj currently belongs to medoid oi. If oi is replaced with oh as a medoid, and oj is
closest to medoid ot of one of the other clusters, then oj is reassigned to ot (new
cluster, different distance).
Cjih = d(j, t) – d(j, i)
10
8 h
7 j
6
5
i
4
t
3
0
0 1 2 3 4 5 6 7 8 9 10

 Case 4:
oj currently belongs to medoid ot, t  j. If oi is replaced with oh as a medoid (from
a different cluster!), and oj is closest to oh, then oj is reassigned to oh (new
cluster, different distance).
Cjih = d(j, h) – d(j, t)
10
5 i
4
h j
3
2
t
1
0
0 1 2 3 4 5 6 7 8 9 10

CLARA (Clustering Large Applications) 8 - 26
(Kaufmann and Rousseeuw, 1990)

 Built in statistical-analysis packages, such as S+
 Draws multiple samples of the data set,
applies PAM on each sample,
and gives the best clustering as the output
 Strength:
 Deals with larger data sets than PAM
 Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not necessarily represent a good
clustering of the whole data set if the sample is biased.

CLARANS ("Randomized" CLARA) 8 - 27
 A Clustering Algorithm based on Randomized Search

(Ng & Han, 1994)
 Samples
 drawn dynamically with some randomness in each step of the search
 Clustering process
 can be presented as searching a graph
where each node is a potential solution, that is, a set of k medoids
 If local optimum found,
 start with new randomly selected node in search for a new local optimum
 More efficient and scalable than both PAM and CLARA
 Focusing techniques and spatial access structures may further
improve its performance
(Ester et al., 1995)


 Summary

Hierarchical Clustering 7 - 29
 Use distance matrix as clustering criteria.

 This method does not require the number of clusters k as an input, but needs
a termination condition.
Step 0 Step 1 Step 2 Step 3 Step 4

agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
AGNES (Agglomerative Nesting) 7 - 30
 Introduced in (Kaufmann & Rousseeuw, 1990)

 Implemented in statistical packages, e.g., S+
 Use the single-link method and the dissimilarity matrix
 See below
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Dendrogram: Shows How Clusters Are Merged 7 - 31
Decompose data objects into a several levels of nested partitioning

(tree of clusters), called a dendrogram
A clustering of the data objects is obtained by cutting the

dendrogram at the desired level, then each connected component
forms a cluster.

DIANA (Divisive Analysis) 7 - 32
 Introduced in (Kaufmann & Rousseeuw, 1990)

 Implemented in statistical analysis packages, e.g., S+
 Inverse order of AGNES
 Eventually each node forms a cluster of its own.
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Distance between Clusters 7 - 33
X X
 Minimum distance:
 Smallest distance between an object in one cluster and an object in the
other, i.e., distmin(Ci, Cj) = minoipCi,ojq Cj d(oip, ojq)
 Maximum distance:
 Largest distance between an object in one cluster and an object in the other,
i.e., distmax(Ci, Cj) = maxoipCi,ojq Cj d(oip, ojq)
 Average distance:
 Average distance between an object in one cluster and an object in the other,
i.e., distavg(Ci, Cj) = 1 / (ni  nj) oipCi,ojq Cj d(oip, ojq)
 Mean distance:
 Distance between the centroids of two clusters,
i.e., distmean(Ci, Cj) = d(ci, cj)

Distance between Clusters (2) 7 - 34
 Nearest-neighbor clustering algorithm:

 Uses minimum distance to measure distance between clusters.
 Single-linkage algorithm:
 Terminates if distance between nearest clusters exceeds user-defined
threshold.
 Minimal spanning-tree algorithm:
 View objects (data points) as nodes of a graph
 Edges form a path between nodes in a cluster.
 Merging of two clusters corresponds to adding an edge
between the nearest pair of nodes.
 Because edges linking clusters always go between distinct clusters,
resulting graph will be a tree.
 Thus, agglomerative hierarchical clustering that uses minimum distance
produces minimal spanning tree.

Distance between Clusters (3) 7 - 35
 Farthest-neighbor clustering algorithm:

 Uses maximum distance to measure distance between clusters.
 Complete-linkage algorithm:
 Terminates if maximum distance between nearest clusters exceeds user-
defined threshold.
 Good if true clusters are rather compact and approx. equal in size.

Extensions to Hierarchical Clustering 7 - 36
 Major weakness of agglomerative clustering methods:

 Can never undo what was done previously
 Do not scale well:
 Time complexity of at least O(n2), where n is the number of objects
 Integration of hierarchical & distance-based clustering
 BIRCH (1996):
 Uses CF-tree and incrementally adjusts the quality of sub-clusters
 CHAMELEON (1999):
 Hierarchical clustering using dynamic modeling

BIRCH (Balanced Iterative Reducing and Clustering 7 - 37
Using Hierarchies)
(Zhang, Ramakrishnan & Livny, SIGMOD'96)

 Incrementally construct a CF (Clustering Feature) tree
 A hierarchical data structure for multiphase clustering
 Phase 1: Scan DB to build an initial in-memory CF-tree
 A multi-level compression of the data
that tries to preserve the inherent clustering structure of the data
 Phase 2: Use an arbitrary clustering algorithm to cluster the leaf nodes of the
CF-tree
 Scales linearly:
 Finds a good clustering with a single scan
and improves the quality with a few additional scans
 Weakness:
 Handles only numerical data, and sensitive to the order of the data records

Clustering Feature in BIRCH 7 - 38
 Clustering Feature CF = (n, LS, SS)

 3D vector summarizing statistics about clusters
 n: number of data points n
 LS: linear sum of N points  xi
i 1
n
 xi
2
 SS: square sum of N points
i 1
10 CF = (5, (16, 30), (54, 190))

(3,4) 9
(2,6) 7
(4,5) 4
(4,7) 2
(3,8) 0
0 1 2 3 4 5 6 7 8 9 10

Clustering Feature in BIRCH (2) 7 - 39
 Allows to derive many useful statistics of a cluster:

n
o i
LS
 E.g. centroid: ci  i 1

n n
 Additive:
 For two disjoint clusters C1 and C2
with clustering features CF1 = (n1, LS1, SS1) and CF2 = (n2, LS2, SS2),
the clustering feature of the cluster that is formed by merging C1 and C2
is simply:
CF1 + CF2 = (n1 + n2, LS1 + LS2, SS1 + SS2)

CF-Tree in BIRCH 7 - 40
 Height-balanced tree
 Stores the clustering features for a hierarchical clustering
 Non-leaf nodes store sums of the CFs of their children.
 Two parameters:
 Branching factor B: max # of children
 Threshold T: maximum diameter of sub-clusters stored at leaf nodes
 Diameter: average pairwise distance within a cluster,
reflects the tightness of the cluster around the centroid
σ𝑛𝑖=1 σ𝑛𝑗=1(𝑜𝑖 − 𝑜𝑗 )2 2𝑛𝑆𝑆 − 2𝐿𝑆 2

𝐷= =
𝑛(𝑛 − 1) 𝑛(𝑛 − 1)

CF-Tree Structure 7 - 41
Root
B=7 CF1 CF2 CF3 CF6
T=6 child1 child2 child3 child6
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
Leaf node Leaf node

prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

The BIRCH Algorithm 7 - 42
Phase 1:
 For each point in the input:
 Find closest leaf-node entry
 Add point to leaf-node entry and update CF
 If entry_diameter > max_diameter, then split leaf node, and possibly parents
 Information about new point is passed toward the root of the tree.
 Algorithm is O(n)
 And incremental
 Concerns
 Sensitive to insertion order of data points
 Since we fix the size of leaf nodes, clusters may not be so natural.
 Clusters tend to be spherical given the radius and diameter measures.

CHAMELEON: Hierarchical Clustering Using 7 - 43
Dynamic Modeling
(Karypis, Han & Kumar, 1999)

 Measures the similarity based on a dynamic model
 Two clusters are merged only
if the interconnectivity and closeness (proximity) between two clusters
are high
relative to the internal interconnectivity of the clusters
and the closeness of items within the clusters.
 Graph-based, and a two-phase algorithm
 Use a graph-partitioning algorithm:
 Cluster objects into a large number of relatively small sub-clusters
 Use an agglomerative hierarchical clustering algorithm:
 Find the genuine clusters by repeatedly combining these sub-clusters

Overall Framework of CHAMELEON 7 - 44
Construct (k-NN)
sparse graph
Partition the
graph
Data Set
k-NN graph:
p and q are connected if q
Merge partitions
is among the top-k closest
neighbors of p
Relative interconnectivity:
connectivity of C1 and C2
over internal connectivity
Final clusters
Relative closeness:
closeness of C1 and C2 over
KDD, 7 Cluster Analysis internal closeness
CHAMELEON (Clustering Complex Objects) 7 - 45

Probabilistic Hierarchical Clustering 7 - 46
 Algorithmic hierarchical clustering

 Nontrivial to choose a good distance measure
 Hard to handle missing attribute values
 Optimization goal not clear: heuristic, local search
 Probabilistic hierarchical clustering
 Use probabilistic models to measure distances between clusters
 Generative model:
 Regard the set of data objects to be clustered as a sample of the
underlying data-generation mechanism to be analyzed
 Easy to understand, same efficiency as algorithmic agglomerative clustering
method, can handle partially observed data
 In practice, assume the generative models adopt common distribution
functions, e.g., Gaussian distribution or Bernoulli distribution, governed by
parameters.

Generative Model 7 - 47
 Given a set of 1-D points X = {x1, …, xn} for clustering analysis

and assuming they are generated by a Gaussian distribution:
( x )2
1 
N ( , 2 )  e 2 2
2 2
 The probability that a point xi ∈ X is generated by the model:
( xi   ) 2
1 
P( xi  ,  2 )  e 2 2
2 2
 The likelihood that X is generated by the model:
n ( xi   ) 2
1 
L( N (  ,  2 ) : X )  P ( X  ,  2 )   e 2 2
i 1 2 2
 The task of learning the generative model:

the maximum likelihood
find the parameters μ and σ2 such that

Ν (0 ,  02 )  arg max L( N ( ,  2 ) : X ) 
A Probabilistic Hierarchical Clustering Algorithm 7 - 48
 For a set of objects partitioned into m clusters C1, …, Cm,

the quality can be measured by: m
Q({C1 ,...,Cm })   P(Ci )
i 1
where P() is the maximum likelihood.

P(C1  C2 )
 Distance between clusters C1 and C2: dist (C1 , C2 )   log
P(C1 ) P(C2 )
 Algorithm: Progressively merge points and clusters
 Input: D = {o1, ..., on}: a dataset containing n objects
 Output: A hierarchy of clusters
 Method:
Create a cluster for each object Ci = {oi}, 1 ≤ i ≤ n;
For i = 1 to n {
Find a pair of clusters Ci and Cj such that
Ci, Cj = argmaxi ≠ j ( log ( P(Ci ∪ Cj) / (P(Ci) P(Cj) ) );
If log ( P(Ci ∪ Cj) / (P(Ci) P(Cj) ) > 0 then merge Ci and Cj ;
Else stop };

 Summary

Density-Based Clustering Methods 7 - 50
 Clustering based on density (local cluster criterion),

such as density-connected points
 Major features:
 Discover clusters of arbitrary shape
 Handle noise
 One scan
 Need density parameters as termination condition
 Several interesting studies:
 DBSCAN (Ester et al., KDD'96)
 OPTICS (Ankerst et al., SIGMOD'99).
 DENCLUE (Hinneburg & Keim, KDD'98)
 CLIQUE (Agrawal et al., SIGMOD'98) (more grid-based)

Density-Based Clustering: Basic Concepts 7 - 51
 Two parameters:
 Eps: Maximum radius of a neighborhood
 Defines the neighborhood of a point p:
NEps(p) = { q  D | dist(p,q) ≤ Eps }
 Distance function still needed
 MinPts: Minimum number of points in an Eps-neighborhood of a point p
 Core-point condition:
| NEps(p) | ≥ MinPts
 Only these neighborhoods are considered.
q MinPts = 5
Eps = 1 cm
p

Directly Density-Reachable 7 - 52
 Directly density-reachable:
 A point q
is directly density-reachable
from a core point p
w.r.t. Eps, MinPts,
if q belongs to NEps(p). MinPts = 5
q Eps = 1 cm
p

Density-Reachable and Density-Connected 7 - 53
 Density-reachable:
 A point q is density-reachable
from a core point p w.r.t. Eps, MinPts,
if there is a chain of points
q
p1, …, pn, p2
p1 = p, pn = q p
such that pi+1 is directly density-reachable
from pi.
 Density-connected:
 A point p is density-connected
to a point q w.r.t. Eps, MinPts,
if there is a point o
such that both p and q
p q
are density-reachable from o.
o

DBSCAN: Density-Based Spatial Clustering of 7 - 54
Applications with Noise
 Relies on a density-based notion of cluster:

 A cluster is defined as a maximal set of density-connected points.
 Discovers clusters of arbitrary shape
in spatial databases with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5

DBSCAN: The Algorithm 7 - 55
 All objects in database D are marked "unvisited."

 Randomly select unvisited object p and mark as "visited."
 If Eps-neighborhood of p contains less than MinPts objects:
 Mark p as "noise."
 Otherwise (that is, p is core point):
 Create new cluster C for point p.
 Add all objects in Eps-neighborhood of p to candidate set N.
 For each p' in N that does not yet belong to a cluster:
 Add p' to C.
 If p' is unvisited, mark as "visited."
 If p' is core point, add all objects in its Eps-neighborhood to N.
 Ends when N is empty, that is, C can no longer be expanded.
 Continue the process (randomly select next unvisited object in D)
until all points have been visited.

DBSCAN: Sensitive to Parameters 7 - 56

OPTICS: A Cluster-Ordering Method 7 - 57
 OPTICS: Ordering Points To Identify the Clustering Structure

 (Ankerst, Breunig, Kriegel & Sander, SIGMOD'99)
 Avoid difficulty of finding the right parameters
 Only MinPts must be given, Eps remains open.
 Produces a special order of the database
w.r.t. its density-based clustering structure
 This cluster-ordering contains info equivalent to the density-based clusterings
corresponding to a broad range of parameter settings.
 Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
 Can be represented graphically or using visualization techniques
 Process point in order:

Select a point that is density-reachable w.r.t. the lowest Eps value,
so that clusters with higher density (lower Eps) will be finished first.

OPTICS: Some Extension of DBSCAN 7 - 58
 Core Distance:
 of a point p
 min Eps s.t. p is core point
 Reachability Distance r:
 of a point p from q
 Minimum radius value that makes p
directly density-reachable from q
 That is, r(q, p) := p1
max {core-distance(q), dist(q, p)}
 Example: q
 MinPts = 5,
core_distance(q) = 3 cm
p2
 dist(q, p1) = 2.8 cm  r(q, p1) = 3 cm
 dist(q, p2) = 4 cm  r(q, p2) = 4 cm

OPTICS: Algorithm 7 - 59
 Maintains a list of OrderSeeds:

 Points sorted by reachability-distance from their resp. closest core points
 Begin with arbitrary point from input DB
 For each point p under consideration:
 Retrieve the closest MinPts points of p,
determine core-distance,
set reachability-distance to undefined.
 Write p to output.
 If p is core point, for each point q in the Eps-neighborhood of p:
 Update reachability-distance from p
 Insert q into OrderSeeds (if q has not yet been processed)
 Move to next object in OrderSeeds list
with smallest reachability-distance
(or input DB if OrderSeeds is empty)
 Continue until input DB fully consumed (and OrderSeeds empty)
OPTICS: Visualization 7 - 60
Reachability-distance
undefined
Eps
Eps'
Cluster-order of the points

DENCLUE: Using Statistical Density Functions 7 - 61
 DENsity-based CLUstEring
 (Hinneburg & Keim, KDD'98) total influence
 Using statistical density functions: on x
d ( x , y )2 d ( x , xi ) 2
 
f Gaussian ( x, y)  e ( x)  i 1 e
2 2 N
D 2 2
f Gaussian
influence of y d ( x , xi ) 2

( x, xi )  i 1 ( xi  x)  e
on x N
f Gaussian
D 2 2
 Major features
gradient of x in
 Solid mathematical foundation the direction of
 Good for data sets with large amounts of noise xi
 Allows a compact mathematical description of arbitrarily shaped clusters in

high-dimensional data sets
 Significantly faster than existing algorithms (e.g., DBSCAN)
 But needs a large number of parameters

DENCLUE: Technical Essence 7 - 62
 Density estimation:
 Estimation of an unobservable underlying probability density function
based on a set of observed data
 Regarded as a sample
 Kernel density estimation:
 Treat an observed object as an indicator of high-probability density in the
surrounding region
 Probability density at a point depends on the distances from this point to the
observed objects
 Density attractor:
 Local maximum of the estimated density function
 Must be above threshold ξ

DENCLUE: Technical Essence (2) 7 - 63
 Clusters
 Can be determined mathematically by identifying density attractors
 That is, local maxima of the overall density function
 Center-defined clusters:
 Assign to each density attractor the points density-attracted to it
 Arbitrary shaped cluster:
 Merge density attractors that are connected through paths of high density
(> threshold)

Density Attractor 7 - 64

Center-Defined and Arbitrary 7 - 65


 Summary

Grid-Based Clustering Method 7 - 67
 Using multi-resolution grid data structure

 Several interesting methods
 STING (a STatistical INformation Grid approach)
 (Wang, Yang & Muntz, VLDB'97)
 WaveCluster
 (Sheikholeslami, Chatterjee & Zhang, VLDB'98)
 A multi-resolution clustering approach using wavelet method
 Not shown here
 CLIQUE
 (Agrawal et al., SIGMOD'98)
 Both grid-based and subspace clustering

STING: A Statistical Information Grid Approach 7 - 68
(Wang, Yang & Muntz, VLDB'97)

 The spatial area is divided into rectangular cells.
 There are several levels of cells
corresponding to different levels of resolution.

The STING Clustering Method 7 - 69
 Each cell at a higher level

 Partitioned into a number of smaller cells at next lower level
 Statistical info of each cell
 Calculated and stored beforehand and used to answer queries
 count, plus for each attribute: mean, standard dev, min, max and type of
distribution—normal, uniform, etc.
 Parameters of higher-level cells
 Can easily be calculated from parameters of lower-level cells
 Use a top-down approach
 To answer spatial data queries (or: cluster definitions)
 Start from a pre-selected layer
 Typically with a small number of cells
 For each cell at the current level
 Compute the confidence interval
The STING Clustering Method (2) 7 - 70
 Cells labeled relevant or not relevant

 At the specified confidence level
 Irrelevant cells removed from further consideration
 When finished examining the current layer,
proceed to the next lower level
 Only look at cells that are children of relevant cells
 Repeat this process until bottom layer is reached
 Find regions (clusters) that satisfy the density specified
 Breadth-first search, at bottom layer
 Examine cells within a certain distance from center of current cell
(often just the neighbors)
 If average density within this small area is greater than density specified,
mark area and put relevant cells just examined into queue.
 Examine next cell from queue and repeat procedure, until end of queue
 Then one region has been identified.
STING Algorithm: Analysis 7 - 71
 Advantages:
 Query-independent, easy to parallelize, incremental update
 O(K), where K is the number of grid cells at the lowest level
 Disadvantages:
 All the cluster boundaries are either horizontal or vertical, and no diagonal
boundary is detected.

CLIQUE (CLustering In QUEst) 7 - 72
(Agrawal, Gehrke, Gunopulos & Raghavan, SIGMOD'98)

 Using subspaces (lower-dimensional)
of a high-dimensional data space
that allow better clustering than original space
 CLIQUE can be considered as both density-based and grid-based.
 Partitions each dimension into non-overlapping intervals,
thereby partitioning the entire data space into cells
 Uses density threshold to identify dense cells.
 A cell is dense, if the number of data points mapped to it exceeds the
density threshold.

CLIQUE: The Major Steps 7 - 73
 Monotonicity of dense cells w.r.t. dimensionality:

 Based on the Apriori property
used in frequent-pattern and association-rule mining (see Chapter 5)
 k-dimensional cell c can have at least l points only,
if every (k–1)-dimensional projection of c
(which is a cell in a (k–1)-dimensional subspace)
has at least l points, too.
 Clustering step 1:
 Partition each dimension into intervals,
identify intervals containing at least l points.
 Iteratively join k-dimensional dense cells c1 and c2
in subspaces (Di1, … Dik) and (Dj1, …, Djk)
with Di1 = Dj1 and Di2 = Dj2 and … Dik–1 = Djk–1
and c1 and c2 share the same intervals to those dimensions.
 New (k+1)-dimensional candidate cell c in space (Di1, … Dik, Djk)
tested for density.
CLIQUE: The Major Steps (2) 7 - 74
 Clustering step 1 (cont.):

 Iteration terminates
when no more candidate cells can be generated
or no candidate cells are dense.
 Clustering step 2:
 Use dense cells in each subspace to assemble clusters
 Apply Minimum Description Length (MDL) principle
to use the maximal regions to cover connected dense cells
 Maximal region: hyper rectangle where every cell falling into the regions
is dense, and region cannot be extended further in any dimension.
 Simple greedy approach:
 Start with arbitrary dense cell
 Find maximum region covering that cell
 Work on remaining dense cells that have not yet been covered

CLIQUE: Example 7 - 75
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
Vacation (week)
Salary (10,000)
age age
20 30 40 50 60
Vacation 20 30 40 50 60
30 50 age

Strength and Weakness of CLIQUE 7 - 76
 Strength
 Automatically finds subspaces of the highest dimensionality
such that high-density clusters exist in those subspaces
 Insensitive to the order of records in input
and does not presume any canonical data distribution
 Scales linearly with the size of input
and has good scalability as the number of dimensions in the data is
increased
 Weakness
 Dependent on proper grid size
and density threshold
 Accuracy of clustering result may be degraded
at the expense of simplicity of the method


 Summary

Assessing Clustering Tendency 7 - 78
 Assess if non-random structure exists in the data

by measuring the probability that the data is generated by a uniform
data distribution
 Data with random structure:
 Points uniformly distributed in data space
 Clustering may return clusters, but:
 Artificial partitioning
 Meaningless

Hopkins Statistic 7 - 79
 Data set:
 Let X = { xi | i =1 to n } be a collection of n patterns
in a d-dimensional space such that xi = (xi1, xi2, …, xid).
 Random sample of data space:
 Let Y = { yj | j =1 to m } be m sampling points placed at random in the d-
dimensional space, m << n.
 m out of the available n patterns are marked at random.
 Two types of distances defined:
 uj as minimum distance from yj to its nearest pattern in X and
 wj as minimum distance from a randomly selected pattern in X to its nearest
neighbor in X
 The Hopkins statistic in d dimensions is defined as:

m
j 1
uj
H
 j 1 u j   j 1 w j
m m

Hopkins Statistic (2) 7 - 80
 Compares nearest-neighbor distribution

of randomly selected locations (points)
to that for randomly selected patterns.
 Under the null hypothesis, H0, of uniform distribution:
 Distances from sampling points to nearest patterns should, on the average,
be the same as the interpattern nearest-neighbor distances, implying
randomness.
 H should be about 0.5.
 When patterns are aggregated or clustered:
 Distances from sampling points to nearest patterns should, on the average,
be larger than interpattern nearest-neighbor distances.
 H should be larger than 0.5.
 Almost equal to 1.0 for very well defined clustered data.

Determine the Number of Clusters 7 - 81
 Empirical method
 # of clusters ≈ √n / 2 for a dataset of n points
 Elbow method
 Use the turning point in the curve of sum of within-cluster variance
w.r.t. the # of clusters.
 Cross-validation method
 Divide a given data set into m parts.
 Use m – 1 parts to obtain a clustering model.
 Use the remaining part to test the quality of the clustering.
 E.g., for each point in the test set, find the closest centroid, and use the
sum of squared distances between all points in the test set and the
closest centroids to measure how well the model fits the test set.
 For any k > 0, repeat it m times, compare the overall quality measure w.r.t.
different k's, and find # of clusters that fits the data the best.

Measuring Clustering Quality 7 - 82
Two methods:
 Extrinsic: supervised, i.e., the ground truth is available
 Compare a clustering against the ground truth
using certain clustering quality measure.
 Ex. BCubed precision and recall metrics
 Intrinsic: unsupervised, i.e., the ground truth is unavailable
 Evaluate the goodness of a clustering by considering
how well the clusters are separated,
and how compact the clusters are.
 Ex. Silhouette coefficient

Measuring Clustering Quality: Extrinsic Methods 7 - 83
 Clustering-quality measure: Q(C, Cg)

 For a clustering C given the ground truth Cg.
 Q is good, if it satisfies the following four essential criteria:
 Cluster homogeneity:
 The purer, the better
 Cluster completeness:
 Should assign objects that belong to the same category in the ground
truth to the same cluster
 Rag bag:
 Putting a heterogeneous object into a pure cluster should be penalized
more than putting it into a rag bag (i.e., "miscellaneous" or "other"
category)
 Small cluster preservation:
 Splitting a small category into pieces is more harmful than splitting a
large category into pieces.

 Summary

Summary 7 - 85
 Cluster analysis
 Groups objects based on their similarity and has wide applications
 Measure of similarity
 Can be computed for various types of data
 Clustering algorithms can be categorized into
 Partitioning methods (k-means and k-medoids)
 Hierarchical methods (BIRCH and CHAMELEON; probabilistic hierarchical
clustering)
 Density-based methods (DBSCAN, OPTICS, and DENCLUE)
 Grid-based methods (STING, CLIQUE)
 (Model-based methods)
 Quality of clustering results
 Can be evaluated in various ways.

References 7 - 86
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace

clustering of high dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. OPTICS: Ordering points
to identify the clustering structure, SIGMOD'99
A. Banerjee and R. N. Davé. Validating Clusters using the Hopkins Statistic.
Fuzzy Systems 2004
F. Beil, M. Ester, and X. Xu. Frequent Term-Based Text Clustering. KDD'02
M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying Density-
Based Local Outliers. SIGMOD 2000
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95
D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987
L. Fu and E. Medico. FLAME, a novel fuzzy clustering method for the analysis
of DNA microarray data. BMC Bioinformatics 2007, 8:3
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB'98
V. Ganti, J. Gehrke, and R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB'98
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98
S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for
categorical attributes. ICDE'99
A. Hinneburg and D. A. Keim. An Efficient Approach to Clustering in Large
Multimedia Databases with Noise. KDD'98
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988

G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering

Algorithm Using Dynamic Modeling. IEEE Computer, 32(8): 68-75, 1999
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB'98
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications
to Clustering. John Wiley and Sons, 1988
R. Ng and J. Han. Efficient and effective clustering method for spatial data
mining. VLDB'94
L. Parsons, E. Haque, and H. Liu, Subspace Clustering for High Dimensional
Data: A Review. SIGKDD Explorations, 6(1), June 2004
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very
large data sets. Pattern Recognition 1996
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-
resolution clustering approach for very large spatial databases. VLDB'98
A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based

Clustering in Large Databases. ICDT'01
A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of
Obstacles. ICDE'01
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in
large data sets, SIGMOD'02
W. Wang, Yang, and R. Muntz. STING: A Statistical Information Grid Approach
to Spatial Data Mining. VLDB'97
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering
method for very large databases. SIGMOD'96
X. Yin, J. Han, and P. S. Yu. LinkClus: Efficient Clustering via Heterogeneous
Semantic Links. VLDB'06

Chapter 8: Outlier Analysis


Chapter 8: Outlier Analysis 8-2
 Outlier and Outlier Analysis

 Outlier-Detection Methods
 Statistical Approaches
 Proximity-Based Approaches
 Summary

KDD, Outlier Analysis
What Are Outliers? 8-3
 Outlier:
 A data object that deviates significantly from the normal objects
as if it were generated by a different mechanism
 Ex.: unusual credit card purchase,
sports: Michael Jordon, Wayne Gretzky, ...
 Outliers are different from noise.
 Noise is a random error or variance in a measured variable.
 Noise should be removed before outlier detection.
 Outliers are interesting.
 They violate the mechanism that generates the normal data.
 Outlier detection vs. novelty detection:
 Early stage: outlier;
but later merged into the model

Outliers (2) 8-4
 Applications:
 Credit-card-fraud detection
 Telecom-fraud detection
 Customer segmentation
 Medical analysis

Types of Outliers 8-5
 Three kinds: global, contextual, and collective outliers

 Global outlier (or point anomaly):
 Significantly deviates from the rest of the data set
 Ex.: intrusion detection in computer networks
 Issue: Find an appropriate measurement of deviation
 Contextual outlier (or conditional outlier): Global Outlier
 Deviates significantly based on a selected context
 Ex.: 80o F in Urbana—outlier? (depending on summer or winter)
 Attributes of data objects divided into two groups:
 Contextual attributes: define the context, e.g., time & location
 Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
 Can be viewed as a generalization of local outliers—whose density
significantly deviates from its local area
 Issue: How to define or formulate meaningful context?
Types of Outliers (2) 8-6
 Collective Outliers
 A subset of data objects
that collectively deviates significantly
from the whole data set,
even if the individual data objects may not be outliers.
 Ex.: intrusion detection—A number of computers Collective Outlier
keep sending denial-of-service packages to each other.
 Detection of collective outliers
 Consider not only behavior of individual objects,
but also that of groups of objects
 Need to have the background knowledge
on the relationship among data objects,
such as a distance or similarity measure on objects
 A data set may have multiple types of outliers.
 One object may belong to more than one type of outlier.
Challenges of Outlier Detection 8-7
 Modeling normal objects and outliers properly

 Hard to enumerate all possible normal behaviors in an application
 The border between normal and outlier objects is often a grey area.
 Application-specific outlier detection
 Choice of distance measure among objects and the model of relationship
among objects are often application-dependent.
 E.g. clinical data: a small deviation could be an outlier;
while in marketing analysis: larger fluctuations
 Handling noise in outlier detection
 Noise may distort the normal objects
and blur the distinction between normal objects and outliers.
 It may hide outliers and reduce the effectiveness of outlier detection.
 Understandability
 Understand why these are outliers: justification of the detection
 Specify the degree of an outlier: the unlikelihood of the object being
generated by a normal mechanism
Chapter 8: Outlier Analysis 8-8

 Summary

Outlier Detection I: Supervised Methods 8-9
 Two ways to categorize outlier-detection methods:

 Based on whether user-labeled examples of outliers can be obtained:
 Supervised, semi-supervised vs. unsupervised methods
 Based on assumptions about normal data and outliers:
 Statistical, proximity-based, and clustering-based methods
 Outlier Detection I: Supervised Methods
 Modeling outlier detection as a classification problem
 Samples examined by domain experts used for training & testing
 Methods for learning a classifier for outlier detection effectively:
 Model normal objects & report those not matching the model as outliers, or
 Model outliers and treat those not matching the model as normal
 Challenges
 Imbalanced classes, i.e., outliers are rare:
Boost the outlier class and make up some artificial outliers
 Catch as many outliers as possible, i.e., recall is more important than
accuracy (i.e., not mislabeling normal objects as outliers)

Outlier Detection II: Unsupervised Methods 8 - 10
 Assume the normal objects are somewhat "clustered" into multiple

groups, each having some distinct features.
 An outlier is expected to be far away from any group of normal objects.
 Weakness: Cannot detect collective outliers effectively
 Normal objects may not share any strong pattern,
but the collective outliers may have high similarity in a small area.
 Ex.: In some intrusion or virus detection, normal activities are diverse.
 Unsupervised methods may have a high false-positive rate, but still miss many
real outliers.
 Supervised methods can be more effective,
e.g., identify attacking some key resources.
 Many clustering methods can be adapted for unsupervised methods.
 Find clusters, then outliers: not belonging to any cluster
 Problem 1: Hard to distinguish noise from outliers
 Problem 2: Costly since first clustering, but far less outliers than normal objects
 Newer methods: tackle outliers directly

Outlier Detection III: Semi-Supervised Methods 8 - 11
 Situation:
 In many applications, the number of labeled data objects is small:
Labels could be on outliers only, on normal objects only, or on both.
 Semi-supervised outlier detection
 Regarded as application of semi-supervised learning
 If some labeled normal objects are available:
 Use the labeled examples and the proximate unlabeled objects
to train a model for normal objects.
 Those not fitting the model of normal objects are detected as outliers.
 If only some labeled outliers are available,
that small number may not cover the possible outliers well.
 To improve the quality of outlier detection:
get help from models for normal objects learned from unsupervised methods.

Outlier Detection (1): Statistical Methods 8 - 12
 (Also known as model-based methods)

 Assume that the normal data follow some statistical model.
 The data not following the model are outliers.
 Example (right figure):
 First use Gaussian distribution gD
to model the normal data.
 For each object y in region R, estimate gD(y),
the probability that y fits the Gaussian distribution
 If gD(y) is very low, y is unlikely generated
by the Gaussian model, thus an outlier.
 Effectiveness of statistical methods
 Highly depends on whether the assumption of statistical model holds in the
real data
 There are many kinds of statistical models.
 E.g., parametric vs. non-parametric
Outlier Detection (2): Proximity-Based Methods 8 - 13
 An object is an outlier if the nearest neighbors of the object are far away, i.e.,
the proximity of the object significantly deviates from the proximity of most of the
other objects in the same data set.
 Model the proximity of an object
using its 3 nearest neighbors.
 Objects in region R are substantially different
from other objects in the data set.
 Thus the objects in R are outliers.
 Effectiveness of proximity-based methods
 Highly relies on the proximity measure.
 In some applications, proximity or distance measures cannot be obtained easily.
 Often have a difficulty in finding a group of outliers which are close to each other.
 Two major types of proximity-based outlier detection:
 Distance-based vs. density-based

Outlier Detection (3): Clustering-Based Methods 8 - 14
 Normal data belong to large and dense clusters,

whereas outliers belong to small or sparse clusters,
or do not belong to any cluster.
 Example (right figure): two clusters
 All points not in R form a large cluster
 The two points in R form a tiny cluster,
thus are outliers.
 Many clustering methods

 Thus also many clustering-based outlier detection methods
 Clustering is expensive.
 Straightforward adaptation of a clustering method for outlier detection
can be costly
and does not scale up well for large data sets.

Chapter 8: Outlier Analysis 8 - 15

 Summary

Statistical Approaches 8 - 16
 Assume that the objects in a data set are generated by a stochastic

process (a generative model).
 Idea:
 Learn a generative model fitting the given data set,
and then identify the objects in low-probability regions of the model
as outliers.
 Methods divided into two categories:
 Parametric vs. non-parametric
 Parametric method
 Assumes that the normal data is generated by a parametric distribution with
parameter θ.
 The probability density function of the parametric distribution f(x, θ) gives the
probability that object x is generated by the distribution.
 The smaller this value, the more likely x is an outlier.

Statistical Approaches (2) 8 - 17
 Non-parametric method
 Do not assume an a-priori statistical model
and determine the model from the input data.
 Not completely parameter-free,
but consider number and nature of the parameters to be flexible
and not fixed in advance.
 Examples: histogram and kernel density estimation

Parametric Methods I: Detection of Univariate 8 - 18
Outliers Based on Normal Distribution
 Univariate data:
 A data set involving only one attribute or variable
 Assumption:
 Data are generated from a normal distribution
 Learn the parameters from the input data, and identify the points
with low probability as outliers.
 Use the maximum-likelihood method to estimate μ and σ:
 Taking derivatives with respect to μ and σ2, we derive the following

maximum-likelihood estimates:

Parametric Methods I: Detection of Univariate 8 - 19
Outliers Based on Normal Distribution (2)
 Example
 Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
 For these data with n = 10, we have
 Then the most deviating value 24.0 is 4.61 away form the estimated mean.
 μ  3σ contains 99.7% of the data under the assumption of normal
distribution
 Because 4.61 / 1.51 = 3.04 > 3, probability that 24.0 is generated by
normal distribution is less than 0.15%.
 Each "tail" to the left and to the right of the 99.7% has 0.15%.
 Hence, 24.0 identified as an outlier.

Parametric Methods I: The Grubb's Test 8 - 20
 Univariate outlier detection:

The Grubb's test (maximum normed residual test)
 Another statistical method under normal distribution
 For each object x in a data set, compute its z-score: z = | x – xm | / s
 where xm is the mean and s the standard deviation of the input data.
 x is an outlier, if
 where is the value taken by a t-distribution at a
significance level of α / (2N), and N is the # of objects in the data set.

Parametric Methods II: 8 - 21
Detection of Multivariate Outliers
 Multivariate data:
 A data set involving two or more attributes or variables
 Transform the multivariate outlier-detection task into a univariate
outlier-detection problem.
 Method 1. Compute Mahalanobis distance
 Let ō be the mean vector for a multivariate data set. Mahalanobis distance
for an object o to ō is MDist(o, ō) = (o – ō)T S –1 (o – ō) where S is the
covariance matrix.
 Use the Grubb's test on this measure to detect outliers.
 Method 2. Use 2 statistic
 where Ei is the mean of the i-dimension among all objects, and n is the
dimensionality.
 If 2 statistic is large, then object oi is an outlier.
Parametric Methods III: 8 - 22
Using Mixture of Parametric Distributions
 Assuming that data are generated

by a normal distribution
could sometimes be overly simplified.
 The objects between the two clusters
cannot be captured as outliers
since they are close to the estimated mean.
 Assume the normal data is generated by two normal distributions.
 For any object o in the data set, the probability that o is generated by the
mixture of the two distributions is given by
where fθ1 and fθ2 are the probability density functions of θ1 and θ2 .

 Then use expectation-maximization (EM) algorithm to learn the parameters
μ1, σ1, μ2, σ2 from data.
 An object o is an outlier if it does not belong to any cluster.

Non-Parametric Methods: 8 - 23
Detection Using Histogram
 The model of normal data is learned from the input data

without any apriori structure.
 Often makes fewer assumptions about the data,
and thus can be applicable in more scenarios.
 Outlier detection using histogram:
 Figure shows the histogram
of purchase amounts in transactions.
 A transaction with the amount of $7,500
is an outlier,
since only 0.2% of the transactions
have an amount higher than $5,000.

Non-Parametric Methods: 8 - 24
Detection Using Histogram (2)
 Problem:
 Hard to choose an appropriate bin size for histogram
 Too small bin size → normal objects in empty/rare bins, false positive
 Too big bin size → outliers in some frequent bins, false negative
 Solution:
 Adopt kernel density estimation
to estimate the probability density distribution of the data.
 If the estimated density function is high,
the object is likely normal.
 Otherwise, it is likely an outlier.


 Summary

Proximity-Based Approaches: 8 - 26
Distance-Based vs. Density-Based Outlier
Detection
 Intuition:
 Objects that are far away from the others are outliers.
 Assumption of proximity-based approach:
 The proximity of an outlier deviates significantly from that of most of the
others in the data set.
 Two types of proximity-based outlier-detection methods
 Distance-based outlier detection:
 An object o is an outlier, if its neighborhood does not have enough other
points.
 Density-based outlier detection:
 An object o is an outlier, if its density is relatively much lower than that of
its neighbors.

Distance-Based Outlier Detection 8 - 27
 For each object o, examine the # of other objects in the r-neighborhood of o,

where r is a user-specified distance threshold.
 An object o is an outlier
if most (taking π as a fraction threshold) of the objects in D
are far away from o, i.e., not in the r-neighborhood of o
 An object o is a DB(r, π) outlier if
 Equivalently, one can check the distance between o and its k-th nearest
neighbor ok, where .
 o is an outlier, if dist(o, ok) > r

Distance-Based Outlier Detection (2) 8 - 28
 Efficient computation: Nested-loop algorithm

 For any object oi,
calculate its distance from other objects,
and count the # of other objects in the r-neighborhood.
 If π∙n other objects are within r distance, terminate the inner loop.
 Otherwise, oi is a DB(r, π) outlier.
 Efficiency:
 Actually, CPU time is not O(n2) but linear to the data set size,
since for most non-outlier objects, the inner loop terminates early.

Distance-Based Outlier Detection (3) 8 - 29
 Why is efficiency still a concern?

 If the complete set of objects cannot be held in main memory,
there is significant cost for I/O swapping.
 The major cost:
(1) Each object is tested against the whole data set—
why not only against its close neighbors?
(2) Objects are checked one by one—
why not group by group?

Distance-Based Outlier Detection: 8 - 30
A Grid-Based Method
 CELL:
 Data space is partitioned into a multi-D grid.
 Each cell is a hyper cube with diagonal length r/2.
 r distance threshold parameter
 l dimensions: edge of each cell r / (2√l) long
 Level-1 cells:
 Immediately next to C
 For any possible point x in cell C and
any possible point y in a level-1 cell: dist(x,y) ≤ r
 Level-2 cells:
 One or two cells away from C
 For any possible point x in cell C
and any point y such that dist(x,y) ≥ r,
y is in a level-2 cell.

Distance-Based Outlier Detection: 8 - 31
A Grid-Based Method (2)
 Total number of objects in cell C: a

 Total number of objects in level-1 cells: b1
 Total number of objects in level-2 cells: b2
 Level-1 cell pruning rule:
 if a + b1 >  π n ,
then every object o in C is not a DB(r, π) outlier,
because all objects in C and the level-1 cells are in the r-neighborhood of o,
and there are at least  π n  such objects.
 Level-2 cell pruning rule:
 If a + b1 + b2 <  π n  + 1,
then all objects in C are DB(r, π) outliers,
because all of their r-neighborhoods have less than  π n  other objects.
 Only need to check the objects that cannot be pruned
 Even for such an object o, only need to compute the distance between o and
the objects in level-2 cells
 Since beyond level-2, distance from o is more than r

Density-Based Outlier Detection 8 - 32
 Local outliers:
 Outliers compared to their local neighborhoods,
not to global data distribution
 In Fig., O1 and O2 are local outliers to C1,
O3 is a global outlier, but O4 is not an outlier.
 However, distance of O1 and O2 to objects
in dense cluster C1 is smaller than average
distance in sparse cluster C2.
 Hence, O1 and O2 are not distance-based outliers.
 Intuition:
 Density around outlier object significantly different from density around its
neighbors.

Density-Based Outlier Detection (2) 8 - 33
 Method:
 Use the relative density of an object against its neighbors
as the indicator of the degree of the object being outliers
 k-distance of an object o
 distk(o)
 distance dist(o, p) between o and its k-nearest neighbor p
 at least k objects o' ϵ D – {o} such that dist (o, o') ≤ dist(o, p)
 at most k – 1 objects o'' ϵ D – {o} such that dist(o, o'') > dist(o, p)
 k-distance neighborhood of o:
 Nk(o) = {o' | o' ϵ D, dist(o, o') ≤ distk(o)}
 Nk(o) could be bigger than k
since multiple objects may have identical distance to o

Local Outlier Factor 8 - 34
 Reachability distance from o' to o:
 where k is a user-specified parameter

 Local reachability density of o:
 LOF (Local Outlier Factor) of o:

 The average of the ratio of local reachability of o and those of o's k-nearest
neighbors
 The lower the local reachability density of o, and the higher the local
reachability density of the kNN of o, the higher LOF
 This captures a local outlier whose local density is relatively low comparing to
the local densities of its kNN.

 Summary

Summary 8 - 36
 Types of outliers
 Global, contextual & collective outliers
 Outlier detection
 Supervised, semi-supervised, or unsupervised
 Statistical (or model-based) approaches
 Proximity-based approaches
 Not covered here:
 Clustering-based approaches
 Classification approaches
 Mining contextual and collective outliers
 Outlier detection in high dimensional data

References 8 - 37
B. Abraham and G.E.P. Box. Bayesian analysis of some outlier problems in time
series. Biometrika, 66:229–248, 1979.
M. Agyemang, K. Barker, and R. Alhajj. A comprehensive survey of numeric
and symbolic outlier mining techniques. Intell. Data Anal., 10:521–538, 2006.
F. J. Anscombe and I. Guttman. Rejection of outliers. Technometrics, 2:123–
147, 1960.
D. Agarwal. Detecting anomalies in cross-classified streams: a Bayesian
approach. Knowl. Inf. Syst., 11:29–44, 2006.
F. Angiulli and C. Pizzuti. Outlier mining in large high-dimensional data sets.
TKDE, 2005.
C.C. Aggarwal and P.S. Yu. Outlier detection for high dimensional data.
SIGMOD'01.
R.J. Beckman and R.D. Cook. Outlier...s. Technometrics, 25:119–149, 1983.
I. Ben-Gal. Outlier detection. In: O. Maimon and L. Rockach (eds.), Data Mining
and Knowledge Discovery Handbook: A Complete Guide for Practitioners and
Researchers, Kluwer Academic, 2005.
M.M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-
based local outliers. SIGMOD'00.
D. Barbara, Y. Li, J. Couto, J.-L. Lin, and S. Jajodia. Bootstrapping a data
mining intrusion detection system. SAC'03.
Z.A. Bakar, R. Mohemad, A. Ahmad, and M. M. Deris. A comparative study for
outlier detection techniques in data mining. IEEE Conf. on Cybernetics and
Intelligent Systems, 2006.
S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear
time with randomization and a simple pruning rule. KDD'03.
D. Barbara, N. Wu, and S. Jajodia. Detecting novel network intrusion using
Bayesian estimators. SDM'01.
V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM
Computing Surveys, 41:1–58, 2009.
D. Dasgupta and N.S. Majumdar. Anomaly detection in multidimensional data
using negative selection algorithm. CEC'02.

E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric

framework for unsupervised anomaly detection: Detecting intrusions in
unlabeled data. Int. Conf. of Data Mining for Security Applications, 2002.
E. Eskin. Anomaly detection over noisy data using learned probability
distributions. ICML'00.
T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and
Knowledge Discovery, 1:291–316, 1997.
V.J. Hodge and J. Austin. A survey of outlier detection methodologies. Artif.
Intell. Rev., 22:85–126, 2004.
D. M. Hawkins. Identification of Outliers. Chapman and Hall, London, 1980.
Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern
Recogn. Lett., 24, June, 2003.
W. Jin, K. H. Tung, and J. Han. Mining top-n local outliers in large databases.
KDD'01.

W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric

neighborhood relationship. PAKDD'06.
E. Knorr and R. Ng. A unified notion of outliers: Properties and computation.
KDD'97.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB'98.
E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based outliers: Algorithms and
applications. VLDB J., 8:237–253, 2000.
H.-P. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection in high-
dimensional data. KDD'08.
M. Markou and S. Singh. Novelty detection: A review—part 1: Statistical
approaches. Signal Process., 83:2481–2497, 2003.
M. Markou and S. Singh. Novelty detection: A review—part 2: Neural network
based approaches. Signal Process., 83:2499–2521, 2003.
C. C. Noble and D. J. Cook. Graph-based anomaly detection. KDD'03.

S. Papadimitriou, H. Kitagawa, P.B. Gibbons, and C. Faloutsos. Loci: Fast

outlier detection using the local correlation integral. ICDE'03.
A. Patcha and J.-M. Park. An overview of anomaly detection techniques:
Existing solutions and latest technological trends. Comput. Netw., 51, 2007.
X. Song, M. Wu, C. Jermaine, and S. Ranka. Conditional anomaly detection.
IEEE Trans. on Knowl. and Data Eng., 19, 2007.
Y. Tao, X. Xiao, and S. Zhou. Mining distance-based outliers from large
databases in any metric space. KDD'06.
N. Ye and Q. Chen. An anomaly detection technique based on a chi-square
statistic for detecting intrusions into information systems. Quality and Reliability
Engineering International, 17:105–112, 2001.
B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A.
Biliris. Online data mining for co-evolving time sequences. ICDE'00.


KDD - Knowledge Discovery in Databases

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

KDD - Knowledge Discovery in Databases

Uploaded by

Copyright:

Available Formats

Lecture Knowledge Discovery in Databases

Lehrstuhl für Informatik 6 (Datenmanagement)

 Why data mining?

Summer Semester 2018

 The Explosive Growth of Data:

Summer Semester 2018

 Before 1600, Empirical Science

Summer Semester 2018

 1990-now, Data Science

Summer Semester 2018

 Why data mining?

Summer Semester 2018

 Data mining (knowledge discovery from data):

Summer Semester 2018

 This is a view from typical Visualization,

Data Warehouse Selection

Summer Semester 2018

 Web mining usually involves

Summer Semester 2018

Data Presentation Business

Data Preprocessing/Integration, Data Warehouses

Summer Semester 2018

Summer Semester 2018

Input Data Data Pre- Data Post-

Data integration Pattern discovery Pattern evaluation

 This is a view from typical machine-learning and statistics communities.

Summer Semester 2018

 Health-care & medical data mining

Summer Semester 2018

 "CRoss-Industry Standard Process for Data Mining"

Modeling Evaluation Deployment

Summer Semester 2018

 Why data mining?

Summer Semester 2018

Summer Semester 2018

 Why data mining?

Summer Semester 2018

 Database-oriented data sets and applications

Summer Semester 2018

 Why data mining?

Summer Semester 2018

 Information integration and data-warehouse construction

Summer Semester 2018

 Frequent patterns (or frequent itemsets)

Summer Semester 2018

 Classification and (class-) label prediction

Summer Semester 2018

 Definition of similarity or proximity as input!

Summer Semester 2018

Summer Semester 2018

 Sequence, trend, and evolution analysis

Summer Semester 2018

Summer Semester 2018

 Is all mined knowledge interesting?

Summer Semester 2018

 Why data mining?

Summer Semester 2018

Summer Semester 2018

 Tremendous amount of data

 Why data mining?

Summer Semester 2018

Summer Semester 2018

 Why data mining?

Summer Semester 2018