Professional Documents
Culture Documents
Chapter 1. Introduction
Prof. Dr. Klaus Meyer-Wegener
Summer Semester 2018
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for
Online Science, Comm. ACM 45(11): 50-54, Nov. 2002
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS products
Advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web databases
2000s:
Stream-data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
Summer Semester 2018
KDD, 1 Introduction
Chapter 1. Introduction 1-7
Task-relevant Data
Data Cleaning
Data Integration
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
Business-intelligence view
Warehouse, data cube, reporting
But not much mining
Business objects vs. data-mining tools
Supply-chain example: tools
Data presentation
Exploration
Data to be mined
Database data (extended-relational, object-oriented, heterogeneous, legacy),
data warehouse, transactional data, stream, spatiotemporal, time-series,
sequence, text and Web, multi-media, graphs & social and information
networks
Knowledge to be mined (or: Data-mining functions)
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Descriptive vs. predictive data mining
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database, data warehouse (OLAP), machine learning, statistics, pattern
recognition, visualization, high-performance computing, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock-
market analysis, text mining, Web mining, etc.
Unsupervised learning
I.e., class (cluster) label is unknown
Group data to form new categories (i.e., clusters)
E.g., cluster houses to find distribution patterns
Principle:
Maximize intra-class similarity &
minimize interclass similarity
Many methods and applications
Outlier:
A data object that does not comply with the general behavior of the data
Noise or exception?
One person's garbage could be another person's treasure.
Methods:
By-product of clustering or regression analysis, …
Useful in fraud detection, rare-events analysis
Graph mining
Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (Web fragments)
Information-network analysis
Social networks: actors (objects, nodes) and relationships (edges)
E.g., author networks in CS, terrorist networks
Multiple heterogeneous networks
A person could be in multiple information networks: friends, family,
classmates, …
Links carry a lot of semantical information: Link mining
Web mining
Web is a big information network: from PageRank to Google
Analysis of Web information networks
Web community discovery, opinion mining, usage mining, …
Pattern
Machine Recognition
Statistics
Learning
Applications Visualization
Data Mining
High-Performance
Algorithms
Database Computing
Technology
Web-page analysis:
From Web-page classification, clustering to PageRank & HITS algorithms
HITS = Hyperlink-Induced Topic Search
Collaborative analysis & recommender systems
Basket-data analysis for targeted marketing
Biological and medical data analysis:
Classification, cluster analysis (microarray data analysis), biological
sequence analysis, biological network analysis
Data mining and software engineering
E.g., IEEE Computer, Aug. 2009 issue
From major dedicated data-mining systems/tools
E.g., SAS, MS SQL-Server Analysis Manager, Oracle Data-Mining Tools
To invisible data mining
Mining methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
Web and IR
Conferences: SIGIR, WWW, CIKM, etc.
Journals: WWW: Internet and Web Information Systems
Statistics
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of Statistics, etc.
Visualization
Conferences: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. Visualization and Computer Graphics, etc.
Data mining:
Discovering interesting patterns and knowledge
from massive amounts of data
A natural evolution of database technology
In great demand, with wide applications
KDD process includes:
Data cleaning, data integration, data selection, transformation, data mining,
pattern evaluation, and knowledge presentation
Mining can be performed in a variety of data.
Data-mining functionalities:
Characterization, discrimination, association, classification, clustering, outlier
and trend analysis, etc.
Data-mining technologies and applications
Major issues in data mining
Summer Semester 2018
KDD, 1 Introduction
Recommended Reference Books 1 - 44
Colin Shearer: The CRISP-DM Model: The New Blueprint for Data Mining.
Journal of Data Warehousing, vol. 5, no. 4, Fall 2000, pp. 13-22
Tao Xie, Suresh Thummalapenta, David Lo, and Chao Liu: Data Mining for
Software Engineering. IEEE Computer, August 2009, pp. 55-62
Rob J. Hyndman and George Athanasopoulos: Forecasting: Principles and
Practice. 2nd ed. Monash University, Australia, April 2018.
https://otexts.org/fpp2/
Record
Relational records
Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
crosstabs
team
ball
lost
play
win
Document data: text documents;
term-frequency vector
Transaction data
Graph and network Document 1 3 0 5 0 2 6 0 2 0 2
Dimensionality
Curse of dimensionality (sparse high-dimensional data spaces)
Sparsity
Only presence counts
Resolution
Patterns depend on the scale
Distribution
Centrality and dispersion
Attribute
Also called: field, dimension, feature, variable, …
A data field, representing a characteristic or feature of a data object
E.g., customer_ID, name, address
Types:
Nominal
Binary
Ordinal
Numerical: quantitative
Interval-scaled
Ratio-scaled
Nominal
Categories, states, or "names of things"
E.g., hair_color = {auburn, black, blond, brown, grey, red, white}
Other examples: marital status, occupation, ID number, ZIP code
Binary
Nominal attribute with only two states (0 and 1)
Symmetric binary: both outcomes equally important
E.g., gender
Asymmetric binary: outcomes not equally important
E.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)
Ordinal
Values have a meaningful order (ranking),
but magnitude between successive values is not known.
E.g., size = {small, medium, large}, grades, army rankings
Discrete Attribute
Has only a finite or countably infinite set of values
E.g., ZIP code, profession, or the set of words in a collection of
documents
Sometimes represented as integer variables
Note: Binary attributes are a special case of discrete attributes.
Continuous Attribute
Has real numbers as attribute values
E.g., temperature, height, or weight
Practically, real values can only be measured and represented using a finite
number of digits.
Continuous attributes are typically represented as floating-point variables.
Motivation
To better understand the data: central tendency, variation, and spread
Data-dispersion characteristics
Median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals.
Data dispersion: analyzed with multiple granularities of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
1 n
x xi x
n i 1 N
Trimmed mean: chopping extreme values
n
w x i i
x i 1
n
w
i 1
i
Median:
Middle value if odd number of values,
or average of the middle two values otherwise
Estimated by interpolation (for grouped data):
N / 2 ( freq) l
median L1 ( ) width median
freqmedian
with L1 lower boundary of median interval,
( freq)l sum of frequencies of all intervals
lower than median
Mode:
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula: mean mode 3 (mean median )
Summer Semester 2018
KDD, 2 Data
Symmetrical vs. Skewed Data 2 - 14
N i 1
2
N i 1
Five-number summary of a
distribution
Minimum, Q1, median, Q3,
maximum
Boxplot
Data is represented with a box.
The ends of the box are at the
first and third quartiles, i.e., the
height of the box is IQR.
The median is marked by a line
within the box.
Whiskers: two lines outside the
box extended to minimum and
maximum
Outliers: points beyond a
specified outlier threshold,
plotted individually
Summer Semester 2018
KDD, 2 Data
Visualization of Data Dispersion: 3-D Boxplots 2 - 17
Boxplot:
Graphic display of five-number summary
Histogram:
X-axis are values, y-axis repres. frequencies
Quantile plot:
Each value xi is paired with fi indicating that approximately 100 fi % of data
are xi
Quantile-quantile (q-q) plot:
Graphs the quantiles of one univariant distribution against the corresponding
quantiles of another
Scatter plot:
Each pair of values is a pair of coordinates and plotted as points in the plane.
(Naumann 2013)
More from the database perspective
Derive metadata such as
data types and value patterns,
completeness and uniqueness of columns,
keys and foreign keys, and
occasionally functional dependencies and association rules;
discovery of inclusion dependencies and conditional functional
dependencies.
Statistics:
number of null values, and distinct values in a column,
data types,
most frequent patterns of values.
(a) Summer
IncomeSemester 2018 (b) Credit Limit (c) Transaction Volume (d) Age
KDD, 2 Data
Laying Out Pixels in Circle Segments 2 - 31
To save space and show the connections among multiple dimensions, space
filling is often done in a circle segment.
References:
Gonick, L. and Smith, W.
The Cartoon Guide to Statistics.
New York: Harper Perennial, 1993, p. 212
Weisstein, Eric W. "Chernoff Face."
From MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.html
Summer Semester 2018
KDD, 2 Data
Stick Figure 2 - 38
A census data
figure showing age,
income, gender,
education, etc.
A 5-piece stick
figure (1 body and 4
limbs w. different
angle/length)
Screen-filling method
Uses a hierarchical partitioning of the screen
into regions depending on the attribute values
X- and y-dimension of the screen partitioned alternately
according to the attribute values (classes)
3D cone-tree visualization
technique
works well for up to a thousand
nodes or so
First build a 2D circle tree that
arranges its nodes in concentric
circles centered on the root node
Cannot avoid overlaps when
projected to 2D
G. Robertson, J. Mackinlay, S.
Card. “Cone Trees: Animated 3D
Visualizations of Hierarchical
Information”, ACM SIGCHI'91
Graph from Nadeau Software
Consulting website: Visualize a
social network data set that
models the way an infection
spreads from one person to the
next
Similarity
Numerical measure of how alike two data objects are
Value is higher when objects are more alike.
Often falls in the range [0,1]
Dissimilarity
E.g., distance
Numerical measure of how different two data objects are
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity
refers to similarity or dissimilarity
Data matrix
n data points with p dimensions x11 ... x1f ... x1p
Two-mode matrix
... ... ... ... ...
x ... x if ... x ip
i1
... ... ... ... ...
x ... x nf ... x np
n1
Dissimilarity matrix
n data points, but registers only 0
the distance d(2,1) 0
A triangular matrix
d(3,1) d ( 3,2) 0
One-mode matrix
: : :
d ( n,1) d ( n,2) ... ... 0
Summer Semester 2018
KDD, 2 Data
Proximity Measure for Nominal Attributes 2 - 49
d (i, j) p pm
Object j
A contingency table
for binary data Object i
Counting matches
Distance measure for
symmetrical binary variables:
Distance measure for
asymmetrical binary variables:
Jaccard coefficient
(similarity measure for
asymmetrical binary variables):
Note: Jaccard coefficient is the
same as "coherence":
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
Summer Semester 2018
KDD, 2 Data
Standardizing Numerical Data 2 - 52
z x
Z-score:
x: raw score to be standardized, μ: mean of the population,
σ: standard deviation
The distance between the raw score and the population mean in units of the
standard deviation.
Negative when the raw score is below the mean, "+" when above
An alternative way: Calculate the mean absolute deviation
s 1 (| x m | | x m | ... | x m |)
f n 1f f 2f f nf f
where m 1 (x x ... x ).
f n 1f 2f nf
x m
if f
Standardized measure (z-score): z
if s
f
Data Matrix
x2 x4 point attribute1 attribute2
x1 1 2
x2 3 5
4
x3 2 0
x4 4 5
Dissimilarity Matrix
2 (with Euclidean Distance)
x1
x1 x2 x3 x4
x1 0
x2 3,61 0
x3 x3 5,1 5,1 0
0 2 4 x4 4,24 1 5,39 0
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order
The distance so defined is also called Lh norm.
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric.
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L1 x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
x2 x4 Euclidean (L2)
L2 x1 x2 x3 x4
4 x1 0
x2 3,61 0
x3 2,24 5,1 0
x4 4,24 1 5,39 0
2 x1
Supremum
L x1 x2 x3 x4
x3 x1 0
0 2 4 x2 3 0
x3 2 5 0
Summer Semester 2018
KDD, 2 Data
x4 3 1 5 0
Ordinal Variables 2 - 57
pf 1 ij( f )dij( f )
d (i, j)
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric:
Use the normalized distance
f is ordinal:
Compute ranks rif and
zif
r 1
if
Treat zif as interval-scaled
M 1 f
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1|| = (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5 = (42)0.5
= 6.481
||d2|| = (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5 =(17)0.5
= 4.12
sim(d1, d2) = 0.94
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept-hierarchy generation
Noise:
Random error or variance in a measured variable
Stored value a little bit off the real value, up or down
Leads to (slightly) incorrect attribute values
May be due to:
Faulty data-collection instruments
Data-entry problems
Data-transmission problems
Technology limitation
Inconsistency in naming conventions
Binning
First sort data and partition into (equal-frequency) bins
Then smooth by bin mean, by bin median, or by bin boundaries
Regression
Smooth by fitting the data to regression functions
Clustering
Detect and remove outliers
Combined computer and human inspection
Detect suspicious values and check by human
E.g., deal with possible outliers
Data-discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
Check field overloading
Check uniqueness rule, consecutive rule, and null rule
Use commercial tools
Data scrubbing: use simple domain knowledge (e.g., postal code, spell-
check) to detect errors and make corrections
Data auditing: by analyzing data to discover rules and relationships to
detect violators (e.g., correlation and clustering to find outliers)
Data migration and integration
Data-migration tools: allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
Integration of the two processes
Iterative and interactive (e.g., the Potter's Wheel tool)
Summer Semester 2018
KDD, 3 Preprocessing
Chapter 3: Data Preprocessing 3 - 12
Data integration
Combine data from multiple sources into a coherent store
Schema integration
E.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity-identification problem:
Identify the same real-world entities from multiple data sources
E.g., Bill Clinton = William Clinton
Detecting and resolving data-value conflicts
For the same real world entity, attribute values from different sources are
different.
Possible reasons:
Different representations (coding)
Different scales, e.g., metric vs. British units
Two attributes:
A has c distinct values: a1, a2, …, ac
B has r distinct values: b1, b2, …, br
Contingency table:
Columns: the c values of A
Rows: the r values of B
Cells: counts of records with A = ai and B = bj
Expected count in cell (i, j):
count ( A ai ) count ( B b j )
eij
n
2 (chi-square) test
(Observed Expected ) 2
2
Expected
Correlation coefficient
Also called Pearson's product-moment coefficient
(n 1) A B (n 1) A B
Scatter plots
showing the
similarity
from –1 to 1
σ𝑛𝑖=1(𝑎𝑖 − 𝐴)(𝑏
ҧ 𝑖 − 𝐵)
ത
𝐶𝑜𝑣 𝐴, 𝐵 = 𝐸( 𝐴 − 𝐴ҧ 𝐵 − 𝐵ത ) =
𝑛
Can be simplified in computation as:
𝐶𝑜𝑣 𝐴, 𝐵 = 𝐸 𝐴 ∙ 𝐵 − 𝐴ҧ𝐵ത
Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14)
If the stocks are affected by the same industry trends, will their prices rise or
fall together?
E(A) = (2 + 3 + 5 + 4 + 6) / 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) / 5 = 48/5 = 9.6
Cov(A,B) = (2×5 + 3×8 + 5×10 + 4×11 + 6×14) / 5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A,B) > 0.
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse.
Density and distance between points, which are critical to clustering and
outlier analysis, become less meaningful.
The possible combinations of subspaces will grow exponentially.
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality-reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
Fourier transform
Wavelet transform
Decomposes a signal
into different frequency
subbands
Applicable to n-dimensional
signals
Data transformed
to preserve relative distance
between objects
at different levels of resolution
Allow natural clusters to
become more distinguishable
Used for image compression
Example:
X = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to
X' = [2.75, -1.25, 0.5, 0, 0, -1, -1, 0]
Compression:
Many small detail coefficients can be replaced by 0's,
and only the significant coefficients are retained.
Coefficient "Supports"
Hierarchical
decomposition 2.75
2.75 +
structure
(a.k.a. "error tree")
+
-1.25
-1.25 + -
+
-
0.5 0
0.5 + -
+ - + - 0 + -
0 -1 -1 0
+
- + - + - + -
0 + -
2 2 0 2 3 5 4 4
-1 + -
-1 + -
Original frequency distribution 0 + -
Summer Semester 2018
KDD, 3 Preprocessing
Why Wavelet Transform? 3 - 30
Find a projection
that captures the largest amount of variation in data.
The original data are projected onto a much smaller space, resulting in
dimensionality reduction.
We find the eigenvectors of the covariance matrix,
and these eigenvectors define the new space.
x2
Linear regression
Data modeled to fit a straight line
Often uses the least-square method to fit the line
Multiple regression
Allows a response variable Y to be modeled as a linear function
of a multidimensional feature vector (x1, …, xn)
Log-linear model
Approximates discrete multidimensional probability distributions
y
A collective name
for techniques for the modeling
and analysis of numerical data Y1
consisting of values of a
dependent variable (also called
response variable or
measurement) and of one or Y1' y=x+1
more independent variables
(aka. explanatory variables or
predictors)
The parameters are estimated x
so as to give a "best fit" of the X1
data.
Most commonly the best fit is Used for prediction (including
evaluated by using the least- forecasting of time-series data),
squares method, but other inference, hypothesis testing,
criteria have also been used. and modeling of causal
relationships
Summer Semester 2018
KDD, 3 Preprocessing
Regression Analysis and Log-Linear Models 3 - 39
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the line and are to be estimated
by using the data at hand.
Using the least-squares criterion to the known values of Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
Many nonlinear functions can be transformed into the above.
Log-linear models:
Approximate discrete multidimensional probability distributions
Estimate the probability of each point (tuple) in a multi-dimensional space
for a set of discretized attributes,
based on a smaller subset of dimensional combinations
Useful also for dimensionality reduction and data smoothing
String compression
There are extensive theories and well-tuned algorithms.
Typically lossless, but only limited manipulation is possible without expansion
Audio/video compression
Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed
without reconstructing the whole.
Time sequence is not audio
Typically short and varies slowly with time
Dimensionality and numerosity reduction may also be considered
as forms of data compression.
A function
that maps the entire set of values of a given attribute
to a new set of replacement values
s.t. each old value can be identified with one of the new values
Methods
Smoothing: Remove noise from data
Attribute/feature construction: New attributes constructed from the given ones
Aggregation: Summarization, data-cube construction
Normalization: Scaled to fall within a smaller, specified range
Min-max normalization
Z-score normalization
Normalization by decimal scaling
Discretization: Concept-hierarchy climbing
Min-max normalization:
to [new_minA, new_maxA]
v minA
v' (new _ max A new _ minA) new _ minA
max A minA
Ex. Let income range from $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,600 is mapped to 73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
Z-score normalization:
(μ: mean, σ: standard deviation):
v A
v'
A
73,600 54,000
Ex. Let μ = 54,000, σ = 16,000. Then 1.225
16,000
Normalization by decimal scaling:
v
v'
10 j where j is the smallest integer such that Max(|ν'|) < 1
Summer Semester 2018
KDD, 3 Preprocessing
Discretization 3 - 49
Typical methods:
All the methods can be applied recursively.
Binning
Unsupervised, top-down split
Histogram analysis
Unsupervised, top-down split
Clustering analysis
Unsupervised, top-down split or bottom-up merge
Decision-tree analysis
Supervised, top-down split
Correlation (e.g., 2) analysis
Unsupervised, bottom-up merge
Classification
E.g., decision-tree analysis
Supervised: Class labels given for training set
E.g., cancerous vs. benign
Using entropy to determine split point (discretization point)
Top-down, recursive split
Details will be covered in Chapter 6
Correlation analysis
E.g., Chi-merge: 2-based discretization
Supervised: use class information
Bottom-up merge: find the best neighboring intervals (those having similar
distributions of classes, i.e., low 2 values) to merge
Merge performed recursively, until a predefined stopping condition
Concept hierarchy
Organizes concepts (i.e., attribute values) hierarchically
Usually associated with each dimension in a data warehouse
Facilitates drilling and rolling in data warehouses
to view data at multiple granularity
Concept-hierarchy formation
Recursively reduce the data
by collecting and replacing low-level concepts
(such as numerical values for age)
by higher-level concepts
(such as youth, adult, or senior)
Can be explicitly specified by domain experts and/or data-warehouse
designers
Can be automatically formed for both numerical and nominal data
For numerical data, use discretization methods shown.
Data quality
Accuracy, completeness, consistency, timeliness, believability, interpretability
Data cleaning
E.g. missing/noisy values, outliers
Data integration from multiple sources:
Entity identification problem
Remove redundancies
Detect inconsistencies
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept-hierarchy generation
OLTP OLAP
users clerk, IT professional knowledge worker
function day-to-day operations decision support
DB design application-oriented subject-oriented
current, up-to-date; historical;
data detailed, flat relational; summarized, multidimensional;
isolated integrated, consolidated
usage repetitive ad-hoc
read/write;
access lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
#records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Monitor
& OLAP Server
Other Metadata
sources Integrator
Data Marts
Enterprise Warehouse
Collects all of the information about subjects
spanning the entire organization
Data Mart
A subset of corporate-wide data
that is of value to a specific group of users.
Its scope is confined to specific, selected groups,
such as marketing data mart.
Independent vs. dependent (directly from warehouse) data mart
Virtual Warehouse
A set of views over operational databases
Only some of the possible summary views may be materialized.
Extraction
Get data from multiple, heterogeneous, and external sources
Cleaning
Detect errors in the data and rectify them if possible
Transformation
Convert data from legacy or host format to warehouse format
Loading
Sort, summarize, consolidate, compute views, check integrity, and build
indices and partitions
Refresh
Propagate only the updates from the data sources to the warehouse
Data warehouse
Based on a multidimensional data model
which views data in the form of a data cube
Data cube
Allows data (here: sales) to be modeled and viewed in multiple dimensions
Dimension tables
Such as: item (item_name, brand, type),
or: time (day, week, month, quarter, year)
Fact table
Contains measures (such as dollars_sold)
and references (foreign keys) to each of the related dimension tables
n-D base cube
Called a base cuboid in data-warehousing literature
Top most 0-D cuboid
Holds the highest-level of summarization
Called the apex cuboid
Lattice of cuboids
Forms a data cube
all
0-D (apex) cuboid
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key
type
year item_key supplier_type
branch_key
branch location_key location
branch_key units_sold location_key
branch_name dollars_sold street
branch_type city
avg_sales
province_or_street
country
Measures
Summer Semester 2018
KDD, 4 OLAP
Example of Snowflake Schema 4 - 19
time
time_key item supplier
day item_key
day_of_the_week supplier_key
Sales Fact Table item_name
month supplier_type
brand
quarter time_key
supplier_key
year item_key
branch_key
branch location_key location
branch_key units_sold location_key
branch_name dollars_sold street
branch_type city_key
avg_sales city
city_key
city
Measures province_or_street
Summer Semester 2018 country
KDD, 4 OLAP
Example of Fact Constellation 4 - 20
time Shipping
time_key item Fact Table
day item_key
day_of_the_week time_key
Sales Fact Table item_name
month brand item_key
quarter time_key
type shipper_key
year item_key supplier_type from_location
branch_key
to_location
branch location_key location dollars_cost
branch_key units_sold location_key units_shipped
branch_name dollars_sold street
branch_type city
avg_sales
province_or_street shipper
country
shipper_key
shipper_name
Measures location_key
Summer Semester 2018 shipper_type
KDD, 4 OLAP
A Concept Hierarchy: Dimension (Location) 4 - 21
all all
...
office L. Chan M. Wind
Summer Semester 2018
KDD, 4 OLAP
Data-Cube Measures: Three Categories 4 - 22
Distributive:
If the result derived by applying the function to the n aggregate values
obtained for n partitions of the dataset
is the same as that derived by applying the function
on all the data without partitioning
E.g., count(), sum(), min(), max()
Algebraic:
If it can be computed by an algebraic function with M arguments
(where M is a bounded integer),
each of which is obtained by applying a distributive aggregate function
E.g., avg(), min_N(), standard_deviation()
Holistic:
If there is no constant bound on the storage size needed to describe a
subaggregate
E.g., median(), mode(), rank()
Non-trivial Property
Next to name and value range
Defines the set of aggregation operations
that can be executed on a measure (a fact)
FLOW
Any aggregation
E.g., sales, turnover
STOCK
No temporal aggregation
E.g., stock, inventory
VPU (Value per Unit)
No summarization
E.g., price, tax, in general factors
(Always applicable: minimum, maximum, and average.)
Summer Semester 2018
KDD, 4 OLAP
View of Warehouses and Hierarchies 4 - 24
Specification
of hierarchies
Schema hierarchy
day <
{month < quarter; week} <
year
Set-grouping
hierarchy
Summer Semester 2018
KDD, 4 OLAP
{1..10} < inexpensive
Multidimensional Data 4 - 25
Office Day
Month
Summer Semester 2018
KDD, 4 OLAP
A Sample Data Cube 4 - 26
Country
sum
Canada
Mexico
sum
all
0-D (apex) cuboid
product date country
1-D cuboids
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location
Promotion Each circle is Organization
Summer Semester 2018
KDD, 4 OLAP
called a footprint.
Browsing a Data Cube 4 - 31
Visualization
OLAP capabilities
Interactive manipulation
Summer Semester 2018
KDD, 4 OLAP
Chapter 4: Data Warehousing and 4 - 32
Online Analytical Processing
Multi-Tier Data
Warehouse
Distributed
Data Marts
Enterprise
Data Data
Data
Mart Mart
Warehouse
Model
refinement Model refinement
Summarize data
By replacing relatively low-level values
E.g., numerical values for the attribute age
with higher-level concepts
E.g., young, middle-aged, and senior
Proposed in 1989
KDD'89 workshop
Not confined to categorical data nor to particular measures
How is it done?
Collect the task-relevant data (initial relation)
using a relational database query
Perform generalization
by attribute removal or attribute generalization
Apply aggregation
by merging identical, generalized tuples
and accumulating their respective counts
Interaction with users
for knowledge presentation
Example:
Describe general characteristics of graduate students
in a University database
Step 1:
Fetch relevant set of data using an SQL statement, e.g.,
SELECT name, gender, major, birth_place, birth_date,
residence, phone#, gpa)
FROM student
WHERE student_status IN {"Msc", "MBA", "PhD"};
Step 2:
Perform attribute-oriented induction
Step 3:
Present results in generalized-relation, cross-tab, or rule forms
Birth_Region
Cross-table
Gender Canada Foreign Total
M 16 14 30
F 10 22 32
Total 26 36 62
Data focusing:
Task-relevant data, including dimensions
The result is the initial relation.
Attribute removal:
Remove attribute A, if there is a large set of distinct values for A
but (1) there is no generalization operator on A,
or (2) A's higher-level concepts are expressed in terms of other attributes.
Attribute generalization:
If there is a large set of distinct values for A,
and there exists a set of generalization operators on A,
then select an operator and generalize A.
Attribute-threshold control:
Typical 2-8, specified/default
Generalized-relation-threshold control:
Control the final relation/rule size
Summer Semester 2018
KDD, 4 OLAP
Attribute-Oriented Induction: Basic Algorithm 4 - 52
InitialRel:
Query processing of task-relevant data, deriving the initial relation
PreGen:
Based on the analysis of the number of distinct values in each attribute,
determine generalization plan for each attribute:
removal? or how high to generalize?
PrimeGen:
Based on the PreGen plan,
perform generalization to the right level
to derive a "prime generalized relation",
accumulating the counts.
Presentation:
User interaction:
(1) adjust levels by drilling,
(2) pivoting,
(3) mapping into rules, cross tabs, visualization presentations.
Summer Semester 2018
KDD, 4 OLAP
Presentation of Generalized Results 4 - 53
Generalized relation:
Relations where some or all attributes are generalized,
with counts or other aggregation values accumulated
Cross tabulation:
Mapping results into cross-tabulation form (similar to contingency tables)
Visualization techniques:
Pie charts, bar charts, curves, cubes, and other visual forms
Quantitative characteristic rules:
Mapping generalized result into characteristic rules
with quantitative information associated with it, e.g.,
grad ( x) male( x)
birth _ region( x) "Canada"[t :53%] birth _ region( x) " foreign"[t : 47%].
Similarity:
Data generalization
Presentation of data summarization at multiple levels of abstraction
Interactive drilling, pivoting, slicing and dicing
Differences:
OLAP has systematic preprocessing, query independent, and can drill down
to rather low level
AOI has automated desired-level allocation,
and may perform dimension-relevance analysis/ranking
when there are many relevant dimensions
AOI works on data which are not in relational forms
Basic Concepts
Scalable Frequent-Itemset-Mining Methods
Apriori: A Candidate-Generation-and-Test Approach
Improving the Efficiency of Apriori
FPGrowth: A Frequent-Pattern-Growth Approach
ECLAT: Frequent-Pattern Mining with Vertical Data Format
Mining Closed Itemsets and Max-Itemsets
Generating Association Rules from Frequent Itemsets
Which Patterns Are Interesting?—Pattern-Evaluation Methods
Summary
Frequent pattern:
A pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a dataset
First proposed by (Agrawal, Imielinski & Swami, AIS'93)
in the context of frequent itemsets and association-rule mining
Motivation: Finding inherent regularities in data
What products are often purchased together?—Beer and diapers?!
What are the subsequent purchases after buying a PC?
"Who bought this has often also bought … "
What kinds of DNA are sensitive to this new drug?
Can we automatically classify Web documents?
Applications:
Basket-data analysis, cross-marketing, catalog design, sale-campaign
analysis, Web-log (click-stream) analysis, and DNA-sequence analysis
Solution:
Mine closed itemsets and max-itemsets instead
An itemset X is closed, if X is frequent
and there exists no super-itemset Y X with the same support as X.
Proposed by (Pasquier et al., ICDT'99)
An itemset X is a max-itemset, if X is frequent
and there exists no frequent super-itemset Y X.
Proposed by (Bayardo, SIGMOD'98)
Closed itemset is a lossless compression of frequent itemsets.
Reducing the number of itemsets (and rules)
Example:
DB = {<a1, …, a100>, < a1, …, a50>}
I.e., just two transactions
min_sup = 1
What are the closed itemsets?
<a1, …, a100>: 1
<a1, …, a50>: 2
Number behind the colon: support_count
What are the max-itemsets?
<a1, …, a100>: 1
What is the set of all frequent itemsets?
!!
Basic Concepts
Scalable Frequent-Itemset-Mining Methods
Apriori: A Candidate-Generation-and-Test Approach
Improving the Efficiency of Apriori
FPGrowth: A Frequent-Pattern-Growth Approach
ECLAT: Frequent-Pattern Mining with Vertical Data Format
Mining Closed Itemsets and Max-Itemsets
Generating Association Rules from Frequent Itemsets
Which Patterns Are Interesting?—Pattern-Evaluation Methods
Summary
min_sup = 2
Itemset sup
Itemset sup
Database TDB C1 {A} 2
Tid Items
L1 {A} 2
{B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup
C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 {A, B}
{A, C} 2 {A, C}
{B, C} 2
{A, E} 1 2nd scan
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
{B, C, E} 3rd scan L3 Itemset sup
Summer Semester 2018
{B, C, E} 2
KDD, 5 Frequent Patterns
The Apriori Algorithm (Pseudo-Code) 5 - 15
L1 = {frequent items};
for (k = 1; Lk != ; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t;
Lk+1 = candidates in Ck+1 with min_sup;
end;
return k Lk;
Subset function
1,4,7 3,6,9 Transaction: 1 2 3 5 6
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
125 159
457
458
Basic Concepts
Scalable Frequent-Itemset-Mining Methods
Apriori: A Candidate-Generation-and-Test Approach
Improving the Efficiency of Apriori
FPGrowth: A Frequent-Pattern-Growth Approach
ECLAT: Frequent-Pattern Mining with Vertical Data Format
Mining Closed Itemsets and Max-Itemsets
Generating Association Rules from Frequent Itemsets
Which Patterns Are Interesting?—Pattern-Evaluation Methods
Summary
Basic Concepts
Scalable Frequent-Itemset-Mining Methods
Apriori: A Candidate-Generation-and-Test Approach
Improving the Efficiency of Apriori
FPGrowth: A Frequent-Pattern-Growth Approach
ECLAT: Frequent-Pattern Mining with Vertical Data Format
Mining Closed Itemsets and Max-Itemsets
Generating Association Rules from Frequent Itemsets
Which Patterns Are Interesting?—Pattern-Evaluation Methods
Summary
{}
1. Scan DB once, find frequent
1-itemsets (single-item Header Table
patterns) f:4 c:1
Item frequency head
2. Sort frequent items in f 4
frequency-descending order, c 4 c:3 b:1 b:1
creating the f-list a 3
b 3
3. Scan DB again, construct m 3
a:3 p:1
FP-tree p 3
m:2 b:1
F-list = f-c-a-b-m-p
Summer Semester 2018 p:2 m:1
KDD, 5 Frequent Patterns
Partition Itemsets and Databases 5 - 30
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a fc:3
m 3
a:3 p:1
b fca:1, f:2, c:1
p 3
m:2 b:1 m fca:3, fcab:1
p fcam:2, cb:1
Summer Semester 2018 p:2 m:1
KDD, 5 Frequent Patterns
p's Conditional Pattern Base 5 - 32
{}
Header Table
f:2 c:1
Item frequency head
f 4
c 4 c:2 b:1 b:1
a 3
b 3
m 3
a:2 p:1
p 3
m:2 b:1
p:2 m:1
Hence, p's conditional pattern base is
fcam: 2, cb: 1;
both below min_sup
{}
Header Table
f:3 c:1
Item frequency head
f 4
c 4 c:3 b:1 b:1
a 3
b 3
m 3
a:3 p:1
p 3
m:2 b:1
p:2 m:1
Hence, m's conditional pattern base is
fca: 3, fcab: 1;
fca has min_sup
{}
Header Table
f:2 c:1
Item frequency head
f 4
c 4 c:1 b:1 b:1
a 3
b 3
m 3
a:1 p:1
p 3
m:2 b:1
p:2 m:1
Hence, b's conditional pattern base is
fca: 1, f: 2, c: 1;
all below min_sup
{}
Header Table
f:3 c:1
Item frequency head
f 4
c 4 c:3 b:1 b:1
a 3
b 3
m 3
a:3 p:1
p 3
m:2 b:1
p:2 m:1
Hence, a's conditional pattern base is
fc: 3;
has min_sup (could have used the m-conditional
pattern base here (recursion))
Summer Semester 2018
KDD, 5 Frequent Patterns
From Conditional Pattern Bases 5 - 36
to Conditional FP-Trees
(min_sup = 3)
{}
Header Table
m's conditional pattern base:
f:4 c:1 fca:3, fcab:1
Item frequency head
f 4 All frequent
c 4 c:3 b:1 b:1 patterns related to
{} m:
a 3
b 3 m,
m 3
a:3 p:1 f:3 fm, cm, am,
p 3
m:2 b:1 c:3 fcm, fam, cam,
fcam
p:2 m:1 a:3
Summer Semester 2018
KDD, 5 Frequent Patterns
m's conditional FP-tree
Recursion: Mining Each Conditional FP-tree 5 - 37
{}
c:3
f:3
am's conditional FP-tree
c:3 {}
Cond. pattern base of "cm": (f:3)
a:3 f:3
m's conditional FP-tree
cm's conditional FP-tree
{}
Cond. pattern base of "cam": (f:3)
f:3
cam's conditional FP-tree
Summer Semester 2018
KDD, 5 Frequent Patterns
A Special Case: Single Prefix Path in FP-tree 5 - 38
{}
a1:n1
{} r1
a2:n2
a3:n3 a1:n1
r1 = + b1:m1 C1:k1
a2:n2
b1:m1 C1:k1
a3:n3 C2:k2 C3:k3
C :k
Summer2Semester
2 C :k3
2018 3
KDD, 5 Frequent Patterns
Benefits of the FP-tree Structure 5 - 39
Completeness
Preserve complete information for frequent-pattern mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
Items in frequency-descending order:
The more frequently occurring, the more likely to be shared
Never larger than the original database
Not counting node links and the count fields
am-proj DB cm-proj DB
fc f …
fc f
Summer Semester 2018 fc f
KDD, 5 Frequent Patterns
Performance of FP-Growth in Large Datasets 5 - 43
100
140
90 D1 FP-grow th runtime D2 FP-growth
80
D1 Apriori runtime 120 D2 TreeProjection
70 100
Runtime (sec.)
Run time(sec.)
60
80
50 Data set T25I20D10K Data set T25I20D100K
40 60
30 40
20
20
10
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)
Divide-and-conquer:
Decompose both the mining task and DB according to the frequent patterns
obtained so far
Leading to focused search of smaller databases
Other factors
No candidate generation, no candidate test
Compressed database: FP-tree structure
No repeated scan of entire database
Basic ops: counting local frequent items and building sub FP-tree, no pattern
search and matching
A good open-source implementation and refinement of FP-Growth:
FPGrowth+
(Grahne & Zhu, FIMI'03)
AFOPT
(Liu et al., KDD'03)
A "push-right" method for mining condensed frequent-pattern (CFP) tree
Carpenter
(Pan et al., KDD'03)
Mine datasets with small rows but numerous columns
Construct a row-enumeration tree for efficient mining
FPgrowth+
(Grahne & Zhu, FIMI'03)
Efficiently using prefix-trees in mining frequent itemsets
TD-Close
(Liu et al., SDM'06)
Basic Concepts
Scalable Frequent-Itemset-Mining Methods
Apriori: A Candidate-Generation-and-Test Approach
Improving the Efficiency of Apriori
FPGrowth: A Frequent-Pattern-Growth Approach
ECLAT: Frequent-Pattern Mining with Vertical Data Format
Mining Closed Itemsets and Max-Itemsets
Generating Association Rules from Frequent Itemsets
Which Patterns Are Interesting?—Pattern-Evaluation Methods
Summary
Basic Concepts
Scalable Frequent-Itemset-Mining Methods
Apriori: A Candidate-Generation-and-Test Approach
Improving the Efficiency of Apriori
FPGrowth: A Frequent-Pattern-Growth Approach
ECLAT: Frequent-Pattern Mining with Vertical Data Format
Mining Closed Itemsets and Max-Itemsets
Generating Association Rules from Frequent Itemsets
Which Patterns Are Interesting?—Pattern-Evaluation Methods
Summary
Itemset merging:
If Y appears in each occurrence of X, then Y is merged with X.
Sub-itemset pruning:
if Y כX, and sup(X) = sup(Y), X and all of X’s descendants in the set-
enumeration tree can be pruned.
Hybrid tree projection
Bottom-up physical tree projection
Top-down pseudo tree projection
Item skipping:
If a local frequent item has the same support in several header tables at
different levels, one can prune it from the header table at higher levels.
Efficient subset checking
Basic Concepts
Scalable Frequent-Itemset-Mining Methods
Apriori: A Candidate-Generation-and-Test Approach
Improving the Efficiency of Apriori
FPGrowth: A Frequent-Pattern-Growth Approach
ECLAT: Frequent-Pattern Mining with Vertical Data Format
Mining Closed Itemsets and Max-Itemsets
Generating Association Rules from Frequent Itemsets
Which Patterns Are Interesting?—Pattern-Evaluation Methods
Summary
Basic Concepts
Scalable Frequent-Itemset-Mining Methods
Apriori: A Candidate-Generation-and-Test Approach
Improving the Efficiency of Apriori
FPGrowth: A Frequent-Pattern-Growth Approach
ECLAT: Frequent-Pattern Mining with Vertical Data Format
Mining Closed Itemsets and Max-Itemsets
Generating Association Rules from Frequent Itemsets
Which Patterns Are Interesting?—Pattern-Evaluation Methods
Summary
Support and
confidence are
not good
to indicate
correlation.
Over 20
interestingness
measures
have been
proposed.
(Tan, Kumar &
Sritastava,
KDD'02)
Which are
good ones?
Summer Semester 2018
KDD, 5 Frequent Patterns
Null-Invariant Measures 5 - 61
Null-transaction:
A transaction that does not contain any of the itemsets being examined
Can outweigh the number of individual itemsets
A measure is null-invariant,
if its value is free from the influence of null-transactions.
Lift and 2 are not null-invariant.
Null-transactions
Kulczynski measure
w.r.t. m and c (1927) Null-invariant
Advisor-advisee relation:
Kulc: high, coherence: low, cosine: middle
(Wu, Chen & Han, PKDD'07)
Summer Semester 2018
KDD, 5 Frequent Patterns
Which Null-Invariant Measure Is Better? 5 - 65
Kulczynski and IR together present a clear picture for all the three
datasets D4 through D6
D4 is balanced & neutral
D5 is imbalanced & neutral
D6 is very imbalanced & neutral
Basic Concepts
Scalable Frequent-Itemset-Mining Methods
Apriori: A Candidate-Generation-and-Test Approach
Improving the Efficiency of Apriori
FPGrowth: A Frequent-Pattern-Growth Approach
ECLAT: Frequent-Pattern Mining with Vertical Data Format
Mining Closed Itemsets and Max-Itemsets
Generating Association Rules from Frequent Itemsets
Which Patterns Are Interesting?—Pattern-Evaluation Methods
Summary
Basic concepts:
Association rules
Support-confidence framework
Closed and max-itemsets
Scalable frequent-itemset-mining methods
Apriori
Candidate generation & test
Projection-based
FPgrowth, CLOSET+, ...
Vertical-format approach
ECLAT, CHARM, ...
Association rules generated from frequent itemsets
Which patterns are interesting?
Pattern-evaluation methods
Summer Semester 2018
KDD, 5 Frequent Patterns
References: Basic Concepts of Frequent-Pattern 5 - 68
Mining
(Association Rules)
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between
sets of items in large databases. SIGMOD'93
(Max-Itemset)
R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98
(Closed Itemsets)
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent
closed itemsets for association rules. ICDT'99
(Sequential Pattern)
R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95
Chapter 6: Classification
Prof. Dr. Klaus Meyer-Wegener
Summer Semester 2018
Classification:
Predicts categorical class labels (discrete, nominal)
Constructs a model
based on the training set
and the values (class labels) in a classifying attribute
and uses it in classifying new data
Numerical Prediction:
Models continuous-valued functions
I.e., predicts missing or unknown (future) values
Typical applications of classification:
Credit/loan approval: Will it be paid back?
Medical diagnosis: Is a tumor cancerous or benign?
Fraud detection: Is a transaction fraudulent or not?
Web-page categorization: Which category is it?
Classification
Algorithms
Training
Data
Classifier
(Jeff, Professor, 4)
no yes no yes
age?
Example:
4 4 6 6 4 4
SplitInfoincome ( D) log 2 log 2 log 2 1.557
14 14 14 14 14 14
GainRatio(income) = 0.029 / 1.557 = 0.019
The attribute with the maximum gain ratio is selected as the splitting attribute.
Source: Wikipedia
Summer Semester 2018
KDD, 6 Classification
Gini Index (CART, IBM IntelligentMiner) 6 - 18
j 1
The attribute A that provides the smallest giniA(D) (or the largest
reduction in impurity) is chosen to split the node.
Need to enumerate all the possible splitting points for each attribute.
Example:
D has 9 tuples in buys_computer = "yes" and 5 in "no"
2 2
9 5
gini ( D) 1 0.459
14 14
10 4
giniincome{low ,medium } ( D ) gini ( D1 ) gini ( D2 )
14 14
10 7 3 4 2 2 2 2
2 2
1 1
14 10 10 14 4 4
0.443
giniincome{high } ( D )
Example (cont.):
giniincome {low,high} is 0.458;
giniincome {medium,high} is 0.450.
Thus, split on the {low,medium} and {high} since it has the lowest gini index.
CHAID:
A popular decision-tree algorithm, measure based on 2 test for independence
C-SEP:
Performs better than information gain and Gini index in certain cases
G-statistic:
Has a close approximation to 2 distribution
MDL (Minimal Description Length) principle:
I.e., the simplest solution is preferred.
The best tree is the one that requires the fewest number of bits
to both (1) encode the tree and (2) encode the exceptions to the tree.
Multivariate splits:
Partitioning based on multiple variable combinations
CART: finds multivariate splits based on a linear combination of attributes.
Which attribute-selection measure is the best?
Most give good results, none is significantly superior to others.
Summer Semester 2018
KDD, 6 Classification
Overfitting and Tree Pruning 6 - 23
Training Set
age income student credit_rating
buys_computer AVC-set on age
<=30 high no fair no buys_computer
<=30 high no excellent no age yes no
31…40 high no fair yes <=30 2 3
>40 medium no fair yes 31..40 4 0
>40 low yes fair yes >40 3 2
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes AVC-set on income
>40 medium yes fair yes buys_computer
Training Set
age buys_computer AVC-set on student
income student credit_rating
<=30 high no fair no buys_computer
<=30 high no excellent no student yes no
31…40 high no fair yes yes 6 1
>40 medium no fair yes no 3 4
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes AVC-set on credit_rating
<=30 medium no fair no
buys_computer
<=30 low yes fair yes credit_
rating yes no
>40 medium yes fair yes
fair 6 2
<=30 medium yes excellent yes
excellent 3 3
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
A statistical classifier:
Performs probabilistic prediction,
i.e., predicts class-membership probabilities
Foundation: Bayes' Theorem
Performance:
A simple Bayesian classifier (naïve Bayesian classifier) has performance
comparable with decision tree and selected neural-network classifiers.
Incremental:
Each training example can incrementally increase/decrease the probability
that a hypothesis is correct—prior knowledge can be combined with
observed data.
Standard:
Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods
can be measured.
Summer Semester 2018
KDD, 6 Classification
Bayes' Theorem: Basics 6 - 35
P(X | H ) P( H )
P ( H | X) P ( X | H ) P ( H ) / P ( X)
P ( X)
P ( X | Ci ) P (Ci )
2
and P(xk | Ci) is:
P( xk | Ci) g ( xk , Ci , Ci )
Summer Semester 2018
KDD, 6 Classification
Naïve Bayesian Classifier: Training Dataset 6 - 39
Classes:
C1: buys_computer = yes age income student credit_ratingbuys_co
C2: buys_computer = no <=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
Data sample:
>40 medium no fair yes
X = (age <=30,
>40 low yes fair yes
income = medium,
>40 low yes excellent no
student = yes,
credit_rating = fair) 31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Summer Semester 2018
KDD, 6 Classification
Naïve Bayesian Classifier: An Example 6 - 40
P(Ci):
P(buys_computer = yes) = 9/14 = 0.643
P(buys_computer = no) = 5/14= 0.357
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
Compute P(X | Ci) for each class:
P(age <= 30 | buys_computer = yes) = 2/9 = 0.222
P(age <= 30 | buys_computer = no) = 3/5 = 0.6
P(income = medium | buys_computer = yes) = 4/9 = 0.444
P(income = medium | buys_computer = no) = 2/5 = 0.4
P(student = yes | buys_computer = yes) = 6/9 = 0.667
P(student = yes | buys_computer = no) = 1/5 = 0.2
P(credit_rating = fair | buys_computer = yes) = 6/9 = 0.667
P(credit_rating = fair | buys_computer = no) = 2/5 = 0.4
P(X | Ci) :
P(X | buys_computer = yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X | buys_computer = no) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X | Ci) x P(Ci):
P(X | buys_computer = yes) x P(buys_computer = yes) = 0.028
P(X | buys_computer = no) x P(buys_computer = no) = 0.007
Therefore, X belongs to class C1 (buys_computer = yes)
Summer Semester 2018
KDD, 6 Classification
Avoiding the Zero-Probability Problem 6 - 41
Example:
Suppose a dataset with 1000 tuples,
income = low (0), income = medium (990), and income = high (10)
Use Laplacian correction (or Laplacian estimator)
Add 1 to each case:
P(income = low) = 1 / 1003
P(income = medium) = 991 / 1003
P(income = high) = 11 / 1003
The "corrected" probability estimates are close to their "uncorrected"
counterparts.
Summer Semester 2018
KDD, 6 Classification
Naïve Bayesian Classifier: Comments 6 - 42
Advantages:
Easy to implement
Good results obtained in most of the cases
Disadvantages:
Assumption: class conditional independence, therefore loss of accuracy
Practically, dependencies exist among variables.
E.g., hospital patients:
Profile: age, family history, etc.
Symptoms: fever, cough, etc.
Disease: lung cancer, diabetes, etc.
Cannot be modeled by Naïve Bayesian Classifier
How to deal with these dependencies?
Bayesian Belief Networks
See textbook
Size ordering:
Assign the highest priority to the triggered rule
that has the "toughest" requirement
(i.e., the most attribute tests).
Class-based ordering:
Decreasing order of prevalence or misclassification cost per class
Rule-based ordering (decision list):
Rules are organized into one long priority list,
according to some measure of rule quality or by experts.
Examples covered
Examples covered by rule 2
by rule 1 Examples covered
by rule 3
Positive
examples
To generate a rule:
while(true):
find the best predicate p (attribute = value);
if FOIL_gain(p) > threshold
then add p to current rule;
else break;
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
Summer Semester 2018
KDD, 6 Classification
How to Learn One Rule? 6 - 50
Danger of overfitting
Removing a conjunct (attribute test)
If pruned version of rule has greater quality,
assessed on an independent set of test tuples (called "pruning set")
FOIL uses:
pos neg
FOIL _ Prune ( R )
pos neg
If FOIL_Prune is higher for the pruned version of R,
prune R.
Evaluation metrics:
How can we measure accuracy?
Other metrics to consider?
Use test set of class-labeled tuples
instead of training set
when assessing accuracy
Methods for estimating a classifier's accuracy:
Holdout method, random subsampling
Cross-validation
Bootstrap
Comparing classifiers:
Confidence intervals
Cost-benefit analysis and ROC curves
Confusion Matrix:
Actual class \ Predicted class: C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Precision:
Exactness – the % of tuples that are actually positive in those that the
classifier labeled as positive
TP
precision
TP FP
Recall:
Completeness – the % of tuples that the classifier labeled as positive
in all positive tuples TP
recall
Perfect score is 1.0 TP FN
Inverse relationship between precision and recall
F-measure (F1 or F-score): 2 precision recall
Gives equal weight F
precision recall
to precision and recall
Fβ: weighted measure of precision and recall
Assigns β times as much weight (1 ) 2 precision recall
to recall as to precision F
2 precision recall
Summer Semester 2018
KDD, 6 Classification
Classifier-Evaluation Metrics: Example 6 - 57
Holdout method
Given data is randomly partitioned into two independent sets:
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
Random sampling: a variation of holdout
Repeat holdout k times, accuracy = avg. of the accuracies obtained
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
At i-th iteration, use Di as test set and the others as training set.
Leave-one-out:
k folds, where k = # of tuples; for small-sized data
Stratified cross-validation:
Folds are stratified so that class distribution in each fold is approx. the
same as that in the initial data.
Summer Semester 2018
KDD, 6 Classification
Evaluating Classifier Accuracy: Bootstrap 6 - 59
Bootstrap
Works well with small data sets
Samples the given training tuples uniformly with replacement
I.e., each time a tuple is selected, it is equally likely to be selected again
and re-added to the training set.
Several bootstrap methods, and a common one is .632 bootstrap
Data set with d tuples sampled d times, with replacement,
resulting in a training set of d samples.
The data tuples that did not make it into the training set end up forming the
test set.
About 63.2% of the original data end up in the bootstrap, and the
remaining 36.8% form the test set (since (1 – 1/d)d ≈ e–1 = 0.368).
Repeat the sampling procedure k times; overall accuracy of the model:
where
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
Summer Semester 2018
KDD, 6 Classification
Estimating Confidence Intervals: 6 - 63
Table for t-distribution
Symmetrical
Significance level
E.g., sig = 0.05 or 5%
means M1 & M2 are
significantly different for
95% of population
Confidence limit
z = sig / 2
M1
ROC (Receiver Operating
Characteristics) curves:
For visual comparison of classification
models M2
Originated from signal-detection theory
Shows the trade-off between the true-
positive rate and the false-positive rate
The area under the ROC curve is a
measure of the accuracy of the
model.
Rank the test tuples in decreasing Vertical axis represents the true
order: positive rate.
The one that is most likely to belong to
the positive class appears at the top of Horizontal axis repr. the false
the list positive rate.
The closer to the diagonal line The plot also shows a diagonal
(i.e., the closer the area is to 0.5), line.
the less accurate is the model. A model with perfect accuracy will
have an area of 1.0.
Accuracy
Classifier accuracy: predicting class label
Speed
Time to construct the model (training time)
Time to use the model (classification/prediction time)
Robustness
Handling noise and missing values
Scalability
Efficiency in disk-resident databases
Interpretability
Understanding and insight provided by the model
Other measures
E.g., goodness of rules, such as decision-tree size or compactness of
classification rules
Summer Semester 2018
KDD, 6 Classification
Chapter 6: Classification 6 - 67
Ensemble methods
Use a combination of models
to increase accuracy
Combine a series
of k learned models,
M1, M2, …, Mk,
with the aim of creating
an improved model M*
Popular ensemble methods
Bagging:
Averaging the prediction over a collection of classifiers
Boosting:
Weighted vote with a collection of classifiers
Ensemble:
Combining a set of heterogeneous classifiers
Analogy:
Diagnosis based on multiple doctors' majority vote
Training
Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
A classifier model Mi is learned for each training set Di
Classification: classify an unknown sample X
Each classifier Mi returns its class prediction.
The bagged classifier M* counts the votes
and assigns the class with the most votes to X
Prediction
Can be applied to the prediction of continuous values
by taking the average value of each prediction for a given test tuple
Accuracy
Often significantly better than a single classifier derived from D
For noisy data: not considerably worse, more robust
Proved improved accuracy in prediction
Analogy:
Consult several doctors, based on a combination of weighted diagnoses—
weight assigned based on the previous diagnosis accuracy
How boosting works:
Weights are assigned to each training tuple.
A series of k classifiers is iteratively learned.
After a classifier Mi is learned, the weights are updated to allow the
subsequent classifier, Mi+1 to pay more attention to the training tuples that
were misclassified by Mi.
The final M* combines the votes of each individual classifier, where the
weight of each classifier's vote is a function of its accuracy.
Boosting algorithm can be extended for numeric prediction.
Comparing with bagging:
Boosting tends to have greater accuracy, but it also risks overfitting the
model to misclassified data.
Summer Semester 2018
KDD, 6 Classification
AdaBoost 6 - 71
(Breiman 2001)
Random Forest:
Each classifier in the ensemble is a decision-tree classifier
and is generated using a random selection of attributes at each node
to determine the split.
During classification, each tree votes and the most popular class is returned.
Two Methods to construct Random Forest:
Forest-RI (random input selection):
Randomly select, at each node, F attributes as candidates for the split at the
node. The CART methodology is used to grow the trees to maximum size.
Forest-RC (random linear combinations):
Creates new attributes (or features) that are a linear combination of the
existing attributes (reduces the correlation between individual classifiers).
Comparable in accuracy to AdaBoost, but more robust to errors and
outliers
Insensitive to the number of attributes selected for consideration at
each split, and faster than bagging or boosting
Summer Semester 2018
KDD, 6 Classification
Classification of Class-Imbalanced Data Sets 6 - 73
Class-imbalance problem:
Rare positive example but numerous negative ones
E.g., medical diagnosis, fraud, oil-spill, fault, etc.
Traditional methods assume a balanced distribution of classes and equal
error costs: not suitable for class-imbalanced data
Typical methods for imbalanced data in 2-class classification:
Oversampling:
Re-sampling of data from positive class
Undersampling:
Randomly eliminate tuples from negative class
Threshold-moving:
Moves the decision threshold, t, so that the rare-class tuples are easier to
classify, and hence, less chance of costly false-negative errors
Ensemble techniques:
Ensemble multiple classifiers introduced above
Still difficult on multi-class tasks
Summer Semester 2018
KDD, 6 Classification
Chapter 6: Classification 6 - 74
Classification
A form of data analysis
that extracts models describing important data classes
Effective and scalable methods
Developed for decision-tree induction, naive Bayesian classification, rule-
based classification, and many other classification methods
Evaluation metrics:
Accuracy, sensitivity, specificity, precision, recall, F-measure, and Fß
measure
Stratified k-fold cross-validation
Recommended for accuracy estimation
Significance tests and ROC curves
Useful for model selection
Ensemble methods
Bagging and boosting can be used to increase overall accuracy by learning
and combining a series of individual models.
Numerous comparisons of the different classification methods
Matter remains a research topic
No single method has been found to be superior over all others for all data
sets.
Issues such as accuracy, training time, robustness, scalability, and
interpretability must be considered and can involve trade-offs, further
complicating the quest for an overall superior method.
P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning 3:261–
283, 1989
W. Cohen. Fast effective rule induction. ICML'95
S. L. Crawford. Extensions to the CART algorithm. Int. J. Man-Machine Studies
31:197–217, Aug. 1989
J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML'93
P. Smyth and R. M. Goodman. An information theoretic approach to rule
induction. IEEE Trans. Knowledge and Data Engineering 4:301–316, 1992
X. Yin and J. Han. CPAR: Classification based on predictive association rules.
SDM'03
Decision Tree
Laszlo Kovacs, 2018
Classification
General model: {((x1,x2,x3,...,xm):y} ---> f(x1,x2,x3,...,xm) -> Y
Xi : attributes (visible)
Y, Y : variable (not visible)
Application areas:
Movie
Fraud Detection OCR
Recommendations
Supervised Both attributes and output variables are known in
learning the knowledge base (experiences)
output
Data Set model prediction
variables
training set
target
feature
label
example
Data preparation is an important step
Features may be converted from numerical to categorical (intervals)
Decision Tree
Elementary decision Expensive?
condition
outputs (labels) Not to
To buy buy
DT optimization:
Do you Not to
minimal depth
need it? buy
minimal classification error
Objective functions:
Calculation of entropy
X = (a,b,b,a,a,c,c,c)
pa = 3/8 pb = 2/8 pc = 3/8
log pa = -1.41 pb = -2 pc = -1.41
H = 0.53 + 0.5 + 0.53 = 1.56
X = (a,b,b,a,a,b,b,b)
pa = 3/8 pc = 5/8
log pa = -1.41 pb = -0.67
H = 0.53 + 0.43 = 0.95
ID3 algorithm
T : training set
TC: current training set
A: feature (attribute) set
AC : current attribute set
Initially:
AC = A, TC = T
Method:
Select the best attribute from AC using entropy measure and
add this node to the tree
Selection of the winner attribute:
for a in AC loop
V = determine the set of values of a in TC
split TC into disjoint sets according to V
calculate the entropy of each partition
Ea = the aggregated entropy of a
end loop
return the a with best entropy
Why entropy?
maximum homogeneity
Yes: 8
No : 4
Yes: 8
No : 4
Yes: 2 Yes: 6
No : 2 No : 2
Benefits of DT:
Drawbacks of DT:
Only categorical attributes are supported
Too expensive to modify
Low generalization level of the data → tree pruning
Regression tree
Similar to decision tree but
its output is value or range
Quality of the DT model
Training
Testing
Cross-validation
All data items are used for both training and testing but not at the same time
Random forest
plotcp(fit)
printcp(fit)
Example in RapidMiner
Source CSV
Lecture Knowledge Discovery in Databases
Biology:
Taxonomy of living things:
kingdom, phylum, class, order, family, genus, and species
Information retrieval:
Document clustering
Land use:
Identification of areas of similar land use in an earth-observation database
Marketing:
Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
City planning:
Identifying groups of houses according to their house type, value, and geographical location
Earthquake studies:
Observed earthquake epicenters should be clustered along continent faults.
Climate:
Understanding earth climate, find patterns of atmosphere and ocean
Economic Science:
Market research
Summarization
Preprocessing for regression, PCA, classification, and association analysis
Compression
Image processing: vector quantization
Finding k-nearest neighbors
Localizing search to one cluster or a small number of clusters
Outlier detection
Outliers are often viewed as those "far away" from any cluster.
Dissimilarity/similarity metric
Similarity is expressed in terms of a distance function,
typically a metric: d(i, j)
The definitions of distance functions are usually rather different for interval-
scaled, Boolean, categorical, ordinal, ratio, and vector variables.
See Chapter 2!
Weights should be associated with different variables
based on applications and data semantics.
Quality of clustering:
There is usually a separate "quality" function
that measures the "goodness" of a cluster.
It is hard to define "similar enough" or "good enough."
The answer is typically highly subjective.
Partitioning criteria
Single level vs. hierarchical partitioning
Often, multi-level hierarchical partitioning is desirable.
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs.
Non-exclusive (e.g., one document may belong to more than one class)
Similarity measure
Distance-based (e.g., Euclidian, road network, vector) vs.
Connectivity-based (e.g., density or contiguity)
Clustering space
Full space (often when low-dimensional) vs.
Subspaces (often in high-dimensional clustering)
Scalability
Clustering all the data instead of only on samples
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of these
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
High dimensionality
Partitioning approach:
Construct various partitions and then evaluate them by some criterion
E.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARA, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
using some criterion
Typical methods: AGNES, DIANA, BIRCH, CHAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSCAN, OPTICS, DENCLUE
Grid-based approach:
Based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
Model-based approach:
A model is hypothesized for each of the clusters
and tries to find the best fit of that model to each other.
Typical methods: EM, SOM, COBWEB
Frequent-pattern-based approach:
Based on the analysis of frequent patterns
Typical methods: p-Cluster
User-guided or constraint-based approach:
Clustering by considering user-specified or application-specific constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
Objects are often linked together in various ways.
Massive links can be used to cluster objects: SimRank, LinkClus
Partitioning method:
Partition a database D of n objects o into a set of k clusters Ci, 1 ≤ i ≤ k
such that the sum of squared distances to ci is minimized
(where ci is the centroid or medoid of cluster Ci)
Variant:
Start with arbitrarily chosen k objects as initial centroids in step 1.
Continue with step 3.
Summer Semester 2018
KDD, 7 Cluster Analysis
An Example of K-Means Clustering 7 - 15
k=2
Arbitrarily Calculate
partition the cluster
objects into centroids
k groups
Strength:
Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations.
Normally, k, t << n.
Comparing: PAM: O(k(n–k)2), CLARA: O(ks2 + k(n–k))
Comment:
Often terminates at a local optimum.
Weakness:
Applicable only to objects in a continuous n-dimensional space
Using the k-modes method for categorical data
In comparison, k-medoids can be applied to a wide range of data.
Need to specify k, the number of clusters, in advance
There are ways to automatically determine the best k
(see Hastie et al., 2009).
Sensitive to noisy data and outliers
Not suitable to discover clusters with non-convex shapes
Summer Semester 2018
KDD, 7 Cluster Analysis
Variations of the K-Means Method 7 - 17
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K-Medoids Clustering:
Find representative objects (medoids) in clusters
PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw, 1987)
Starts from an initial set of k medoids and iteratively replaces one of the
medoids by one of the non-medoids, if it improves the total distance of
the resulting clustering
PAM works effectively for small data sets, but does not scale well for
large data sets (due to the computational complexity)
Efficiency improvement on PAM
CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
CLARANS (Ng & Han, 1994): Randomized re-sampling
9 9 9
8 8 8
7 7 7
6
Arbitrarily 6
Assign 6
5
choose k 5
each 5
4 objects as 4 remaining 4
3 initial 3
object to 3
2
medoids 2
nearest 2
1 1 1
0 0
medoid 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Randomly select a
Total Cost = 26 nonmedoid object, orandom
10 10
Do loop 9
Compute
9
8 8
Swapping o total cost of
Until no change and orandom
7 7
6
swapping 6
5 5
if quality is 4 4
improved. 3 3
2 2
1 1
TCih = j Cjih
with Cjih as the cost for object oj if oi is swapped with oh
That is, distance to new medoid minus distance to old medoid
Case 1:
oj currently belongs to medoid oi. If oi is replaced with oh as a medoid, and oj is
closest to oh, then oj is reassigned to oh (same cluster, different distance).
Cjih = d(j, h) – d(j, i)
10
7
t
6
j
5
4
i h
3
0
0 1 2 3 4 5 6 7 8 9 10
Case 2:
oj currently belongs to medoid ot, t j. If oi is replaced with oh as a medoid, and
oj is still closest to ot, then the assignment does not change.
Cjih = d(j, i) – d(j, i) = 0
10
t j
8
4 h
3
i
2
0
0 1 2 3 4 5 6 7 8 9 10
Case 3:
oj currently belongs to medoid oi. If oi is replaced with oh as a medoid, and oj is
closest to medoid ot of one of the other clusters, then oj is reassigned to ot (new
cluster, different distance).
Cjih = d(j, t) – d(j, i)
10
8 h
7 j
6
5
i
4
t
3
0
0 1 2 3 4 5 6 7 8 9 10
Case 4:
oj currently belongs to medoid ot, t j. If oi is replaced with oh as a medoid (from
a different cluster!), and oj is closest to oh, then oj is reassigned to oh (new
cluster, different distance).
Cjih = d(j, h) – d(j, t)
10
5 i
4
h j
3
2
t
1
0
0 1 2 3 4 5 6 7 8 9 10
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
X X
Minimum distance:
Smallest distance between an object in one cluster and an object in the
other, i.e., distmin(Ci, Cj) = minoipCi,ojq Cj d(oip, ojq)
Maximum distance:
Largest distance between an object in one cluster and an object in the other,
i.e., distmax(Ci, Cj) = maxoipCi,ojq Cj d(oip, ojq)
Average distance:
Average distance between an object in one cluster and an object in the other,
i.e., distavg(Ci, Cj) = 1 / (ni nj) oipCi,ojq Cj d(oip, ojq)
Mean distance:
Distance between the centroids of two clusters,
i.e., distmean(Ci, Cj) = d(ci, cj)
Good if true clusters are rather compact and approx. equal in size.
n
xi
2
SS: square sum of N points
i 1
(2,6) 7
(4,5) 4
(4,7) 2
(3,8) 0
0 1 2 3 4 5 6 7 8 9 10
o i
LS
E.g. centroid: ci i 1
n n
Additive:
For two disjoint clusters C1 and C2
with clustering features CF1 = (n1, LS1, SS1) and CF2 = (n2, LS2, SS2),
the clustering feature of the cluster that is formed by merging C1 and C2
is simply:
CF1 + CF2 = (n1 + n2, LS1 + LS2, SS1 + SS2)
Height-balanced tree
Stores the clustering features for a hierarchical clustering
Non-leaf nodes store sums of the CFs of their children.
Two parameters:
Branching factor B: max # of children
Threshold T: maximum diameter of sub-clusters stored at leaf nodes
Diameter: average pairwise distance within a cluster,
reflects the tightness of the cluster around the centroid
Root
B=7 CF1 CF2 CF3 CF6
Non-leaf node
CF1 CF2 CF3 CF5
Phase 1:
For each point in the input:
Find closest leaf-node entry
Add point to leaf-node entry and update CF
If entry_diameter > max_diameter, then split leaf node, and possibly parents
Information about new point is passed toward the root of the tree.
Algorithm is O(n)
And incremental
Concerns
Sensitive to insertion order of data points
Since we fix the size of leaf nodes, clusters may not be so natural.
Clusters tend to be spherical given the radius and diameter measures.
Construct (k-NN)
sparse graph
Partition the
graph
Data Set
k-NN graph:
p and q are connected if q
Merge partitions
is among the top-k closest
neighbors of p
Relative interconnectivity:
connectivity of C1 and C2
over internal connectivity
Final clusters
Relative closeness:
Summer Semester 2018
closeness of C1 and C2 over
KDD, 7 Cluster Analysis internal closeness
CHAMELEON (Clustering Complex Objects) 7 - 45
2 2
The probability that a point xi ∈ X is generated by the model:
( xi ) 2
1
P( xi , 2 ) e 2 2
2 2
The likelihood that X is generated by the model:
n ( xi ) 2
1
L( N ( , 2 ) : X ) P ( X , 2 ) e 2 2
i 1 2 2
Ν (0 , 02 ) arg max L( N ( , 2 ) : X )
Summer Semester 2018
KDD, 7 Cluster Analysis
A Probabilistic Hierarchical Clustering Algorithm 7 - 48
Two parameters:
Eps: Maximum radius of a neighborhood
Defines the neighborhood of a point p:
NEps(p) = { q D | dist(p,q) ≤ Eps }
Distance function still needed
MinPts: Minimum number of points in an Eps-neighborhood of a point p
Core-point condition:
| NEps(p) | ≥ MinPts
Only these neighborhoods are considered.
q MinPts = 5
Eps = 1 cm
p
Directly density-reachable:
A point q
is directly density-reachable
from a core point p
w.r.t. Eps, MinPts,
if q belongs to NEps(p). MinPts = 5
q Eps = 1 cm
p
Density-reachable:
A point q is density-reachable
from a core point p w.r.t. Eps, MinPts,
if there is a chain of points
q
p1, …, pn, p2
p1 = p, pn = q p
such that pi+1 is directly density-reachable
from pi.
Density-connected:
A point p is density-connected
to a point q w.r.t. Eps, MinPts,
if there is a point o
such that both p and q
p q
are density-reachable from o.
o
Outlier
Border
Eps = 1cm
Core MinPts = 5
Core Distance:
of a point p
min Eps s.t. p is core point
Reachability Distance r:
of a point p from q
Minimum radius value that makes p
directly density-reachable from q
That is, r(q, p) := p1
max {core-distance(q), dist(q, p)}
Example: q
MinPts = 5,
core_distance(q) = 3 cm
p2
dist(q, p1) = 2.8 cm r(q, p1) = 3 cm
dist(q, p2) = 4 cm r(q, p2) = 4 cm
Reachability-distance
undefined
Eps
Eps'
DENsity-based CLUstEring
(Hinneburg & Keim, KDD'98) total influence
Using statistical density functions: on x
d ( x , y )2 d ( x , xi ) 2
f Gaussian ( x, y) e ( x) i 1 e
2 2 N
D 2 2
f Gaussian
influence of y d ( x , xi ) 2
( x, xi ) i 1 ( xi x) e
on x N
f Gaussian
D 2 2
Major features
gradient of x in
Solid mathematical foundation the direction of
Good for data sets with large amounts of noise xi
Density estimation:
Estimation of an unobservable underlying probability density function
based on a set of observed data
Regarded as a sample
Kernel density estimation:
Treat an observed object as an indicator of high-probability density in the
surrounding region
Probability density at a point depends on the distances from this point to the
observed objects
Density attractor:
Local maximum of the estimated density function
Must be above threshold ξ
Clusters
Can be determined mathematically by identifying density attractors
That is, local maxima of the overall density function
Center-defined clusters:
Assign to each density attractor the points density-attracted to it
Arbitrary shaped cluster:
Merge density attractors that are connected through paths of high density
(> threshold)
Advantages:
Query-independent, easy to parallelize, incremental update
O(K), where K is the number of grid cells at the lowest level
Disadvantages:
All the cluster boundaries are either horizontal or vertical, and no diagonal
boundary is detected.
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
Vacation (week)
Salary (10,000)
age age
20 30 40 50 60
Vacation 20 30 40 50 60
30 50 age
Strength
Automatically finds subspaces of the highest dimensionality
such that high-density clusters exist in those subspaces
Insensitive to the order of records in input
and does not presume any canonical data distribution
Scales linearly with the size of input
and has good scalability as the number of dimensions in the data is
increased
Weakness
Dependent on proper grid size
and density threshold
Accuracy of clustering result may be degraded
at the expense of simplicity of the method
Data set:
Let X = { xi | i =1 to n } be a collection of n patterns
in a d-dimensional space such that xi = (xi1, xi2, …, xid).
Random sample of data space:
Let Y = { yj | j =1 to m } be m sampling points placed at random in the d-
dimensional space, m << n.
m out of the available n patterns are marked at random.
Two types of distances defined:
uj as minimum distance from yj to its nearest pattern in X and
wj as minimum distance from a randomly selected pattern in X to its nearest
neighbor in X
The Hopkins statistic in d dimensions is defined as:
m
j 1
uj
H
j 1 u j j 1 w j
m m
Empirical method
# of clusters ≈ √n / 2 for a dataset of n points
Elbow method
Use the turning point in the curve of sum of within-cluster variance
w.r.t. the # of clusters.
Cross-validation method
Divide a given data set into m parts.
Use m – 1 parts to obtain a clustering model.
Use the remaining part to test the quality of the clustering.
E.g., for each point in the test set, find the closest centroid, and use the
sum of squared distances between all points in the test set and the
closest centroids to measure how well the model fits the test set.
For any k > 0, repeat it m times, compare the overall quality measure w.r.t.
different k's, and find # of clusters that fits the data the best.
Two methods:
Extrinsic: supervised, i.e., the ground truth is available
Compare a clustering against the ground truth
using certain clustering quality measure.
Ex. BCubed precision and recall metrics
Intrinsic: unsupervised, i.e., the ground truth is unavailable
Evaluate the goodness of a clustering by considering
how well the clusters are separated,
and how compact the clusters are.
Ex. Silhouette coefficient
Cluster analysis
Groups objects based on their similarity and has wide applications
Measure of similarity
Can be computed for various types of data
Clustering algorithms can be categorized into
Partitioning methods (k-means and k-medoids)
Hierarchical methods (BIRCH and CHAMELEON; probabilistic hierarchical
clustering)
Density-based methods (DBSCAN, OPTICS, and DENCLUE)
Grid-based methods (STING, CLIQUE)
(Model-based methods)
Quality of clustering results
Can be evaluated in various ways.
L. Fu and E. Medico. FLAME, a novel fuzzy clustering method for the analysis
of DNA microarray data. BMC Bioinformatics 2007, 8:3
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB'98
V. Ganti, J. Gehrke, and R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB'98
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98
S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for
categorical attributes. ICDE'99
A. Hinneburg and D. A. Keim. An Efficient Approach to Clustering in Large
Multimedia Databases with Noise. KDD'98
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988
Outlier:
A data object that deviates significantly from the normal objects
as if it were generated by a different mechanism
Ex.: unusual credit card purchase,
sports: Michael Jordon, Wayne Gretzky, ...
Outliers are different from noise.
Noise is a random error or variance in a measured variable.
Noise should be removed before outlier detection.
Outliers are interesting.
They violate the mechanism that generates the normal data.
Outlier detection vs. novelty detection:
Early stage: outlier;
but later merged into the model
Applications:
Credit-card-fraud detection
Telecom-fraud detection
Customer segmentation
Medical analysis
Collective Outliers
A subset of data objects
that collectively deviates significantly
from the whole data set,
even if the individual data objects may not be outliers.
Ex.: intrusion detection—A number of computers Collective Outlier
keep sending denial-of-service packages to each other.
Detection of collective outliers
Consider not only behavior of individual objects,
but also that of groups of objects
Need to have the background knowledge
on the relationship among data objects,
such as a distance or similarity measure on objects
A data set may have multiple types of outliers.
One object may belong to more than one type of outlier.
Summer Semester 2018
KDD, Outlier Analysis
Challenges of Outlier Detection 8-7
Situation:
In many applications, the number of labeled data objects is small:
Labels could be on outliers only, on normal objects only, or on both.
Semi-supervised outlier detection
Regarded as application of semi-supervised learning
If some labeled normal objects are available:
Use the labeled examples and the proximate unlabeled objects
to train a model for normal objects.
Those not fitting the model of normal objects are detected as outliers.
If only some labeled outliers are available,
that small number may not cover the possible outliers well.
To improve the quality of outlier detection:
get help from models for normal objects learned from unsupervised methods.
An object is an outlier if the nearest neighbors of the object are far away, i.e.,
the proximity of the object significantly deviates from the proximity of most of the
other objects in the same data set.
Example (right figure):
Model the proximity of an object
using its 3 nearest neighbors.
Objects in region R are substantially different
from other objects in the data set.
Thus the objects in R are outliers.
Effectiveness of proximity-based methods
Highly relies on the proximity measure.
In some applications, proximity or distance measures cannot be obtained easily.
Often have a difficulty in finding a group of outliers which are close to each other.
Two major types of proximity-based outlier detection:
Distance-based vs. density-based
Non-parametric method
Do not assume an a-priori statistical model
and determine the model from the input data.
Not completely parameter-free,
but consider number and nature of the parameters to be flexible
and not fixed in advance.
Examples: histogram and kernel density estimation
Univariate data:
A data set involving only one attribute or variable
Assumption:
Data are generated from a normal distribution
Learn the parameters from the input data, and identify the points
with low probability as outliers.
Use the maximum-likelihood method to estimate μ and σ:
Example
Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
For these data with n = 10, we have
Then the most deviating value 24.0 is 4.61 away form the estimated mean.
μ 3σ contains 99.7% of the data under the assumption of normal
distribution
Because 4.61 / 1.51 = 3.04 > 3, probability that 24.0 is generated by
normal distribution is less than 0.15%.
Each "tail" to the left and to the right of the 99.7% has 0.15%.
Hence, 24.0 identified as an outlier.
Multivariate data:
A data set involving two or more attributes or variables
Transform the multivariate outlier-detection task into a univariate
outlier-detection problem.
Method 1. Compute Mahalanobis distance
Let ō be the mean vector for a multivariate data set. Mahalanobis distance
for an object o to ō is MDist(o, ō) = (o – ō)T S –1 (o – ō) where S is the
covariance matrix.
Use the Grubb's test on this measure to detect outliers.
Method 2. Use 2 statistic
where Ei is the mean of the i-dimension among all objects, and n is the
dimensionality.
If 2 statistic is large, then object oi is an outlier.
Summer Semester 2018
KDD, Outlier Analysis
Parametric Methods III: 8 - 22
Using Mixture of Parametric Distributions
where fθ1 and fθ2 are the probability density functions of θ1 and θ2 .
Then use expectation-maximization (EM) algorithm to learn the parameters
μ1, σ1, μ2, σ2 from data.
An object o is an outlier if it does not belong to any cluster.
Problem:
Hard to choose an appropriate bin size for histogram
Too small bin size → normal objects in empty/rare bins, false positive
Too big bin size → outliers in some frequent bins, false negative
Solution:
Adopt kernel density estimation
to estimate the probability density distribution of the data.
If the estimated density function is high,
the object is likely normal.
Otherwise, it is likely an outlier.
Equivalently, one can check the distance between o and its k-th nearest
neighbor ok, where .
CELL:
Data space is partitioned into a multi-D grid.
Each cell is a hyper cube with diagonal length r/2.
r distance threshold parameter
l dimensions: edge of each cell r / (2√l) long
Level-1 cells:
Immediately next to C
For any possible point x in cell C and
any possible point y in a level-1 cell: dist(x,y) ≤ r
Level-2 cells:
One or two cells away from C
For any possible point x in cell C
and any point y such that dist(x,y) ≥ r,
y is in a level-2 cell.
Local outliers:
Outliers compared to their local neighborhoods,
not to global data distribution
In Fig., O1 and O2 are local outliers to C1,
O3 is a global outlier, but O4 is not an outlier.
However, distance of O1 and O2 to objects
in dense cluster C1 is smaller than average
distance in sparse cluster C2.
Hence, O1 and O2 are not distance-based outliers.
Intuition:
Density around outlier object significantly different from density around its
neighbors.
Method:
Use the relative density of an object against its neighbors
as the indicator of the degree of the object being outliers
k-distance of an object o
distk(o)
distance dist(o, p) between o and its k-nearest neighbor p
at least k objects o' ϵ D – {o} such that dist (o, o') ≤ dist(o, p)
at most k – 1 objects o'' ϵ D – {o} such that dist(o, o'') > dist(o, p)
k-distance neighborhood of o:
Nk(o) = {o' | o' ϵ D, dist(o, o') ≤ distk(o)}
Nk(o) could be bigger than k
since multiple objects may have identical distance to o
The lower the local reachability density of o, and the higher the local
reachability density of the kNN of o, the higher LOF
This captures a local outlier whose local density is relatively low comparing to
the local densities of its kNN.
Summer Semester 2018
KDD, Outlier Analysis
Chapter 8: Outlier Analysis 8 - 35
Types of outliers
Global, contextual & collective outliers
Outlier detection
Supervised, semi-supervised, or unsupervised
Statistical (or model-based) approaches
Proximity-based approaches
Not covered here:
Clustering-based approaches
Classification approaches
Mining contextual and collective outliers
Outlier detection in high dimensional data
B. Abraham and G.E.P. Box. Bayesian analysis of some outlier problems in time
series. Biometrika, 66:229–248, 1979.
M. Agyemang, K. Barker, and R. Alhajj. A comprehensive survey of numeric
and symbolic outlier mining techniques. Intell. Data Anal., 10:521–538, 2006.
F. J. Anscombe and I. Guttman. Rejection of outliers. Technometrics, 2:123–
147, 1960.
D. Agarwal. Detecting anomalies in cross-classified streams: a Bayesian
approach. Knowl. Inf. Syst., 11:29–44, 2006.
F. Angiulli and C. Pizzuti. Outlier mining in large high-dimensional data sets.
TKDE, 2005.
C.C. Aggarwal and P.S. Yu. Outlier detection for high dimensional data.
SIGMOD'01.
R.J. Beckman and R.D. Cook. Outlier...s. Technometrics, 25:119–149, 1983.
I. Ben-Gal. Outlier detection. In: O. Maimon and L. Rockach (eds.), Data Mining
and Knowledge Discovery Handbook: A Complete Guide for Practitioners and
Researchers, Kluwer Academic, 2005.
Summer Semester 2018
KDD, Outlier Analysis
References (2) 8 - 38
M.M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-
based local outliers. SIGMOD'00.
D. Barbara, Y. Li, J. Couto, J.-L. Lin, and S. Jajodia. Bootstrapping a data
mining intrusion detection system. SAC'03.
Z.A. Bakar, R. Mohemad, A. Ahmad, and M. M. Deris. A comparative study for
outlier detection techniques in data mining. IEEE Conf. on Cybernetics and
Intelligent Systems, 2006.
S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear
time with randomization and a simple pruning rule. KDD'03.
D. Barbara, N. Wu, and S. Jajodia. Detecting novel network intrusion using
Bayesian estimators. SDM'01.
V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM
Computing Surveys, 41:1–58, 2009.
D. Dasgupta and N.S. Majumdar. Anomaly detection in multidimensional data
using negative selection algorithm. CEC'02.