You are on page 1of 388

Data Warehousing and Mining

Data
Book: Introduction to Data Mining , by
Tan| Steinbach | Kumar

Instructor: BIDYUT KUMAR PATRA

Tuesday[10:00-11:00], Thursday[09:00-10:00], Friday[08:00-09:00]

Saturday[4.15pm-5.15pm]

https://sites.google.com/site/patrabidyutkr/teaching/data-warehousing-and-mining-
cs6312

17/09/2020
Outline

 Attributes and Objects

 Types of Data

 Data Quality

 Similarity and Distance

 Data Preprocessing

17/09/2020
What is Data?

 Collection of data objects Attributes


and their attributes
 An attribute is a property Tid Refund Marital Taxable
or characteristic of an Status Income Cheat

object 1 Yes Single 125K No


– Examples: eye color of a 2 No Married 100K No
person, temperature, etc.
3 No Single 70K No

Objects
– Attribute is also known as
4 Yes Married 120K No
variable, field, characteristic,
dimension, or feature 5 No Divorced 95K Yes

 A collection of attributes 6 No Married 60K No


describe an object 7 Yes Divorced 220K No
– Object is also known as 8 No Single 85K Yes
record, point, case, sample, 9 No Married 75K No
entity, or instance
10 No Single 90K Yes
10
A More Complete View of Data

 Attributes (objects) may have relationships with


other attributes (objects)

 More generally, data may have structure

 Data can be incomplete

 We will discuss this in more detail later

17/09/2020
Attribute Values

 Attribute values are numbers or symbols


assigned to an attribute for a particular object

 Distinction between attributes and attribute values


– Same attribute can be mapped to different attribute
values
 Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of


values
 Example: Attribute values for ID and age are integers
 But properties of attribute values can be different

17/09/2020
Measurement of Length
 The way you measure an attribute may not match the
attributes properties.
5 A 1

B
7 2

C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and additvity
property of properties of
length. 10 4 length.

15 5
Types of Attributes

 There are different types of attributes


– Nominal
 Examples: ID numbers, eye color, zip codes
– Ordinal
 Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height {tall, medium, short}
– Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
 Examples: temperature in Kelvin, length, counts,
elapsed time (e.g., time to run a race)
17/09/2020
Properties of Attribute Values

 The type of an attribute depends on which of the


following properties/operations it possesses:
– Distinctness: = 
– Order: < >
– Differences are + -
meaningful :
– Ratios are * /
meaningful
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & meaningful
differences
– Ratio attribute: all 4 properties/operations

17/09/2020
Difference Between Ratio and Interval

 Is it physically meaningful to say that a


temperature of 10 ° is twice that of 5° on
– the Celsius scale?
– the Fahrenheit scale?
– the Kelvin scale?

 Consider measuring the height above average


– If Bill’s height is three inches above average and
Bob’s height is six inches above average, then would
we say that Bob is twice as tall as Bill?
– Is this situation analogous to that of temperature?

17/09/2020
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative

female} test

Ordinal Ordinal attribute hardness of minerals, median,


values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric

values are correlation, t and


meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current

This categorization of attributes is due to S. S. Stevens


Attribute Transformation Comments
Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Categorical
Qualitative

Ordinal An order preserving change of An attribute encompassing


values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
by { 0.5, 1, 10}.

Interval new_value = a * old_value + b Thus, the Fahrenheit and


where a and b are constants Celsius temperature scales
Quantitative
Numeric

differ in terms of where their


zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.

This categorization of attributes is due to S. S. Stevens


Discrete and Continuous Attributes

 Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
 Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
17/09/2020
Asymmetric Attributes
 Only presence (a non-zero attribute value) is regarded as
important
 Words present in documents
 Items present in customer transactions

 If we met a friend in the grocery store would we ever say the


following?
“I see our purchases are very similar since we didn’t buy most of the
same things.”

 We need two asymmetric binary attributes to represent one


ordinary binary attribute
– Association analysis uses asymmetric attributes

 Asymmetric attributes typically arise from objects that are


sets

17/09/2020
Key Messages for Attribute Types

 The types of operations you choose should be


“meaningful” for the type of data you have
– Distinctness, order, meaningful intervals, and meaningful ratios
are only four properties of data

– The data type you see – often numbers or strings – may not
capture all the properties or may suggest properties that are not
present

– Analysis may depend on these other properties of the data


 Many statistical analyses depend only on the distribution

– Many times what is meaningful is measured by statistical


significance

– But in the end, what is meaningful is measured by the domain

17/09/2020
Types of data sets
 Record
– Data Matrix
– Document Data
– Transaction Data
 Graph
– World Wide Web
– Molecular Structures
 Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data

17/09/2020
Important Characteristics of Data

– Dimensionality (number of attributes)


 High dimensional data brings a number of challenges

– Sparsity
 Only presence counts

– Resolution
 Patterns depend on the scale

– Size
 Type of analysis may depend on size of data

17/09/2020
Record Data

 Data that consists of a collection of records, each


of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

17/09/2020
Data Matrix

 If data objects have the same fixed set of numeric


attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

 Such a data set can be represented by an m by n matrix,


where there are m rows, one for each object, and n
columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

17/09/2020
Document Data

 Each document becomes a ‘term’ vector


– Each term is a component (attribute) of the vector
– The value of each component is the number of times
the corresponding term occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

17/09/2020
Transaction Data

 A special type of data, where


– Each transaction involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
– Can represent transaction data as record data

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
17/09/2020
Graph Data

 Examples: Generic graph, a molecule, and webpages

2
5 1
2
5

Benzene Molecule: C6H6


17/09/2020
Ordered Data

 Sequences of transactions
Items/Events

An element of
the sequence
17/09/2020
Ordered Data

 Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

17/09/2020
Ordered Data

 Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

17/09/2020
Data Quality

 Poor data quality negatively affects many data processing


efforts
“The most important point is that poor data quality is an unfolding
disaster.
– Poor data quality costs the typical company at least ten
percent (10%) of revenue; twenty percent (20%) is
probably a better estimate.”
Thomas C. Redman, DM Review, August 2004

 Data mining example: a classification model for detecting


people who are loan risks is built using poor data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default
17/09/2020
Data Quality …

 What kinds of data quality problems?


 How can we detect problems with the data?
 What can we do about these problems?

 Examples of data quality problems:


– Noise and outliers
– Missing values
– Duplicate data
– Wrong data
– Fake data
17/09/2020
Noise

 For objects, noise is an extraneous object


 For attributes, noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor
phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise


17/09/2020
Outliers

 Outliers are data objects with characteristics that


are considerably different than most of the other
data objects in the data set
– Case 1: Outliers are
noise that interferes
with data analysis

– Case 2: Outliers are


the goal of our analysis
 Credit card fraud
 Intrusion detection

 Causes?
17/09/2020
Missing Values

 Reasons for missing values


– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

 Handling missing values


– Eliminate data objects or variables
– Estimate missing values
 Example: time series of temperature
 Example: census results

– Ignore the missing value during analysis

17/09/2020
Missing Values …

 Missing completely at random (MCAR)


– Missingness of a value is independent of attributes
– Fill in values based on the attribute
– Analysis may be unbiased overall
 Missing at Random (MAR)
– Missingness is related to other variables
– Fill in values based other values
– Almost always produces a bias in the analysis
 Missing Not at Random (MNAR)
– Missingness is related to unobserved measurements
– Informative or non-ignorable missingness
 Not possible to know the situation from the data
17/09/2020
Duplicate Data

 Data set may include data objects that are


duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeneous
sources

 Examples:
– Same person with multiple email addresses

 Data cleaning
– Process of dealing with duplicate data issues

 When should duplicate data not be removed?


17/09/2020
Similarity and Dissimilarity Measures

 Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
 Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
 Proximity refers to a similarity or dissimilarity
17/09/2020
Similarity/Dissimilarity for Simple Attributes

The following table shows the similarity and dissimilarity


between two objects, x and y, with respect to a single, simple
attribute.

17/09/2020
Euclidean Distance

 Euclidean Distance

where n is the number of dimensions (attributes) and


xk and yk are, respectively, the kth attributes
(components) or data objects x and y.

 Standardization is necessary, if scales differ.

17/09/2020
Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
17/09/2020
Minkowski Distance

 Minkowski Distance is a generalization of Euclidean


Distance

Where r is a parameter, n is the number of dimensions


(attributes) and xk and yk are, respectively, the kth
attributes (components) or data objects x and y.

17/09/2020
Minkowski Distance: Examples

 r = 1. City block (Manhattan, taxicab, L 1 norm) distance.


– A common example of this for binary vectors is the
Hamming distance, which is just the number of bits that are
different between two binary vectors

 r = 2. Euclidean distance

 r  . “supremum” (Lmax norm, L norm) distance.


– This is the maximum difference between any component of
the vectors

 Do not confuse r with n, i.e., all these distances are


defined for all numbers of dimensions.

17/09/2020
Minkowski Distance

L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
17/09/2020
Mahalanobis Distance

𝐦𝐚𝐡𝐚𝐥𝐚𝐧𝐨𝐛𝐢𝐬 𝐱, 𝐲 = (𝐱 − 𝐲)𝑇 Ʃ−1 (𝐱 − 𝐲)

 is the covariance matrix

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.


17/09/2020
Mahalanobis Distance

Covariance
Matrix:
0.3 0.2
 
C
 0.2 0.3
B A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4

17/09/2020
Common Properties of a Distance

 Distances, such as the Euclidean distance,


have some well known properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 only if
x = y. (Positive definiteness)
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)

where d(x, y) is the distance (dissimilarity) between


points (data objects), x and y.

 A distance that satisfies these properties is a


metric
17/09/2020
Common Properties of a Similarity

 Similarities, also have some well known


properties.

1. s(x, y) = 1 (or maximum similarity) only if x = y.


(does not always hold, e.g., cosine)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data


objects), x and y.

17/09/2020
Similarity Between Binary Vectors
 Common situation is that objects, x and y, have only
binary attributes

 Compute similarities using the following quantities


f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1

 Simple Matching and Jaccard Coefficients


SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)

J = number of 11 matches / number of non-zero attributes


= (f11) / (f01 + f10 + f11)

17/09/2020
SMC versus Jaccard: Example

x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where x was 0 and y was 1)


f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)


= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

17/09/2020
Cosine Similarity

 If d1 and d2 are two document vectors, then


cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot
product of vectors, d1 and d2, and || d || is the length of
vector d.
 Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150

17/09/2020
Extended Jaccard Coefficient (Tanimoto)

 Variation of Jaccard for continuous or count


attributes
– Reduces to Jaccard for binary attributes

17/09/2020
Correlation measures the linear relationship
between objects

17/09/2020
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

17/09/2020
Drawback of Correlation

 x = (-3, -2, -1, 0, 1, 2, 3)


 y = (9, 4, 1, 0, 1, 4, 9)

y i = x i2

 mean(x) = 0, mean(y) = 4
 std(x) = 2.16, std(y) = 3.74

 corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 )


=0

17/09/2020
Comparison of Proximity Measures

 Domain of application
– Similarity measures tend to be specific to the type of
attribute and data
– Record data, images, graphs, sequences, 3D-protein
structure, etc. tend to have different measures
 However, one can talk about various properties that
you would like a proximity measure to have
– Symmetry is a common one
– Tolerance to noise and outliers is another
– Ability to find more types of patterns?
– Many others possible
 The measure must be applicable to the data and
produce results that agree with domain knowledge
17/09/2020
Information Based Measures

 Information theory is a well-developed and


fundamental discipline with broad applications

 Some similarity measures are based on


information theory
– Mutual information in various versions
– Maximal Information Coefficient (MIC) and related
measures
– General and can handle non-linear relationships
– Can be complicated and time intensive to compute

17/09/2020
Information and Probability

 Information relates to possible outcomes of an event


– transmission of a message, flip of a coin, or measurement
of a piece of data

 The more certain an outcome, the less information


that it contains and vice-versa
– For example, if a coin has two heads, then an outcome of
heads provides no information
– More quantitatively, the information is related the
probability of an outcome
 The smaller the probability of an outcome, the more information it
provides and vice-versa
– Entropy is the commonly used measure
17/09/2020
Entropy

 For
– a variable (event), X,
– with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
– the entropy of X , H(X), is given by
𝑛

𝐻 𝑋 =− 𝑝𝑖 log 2 𝑝𝑖
𝑖=1

 Entropy is between 0 and log2n and is measured in


bits
– Thus, entropy is a measure of how many bits it takes to
represent an observation of X on average
17/09/2020
Entropy Examples

 For a coin with probability p of heads and


probability q = 1 – p of tails
𝐻 = −𝑝 log 2 𝑝 − 𝑞 log 2 𝑞
– For p= 0.5, q = 0.5 (fair coin) H = 1
– For p = 1 or q = 1, H = 0

 What is the entropy of a fair four-sided die?

17/09/2020
Entropy for Sample Data: Example

Hair Color Count p -plog2p


Black 75 0.75 0.3113
Brown 15 0.15 0.4105
Blond 5 0.05 0.2161
Red 0 0.00 0
Other 5 0.05 0.2161
Total 100 1.0 1.1540

Maximum entropy is log25 = 2.3219

17/09/2020
Entropy for Sample Data

 Suppose we have
– a number of observations (m) of some attribute, X,
e.g., the hair color of students in the class,
– where there are n different possible values
– And the number of observation in the ith category is mi
– Then, for this sample
𝑛
𝑚𝑖 𝑚𝑖
𝐻 𝑋 =− log 2
𝑚 𝑚
𝑖=1

 For continuous data, the calculation is harder


17/09/2020
Mutual Information

 Information one variable provides about another

Formally, 𝐼 𝑋, 𝑌 = 𝐻 𝑋 + 𝐻 𝑌 − 𝐻(𝑋, 𝑌), where

H(X,Y) is the joint entropy of X and Y,

𝐻 𝑋, 𝑌 = − 𝑝𝑖𝑗log 2 𝑝𝑖𝑗
𝑖 𝑗
Where pij is the probability that the ith value of X and the jth value of Y
occur together

 For discrete variables, this is easy to compute

 Maximum mutual information for discrete variables is


log2(min( nX, nY ), where nX (nY) is the number of values of X (Y)

17/09/2020
Mutual Information Example

Student Count p -plog2p Student Grade Count p -plog2p


Status Status
Undergrad 45 0.45 0.5184
Undergrad A 5 0.05 0.2161
Grad 55 0.55 0.4744
Undergrad B 30 0.30 0.5211
Total 100 1.00 0.9928
Undergrad C 10 0.10 0.3322

Grade Count p -plog2p Grad A 30 0.30 0.5211

A 35 0.35 0.5301 Grad B 20 0.20 0.4644


B 50 0.50 0.5000 Grad C 5 0.05 0.2161
C 15 0.15 0.4105 Total 100 1.00 2.2710
Total 100 1.00 1.4406

Mutual information of Student Status and Grade = 0.9928 + 1.4406 - 2.2710 = 0.1624

17/09/2020
Maximal Information Coefficient
 Reshef, David N., Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter
J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, and Pardis C. Sabeti. "Detecting novel
associations in large data sets." science 334, no. 6062 (2011): 1518-1524.

 Applies mutual information to two continuous


variables
 Consider the possible binnings of the variables into
discrete categories
– nX × nY ≤ N0.6 where
 nX is the number of values of X
 nY is the number of values of Y
 N is the number of samples (observations, data objects)
 Compute the mutual information
– Normalized by log2(min( nX, nY )
 Take the highest value
17/09/2020
General Approach for Combining Similarities

 Sometimes attributes are of many different types, but an


overall similarity is needed.
1: For the kth attribute, compute a similarity, sk(x, y), in the
range [0, 1].
2: Define an indicator variable, k, for the kth attribute as
follows:
k = 0 if the kth attribute is an asymmetric attribute and
both objects have a value of 0, or if one of the objects
has a missing value for the kth attribute
k = 1 otherwise
3. Compute

17/09/2020
Using Weights to Combine Similarities

 May not want to treat all attributes the same.


– Use non-negative weights 𝜔𝑘

𝑛
𝑘=1 𝜔𝑘 𝛿𝑘𝑠𝑘 (𝐱,𝐲)
– 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐱, 𝐲 = 𝑛
𝑘=1 𝜔𝑘𝛿𝑘

 Can also define a weighted form of distance

17/09/2020
Density

 Measures the degree to which data objects are close to


each other in a specified area
 The notion of density is closely related to that of proximity
 Concept of density is typically used for clustering and
anomaly detection
 Examples:
– Euclidean density
 Euclidean density = number of points per unit volume
– Probability density
 Estimate what the distribution of the data looks like
– Graph-based density
 Connectivity

17/09/2020
Euclidean Density: Grid-based Approach

 Simplest approach is to divide region into a


number of rectangular cells of equal volume and
define density as # of points the cell contains

Grid-based density. Counts for each cell.


17/09/2020
Euclidean Density: Center-Based

 Euclidean density is the number of points within a


specified radius of the point

Illustration of center-based density.


17/09/2020
Data Preprocessing

 Aggregation
 Sampling
 Dimensionality Reduction
 Feature subset selection
 Feature creation
 Discretization and Binarization
 Attribute Transformation

17/09/2020
Aggregation

 Combining two or more attributes (or objects) into


a single attribute (or object)

 Purpose
– Data reduction
 Reduce the number of attributes or objects
– Change of scale
 Cities aggregated into regions, states, countries, etc.
 Days aggregated into weeks, months, or years

– More “stable” data


 Aggregated data tends to have less variability
17/09/2020
Example: Precipitation in Australia

 This example is based on precipitation in


Australia from the period 1982 to 1993.
The next slide shows
– A histogram for the standard deviation of average
monthly precipitation for 3,030 0.5◦ by 0.5◦ grid cells in
Australia, and
– A histogram for the standard deviation of the average
yearly precipitation for the same locations.
 The average yearly precipitation has less
variability than the average monthly precipitation.
 All precipitation measurements (and their
standard deviations) are in centimeters.
17/09/2020
Example: Precipitation in Australia …

Variation of Precipitation in Australia

Standard Deviation of Average Standard Deviation of


Monthly Precipitation Average Yearly Precipitation
17/09/2020
Sampling
 Sampling is the main technique employed for data
reduction.
– It is often used for both the preliminary investigation of
the data and the final data analysis.

 Statisticians often sample because obtaining the


entire set of data of interest is too expensive or
time consuming.

 Sampling is typically used in data mining because


processing the entire set of data of interest is too
expensive or time consuming.

17/09/2020
Sampling …

 The key principle for effective sampling is the


following:

– Using a sample will work almost as well as using the


entire data set, if the sample is representative

– A sample is representative if it has approximately the


same properties (of interest) as the original set of data

17/09/2020
Sample Size

8000 points 2000 Points 500 Points

17/09/2020
Types of Sampling
 Simple Random Sampling
– There is an equal probability of selecting any particular
item
– Sampling without replacement
 As each item is selected, it is removed from the
population
– Sampling with replacement
 Objects are not removed from the population as they
are selected for the sample.
 In sampling with replacement, the same object can
be picked up more than once
 Stratified sampling
– Split the data into several partitions; then draw random
samples from each partition

17/09/2020
Sample Size
 What sample size is necessary to get at least one
object from each of 10 equal-sized groups.

17/09/2020
Curse of Dimensionality

 When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies

 Definitions of density and


distance between points,
which are critical for
clustering and outlier
detection, become less
meaningful • Randomly generate 500 points
• Compute difference between max and
min distance between any pair of points
17/09/2020
Dimensionality Reduction

 Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise

 Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques

17/09/2020
Dimensionality Reduction: PCA

 Goal is to find a projection that captures the


largest amount of variation in data
x2

x1
17/09/2020
Dimensionality Reduction: PCA

17/09/2020
Feature Subset Selection

 Another way to reduce dimensionality of data


 Redundant features
– Duplicate much or all of the information contained in
one or more other attributes
– Example: purchase price of a product and the amount
of sales tax paid
 Irrelevant features
– Contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
 Many techniques developed, especially for
classification
17/09/2020
Feature Creation

 Create new attributes that can capture the


important information in a data set much more
efficiently than the original attributes

 Three general methodologies:


– Feature extraction
 Example: extracting edges from images
– Feature construction
 Example: dividing mass by volume to get density
– Mapping data to new space
 Example: Fourier and wavelet analysis

17/09/2020
Mapping Data to a New Space

 Fourier and wavelet transform

Frequency

Two Sine Waves + Noise Frequency

17/09/2020
Discretization

 Discretization is the process of converting a


continuous attribute into an ordinal attribute
– A potentially infinite number of values are mapped
into a small number of categories
– Discretization is commonly used in classification
– Many classification algorithms work best if both
the independent and dependent variables have
only a few values
– We give an illustration of the usefulness of
discretization using the Iris data set

17/09/2020
Iris Sample Data Set

 Iris Plant data set.


– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher
– Three flower types (classes):
 Setosa
 Versicolour
 Virginica
– Four (non-class) attributes
 Sepal width and length
 Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
17/09/2020
Discretization: Iris Example

Petal width low or petal length low implies Setosa.


Petal width medium or petal length medium implies Versicolour.
Petal width high or petal length high implies Virginica.
Discretization: Iris Example …

 How can we tell what the best discretization is?


– Unsupervised discretization: find breaks in the data
values 50
 Example:
Petal Length 40

Counts
30

20

10

0
0 2 4 6 8
Petal Length

– Supervised discretization: Use class labels to find


breaks
17/09/2020
Discretization Without Using Class Labels

Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.

17/09/2020
Discretization Without Using Class Labels

Equal interval width approach used to obtain 4 values.

17/09/2020
Discretization Without Using Class Labels

Equal frequency approach used to obtain 4 values.

17/09/2020
Discretization Without Using Class Labels

K-means approach to obtain 4 values.

17/09/2020
Binarization

 Binarization maps a continuous or categorical


attribute into one or more binary variables

 Typically used for association analysis

 Often convert a continuous attribute to a


categorical attribute and then convert a
categorical attribute to a set of binary attributes
– Association analysis needs asymmetric binary
attributes
– Examples: eye color and height measured as
{low, medium, high}
17/09/2020
Attribute Transformation

 An attribute transform is a function that maps the


entire set of values of a given attribute to a new
set of replacement values such that each old
value can be identified with one of the new values
– Simple functions: xk, log(x), ex, |x|
– Normalization
 Refers to various techniques to adjust to
differences among attributes in terms of frequency
of occurrence, mean, variance, range
 Take out unwanted, common signal, e.g.,
seasonality
– In statistics, standardization refers to subtracting off
the means and dividing by the standard deviation
17/09/2020
Example: Sample Time Series of Plant Growth
Minneapolis

Net Primary
Production (NPP)
is a measure of
plant growth used
by ecosystem
scientists.

Correlations between time series


Correlations between time series
Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.7591 -0.7581
Atlanta 0.7591 1.0000 -0.5739
Sao Paolo -0.7581 -0.5739 1.0000

17/09/2020
Seasonality Accounts for Much Correlation
Minneapolis
Normalized using
monthly Z Score:
Subtract off monthly
mean and divide by
monthly standard
deviation

Correlations between time series


Correlations between time series
Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.0492 0.0906
Atlanta 0.0492 1.0000 -0.0154
Sao Paolo 0.0906 -0.0154 1.0000
17/09/2020
Data Warehousing and Mining
(CS6312)
From August 12, 2019:TC Slot (Room No. PPA)

10.00-11.00 (Monday)
09.00-10.00 (Tuesday)
08.00-09.00 (Thursday)

1
2
Data Warehousing and Mining
(CS6312)

3
Data Warehousing and Mining
(CS6312)

4
Data Warehousing and Mining
(CS6312)

5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Computing Gini Index for a Collection of
Nodes
When a node p is split into k partitions (children)
k
ni
GINI split   GINI (i )
i 1 n

where, ni = number of records at child i,


n = number of records at parent node p.

Choose the attribute that minimizes weighted average


Gini index of the children

Gini index is used in decision tree algorithms such as


CART, SLIQ, SPRINT
24
25
26
27
Binary Attributes: Computing GINI Index

Splits into two partitions


Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 7
Yes No C2 5
Gini = 0.486
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2 N1 N2 Weighted Gini of N1 N2
= 0.278
C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 Gini=0.361
= 0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125
4/28/2021 28
Categorical Attributes: Computing Gini Index

For each distinct value, gather counts for each class in the dataset
Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of values)

CarType CarType CarType


{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

Which of these is the best?

4/28/2021 29
Continuous Attributes: Computing Gini Index

Use Binary Decisions based on one ID


Home
Owner
Marital
Status
Annual
Income
Defaulted
value 1 Yes Single 125K No
Several Choices for the splitting value 2 No Married 100K No

– Number of possible splitting values 3 No Single 70K No


= Number of distinct values 4 Yes Married 120K No
5 No Divorced 95K Yes
Each splitting value has a count matrix
6 No Married 60K No
associated with it
7 Yes Divorced 220K
– Class counts in each of the
No
8 No Single 85K Yes
partitions, A < v and A  v 9 No Married 75K No
Simple method to choose best v 10 No Single 90K Yes

– For each v, scan the database to


10

Annual Income ?
gather count matrix and compute
its Gini index
≤ 80 > 80
– Computationally Inefficient!
Defaulted Yes 0 3
Repetition of work.
Defaulted No 3 4

4/28/2021 30
Continuous Attributes: Computing Gini Index...

For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

4/28/2021 31
Continuous Attributes: Computing Gini Index...

For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

4/28/2021 32
Continuous Attributes: Computing Gini Index...

For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

4/28/2021 33
Continuous Attributes: Computing Gini Index...

For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

4/28/2021 34
Continuous Attributes: Computing Gini Index...

For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

4/28/2021 Introduction to Data Mining, 2nd Edition 35


36
Measure of Impurity: Entropy

Entropy at a given node t:


Entropy(t )   p( j | t ) log p( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

 Maximum (log nc) when records are equally distributed


among all classes implying least information
 Minimum (0.0) when all records belong to one class,
implying most information

– Entropy based computations are quite similar to


the GINI index computations
4/28/2021 37
38
39
40
41
Gain Ratio

Gain Ratio:

GAIN n n
GainRATIO  SplitINFO    log
Split k
i i
split
SplitINFO n n i 1

Parent Node, p is split into k partitions


ni is the number of records in partition i

CarType CarType CarType


{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97

4/28/2021 42
43
44
Misclassification Error vs Gini Index

A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!
4/28/2021 45
Misclassification Error vs Gini Index

A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416

Misclassification error for all three cases = 0.3 !

4/28/2021 46
Comparison among Impurity Measures

For a 2-class problem:

4/28/2021 47
4/28/2021 48
4/28/2021 49
Decision Tree Based Classification
Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid
overfitting are employed)
– Can easily handle redundant or irrelevant attributes (unless
the attributes are interacting)
Disadvantages:
– Space of possible decision trees is exponentially large.
Greedy approaches are often unable to find the best tree.
– Does not take into account interactions between attributes
– Each decision boundary involves only a single attribute
4/28/2021 Introduction to Data Mining, 2nd Edition 50
4/28/2021 51
4/28/2021 52
4/28/2021 53
4/28/2021 54
4/28/2021 55
4/28/2021 56
4/28/2021 57
4/28/2021 58
4/28/2021 59
4/28/2021 60
Outline
Distance Based Classification Algorithms

Distance Based Classification Algorithms

Dr. Bidyut Kr. Patra Classification Technique (Contd..)


Outline
Distance Based Classification Algorithms

Nearest Neighbor Classifier(Cover and Hart, 1967)

Let T = {(xi , yi )}ni=1 be set of labeled patterns (Training set),


where xi is a pattern and yi be its class label. Let x be a
pattern with unknown class label (test pattern).
NN Rule is :
Let x ′ ∈ T be the pattern nearest to a test pattern x.
l (x) = l (x ′ )
Complexity:
Time: O(|T |)
Space: O(|T |)

Dr. Bidyut Kr. Patra Classification Technique (Contd..)


Outline
Distance Based Classification Algorithms

Condensed NNC (Hart, 1968)

INPUT: Training Set T


OUTPUT: A condenced Set S.
1 Start with a condensed set S = {x}.

2 For each x ∈ T \ S
1 Classify x using NN considering S as training set.

2 if x is misclassified then S = S ∪ {x}


3 Repeat Step 2 until no change found in Condensed Set

Dr. Bidyut Kr. Patra Classification Technique (Contd..)


Outline
Distance Based Classification Algorithms

Modified CNN(Devi and Murthy, 2003)


1 Start with condensed set S. S contains one pattern from each
class.

2 G=∅
3 For each x ∈ T
1 Classify x using NN considering S as training set.

2 if x is misclassified then G = G ∪ {x}


4 Find a representative pattern from each class in G; Let
representative set is R.

5 S =S ∪R
6 G=∅
7 Repeat Step 2 to Step 6 until there is no misclassification.
Dr. Bidyut Kr. Patra Classification Technique (Contd..)
Outline
Distance Based Classification Algorithms

MCNN is an order independent algorithm

Many Iterations.

Dr. Bidyut Kr. Patra Classification Technique (Contd..)


Outline
Distance Based Classification Algorithms

K-Nearest Neighbor Classifier

Let T = {(xi , yi )}ni=1 be a Training set.


Let x be a pattern with unknown class label (test pattern).

Algorithm:
KNN = ∅
For each t ∈ T
1 if |KNN| <= K
KNN = KNN ∪ {t}
2 else
1 Find a x ′ ∈ KNN such that dis(x, x ′ ) > dis(x, t)
2 KNN = KNN − {x ′ }; KNN = KNN ∪ {t}
The pattern x belongs to a Class in which most of the
patterns in KNN belong to.

Dr. Bidyut Kr. Patra Classification Technique (Contd..)


Outline
Distance Based Classification Algorithms

How to find the value of K

r -fold Cross Validation


Partition the training set into r blocks. Let these are
T1 , T2 , . . . , Tr
For i = 1 to r do
1 Consider T − Ti as the training set and Ti as the validation set.
2 For a range of K values (say from 1 to m) find the error rates
on the validation set.
3 Let these error rates are ei 1 , ei 2 , . . . , eim
Take ei = mean of {e1i , e2i , . . . , eri }, for i = 1 to m.
The value of K = argmin{e1 , e2 , . . . , em }
i

Dr. Bidyut Kr. Patra Classification Technique (Contd..)


Outline
Distance Based Classification Algorithms

Weighted k-NNC(Dudani, 1976)


k-NNC gives equal importance to the first NN and to the last
NN.
Weight is assigned to each nearest neighbor of a quary
pattern.
Cj C
Let X = {xi=1..k | xi j ∈ T } be the set of k-NN of q, whose
class label is to be determined.
Let D = {d1 , d2 , . . . , dk } be an ordered set, where
di = ||xi − q||, di ≤ dj , i < j
The weight wj is assigned to j th nearest neighbor as follows.
( d −d
k j
if dj 6= d1
wj = dk −d1
1 if dj = d1
Calculate weighted sums of patterns belong to each class.
Classify q to a class for which weighted sum is maximum.
Dr. Bidyut Kr. Patra Classification Technique (Contd..)
Outline
Distance Based Classification Algorithms

Editing Techniques

Larger the training set, more the computational cost.


Another technique eliminates (edit) training prototypes
(pattern) erroneously labeled, commonly outliers, and at the
same time, to “clean” the possible overlapping between
regions of different classes.

Dr. Bidyut Kr. Patra Classification Technique (Contd..)


Outline
Distance Based Classification Algorithms

Editing Techniques

Larger the training set, more the computational cost.


Another technique eliminates (edit) training prototypes
(pattern) erroneously labeled, commonly outliers, and at the
same time, to “clean” the possible overlapping between
regions of different classes.
Wilsons editing relies on the idea that, if a prototype is
erroneously classified using the k-NN, it has to be eliminated
from the training set.

Dr. Bidyut Kr. Patra Classification Technique (Contd..)


Outline
Distance Based Classification Algorithms

Edited Nearest Neighbor (Willson, 1976)

INPUT: T is a training set


OUTPUT: S is an edited set
1 for each x ∈ T do
1 classify x using k-NN Classifier (break the ties randomly)
2 if x is misclassified then mark x
2 Delete all marked paterns from T ; let the reduced training set
be S.
3 Output S

Dr. Bidyut Kr. Patra Classification Technique (Contd..)


Outline
Distance Based Classification Algorithms

Repeated ENN (Tomek, 1976)

Apply ENN method repeatedly until there is no chnage in the


training set T

Dr. Bidyut Kr. Patra Classification Technique (Contd..)


Confusion Matrix and ROC Plot

 Confusion Matrix

 Receiver Operating Characteristic (ROC)

1
Which model is better?

PREDICTED
A Class=Yes Class=No
ACTUAL Class=Yes 0 10
Class=No 0 990

PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 90 900

2
Which model is better?

PREDICTED
A Class=Yes Class=No
ACTUAL Class=Yes 5 5
Class=No 0 990

PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 90 900

3
Alternative Measures

PREDICTED CLASS
Class=Yes Class=No

Class=Yes a b
ACTUAL
CLASS Class=No c d

a
Precision (p) =
a+c
a
Recall (r) =
a+b
2rp 2a
F - measure (F) = =
r + p 2a + b + c
4
Alternative Measures
10
PREDICTED CLASS Precision (p) = = 0.5
10 + 10
10
Class=Yes Class=No Recall (r) = =1
10 + 0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F) = = 0.62
CLASS Class=No 10 980 1 + 0.5
990
Accuracy = = 0.99
1000

5
Alternative Measures
10
PREDICTED CLASS Precision (p) = = 0.5
10 + 10
10
Class=Yes Class=No Recall (r) = =1
10 + 0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F) = = 0.62
CLASS Class=No 10 980 1 + 0.5
990
Accuracy = = 0.99
1000

PREDICTED CLASS 1
Precision (p) = =1
1+ 0
Class=Yes Class=No
1
Recall (r) = = 0.1
Class=Yes 1 9 1+ 9
ACTUAL 2 * 0.1*1
CLASS Class=No 0 990 F - measure (F) = = 0.18
1 + 0.1
991
Accuracy = = 0.991
1000
6
Alternative Measures

PREDICTED CLASS
Precision (p) = 0.8
Class=Yes Class=No
Recall (r) = 0.8
Class=Yes 40 10 F - measure (F) = 0.8
ACTUAL
CLASS Class=No 10 40 Accuracy = 0.8

7
Alternative Measures

PREDICTED CLASS
Precision (p) = 0.8
Class=Yes Class=No
Recall (r) = 0.8
A Class=Yes 40 10 F - measure (F) = 0.8
ACTUAL
CLASS Class=No 10 40 Accuracy = 0.8

PREDICTED CLASS
B Class=Yes Class=No Precision (p) =~ 0.04
Class=Yes 40 10 Recall (r) = 0.8
ACTUAL F - measure (F) =~ 0.08
CLASS Class=No 1000 4000
Accuracy =~ 0.8

8
Measures of Classification Performance

PREDICTED CLASS
Yes No
ACTUAL
Yes TP FN
CLASS
No FP TN

α is the probability that we reject


the null hypothesis when it is
true. This is a Type I error or a
false positive (FP).

β is the probability that we


accept the null hypothesis when
it is false. This is a Type II error
or a false negative (FN).

9
Alternative Measures

PREDICTED CLASS Precision (p) = 0.8


TPR = Recall (r) = 0.8
Class=Yes Class=No FPR = 0.2
F−measure (F) = 0.8
Class=Yes 40 10 Accuracy = 0.8
ACTUAL
CLASS Class=No 10 40
TPR
=4
FPR

PREDICTED CLASS Precision (p) = 0.038


TPR = Recall (r) = 0.8
Class=Yes Class=No
FPR = 0.2
Class=Yes 40 10 F−measure (F) = 0.07
ACTUAL Accuracy = 0.8
CLASS Class=No 1000 4000
TPR
=4
FPR

10
Alternative Measures

PREDICTED CLASS
Class=Yes Class=No
Precision (p) = 0.5
Class=Yes 10 40
TPR = Recall (r) = 0.2
ACTUAL
Class=No 10 40
FPR = 0.2
CLASS
F − measure = 0.28

PREDICTED CLASS
Precision (p) = 0.5
Class=Yes Class=No
TPR = Recall (r) = 0.5
Class=Yes 25 25
ACTUAL FPR = 0.5
Class=No 25 25
CLASS F − measure = 0.5

PREDICTED CLASS Precision (p) = 0.5


Class=Yes Class=No
TPR = Recall (r) = 0.8
Class=Yes 40 10
ACTUAL FPR = 0.8
Class=No 40 10
CLASS
F − measure = 0.61
11
ROC (Receiver Operating Characteristic)

 A graphical approach for displaying trade-off


between detection rate and false alarm rate
 Developed in 1950s for signal detection theory to
analyze noisy signals
 ROC curve plots TPR against FPR
– Performance of a model represented as a
point in an ROC curve
– Changing the threshold parameter of classifier
changes the location of the point

12
ROC Curve

(TPR,FPR):
 (0,0): declare everything
to be negative class
 (1,1): declare everything
to be positive class
 (1,0): ideal

 Diagonal line:
– Random guessing
– Below diagonal line:
 prediction is opposite
of the true class

13
ROC (Receiver Operating Characteristic)

 To draw ROC curve, classifier must produce


continuous-valued output
– Outputs are used to rank test records, from the most
likely positive class record to the least likely positive
class record

 Many classifiers produce only discrete outputs (i.e.,


predicted class)
– How to get continuous-valued outputs?
 Decision trees, rule-based classifiers, neural networks,
Bayesian classifiers, k-nearest neighbors, SVM

14
ROC Curve Example
- 1-dimensional data set containing 2 classes (positive and negative)
- Any points located at x > t is classified as positive

At threshold t:
TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88
15
Using ROC for Model Comparison

 No model consistently
outperforms the other
 M1 is better for
small FPR
 M2 is better for
large FPR

 Area Under the ROC


curve
 Ideal:
 Area =1
 Random guess:
 Area = 0.5

16
How to Construct an ROC curve

• Use a classifier that produces a


Instance Score True Class
continuous-valued score for
1 0.95 +
each instance
2 0.93 +
• The more likely it is for the
3 0.87 - instance to be in the + class, the
4 0.85 - higher the score
5 0.85 - • Sort the instances in decreasing
6 0.85 + order according to the score
7 0.76 - • Apply a threshold at each unique
8 0.53 + value of the score
9 0.43 - • Count the number of TP, FP,
10 0.25 + TN, FN at each threshold
• TPR = TP/(TP+FN)
• FPR = FP/(FP + TN)

17
How to construct an ROC curve
Class + - + - - - + - + +
Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

ROC Curve:

18
Classification Technique (Contd..)
Outline

Bayesian Classifier
Classification Technique (Contd..)
Bayesian Classifier

Bayes Rule

Relationship between the attribute set and the class is


non-deterministic.
A doctor knows that meningitis causes stiff neck 50% of the
time.
1 Prior probability of any patient having meningitis is 1/50, 000
2 Prior probability of any patient having stiff neck is 1/20

Bayesian Classifier or Bayes Rule models probabilistic


relationships between the features and classes.
Classification Technique (Contd..)
Bayesian Classifier

Conditional Probability

Let A and C be two random variable.


A conditional probability is the probability that a random
variable will take a particular value given that value of another
random variable is known.
Conditional Probability :
P(A, C )
P(A|C ) = .
P(C )
P(A, C )
P(C |A) =
P(A)
Classification Technique (Contd..)
Bayesian Classifier

Bayes Theorem, 1763

P(A|C )P(C )
P(C |A) =
P(A)

If a patient has stiff neck, what is the probability that s/he


has meningitis?
P(S|M)P(M) 0.5 × 1/50, 000
P(M|S) = = = 0.0002
P(S) 1/20
Classification Technique (Contd..)
Bayesian Classifier

How to use Bayes Rule in Classification

Consider a two class problem. Let it be ω1 , ω2 .


Given a pattern X = (a1 , a2 , a3 , . . . , am )
Goal is to predict the class label (ω1 /ω2 ) of the pattern X .
Compute:
P(X |ω1 ) × P(ω1 )
P(ω1 |X ) =
P(X )
Compute:
P(X |ω2 ) × P(ω2 )
P(ω2 |X ) =
P(X )
Assign class label ω1 if P(ω1 |X ) > P(ω2 |X ), otherwise ω2 .
Choose a class C which maximixe P(X |C )P(C ).
Classification Technique (Contd..)
Bayesian Classifier

˙ Bayes Classifier
Naive

Let T be a training set and X = (a1 , a2 , a3 , . . . , am ) be an


unknown object.
Suppose there are k classes, C1 , C2 , . . . , Ck
Naive Bayes predicts a class Ci if and only if
P(Ci |X ) > P(Cj |X ) for 1 <= j <= k, j 6= i
P(X |Ci ) × P(Ci )
P(Ci |X ) =
P(X )
How to estimate P(Ci ) ?
|Ci ,T |
Prior Probability P(Ci ) = ,
|T |
where, |Ci ,T | = Number of patterns in Class Ci
Classification Technique (Contd..)
Bayesian Classifier

Naive Bayes (Contd..)

How to estimate P(X |Ci ) = P(a1 , a2 , a3 , . . . , am |Ci )?


Assume features are independent for a given class.

P(a1 , a2 , a3 , . . . , am |Ci ) = P(a1 |Ci ) × P(a2 |Ci ) . . . P(am |Ci )


Y
= P(ai |Ci )
i

1 If Aj is categorical, then
#pattern in Ci with value aj
P(aj |Ci ) =
|Ci ,T |
Classification Technique (Contd..)
Bayesian Classifier

How to Estimate Probabilities from Data?

If Aj is continuous-valued,
Continuous valued attribute is assumed to be Gaussian
distribution.
1 (x−µ)2
G (x, µ, σ) = √ e − 2σ2
2πσ
P(aj |Ci ) = G (aj , µCi , σCi )
Classification Technique (Contd..)
Bayesian Classifier

How to Estimate Probabilities from Data?

Table: Training Set (Source: Tan, Kumar and Steinbach, Introduction to Data Mining)

Tid Refund Marital Income Defaulter


Status (Class)
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes

(aj −µ) 2
1
P(aj |Ci ) = √ e − 2σ2
2πσ
Classification Technique (Contd..)
Bayesian Classifier

How to Estimate Probabilities from Data?

P(Income|Class = NO) :

1 Sample Mean= 110, Sample Varience=2975


(120−110)2
1
P(Income = 120K |No) = √ e 2×(2975) = 0.0072
2π(54.54)

X = (Refund = No, Married, Income = 120K )? What will be the class label of X ?

P(X |Class = No) = P(Refund = No|Class = No) × P(Married|Class = No) × P(Income = 120|No)
4 4
= × × 0.0072 = 0.0024
7 7

P(X |Class = Yes) = P(Refund = No|Class = Yes) × P(Married|Class = Yes) × P(Income = 120|Yes)
= 0.

P(X |Class = No) × P(Class = No) > P(X |Class = Yes) × P(Class = Yes)
Data Mining

Chapter 4

Artificial Neural Networks

Introduction to Data Mining , 2nd Edition

10/12/2020 1
Artificial Neural Networks (ANN)

 Basic Idea: A complex non-linear function can be


learned as a composition of simple processing units
 ANN is a collection of simple processing units
(nodes) that are connected by directed links (edges)
– Every node receives signals from incoming edges,
performs computations, and transmits signals to
outgoing edges
– Analogous to human brain where nodes are neurons
and signals are electrical impulses
– Weight of an edge determines the strength of
connection between the nodes
– Simplest ANN: Perceptron (single neuron)
10/12/2020 2
Basic Architecture of Perceptron

1
W_0

( x_1 x_2 x_3 1)^T


1 b
( W_1 W_2 W_3 b)
X_1 W_1
X_2 W_2 Activation Function
X_3 W_3

 Learns linear decision boundaries


 Similar to logistic regression (activation function is sign
instead of sigmoid)
10/12/2020 3
Perceptron Example

X1 X2 X3 Y Input Black box


1 0 0 -1
1 0 1 1
X1
1 1 0 1 Output
1 1 1 1
0 0 1 -1
X2 Y
0 1 0 -1
0 1 1 1 X3
0 0 0 -1

Output Y is 1 if at least two of the three inputs are equal to 1.

10/12/2020 4
Perceptron Example

Input
nodes Black box
X1 X2 X3 Y
1 0 0 -1 Output
1 0 1 1
X1 0.3 node
1 1 0 1
1 1 1 1
X2 0.3
0 0 1 -1
 Y
0 1 0 -1
0 1 1 1 X3 0.3 t=0.4
0 0 0 -1

Y  sign (0.3 X 1  0.3 X 2  0.3 X 3  0.4)


 1 if x  0
where sign ( x )  
 1 if x  0
10/12/2020 5
Perceptron Learning Rule

 Initialize the weights (w0, w1, …, wd)


 Repeat
– For each training example (xi, yi)
 Compute 𝑦𝑖

 Update the weights:

 Until stopping condition is met


 k: iteration number; 𝜆: learning rate

10/12/2020 6
Perceptron Learning Rule

 Weight update formula:

 Intuition:
– Update weight based on error: e =
– If y = 𝑦, e=0: no update needed
– If y > 𝑦, e=2: weight must be increased so
that 𝑦 will increase
– If y < 𝑦, e=-2: weight must be decreased so
that 𝑦 will decrease
10/12/2020 7
Example of Perceptron Learning

  0.1
X1 X2 X3 Y w0 w1 w2 w3 Epoch w0 w1 w2 w3
1 0 0 -1 0 0 0 0 0 0 0 0 0 0
1 0 1 1 1 -0.2 -0.2 0 0 1 -0.2 0 0.2 0.2
2 0 0 0 0.2 2 -0.2 0 0.4 0.2
1 1 0 1
3 0 0 0 0.2
1 1 1 1 3 -0.4 0 0.4 0.2
4 0 0 0 0.2
0 0 1 -1 5 -0.2 0 0 0 4 -0.4 0.2 0.4 0.4
0 1 0 -1 6 -0.2 0 0 0 5 -0.6 0.2 0.4 0.2
0 1 1 1 7 0 0 0.2 0.2 6 -0.6 0.4 0.4 0.2
0 0 0 -1 8 -0.2 0 0.2 0.2
Weight updates over
Weight updates over first epoch all epochs

10/12/2020 8
Perceptron Learning

 Since y is a linear
combination of input
variables, decision
boundary is linear

10/12/2020 9
Perceptron Learning

 Since y is a linear
combination of input
variables, decision
boundary is linear

 For nonlinearly separable problems, perceptron


learning algorithm will fail because no linear
hyperplane can separate the data perfectly

10/12/2020 10
Nonlinearly Separable Data

XOR Data

y  x1  x2
x1 x2 y
0 0 -1
1 0 1
0 1 1
1 1 -1

10/12/2020 11
Multi-layer Neural Network

x1 x2 x3 x4 x5
 More than one hidden layer of
Input computing nodes
Layer

 Every node in a hidden layer


operates on activations from
Hidden preceding layer and transmits
Layer
activations forward to nodes of
next layer

Output  Also referred to as


“feedforward neural networks”
Layer

10/12/2020 12
Multi-layer Neural Network

 Multi-layer neural networks with at least one


hidden layer can solve any type of classification
task involving nonlinear decision surfaces
XOR Data

Input Hidden Output


Layer Layer Layer

w31
x1 n1 n3 w53
w41

n5 y

w32
w54
x2 n2 n4
w42

10/12/2020 13
Why Multiple Hidden Layers?

 Activations at hidden layers can be viewed as features


extracted as functions of inputs
 Every hidden layer represents a level of abstraction
– Complex features are compositions of simpler features

 Number of layers is known as depth of ANN


– Deeper networks express complex hierarchy of features

10/12/2020 14
Multi-Layer Network Architecture


Activation value Activation


at node i at layer l Function Linear Predictor

10/12/2020 15
Activation Functions

10/12/2020 16
Learning Multi-layer Neural Network

 Can we apply perceptron learning rule to each


node, including hidden nodes?
– Perceptron learning rule computes error term
e = y - 𝑦 and updates weights accordingly
 Problem: how to determine the true value of y for
hidden nodes?
– Approximate error in hidden nodes by error in
the output nodes
 Problem:
– Not clear how adjustment in the hidden nodes affect overall
error
– No guarantee of convergence to optimal solution

10/12/2020 17
Gradient Descent

 Loss Function to measure errors across all training points


Squared Loss:

 Gradient descent: Update parameters in the direction of


“maximum descent” in the loss function across all points

𝜆: learning rate

 Stochastic gradient descent (SGD): update the weight for every


instance (minibatch SGD: update over min-batches of instances)

10/12/2020 18
Computing Gradients
𝑦 = 𝑎𝐿

𝑖𝑗 𝑖𝑗

 Using chain rule of differentiation (on a single instance):

 For sigmoid activation function:

 How can we compute 𝛿𝑖𝑙 for every layer?


10/12/2020 19
Backpropagation Algorithm

 At output layer L:

 At a hidden layer 𝑙 (using chain rule):

– Gradients at layer l can be computed using gradients at layer l + 1


– Start from layer L and “backpropagate” gradients to all previous
layers
 Use gradient descent to update weights at every epoch
 For next epoch, use updated weights to compute loss fn. and its gradient
 Iterate until convergence (loss does not change)
10/12/2020 20
Design Issues in ANN

 Number of nodes in input layer


– One input node per binary/continuous attribute
– k or log2 k nodes for each categorical attribute with k
values
 Number of nodes in output layer
– One output for binary class problem
– k or log2 k nodes for k-class problem
 Number of hidden layers and nodes per layer
 Initial weights and biases
 Learning rate, max. number of epochs, mini-batch size for
mini-batch SGD, …

10/12/2020 21
Characteristics of ANN

 Multilayer ANN are universal approximators but could


suffer from overfitting if the network is too large
 Gradient descent may converge to local minimum
 Model building can be very time consuming, but testing
can be very fast
 Can handle redundant and irrelevant attributes because
weights are automatically learnt for all attributes
 Sensitive to noise in training data
 Difficult to handle missing attributes

10/12/2020 22
Deep Learning Trends

 Training deep neural networks (more than 5-10 layers)


could only be possible in recent times with:
– Faster computing resources (GPU)
– Larger labeled training sets
– Algorithmic Improvements in Deep Learning
 Recent Trends:
– Specialized ANN Architectures:
Convolutional NeuralNetworks (for image data)
Recurrent Neural Networks (for sequence data)

Residual Networks (with skip connections)

– Unsupervised Models: Autoencoders


– Generative Models: Generative Adversarial Networks
10/12/2020 23
Vanishing Gradient Problem

 Sigmoid activation function easily saturates (show zero gradient


with z) when z is too large or too small
 Lead to small (or zero) gradients of squared loss with weights,
especially at hidden layers, leading to slow (or no) learning

10/12/2020 24
Handling Vanishing Gradient Problem

 Use of Cross-entropy loss function

 Use of Rectified Linear Unit (ReLU) Activations:

10/12/2020 25
Outline
Ensemble of Classifiers
Bagging

Advanced Classification Techniques

August 25, 2017

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Ensemble of Classifiers

Bagging

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Model Ensembles

TWO HEADS ARE BETTER THAN ONE


1 Construct a set of classifiers from adopted versions of the

training data (resampled/reweighted).


2 Combine the predictions of these classifiers in some way, often
by simple averaging or voting.

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Figure: General Idea of Ensemble

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Rationale for Ensemble Method

1 Suppose there are 25 base classifiers.


2 Each classifier has error rate ǫ = 0.35.
3 Assume classifiers are independent.
4 Probability that the ensemble classifier makes wrong
prediction:
25  
X 25 i
ǫ (1 − ǫ)25−i = 0.06
i
i=13

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bootstrap aggregating (Bagging)

INPUT: Training Set D, a learning algorithm A, T = ensemble


size.
OUTPUT: Ensemble of Classifiers.

for t = 1to T do
1 Build a bootstrap sample Dt by sampling |D| points.
2 Run A on Dt to build a classifier Mt
Return {Mt |1 ≤ t ≤ T }
Comments: Probability that a particular data point not being
selected is (1 − 1/n)n

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1

Bagging Round 2: x ≤ 0.65 : y = +1, y = 1 else.

x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1

Bagging Round 2: x ≤ 0.65 : y = +1, y = 1 else.

x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1

Bagging Round 3: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1

Bagging Round 2: x ≤ 0.65 : y = +1, y = 1 else.

x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1

Bagging Round 3: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1

Bagging Round 4: x ≤ 0.3 : y = +1, y = −1 else.

x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1

Bagging Round 2: x ≤ 0.65 : y = +1, y = 1 else.

x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1

Bagging Round 3: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1

Bagging Round 4: x ≤ 0.3 : y = +1, y = −1 else.

x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1

Bagging Round 5: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1.0 1.0 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1

Bagging Round 2: x ≤ 0.65 : y = +1, y = 1 else.

x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1

Bagging Round 3: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1

Bagging Round 4: x ≤ 0.3 : y = +1, y = −1 else.

x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1

Bagging Round 5: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1.0 1.0 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging
Bagging Round 6: x ≤ 0.75 : y = −1, y = +1 else.

x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1.0
y 1 -1 -1 -1 -1 -1 -1 1 1 1

Bagging Round 7: x ≤ 0.75 : y = −1, y = +1 else.

x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1.0
y 1 -1 -1 -1 -1 1 1 1 1 1

Bagging Round 8: x ≤ 0.75 : y = −1, y = +1 else.

x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1.0
y 1 1 -1 -1 -1 -1 -1 1 1 1

Bagging Round 9: x ≤ 0.75 : y = −1, y = +1 else.

x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1.0 1.0
y 1 1 -1 -1 -1 -1 -1 1 1 1

Bagging Round 10: x ≤ 0.05 : y = −1, y = +1 else.

x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9
y 1 1 1 1 1 1 1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


y 1 1 1 -1 -1 -1 -1 1 1 1

x
stump Round 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
≤ 0.35, +1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.65 ≥, +1 2 1 1 1 1 1 1 1 1 1 1
≤ 0.35 3 1 1 1 -1 -1 -1 -1 -1 -1 -1
le0.3, +1 4 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.35 5 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.75, −1 6 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 7 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 8 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 9 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.05, −1 10 1 1 1 1 1 1 1 1 1 1
SUM 2 2 2 -6 -6 -6 -6 2 2 2
Class 1 1 1 -1 -1 -1 -1 1 1 1
Actual 1 1 1 -1 -1 -1 -1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Random Forest

Bagging is useful in combination with decision tree models.


Build each tree from a different random subset of features
(subspace sampling).

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Random Forest

INPUT: Training Set D, subspace dimension=d, T = ensemble


size.
OUTPUT: Ensemble of Classifiers.

for t = 1 to T do
1 Build a bootstrap sample Dt by sampling with replacement.
2 select d features randomly and reduce dimensionality of Dt
accordingly.
3 Build a decision tree Mt using Dt .
Return {Mt |1 ≤ t ≤ T }.

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Boosting

Boosting assign a weight to each training example and


adaptively changes the weights at the end of each boosting
round.
How the weight changes?
Half of the total weight assigned to the misclassified examples
and the other half to the rest examples.

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Boosting
Initial weights are uniform that sum to 1.
Current weight assigned to misclassified samples are exactly
the error rate ǫ.
Multiply with 1/2ǫ weights for misclassified instances.
Multiply 1/2(1 − ǫ) weights for correctly classified samples.

Table: Confusion Matrix


ACTUAL CLASS
Class=+ Class=-
PREDICTED Class=+ 24 9
CLASS Class=- 16 51

ǫ = 0.25 weight update factor = 1/2ǫ = 2 for misclassified


samples. 1/2(1 − ǫ) = 1/1.5 = 2/3.
Advanced Classification Techniques
Outline
Ensemble of Classifiers
Bagging

Boosting

Confidence/weightage of each boosting model (α).


1 1 − ǫt
αt = ln
2 ǫt

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

AdaBoost

INPUT: Training Set D; ensemble size T : learning Algorithm A


OUTPUT:Weighted Ensemble of models
w1i = 1/|D| for all xi ∈ D
for t=1 to T do
1 Run A on weight wti to produce a model Mt
2 calculate weighted error.
3 if ǫt ≥ 1/2 then T=t-1 break;
1 − ǫt
4 αt = 12 ln
ǫt
5 w(t+1)i = wZtt × exp(−αt ) for correctly classified examples.
w(t+1)i = wZtt × exp(+αt ) for misclassified classified examples.
PT
Return M(x) = t=1 αt × Mt (x)

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Example of AdaBoost

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


y 1 1 1 -1 -1 -1 -1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Example of AdaBoost

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


y 1 1 1 -1 -1 -1 -1 1 1 1
Boosting Round 1: x ≤ 0.75 : y = −1, y = +1 else.

x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Example of AdaBoost

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


y 1 1 1 -1 -1 -1 -1 1 1 1
Boosting Round 1: x ≤ 0.75 : y = −1, y = +1 else.

x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Boosting Round 2: x ≤ 0.05 : y = +1, y = +1 else.

x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Example of AdaBoost

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


y 1 1 1 -1 -1 -1 -1 1 1 1
Boosting Round 1: x ≤ 0.75 : y = −1, y = +1 else.

x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Boosting Round 2: x ≤ 0.05 : y = +1, y = +1 else.

x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1

Boosting Round 3: x ≤ 0.3 : y = +1, y = −1 else.

x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1- -1 -1 -1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Example

Round Split Point α


1 0.75 1.738
2 0.05 2.7784
3 0.3 4.1195

Round x=1 2 3 4 5 6 7 8 9 10
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Sign 1 1 1 -1 -1 -1 -1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

THANK YOU

Advanced Classification Techniques


Outline
Introduction to Classification
Introduction to Clustering
Clustering Methods for Large Datasets
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Introduction to Clustering Methods

Dr. Bidyut Kr. Patra

Assistant Professor, National Institute of Technology Rourkela


Rourkela, Orissa

October 24, 2016

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Clustering Methods for Large Datasets
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Introduction to Classification
Introduction to Clustering
Similarity Measures
Category of clustering methods
Partitional Clustering
Soft Clustering
Hierarchical Clustering
Density Based Clustering Method
Clustering Methods for Large Datasets
Major Approaches to Clustering Large Datasets
Hybrid Clustering Method
Data Summarization
BIRCH
Application of Clustering Methods to Image Processing
Conclusions and Research Directions
Dr. Bidyut Kr. Patra Introduction to Clustering Methods
Outline
Introduction to Classification
Introduction to Clustering
Clustering Methods for Large Datasets
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Classification
Definition
The task of classification is to assign an object into one of the
predefined categories.

Working Principle

A set of objects with their categories are provided. Training


Set.
It captures relationship between objects and their categories.
(Find a suitable model)
Assign a class label (category) to an unknown object.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Clustering Methods for Large Datasets
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Examples of Classification

Predicting tumor cells as non malignant or malignant.


Classify emails as spam or genuine e-mail.
Classifying credit card transactions as legitimate or fraudulent
Categorizing news stories as finance, weather, entertainment,
sports, etc

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Clustering Methods for Large Datasets
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Drawback

Non availability of proper training set. It is difficult to have


correctly labeled data.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Introduction to Clustering
Cluster Analysis
Cluster analysis is to discover the natural grouping(s) of a set of
patterns, points, or objects a .
a
A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition
Letters, 31(8):651–666, 2010.

Operational definition of clustering :


Cluster Analysis
Clustering activity or cluster analysis is to find group(s) of patterns,
called cluster (s) in a dataset in such a way that patterns in a cluster
are more similar to each other than patterns in distinct clusters.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Set Theoretic Definition

Definition
Let D = {x1 , x2 , . . . , xn } be a set of patterns called dataset, where
each xi is a pattern of N dimensions. A clustering π of D can be
defined as follows.
π = {C1 , C2 , . . . , Ck }, such that
Sk
i Ci = D
Ci 6= ∅, i = 1..k
Ci ∩ Cj = ∅, i 6= j , i , j = 1..k
sim (x1 , x2 ) > sim (x1 , y1 ), x1 , x2 ∈ Ci and y1 ∈ Cj , i 6= j

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Application Domains:
1 Biology: Clustering has been applied to genomic data to
group functionally similar genes.
2 Information Retrieval: Search results obtained by various
search engines such as Google, Yahoo can be clustered so that
related documents appear together in a cluster.
(Vivisimo search engine (http://vivisimo.com/) groups related
documents.)
3 Market Research: Entities (people, market, organizations) can
be clustered based on common features or characteristics.
4 Geological mapping, Bio-informatics, Climate, Web mining,
Image Processing

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Similarity
Similarity between a pair of patterns x and y in D is a
mapping, sim(x, y ) : D × D → [0, 1].
Closer the value of sim(.) is to 1, higher the similarity and it is
1, when x = y .

Simple Matching Coefficient (SMC): Let x and y be two N-dimensional


binary vectors, i.e. x, y ∈ {0, 1}N .

1 if x = y
PN i i
i =1 t | t =
0 otherwise
SMC (x, y ) = ,
N
where, xi and yi are the i th feature values of x and y , respectively.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Similarity(Contd..)
Jaccard Coefficient(J):
a11
J= , (1)
N − a00
(
PN1 if xi = yi = 1
where a11 = i =1 t |t= ,
0 otherwise
(
1 if xi = yi = 0
a00 = N
P
i =1 t | t =
0 otherwise
Let x = (1, 0, 0, 1, 1) and y = (0, 1, 0, 1, 0).
2 1
SMC = ; Jaccard Coefficient =
5 4

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Similarity(Contd..)
Cosine Similarity: Let x and y be two document-vectors. Similarity
between x and y are expressed as cosine of the angle between them.
x •y
cosine(x, y ) =
||x|| ||y ||

where, • is the dot product and ||.|| is L2 -norm of x, y .

Let x = (3, 0, 2, 0, 0, 1) and y = (2, 1, 3, 0, 1, 0) be two document vectors.


6+0+6+0+0+0
cosine(x, y ) = √ √
32 + 02 + 22 + 02 + 02 + 12 22 + 12 + 32 + 02 + 12 + 02
12
= √ √
14 15

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Dissimilarity
Many clustering methods use dissimilarity measures to find clusters
instead of similarity measure.

Euclidean distance between a pair of N dimensional points (patterns) can


be expressed as follow.
v
u N
uX
d(x, y ) = t (xi − yi )2
i =1

Generalization of Euclidean distance is known as Minkowski distance.


N
!1/p
X p
Lp = |xi − yi |
i =1

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Metric Space

Definition
M = (D, d) is said to be metric space if d is a metric on D, i.e.,
d : D × D → R≥0 , which satisfies following conditions.
For three patterns x, y , z ∈ D,
Non-negativity: d(x, y ) ≥ 0
Reflexivity: d(x, y ) = 0, if x = y
Symmetry: d(x, y ) = d(y , x)
Triangle inequality: d(x, y ) + d(y , z) ≥ d(x, z)

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Classification of Clustering Methods

Partitional Clustering: Partitional clustering creates a single clustering of


a given dataset. Let C1 and C2 be two clusters in a clustering.

(i) C1 * C2 or C1 + C2 , (ii) C1 ∩ C2 = ∅

Fuzzy and rough clustering approaches violate constraint (ii). C1 ∩ C2 6= ∅,


Hierarchical Clustering method : Hierarchical clustering method creates a
sequence of partitional clusterings of a dataset.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

DATASET Clustering

Figure: Partitional Clustering

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

DATASET

Figure: Hierarchical Clustering

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Partitional Clustering Methods


Basic Sequential Algorithmic Scheme (BSAS) : BSAS needs
two user specified parameters: distance threshold (τ ) and
number of clusters (k).

i = 1, Ci = {x1 }
x ∈ D \ {x1 }
1 Find nearest existing cluster Cmin such that
d(x, Cmin ) = minj=1..i d(x, Cj )
2 if d(x, Cmin ) > τ and i < k, then
i = i + 1; Ci = {x}
ESLE Cmin = Cmin ∪ {x}
3 Repeat Step 1 and Step 2 until all patterns are assigned to
clusters.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Advantages:
1 Single dataset scan method.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Advantages:
1 Single dataset scan method.
2 Time Complexity= O(kn)

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Advantages:
1 Single dataset scan method.
2 Time Complexity= O(kn)

Disadvantages:
1 Number of clusters is to be provided.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

leaders clustering method (Hartigan, 1975)


INPUT:D, τ
1 L ← {x1 }
2 For each pattern x ∈ D \ {x1 }, if there is a l ∈ L such that
||l − x|| ≤ τ , then x is assigned to that cluster that is
represented by l . There is no such leader, then x becomes a
leader and is added to L.
3 It outputs leaders set L
Advantages:
Single scan method
Time complexity O(mn), m = |L|

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

.
. .
.
.
.
.
. .

. .
.
.

Figure: Leaders find semi-spherical clusters.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Drawbacks
Clustering results are ordered dependent.
Distance between followers of different clusters (leaders) may
be less than corresponding leaders.
ly
lx

x
y

Figure: Distance between followers (x, y ) of leaders lx and ly is less


than τ

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

The k-means Clustering Methods(MacQueen,1967)

The k-means optimizes Sum of Squared Error (SSE)


defined as follows. Let C1 , . . . , Ck be k clusters of the
dataset. Then,
|Cj |
k X |Cj |
X 2 1 X
SSE = ||xi − x j ||2 , where xi ∈ Cj , xj = xi .
|Cj |
j=1 i =1 i =1

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

k-means Clustering(MacQueen,1967)

Select k points as initial centroids.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

k-means Clustering(MacQueen,1967)

Select k points as initial centroids.


repeat
1 Form k clusters by assigning each point to its closest centroid.
2 Recompute the centroid of the each cluster.
until Centroids do not change.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Example

D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Example

D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2


initial centroids 2 and 4

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Example

D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2


initial centroids 2 and 4
C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25} using L2 distance.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Example

D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2


initial centroids 2 and 4
C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25} using L2 distance.
mC1 = 2.5, mC2 = 16

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Example

D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2


initial centroids 2 and 4
C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25} using L2 distance.
mC1 = 2.5, mC2 = 16
Next Iteration:
C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25}
mC1 = 3.0, mC2 = 18

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Example

D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2


initial centroids 2 and 4
C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25} using L2 distance.
mC1 = 2.5, mC2 = 16
Next Iteration:
C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25}
mC1 = 3.0, mC2 = 18
C1 = {2, 3, 4, 10}, C2 = {12, 20, 30, 11, 25}

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Example

D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2


initial centroids 2 and 4
C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25} using L2 distance.
mC1 = 2.5, mC2 = 16
Next Iteration:
C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25}
mC1 = 3.0, mC2 = 18
C1 = {2, 3, 4, 10}, C2 = {12, 20, 30, 11, 25}
mC1 = 4.75, mC2 = 19.6

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Example

D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2


initial centroids 2 and 4
C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25} using L2 distance.
mC1 = 2.5, mC2 = 16
Next Iteration:
C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25}
mC1 = 3.0, mC2 = 18
C1 = {2, 3, 4, 10}, C2 = {12, 20, 30, 11, 25}
mC1 = 4.75, mC2 = 19.6
C1 = {2, 3, 4, 10, 11, 12}, C2 = {20, 30, 25}
mC1 = 7, mC2 = 25

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Example

D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2


initial centroids 2 and 4
C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25} using L2 distance.
mC1 = 2.5, mC2 = 16
Next Iteration:
C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25}
mC1 = 3.0, mC2 = 18
C1 = {2, 3, 4, 10}, C2 = {12, 20, 30, 11, 25}
mC1 = 4.75, mC2 = 19.6
C1 = {2, 3, 4, 10, 11, 12}, C2 = {20, 30, 25}
mC1 = 7, mC2 = 25
C1 = {2, 3, 4, 10, 11, 12}, C2 = {20, 30, 25}
Dr. Bidyut Kr. Patra Introduction to Clustering Methods
Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Time and Space Complexity


Time=O(I ∗ k ∗ n) , Space=O(k + n)

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Drawbacks

1 It cannot detect outliers.


2 It can find only convexed shaped clusters.
3 It is applicable to only numeric dataset.
4 With different initial points, it produces different clustering
results.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

DATASET CLUSTERING by k−means

Figure: Result produced by k-means clustering method

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Bisecting k-means(Steinbach, 2000)

Algorithm
1 Initialize the list of clusters L = {C1 }, where C1 = D.
2 repeat
1 Remove a cluster from the list of clusters.
2 Bisect the selected cluster using k-means clustering method for
a number of times.
3 Select two clusters from a bisection with lowest total SSE.
4 Add these two clusters to the list of clusters L.
3 until number of clusters in the list L is k.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

K-means as an Optimization Problem

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

K-means as an Optimization Problem

Pk
− x)2
P
Minimize SSE = i =1 x∈Ci (ci
1P
ck = x
n x∈Ck

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

K-means as an Optimization Problem

Pk
− x)2
P
Minimize SSE = i =1 x∈Ci (ci
1P
ck = x
n x∈Ck
Pk P
Minimize SAE = i =1 x∈Ci | ci − x |
ck = median of the objects in the cluster.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Triangle Inequality to Accelerate k-means (Elkan, 2003)

Let x be a point and let b and c be centers. If


d(b, c) ≥ 2d(x, b) then d(x, c) ≥ d(x, b).

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Soft Clustering

In many application domains clusters do not have crisp


(exact) boundary.
Fuzzy Set Theory [Zadeh,1965], Rough Set Theory
[Pawlak,1982] and hybridization of both approaches.

Fuzzy c-Means Method (FCM)[Dun,1973]

Rough k means (Lingras and West (2004))

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Hierarchical Clustering
Hierarchical clustering methods create a sequence of
clusterings π1 , π2 , π3 , . . . πi . . . πp of given dataset D.
It can produce inherent nested structures (hierarchical) of
clusters in a data.
Hierarchical Clustering obtained in two ways
1 Divisive (Top-down) approach:

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Hierarchical Clustering
Hierarchical clustering methods create a sequence of
clusterings π1 , π2 , π3 , . . . πi . . . πp of given dataset D.
It can produce inherent nested structures (hierarchical) of
clusters in a data.
Hierarchical Clustering obtained in two ways
1 Divisive (Top-down) approach:
Start with one cluster containing all points.
At each step, split a cluster until each cluster contains a point.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Hierarchical Clustering
Hierarchical clustering methods create a sequence of
clusterings π1 , π2 , π3 , . . . πi . . . πp of given dataset D.
It can produce inherent nested structures (hierarchical) of
clusters in a data.
Hierarchical Clustering obtained in two ways
1 Divisive (Top-down) approach:
Start with one cluster containing all points.
At each step, split a cluster until each cluster contains a point.
2 Agglomerative (Bottom-up) approach :

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Hierarchical Clustering
Hierarchical clustering methods create a sequence of
clusterings π1 , π2 , π3 , . . . πi . . . πp of given dataset D.
It can produce inherent nested structures (hierarchical) of
clusters in a data.
Hierarchical Clustering obtained in two ways
1 Divisive (Top-down) approach:
Start with one cluster containing all points.
At each step, split a cluster until each cluster contains a point.
2 Agglomerative (Bottom-up) approach :
Start with the points as individual clusters.
Merge the closest pair of clusters until only number of cluster
becomes one.
Dr. Bidyut Kr. Patra Introduction to Clustering Methods
Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

π5 = {{p1, p5, p2, p3, p4}}

π4 = {{p1, p5}, {p2, p3, p4}}


π3 = {{p1}, {p2, p3, p4}, {p5}}
π2 = {{p1}, {p2, p3}, {p4}, {p5}}

p1 p2 p3 p4 p5 π1 = {{p1}, {p2}, {p3}, {p4 }, {p5}}

Figure: Dendogram produced by a hierarchical clustering method

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Agglomerative (Bottom-up) approach

Compute distance matrix for the dataset.


Let each data point be a cluster
repeat
1 Merge the two closest clusters.
2 Update the distance matrix.
until only a single cluster remains

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Popular Agglomerative Methods


Single-link :Dis(C1 , C2 ) = min{||xi − xj || | xi ∈ C1 , xj ∈ C2 }
Complete-link
:Dis(C1 , C2 ) = max{||xi − xj || | xi ∈ C1 , xj ∈ C2 }
Average-link:
1 P P
Dis() = |C1 |×|C2| i j ||xi − xj ||, where xi ∈ C1 , xj ∈ C2

(a) Single-link (b) Complete-link (c) Average-link

Figure: Distance between a pair of clusters in three hierarchal clustering


methods.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Complexity

Space Complexity : O(n2 )


Time Complexity: O(n2 )

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Single-link Clustering Method

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

dendogram produced by single-link method

Figure: DendogramIntroduction
Dr. Bidyut Kr. Patra
for single-link
to Clustering Methods
Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Distance Updation

Let C = Cx ∪ Cy (merging clusters Cx , Cy ) be a new cluster


formed.
Let Co be an another cluster.

d(Co , (Cx , Cy )) = min{d(Co , Cx ), d(Co , Cy )}

Lance and Williams (1967) generalizes the distance updation

d(Co , (Cx , Cy )) = αi × d(Co , Cx ) + αj × d(Co , Cy )


+β × d(Cx , Cy ) + γ × |d(Co , Cx ) − d(Co , Cy )|
where, d(., .) is a distance function and values of αi , αj , β and γ(∈ R) depend on the method used.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Table: Lance-Williams parameters for different hierarchal methods


Method αi αj β γ
Single-link 1/2 1/2 0 -1/2
Complete-link 1/2 1/2 0 1/2
mC x mC y
Average-link mCx +mCy mCx +mCy 0 0
mC x mC y −mCx .mCy
Centroid mCx +mCy mCx +mCy mCx +mCy 0

Complete Link:

1 1 1
d(Co , (Cx , Cy )) = × d(Co , Cx ) + × d(Co , Cy ) + 0 × d(Cx , Cy ) + × |d(Co , Cx ) − d(Co , Cy )|
2 2 2
= max{d(Co , Cx ), d(Co , Cy )}

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Density Based Clustering Method

Density based clustering method views clusters as dense


regions in the feature space which are separated by relatively
less dense regions.
Density based clustering approach optimizes local criteria,
which is based on density distribution of the dataset.
DBSCAN (Density Based Spatial Clustering of Applications
with Noise) [Ester et al. (1996)] is a very popular density
based partitional clustering method.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

DBSCAN (M. Ester et al., SIG-KDD 1996)

DBSCAN is a density based clustering method, which can find


arbitrary shaped clusters.
It classifies each point into any of the three categories:
1 Core Point:A hyper-sphere of radius ǫ centered at point x has
more than Min data point, then x is called core point.
2 Border Point: The point x is called border point if the
hyper-sphere has less than Min points but there is a near-by
core point (||x − CP|| < ǫ)
3 Noisy Point: If the hyper-sphere has less than Min points and
there is no near-by core point.
DBSCAN starts with a core point and expand the core point
recursively merging nearby core points.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

DBSCAN(contd..)

Noise Point
minpts=4
Core Point

Border Point

Figure: Classification of points

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

DBSCAN(Contd..)

Directly density-reachable: A pattern y is Directly


density-reachable from a pattern x if ||x − y || ≤ ǫ and x is a
core point.
Density reachable: A pattern y is density reachable from
another pattern x if there is a sequence of patterns
x1 , x2 , . . . , xn with x1 = x, xn = y such that xi +1 directly
density reachable from xi , i = 1..(n − 1).
Density connected: A pattern x is density connected from y if
there is a pattern xc such that both x and y are density
reachable from xc .

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Cluster
x
xc y

Noise Points

Minpts = 3

Figure: Cluster produced by DBSCAN:Points x and y are density


connected through xc ; Point x is density reachable from xc .

A cluster C ⊂ D holds following properties.


∀x, y ∈ C , x and y are density connected.
If x ∈ C is a core point, all density-reachable points from x
are included in C .
Dr. Bidyut Kr. Patra Introduction to Clustering Methods
Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

DBSCAN

1 Let x be a pattern in the dataset.


1 If x is core point (CP), then explore its ǫ neighbors (x) in
recursive manner and marked the points as “seen”. It forms a
cluster.
(At any stage,if any of the neighbors is border point, no need
to explore the point but include it into the cluster.)
2 If x neither CP nor part of any cluster, then mark x as “seen”
and temporarily marked as Noisy Point.
2 Repeat Step 2 until all points marked as “seen”.
3 If temporarily marked “Noisy Point” does not belong to any
cluster, then the point is declared as final Noisy Point.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Selection of DBSCAN Parameters

Compute k-dist for each datapoint. (For some value of k)


Sort the points in ascending order and plot the sorted values.

We expect a sharp change at the value of k-dist which


corresponds to the suitable value of ǫ.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Drawbacks

100 100
"NBC_paper_data.txt" using 1:2 Cluster1
Cluster2
Cluster3
80 80 Noise

60 60

40 40

20 20

0 0
0 20 40 60 80 100 0 20 40 60 80 100

Figure: Dataset Figure: Clusters obtained


by DBSCAN Method

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

DBSCAN fails to detect clusters of arbitrary shapes with


highly variable density.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Neighborhood Based Clustering (NBC)

S. Zhou, Y. Zhao, J. Guan, and J. Z. Huang.


A Neighborhood-Based Clustering Algorithm.
Appeared in Pacific-Asia Conference on Knowledge Discovery and
Data Mining, (PAKDD 2005)
Neighborhood Based Clustering (NBC) discovers clusters
based on the neighborhood characteristics of data.
NBC is effective in discovering clusters of arbitrary shape and
different densities.
NBC needs fewer input parameters than the DBSCAN
clustering method.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

NBC

The core concept of NBC is the Neighborhood Density Factor


(NDF).
Number of reverse K nearest neighbors(x)
NDF (x) =
Number of K nearest neighbors(x)

Reverse K-Nearest Neighbors Set of x(R-KNN): Set of objects


whose KNN contains x.

R − KNN(x) = {p ∈ D | x ∈ KNN(p)}

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

It classifies a point x into any of the three categories:


1 Core Point: If NDF (x) > 1

2 Even Point: If NDF (x) = 1

3 Noisy Point: If NDF (x) < 1

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Definition (Neighborhood-based directly density reachable)


Let x, y be two points in D and NNK (x) be the knn of x. The
point y is directly reachable from x if
x is CP or EP and y ∈ NNK (x)

Definition ( Neighborhood-based density reachable )


Let x, y be two points in D and NNK (x) be the KNN of x. The
point y is reachable from x if there is a chain of patterns
p1 = x, p2 , . . . , pn = y such that pi +1 is directly reachable from
pi . i = 1..i = n − 1

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Definition (Neighborhood-based density connected)


Let x, y be two points in D and NNK (x) be the KNN of x. The
points x, y are neighborhood connected if any of the following
conditions holds.
y is reachable from x, or vice-versa.
x and y are reachable from any other pattern in the dataset.

Definition (Neighborhood-based Cluster)


Let D be a dataset. C ⊆ D is a cluster such that
If p, q ∈ C , then p and q are neighborhood connected.
If p ∈ C and q ∈ D are neighborhood connected, then q also
belongs to cluster C .

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Neighborhood based Clustering (NBC) method


1 Calculate NDF for all patterns of the dataset
2 Let x be a pattern in the dataset.
1 If x is cluster point(CP/EP), then explore its neighbors in
recursive manner and marked the points as “seen”. It forms a
cluster.
(At any stage,if any of the neighbors is not CP/EP, no need to
explore the point but included into the cluster.)
2 If x is not CP, then mark x as “seen” and temporarily marked
as Noisy Point.
3 Repeat Step 2 until all points marked as “seen”.
4 If temporarily marked “Noisy Point” does not belong to any
cluster, then the point is declared as final Noisy Point.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

100
Cluster1
Cluster2
Cluster3
80 Cluster4
Cluster5
Noise

60

40

20

0
0 20 40 60 80 100

Figure: Clusters
Dr. Bidyut obtained
Kr. Patra by NBC
Introduction Method
to Clustering Methods
Outline
Introduction to Classification
Introduction to Clustering
Hybrid Clustering Method
Clustering Methods for Large Datasets
Data Summarization
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Major approaches to clustering large dataset


Many clustering approaches have been proposed to tackle
large size data
These approaches are mainly classified into the following listed
groups1

1 Sampling Based Approach

2 Hybrid Clustering

3 Data Summarization

4 Nearest Neighbor Search


1
A. K. Jain. Data clustering: 50 years beyond k-means. Pattern
Recognition Letters, 31(8):651–666, 2010.
Dr. Bidyut Kr. Patra Introduction to Clustering Methods
Outline
Introduction to Classification
Introduction to Clustering
Hybrid Clustering Method
Clustering Methods for Large Datasets
Data Summarization
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Hybrid Method

Partitional clustering method is combined with hierarchical


clustering method.

First hybrid method was proposed by Murthy and Krishna in


1980.
It is the combination of k-means clustering with single-link
method.
Many clustering methods ( Lin and Chen (2005), Liu et al.
(2009), Chaoji et al. (2009) ) have been developed in this line.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Hybrid Clustering Method
Clustering Methods for Large Datasets
Data Summarization
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Divide the dataset into a number of sub clusters applying


k-means clustering method.
Compute Similarity/dissimilarity between a pair of sub
clusters.
Apply an agglomerative method to these sub clusters to
obtain final clustering.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Hybrid Clustering Method
Clustering Methods for Large Datasets
Data Summarization
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Vijaya et al. (2006) used hybrid clustering method in protein se-


quence classification
1 Leaders clustering is applied to whole dataset to obtain set of
leaders.
2 Apply single-link/complete-link to obtain k clusters.
3 Median of each cluster is selected as the representative of the
cluster.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Hybrid Clustering Method
Clustering Methods for Large Datasets
Data Summarization
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Data Summarization

This approaches create summary of a large dataset.

This summary is intelligently used to scale up expensive


clustering method.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Hybrid Clustering Method
Clustering Methods for Large Datasets
Data Summarization
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

BIRCH (Zhang et al., 1996)

BIRCH is designed for clustering large datasets.


It can work with limited main memory.
It is an incremental method.

Two-Phase Method
1 Create a summary of a given dataset in form of CF tree.

2 Apply conventional hierarchical clustering to the summary.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Hybrid Clustering Method
Clustering Methods for Large Datasets
Data Summarization
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

CF tree

CF-tree is a height-balanced tree with branching factor B.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Hybrid Clustering Method
Clustering Methods for Large Datasets
Data Summarization
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

CF tree

CF-tree is a height-balanced tree with branching factor B.


Each internal node of CF tree has at most B entries of the
form [CFi , childi ]i =1,...,B , ,where CFi is the
clustering feature(CF) of sub-cluster represented by this child.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Hybrid Clustering Method
Clustering Methods for Large Datasets
Data Summarization
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

CF tree

CF-tree is a height-balanced tree with branching factor B.


Each internal node of CF tree has at most B entries of the
form [CFi , childi ]i =1,...,B , ,where CFi is the
clustering feature(CF) of sub-cluster represented by this child.

A leaf node has at most Lmax entries of the form


[CFi ], i = 1 . . . Lmax .

Diameter of each sub-cluster in leaf nodes must be less than a


threshold T .

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Hybrid Clustering Method
Clustering Methods for Large Datasets
Data Summarization
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Figure: CF-tree

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Hybrid Clustering Method
Clustering Methods for Large Datasets
Data Summarization
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Clustering Features(CF)

A CF is a triplet which contains summarized information of a


sub-cluster.

→ −→ −

Let C1 = {X1 , X2 , . . . Xk } be a sub cluster.


The CF for C1 is CF = (k, LS , ss),
−→
where LS is linear sum of patterns in C1 , i.e.,

→ P − →
LS = i Xi ,
ss is square sum of data points, i.e.,
P −→2
ss = i Xi .
CF values follows additive property, i.e., CF3 = CF1 + CF2 .

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Clustering Methods for Large Datasets
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Application of Clustering Methods to Image Segmentation

An image segmentation is typically defined as an exhaustive


partitioning of an input image into regions, each of which is
considered to be homogeneous with respect to some image
property of interest.
1 region-based,
2 edge-based,
3 cluster-based
Idea is to define feature vectors at every image location (pixel)
composed of both functions of image intensity and functions
of the pixel location itself.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Clustering Methods for Large Datasets
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Image Segmentation via Clustering

Hoffman and Jain (1987) employs squared error clustering in a


six-dimensional feature space extracted from a range image.
The technique was enhanced by Flynn and Jain in 1991.

An enhancement of the k -means algorithm called CLUSTER


is used to obtain segment labels for each pixel.
Each pixel in the range image is assigned the segment label of
the nearest cluster center.
Refine the segments.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Clustering Methods for Large Datasets
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Image Segmentation via Clustering

Nguyen and Cohen [1993] used Fuzzy c-means clustering for


segmenting textured images.
The k -means algorithm was applied for segmenting
LANDSAT imagery by Solberg et al. [1996]

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Clustering Methods for Large Datasets
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Sriparna Saha and Sanghamitra Bandyopadhyay [2008]


proposed a new Symmetry-Based Clustering method for
segmenting Stalite Image.
Sriparna Saha and Sanghamitra Bandyopadhyay [2007]
proposed MRI Brain Image Segmentation using fuzzy
clustering approach.
Automatic MR brain image segmentation using a multiseed
based multiobjective clustering approach is proposed by Saha
and Bandyopadhyay in 2011.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Clustering Methods for Large Datasets
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Conclusions

1 Many data clustering methods are discussed.


2 Various version of k-means clustering methods are applied for
segmenting images.
3 Density based clustering methods can be explored for finding
disjoint regions with different densities.
4 Rough set theory based clustering method can be useful for
finding overlapped segments in a given image.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Clustering Methods for Large Datasets
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

1 S. Theodoridis and K. Koutroumbas. Pattern Recognition, third ed. Academic


Press, Inc., Orlando, 2006.
2 A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM
Computing Surveys, 31(3):264323, 1999.
3 Z. Pawlak. Rough sets. International Journal of Computer and Information Sci-
ences, 11(5):341356, 1982.
4 P. A. Vijaya, M. N. Murty, and D. K. Subramanian. Efficient bottom-up hybrid
hierarchical clustering techniques for protein sequence classification. Pattern
Recognition, 39(12):2344–2355, 2006.

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Outline
Introduction to Classification
Introduction to Clustering
Clustering Methods for Large Datasets
Application of Clustering Methods to Image Processing
Conclusions and Research Directions

Questions
THANK YOU

Dr. Bidyut Kr. Patra Introduction to Clustering Methods


Linear SVM

Margin

Assuming linearly separable data, this tries to find the


best separating hyperplane.

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.17/43



Linear SVM
We like to draw two hyperplanes such that

if


���



��
��

if


���




��


———————————————-
for all OR
�� �

��
� �

� �



�� �

��

� �

� �













The two parallel hyperplanes are

� �

��




���

��



c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.18/43



Linear SVM
Distance between origin and is

���
�� �
� �

��
Distance between origin and is


��
� �

��
��
Then, the margin –

� �
�� �


��

��

��
��

��
��
Then the problem is :
Minimize (Objective function)


��

��

Subject to constraints: for all


�� �

��
���






Note that the objective is convex and the constraints are
linear
Lagrangian method can be applied.

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.19/43



Constrained Optimization Problem
Minimize


���
Subject to the constraints .

� ���





Lagrangian,


��

�� �

��

� ��

��

where is called primary variables and are the



Lagrangian multipliers which are also called dual
variables.
has to be minimized with respect to primal varibles

and maximized with respect to dual variables.

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.20/43



Constrained Optimization Problem
The K.K.T (Karush-Kuhn-Tucker) conditions
“necessary” at optimal are:


1. �
� �



2. for all

� �


��




��

3. for all
� ���

� �


��




��

If is convex and is linear for all , then it turns

� ���

���



out that K.K.T conditions are “necessary and sufficient”
for the optimal .

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.21/43



Convex Function
A real valued function defined in is said to be

���

��


convex if


� �

��

� �
��

��


� �

��

� �



for � and

��


��



a b

This definition can be extended to functions in higher


dimensional spaces.
c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.22/43

Lagrangian
Minimize (Objective function)


��

��

Subject to constraints: for all

�� �

��
� �






Lagrangian,


� �

��

��

�� �

��

� �


��





Here,

��
� �
� �

� �
� �

� ���

� ���


� �

� �



��

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.23/43



K.K.T. Conditions


���


��


��

��


���


��

��












��


��

��

��



��


��













for to

��
��




��













�� �

��

��


� �




��












for to .


Solve these equations to get and .



� ��

� ��

� ��


While it is possible to do this, it is tedious ! �

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.24/43



Wolfe Dual Formulation
Other easy and advantageous ways to solve the
optimization problem does exist which can be easily
extended to non-linear SVMs.
This is to get where we eliminate and and which


has .
��� �

��

��

We know, has to be maximized w.r.t. the dual


variables .
��� �

��

��

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.25/43



Wolfe Dual Formulation
The Lagrangian is :



��

��


���

��

��

��

��

��




��

��


� �

��

��

��

��

��




� �

�� �


��

��



���

� �
��

��

��


��




c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.26/43



Wolfe Dual Formulation
Maximize w.r.t


� �

� �
��

��

��


��





such that and for all .



��

��

��


We need to find the Lagrangian multipliers
only.



��


��

Primal variables and are eliminated.


There exists various numeric iterative methods to solve


this constrained convex quadratic optimization problem.
Sequential minimal optimization (SMO) is one such
technique which is a simple and relatively fast method.

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.27/43



The Optimization Problem
let then


� �

� �

� �


� �



� ��

��

��


��

���
� �

��

��





where is a matrix with its entry being
��




� �




���

� �
��
��

If (from (4)) is on a hyperplane, i.e., is


� �

� �


��

a support vector.
Note: lies on the hyperplane

� �
� �




��
Similarly, doesnot lie on hyperplane
���



��

That is, for interior points .

��

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.28/43



The Solution
Once is known, We can find

���

��

��

The classifier is
��



��

� �

��
��

��

��

��



can be found from (4)

For any , we have

�� �

��
� �

� ��


��



Multiplying with on both sides we get,
��

� ��

��

So,

� �

���

� �
��

��

��

��


c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.29/43



Some observations
In the dual problem formulation and in its solution we
have only dot products between some of the training
patterns.
Once we have the matrix , the problem and its

��
solution are independent of the dimensionality .


c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.30/43

Non-linear SVM
We know that every non-linear function in -space


(input space) can be seen as a linear function in an
appropriate -space (feature space).


Let the mapping be .


��


Once the is defined, one has to replace in the


problem as well as in the solution, for certain products,


as explained below.
Whenever we see , replace this by .



���

� �

���

� �


While it is possible to explicitly define the and




generate the training set in the -space, and then
obtain the solution... �

it is tedious and amazingly unnecessary also.

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.31/43



Kernel Function
For certain mappings, . That is,


� �


� �

� �

� ���

� �


dot product in the Y-space can be obtained as a
function in the X-space itself. There is no need to
explicitly generate the patterns in the Y-space.
Eg: Consider a two dimensional problem with
. Let . Then

�� �


� ���

��

� ���



��

��
���


��
.

��

� �


� �


� �

� �

� ���

� �

� �

� �


This kernel trick is one of the reasons for the success of
SVMs.

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.32/43



Kernel Function
is called the kernel function.
� �


� ���

� �
We say is a valid kernel iff there exists a such
� �� �
� � � �




that for all and .
� �



� ���

���

� �

���

� �


Mercer’s Theorem gives the necessary conditions for a
kernel to be valid.
While Mercer’s theorem is a mathematically involved
one, some of the properties of kernels can be used to
verify whether a kernel is valid.

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.33/43



Kernel Function
The following three properties are satisfied by any
Kernel.
1.
� �

� �

� � ��


�� �

� �

�� �

�� �

� �

2.
� �

��






3.


� �

� �


� �

�� �


� ���

� �

� ���

� �

�� �

�� �

� �

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.34/43



Some ways to generate new Kernels
are kernels is a kernel,



��� �

��

��
��

��
��



��

��

are kernels is

� �

�� �



�� �

� �

�� �

� �

�� �

� �

��� �

��

��

also a valid kernel. That is, is a valid kernel.

��
��

For any symmetric, positive semi-definite matrix

��
, is a valid kernel.
� ��
� �


� ���

� �


� �

Let be a polynomial with positive coefficients. Let


���

is a kernel then is a valid kernel.


� �

� �
� �



� ���

� �

� ���

� �
is a kernel exp( ) (called
� �

� �


� ���

� �

� ���

� �

exponential kernel) is also valid kernel.


exp (called Gaussian
��


� �


� �

� ��

���


� ���

� �

���

kernel)is a kernel.

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.35/43



Soft Margin Formulation
Until now, we assumed that the data is linearly (or
non-linearly) separable.
The SVM derived is sensitive to noise.
Soft Margin formulation allows violation of the
constraints to some extent. That is, we allow some of
the boundary patterns to be misclassified.

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.36/43



Soft Margin Formulation

if

� �

���



��
��


if

� �

���

��




��


———————————————-
�� � for all OR

��

� �
���

� �



�� �

��

� �


� �

���






are called slack variables,
� �

and for all

� �

� �











Now, the objective to minimize:


��

��

� �


is called penalty parameter, and .


c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.37/43



Soft Margin Formulation
Lagrangian


��

��

� �


�� �

��
� �

� �


��



� �
� �

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.38/43



Soft Margin: KKT Conditions


���


��


��

� �


���


��

��






��


��

��

� �


��


��




��

� �


��



���

� �
��
� �




��



c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.39/43

Soft Margin: KKT Conditions

� �




��



� �

� �




��

�� �

�� �

��

� �
��
� �

���






��

� �
��
� �
� �




c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.40/43

Wolfe Dual Formulation
Maximize w.r.t


� �

� �
��

��

��


��





such that and for all .




��

��

��


Hard Margin



Very Very Soft Margin



c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.41/43



Training Methods
Any convex quadratic programming technique can be
applied.
But with larger training sets, most of the standard
techniques can become very slow and space
occupying. For example, many techniques needs to
store the kernel matrix whose size is where is the

� �


number of training patterns.
These considerations have driven the design of specific
algorithms for SVMs that can exploit the sparseness of
the solution, the convexity of the optimization problem,
and the implicit mapping into feature space.
One such a simple and fast method is Sequential
Minimal Optimization (SMO).

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.42/43



Next Class ...
SMO algorithm.

c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.43/43



Outline
Ensemble of Classifiers
Bagging

Advanced Classification Techniques

August 25, 2017

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Ensemble of Classifiers

Bagging

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Model Ensembles

TWO HEADS ARE BETTER THAN ONE


1 Construct a set of classifiers from adopted versions of the

training data (resampled/reweighted).


2 Combine the predictions of these classifiers in some way, often
by simple averaging or voting.

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Figure: General Idea of Ensemble

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Rationale for Ensemble Method

1 Suppose there are 25 base classifiers.


2 Each classifier has error rate  = 0.35.
3 Assume classifiers are independent.
4 Probability that the ensemble classifier makes wrong
prediction:
25  
X 25 i
 (1 − )25−i = 0.06
i
i=13

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bootstrap aggregating (Bagging)

INPUT: Training Set D, a learning algorithm A, T = ensemble


size.
OUTPUT: Ensemble of Classifiers.

for t = 1to T do
1 Build a bootstrap sample Dt by sampling |D| points.
2 Run A on Dt to build a classifier Mt
Return {Mt |1 ≤ t ≤ T }
Comments: Probability that a particular data point not being
selected is (1 − 1/n)n

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1

Bagging Round 2: x ≤ 0.65 : y = +1, y = 1 else.

x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1

Bagging Round 2: x ≤ 0.65 : y = +1, y = 1 else.

x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1

Bagging Round 3: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1

Bagging Round 2: x ≤ 0.65 : y = +1, y = 1 else.

x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1

Bagging Round 3: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1

Bagging Round 4: x ≤ 0.3 : y = +1, y = −1 else.

x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1

Bagging Round 2: x ≤ 0.65 : y = +1, y = 1 else.

x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1

Bagging Round 3: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1

Bagging Round 4: x ≤ 0.3 : y = +1, y = −1 else.

x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1

Bagging Round 5: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1.0 1.0 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1

Bagging Round 2: x ≤ 0.65 : y = +1, y = 1 else.

x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1

Bagging Round 3: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1

Bagging Round 4: x ≤ 0.3 : y = +1, y = −1 else.

x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1

Bagging Round 5: x ≤ 0.35 : y = +1, y = −1 else.

x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1.0 1.0 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging
Bagging Round 6: x ≤ 0.75 : y = −1, y = +1 else.

x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1.0
y 1 -1 -1 -1 -1 -1 -1 1 1 1

Bagging Round 7: x ≤ 0.75 : y = −1, y = +1 else.

x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1.0
y 1 -1 -1 -1 -1 1 1 1 1 1

Bagging Round 8: x ≤ 0.75 : y = −1, y = +1 else.

x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1.0
y 1 1 -1 -1 -1 -1 -1 1 1 1

Bagging Round 9: x ≤ 0.75 : y = −1, y = +1 else.

x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1.0 1.0
y 1 1 -1 -1 -1 -1 -1 1 1 1

Bagging Round 10: x ≤ 0.05 : y = −1, y = +1 else.

x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9
y 1 1 1 1 1 1 1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Bagging

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


y 1 1 1 -1 -1 -1 -1 1 1 1

x
stump Round 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
≤ 0.35, +1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.65 ≥, +1 2 1 1 1 1 1 1 1 1 1 1
≤ 0.35 3 1 1 1 -1 -1 -1 -1 -1 -1 -1
le0.3, +1 4 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.35 5 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.75, −1 6 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 7 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 8 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 9 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.05, −1 10 1 1 1 1 1 1 1 1 1 1
SUM 2 2 2 -6 -6 -6 -6 2 2 2
Class 1 1 1 -1 -1 -1 -1 1 1 1
Actual 1 1 1 -1 -1 -1 -1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Random Forest

Bagging is useful in combination with decision tree models.


Build each tree from a different random subset of features
(subspace sampling).

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Random Forest

INPUT: Training Set D, subspace dimension=d, T = ensemble


size.
OUTPUT: Ensemble of Classifiers.

for t = 1 to T do
1 Build a bootstrap sample Dt by sampling with replacement.
2 select d features randomly and reduce dimensionality of Dt
accordingly.
3 Build a decision tree Mt using Dt .
Return {Mt |1 ≤ t ≤ T }.

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Boosting

Boosting assign a weight to each training example and


adaptively changes the weights at the end of each boosting
round.
How the weight changes?
Half of the total weight assigned to the misclassified examples
and the other half to the rest examples.

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Boosting
Initial weights are uniform that sum to 1.
Current weight assigned to misclassified samples are exactly
the error rate .
Multiply with 1/2 weights for misclassified instances.
Multiply 1/2(1 − ) weights for correctly classified samples.

Table: Confusion Matrix


ACTUAL CLASS
Class=+ Class=-
PREDICTED Class=+ 24 9
CLASS Class=- 16 51

 = 0.25 weight update factor = 1/2 = 2 for misclassified


samples. 1/2(1 − ) = 1/1.5 = 2/3.
Advanced Classification Techniques
Outline
Ensemble of Classifiers
Bagging

Boosting

Confidence/weightage of each boosting model (α).


1 1 − t
αt = ln
2 t

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

AdaBoost

INPUT: Training Set D; ensemble size T : learning Algorithm A


OUTPUT:Weighted Ensemble of models
w1i = 1/|D| for all xi ∈ D
for t=1 to T do
1 Run A on weight wti to produce a model Mt
2 calculate weighted error.
3 if t ≥ 1/2 then T=t-1 break;
1 − t
4 αt = 21 ln
t
5 w(t+1)i = wZtt × exp(−αt ) for correctly classified examples.
w(t+1)i = wZtt × exp(+αt ) for misclassified classified examples.
PT
Return M(x) = t=1 αt × Mt (x)

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Example of AdaBoost

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


y 1 1 1 -1 -1 -1 -1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Example of AdaBoost

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


y 1 1 1 -1 -1 -1 -1 1 1 1
Boosting Round 1: x ≤ 0.75 : y = −1, y = +1 else.

x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Example of AdaBoost

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


y 1 1 1 -1 -1 -1 -1 1 1 1
Boosting Round 1: x ≤ 0.75 : y = −1, y = +1 else.

x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Boosting Round 2: x ≤ 0.05 : y = +1, y = +1 else.

x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Example of AdaBoost

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


y 1 1 1 -1 -1 -1 -1 1 1 1
Boosting Round 1: x ≤ 0.75 : y = −1, y = +1 else.

x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Boosting Round 2: x ≤ 0.05 : y = +1, y = +1 else.

x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1

Boosting Round 3: x ≤ 0.3 : y = +1, y = −1 else.

x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1- -1 -1 -1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

Example

Round Split Point α


1 0.75 1.738
2 0.05 2.7784
3 0.3 4.1195

Round x=1 2 3 4 5 6 7 8 9 10
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Sign 1 1 1 -1 -1 -1 -1 1 1 1

Advanced Classification Techniques


Outline
Ensemble of Classifiers
Bagging

THANK YOU

Advanced Classification Techniques

You might also like