DWM Merged

Data Warehousing and Mining
Data
Book: Introduction to Data Mining , by
Tan| Steinbach | Kumar
Instructor: BIDYUT KUMAR PATRA
Tuesday[10:00-11:00], Thursday[09:00-10:00], Friday[08:00-09:00]
Saturday[4.15pm-5.15pm]
https://sites.google.com/site/patrabidyutkr/teaching/data-warehousing-and-mining-
cs6312
17/09/2020
Outline
 Attributes and Objects
 Types of Data
 Data Quality
 Similarity and Distance
 Data Preprocessing
17/09/2020
What is Data?
 Collection of data objects Attributes

and their attributes
 An attribute is a property Tid Refund Marital Taxable
or characteristic of an Status Income Cheat
object 1 Yes Single 125K No

– Examples: eye color of a 2 No Married 100K No
person, temperature, etc.
3 No Single 70K No
Objects
– Attribute is also known as
4 Yes Married 120K No
variable, field, characteristic,
dimension, or feature 5 No Divorced 95K Yes
 A collection of attributes 6 No Married 60K No

describe an object 7 Yes Divorced 220K No
– Object is also known as 8 No Single 85K Yes
record, point, case, sample, 9 No Married 75K No
entity, or instance
10 No Single 90K Yes
10
A More Complete View of Data
 Attributes (objects) may have relationships with

other attributes (objects)
 More generally, data may have structure
 Data can be incomplete
 We will discuss this in more detail later
17/09/2020
Attribute Values
 Attribute values are numbers or symbols

assigned to an attribute for a particular object
 Distinction between attributes and attribute values

– Same attribute can be mapped to different attribute
values
 Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of

values
 Example: Attribute values for ID and age are integers
 But properties of attribute values can be different
17/09/2020
Measurement of Length
 The way you measure an attribute may not match the
attributes properties.
5 A 1
B
7 2
C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and additvity
property of properties of
length. 10 4 length.
15 5
Types of Attributes
 There are different types of attributes

– Nominal
 Examples: ID numbers, eye color, zip codes
– Ordinal
 Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height {tall, medium, short}
– Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
 Examples: temperature in Kelvin, length, counts,
elapsed time (e.g., time to run a race)
17/09/2020
Properties of Attribute Values
 The type of an attribute depends on which of the

following properties/operations it possesses:
– Distinctness: = 
– Order: < >
– Differences are + -
meaningful :
– Ratios are * /
meaningful
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & meaningful
differences
– Ratio attribute: all 4 properties/operations
17/09/2020
Difference Between Ratio and Interval
 Is it physically meaningful to say that a

temperature of 10 ° is twice that of 5° on
– the Celsius scale?
– the Fahrenheit scale?
– the Kelvin scale?
 Consider measuring the height above average

– If Bill’s height is three inches above average and
Bob’s height is six inches above average, then would
we say that Bob is twice as tall as Bill?
– Is this situation analogous to that of temperature?
17/09/2020
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative
female} test
Ordinal Ordinal attribute hardness of minerals, median,

values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric
values are correlation, t and

meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current
This categorization of attributes is due to S. S. Stevens

Attribute Transformation Comments
Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Categorical
Qualitative
Ordinal An order preserving change of An attribute encompassing

values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
by { 0.5, 1, 10}.
Interval new_value = a * old_value + b Thus, the Fahrenheit and

where a and b are constants Celsius temperature scales
Quantitative
Numeric
differ in terms of where their

zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.
This categorization of attributes is due to S. S. Stevens

Discrete and Continuous Attributes
 Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
 Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
17/09/2020
Asymmetric Attributes
 Only presence (a non-zero attribute value) is regarded as
important
 Words present in documents
 Items present in customer transactions
 If we met a friend in the grocery store would we ever say the

following?
“I see our purchases are very similar since we didn’t buy most of the
same things.”
 We need two asymmetric binary attributes to represent one

ordinary binary attribute
– Association analysis uses asymmetric attributes
 Asymmetric attributes typically arise from objects that are

sets
17/09/2020
Key Messages for Attribute Types
 The types of operations you choose should be

“meaningful” for the type of data you have
– Distinctness, order, meaningful intervals, and meaningful ratios
are only four properties of data
– The data type you see – often numbers or strings – may not
capture all the properties or may suggest properties that are not
present
– Analysis may depend on these other properties of the data

 Many statistical analyses depend only on the distribution
– Many times what is meaningful is measured by statistical

significance
– But in the end, what is meaningful is measured by the domain
17/09/2020
Types of data sets
 Record
– Data Matrix
– Document Data
– Transaction Data
 Graph
– World Wide Web
– Molecular Structures
 Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
17/09/2020
Important Characteristics of Data
– Dimensionality (number of attributes)

 High dimensional data brings a number of challenges
– Sparsity
 Only presence counts
– Resolution
 Patterns depend on the scale
– Size
 Type of analysis may depend on size of data
17/09/2020
Record Data
 Data that consists of a collection of records, each

of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10
17/09/2020
Data Matrix
 If data objects have the same fixed set of numeric

attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
 Such a data set can be represented by an m by n matrix,

where there are m rows, one for each object, and n
columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load
10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1
17/09/2020
Document Data
 Each document becomes a ‘term’ vector

– Each term is a component (attribute) of the vector
– The value of each component is the number of times
the corresponding term occurs in the document.
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
17/09/2020
Transaction Data
 A special type of data, where

– Each transaction involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
– Can represent transaction data as record data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
17/09/2020
Graph Data
 Examples: Generic graph, a molecule, and webpages
2
5 1
2
5
Benzene Molecule: C6H6

17/09/2020
Ordered Data
 Sequences of transactions
Items/Events
An element of
the sequence
17/09/2020
Ordered Data
 Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
17/09/2020
Ordered Data
 Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
17/09/2020
Data Quality
 Poor data quality negatively affects many data processing

efforts
“The most important point is that poor data quality is an unfolding
disaster.
– Poor data quality costs the typical company at least ten
percent (10%) of revenue; twenty percent (20%) is
probably a better estimate.”
Thomas C. Redman, DM Review, August 2004
 Data mining example: a classification model for detecting

people who are loan risks is built using poor data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default
17/09/2020
Data Quality …
 What kinds of data quality problems?

 How can we detect problems with the data?
 What can we do about these problems?
 Examples of data quality problems:

– Noise and outliers
– Missing values
– Duplicate data
– Wrong data
– Fake data
17/09/2020
Noise
 For objects, noise is an extraneous object

 For attributes, noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor
phone and “snow” on television screen
Two Sine Waves Two Sine Waves + Noise

17/09/2020
Outliers
 Outliers are data objects with characteristics that

are considerably different than most of the other
data objects in the data set
– Case 1: Outliers are
noise that interferes
with data analysis
– Case 2: Outliers are

the goal of our analysis
 Credit card fraud
 Intrusion detection
 Causes?
17/09/2020
Missing Values
 Reasons for missing values

– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
 Handling missing values

– Eliminate data objects or variables
– Estimate missing values
 Example: time series of temperature
 Example: census results
– Ignore the missing value during analysis
17/09/2020
Missing Values …
 Missing completely at random (MCAR)

– Missingness of a value is independent of attributes
– Fill in values based on the attribute
– Analysis may be unbiased overall
 Missing at Random (MAR)
– Missingness is related to other variables
– Fill in values based other values
– Almost always produces a bias in the analysis
 Missing Not at Random (MNAR)
– Missingness is related to unobserved measurements
– Informative or non-ignorable missingness
 Not possible to know the situation from the data
17/09/2020
Duplicate Data
 Data set may include data objects that are

duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeneous
sources
 Examples:
– Same person with multiple email addresses
 Data cleaning
– Process of dealing with duplicate data issues
 When should duplicate data not be removed?

17/09/2020
Similarity and Dissimilarity Measures
 Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
 Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
 Proximity refers to a similarity or dissimilarity
17/09/2020
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity

between two objects, x and y, with respect to a single, simple
attribute.
17/09/2020
Euclidean Distance
 Euclidean Distance
where n is the number of dimensions (attributes) and

xk and yk are, respectively, the kth attributes
(components) or data objects x and y.
 Standardization is necessary, if scales differ.
17/09/2020
Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
17/09/2020
Minkowski Distance
 Minkowski Distance is a generalization of Euclidean

Distance
Where r is a parameter, n is the number of dimensions

(attributes) and xk and yk are, respectively, the kth
attributes (components) or data objects x and y.
17/09/2020
Minkowski Distance: Examples
 r = 1. City block (Manhattan, taxicab, L 1 norm) distance.

– A common example of this for binary vectors is the
Hamming distance, which is just the number of bits that are
different between two binary vectors
 r = 2. Euclidean distance
 r  . “supremum” (Lmax norm, L norm) distance.

– This is the maximum difference between any component of
the vectors
 Do not confuse r with n, i.e., all these distances are

defined for all numbers of dimensions.
17/09/2020
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
17/09/2020
Mahalanobis Distance
𝐦𝐚𝐡𝐚𝐥𝐚𝐧𝐨𝐛𝐢𝐬 𝐱, 𝐲 = (𝐱 − 𝐲)𝑇 Ʃ−1 (𝐱 − 𝐲)
 is the covariance matrix
For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

17/09/2020
Mahalanobis Distance
Covariance
Matrix:
0.3 0.2
 
C
 0.2 0.3
B A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
17/09/2020
Common Properties of a Distance
 Distances, such as the Euclidean distance,

have some well known properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 only if
x = y. (Positive definiteness)
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between

points (data objects), x and y.
 A distance that satisfies these properties is a

metric
17/09/2020
Common Properties of a Similarity
 Similarities, also have some well known

properties.
1. s(x, y) = 1 (or maximum similarity) only if x = y.

(does not always hold, e.g., cosine)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)
where s(x, y) is the similarity between points (data

objects), x and y.
17/09/2020
Similarity Between Binary Vectors
 Common situation is that objects, x and y, have only
binary attributes
 Compute similarities using the following quantities

f01 = the number of attributes where x was 0 and y was 1
 Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of 11 matches / number of non-zero attributes

= (f11) / (f01 + f10 + f11)
17/09/2020
SMC versus Jaccard: Example
x= 1000000000
y= 0000001001
f01 = 2 (the number of attributes where x was 0 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

= (0+7) / (2+1+0+7) = 0.7
J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0
17/09/2020
Cosine Similarity
 If d1 and d2 are two document vectors, then

cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot
product of vectors, d1 and d2, and || d || is the length of
vector d.
 Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
17/09/2020
Extended Jaccard Coefficient (Tanimoto)
 Variation of Jaccard for continuous or count

attributes
– Reduces to Jaccard for binary attributes
17/09/2020
Correlation measures the linear relationship
between objects
17/09/2020
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
17/09/2020
Drawback of Correlation
 x = (-3, -2, -1, 0, 1, 2, 3)

 y = (9, 4, 1, 0, 1, 4, 9)
y i = x i2
 mean(x) = 0, mean(y) = 4
 std(x) = 2.16, std(y) = 3.74
 corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 )

=0
17/09/2020
Comparison of Proximity Measures
 Domain of application
– Similarity measures tend to be specific to the type of
attribute and data
– Record data, images, graphs, sequences, 3D-protein
structure, etc. tend to have different measures
 However, one can talk about various properties that
you would like a proximity measure to have
– Symmetry is a common one
– Tolerance to noise and outliers is another
– Ability to find more types of patterns?
– Many others possible
 The measure must be applicable to the data and
produce results that agree with domain knowledge
17/09/2020
Information Based Measures
 Information theory is a well-developed and

fundamental discipline with broad applications
 Some similarity measures are based on

information theory
– Mutual information in various versions
– Maximal Information Coefficient (MIC) and related
measures
– General and can handle non-linear relationships
– Can be complicated and time intensive to compute
17/09/2020
Information and Probability
 Information relates to possible outcomes of an event

– transmission of a message, flip of a coin, or measurement
of a piece of data
 The more certain an outcome, the less information

that it contains and vice-versa
– For example, if a coin has two heads, then an outcome of
heads provides no information
– More quantitatively, the information is related the
probability of an outcome
 The smaller the probability of an outcome, the more information it
provides and vice-versa
– Entropy is the commonly used measure
17/09/2020
Entropy
 For
– a variable (event), X,
– with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
– the entropy of X , H(X), is given by
𝑛
𝐻 𝑋 =− 𝑝𝑖 log 2 𝑝𝑖
𝑖=1
 Entropy is between 0 and log2n and is measured in

bits
– Thus, entropy is a measure of how many bits it takes to
represent an observation of X on average
17/09/2020
Entropy Examples
 For a coin with probability p of heads and

probability q = 1 – p of tails
𝐻 = −𝑝 log 2 𝑝 − 𝑞 log 2 𝑞
– For p= 0.5, q = 0.5 (fair coin) H = 1
– For p = 1 or q = 1, H = 0
 What is the entropy of a fair four-sided die?
17/09/2020
Entropy for Sample Data: Example
Hair Color Count p -plog2p

Black 75 0.75 0.3113
Brown 15 0.15 0.4105
Blond 5 0.05 0.2161
Red 0 0.00 0
Other 5 0.05 0.2161
Total 100 1.0 1.1540
Maximum entropy is log25 = 2.3219
17/09/2020
Entropy for Sample Data
 Suppose we have
– a number of observations (m) of some attribute, X,
e.g., the hair color of students in the class,
– where there are n different possible values
– And the number of observation in the ith category is mi
– Then, for this sample
𝑛
𝑚𝑖 𝑚𝑖
𝐻 𝑋 =− log 2
𝑚 𝑚
𝑖=1
 For continuous data, the calculation is harder

17/09/2020
Mutual Information
 Information one variable provides about another
Formally, 𝐼 𝑋, 𝑌 = 𝐻 𝑋 + 𝐻 𝑌 − 𝐻(𝑋, 𝑌), where
H(X,Y) is the joint entropy of X and Y,
𝐻 𝑋, 𝑌 = − 𝑝𝑖𝑗log 2 𝑝𝑖𝑗
𝑖 𝑗
Where pij is the probability that the ith value of X and the jth value of Y
occur together
 For discrete variables, this is easy to compute
 Maximum mutual information for discrete variables is

log2(min( nX, nY ), where nX (nY) is the number of values of X (Y)
17/09/2020
Mutual Information Example
Student Count p -plog2p Student Grade Count p -plog2p

Status Status
Undergrad 45 0.45 0.5184
Undergrad A 5 0.05 0.2161
Grad 55 0.55 0.4744
Undergrad B 30 0.30 0.5211
Total 100 1.00 0.9928
Undergrad C 10 0.10 0.3322
Grade Count p -plog2p Grad A 30 0.30 0.5211
A 35 0.35 0.5301 Grad B 20 0.20 0.4644

B 50 0.50 0.5000 Grad C 5 0.05 0.2161
C 15 0.15 0.4105 Total 100 1.00 2.2710
Total 100 1.00 1.4406
Mutual information of Student Status and Grade = 0.9928 + 1.4406 - 2.2710 = 0.1624
17/09/2020
Maximal Information Coefficient
 Reshef, David N., Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter
J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, and Pardis C. Sabeti. "Detecting novel
associations in large data sets." science 334, no. 6062 (2011): 1518-1524.
 Applies mutual information to two continuous

variables
 Consider the possible binnings of the variables into
discrete categories
– nX × nY ≤ N0.6 where
 nX is the number of values of X
 nY is the number of values of Y
 N is the number of samples (observations, data objects)
 Compute the mutual information
– Normalized by log2(min( nX, nY )
 Take the highest value
17/09/2020
General Approach for Combining Similarities
 Sometimes attributes are of many different types, but an

overall similarity is needed.
1: For the kth attribute, compute a similarity, sk(x, y), in the
range [0, 1].
2: Define an indicator variable, k, for the kth attribute as
follows:
k = 0 if the kth attribute is an asymmetric attribute and
both objects have a value of 0, or if one of the objects
has a missing value for the kth attribute
k = 1 otherwise
3. Compute
17/09/2020
Using Weights to Combine Similarities
 May not want to treat all attributes the same.

– Use non-negative weights 𝜔𝑘
𝑛
𝑘=1 𝜔𝑘 𝛿𝑘𝑠𝑘 (𝐱,𝐲)
– 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐱, 𝐲 = 𝑛
𝑘=1 𝜔𝑘𝛿𝑘
 Can also define a weighted form of distance
17/09/2020
Density
 Measures the degree to which data objects are close to

each other in a specified area
 The notion of density is closely related to that of proximity
 Concept of density is typically used for clustering and
anomaly detection
 Examples:
– Euclidean density
 Euclidean density = number of points per unit volume
– Probability density
 Estimate what the distribution of the data looks like
– Graph-based density
 Connectivity
17/09/2020
Euclidean Density: Grid-based Approach
 Simplest approach is to divide region into a

number of rectangular cells of equal volume and
define density as # of points the cell contains
Grid-based density. Counts for each cell.

17/09/2020
Euclidean Density: Center-Based
 Euclidean density is the number of points within a

specified radius of the point
Illustration of center-based density.

17/09/2020
Data Preprocessing
 Aggregation
 Sampling
 Dimensionality Reduction
 Feature subset selection
 Feature creation
 Discretization and Binarization
 Attribute Transformation
17/09/2020
Aggregation
 Combining two or more attributes (or objects) into

a single attribute (or object)
 Purpose
– Data reduction
 Reduce the number of attributes or objects
– Change of scale
 Cities aggregated into regions, states, countries, etc.
 Days aggregated into weeks, months, or years
– More “stable” data

 Aggregated data tends to have less variability
17/09/2020
Example: Precipitation in Australia
 This example is based on precipitation in

Australia from the period 1982 to 1993.
The next slide shows
– A histogram for the standard deviation of average
monthly precipitation for 3,030 0.5◦ by 0.5◦ grid cells in
Australia, and
– A histogram for the standard deviation of the average
yearly precipitation for the same locations.
 The average yearly precipitation has less
variability than the average monthly precipitation.
 All precipitation measurements (and their
standard deviations) are in centimeters.
17/09/2020
Example: Precipitation in Australia …
Variation of Precipitation in Australia
Standard Deviation of Average Standard Deviation of

Monthly Precipitation Average Yearly Precipitation
17/09/2020
Sampling
 Sampling is the main technique employed for data
reduction.
– It is often used for both the preliminary investigation of
the data and the final data analysis.
 Statisticians often sample because obtaining the

entire set of data of interest is too expensive or
time consuming.
 Sampling is typically used in data mining because

processing the entire set of data of interest is too
expensive or time consuming.
17/09/2020
Sampling …
 The key principle for effective sampling is the

following:
– Using a sample will work almost as well as using the

entire data set, if the sample is representative
– A sample is representative if it has approximately the

same properties (of interest) as the original set of data
17/09/2020
Sample Size
8000 points 2000 Points 500 Points
17/09/2020
Types of Sampling
 Simple Random Sampling
– There is an equal probability of selecting any particular
item
– Sampling without replacement
 As each item is selected, it is removed from the
population
– Sampling with replacement
 Objects are not removed from the population as they
are selected for the sample.
 In sampling with replacement, the same object can
be picked up more than once
 Stratified sampling
– Split the data into several partitions; then draw random
samples from each partition
17/09/2020
Sample Size
 What sample size is necessary to get at least one
object from each of 10 equal-sized groups.
17/09/2020
Curse of Dimensionality
 When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
 Definitions of density and

distance between points,
which are critical for
clustering and outlier
detection, become less
meaningful • Randomly generate 500 points
• Compute difference between max and
min distance between any pair of points
17/09/2020
Dimensionality Reduction
 Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
 Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
17/09/2020
Dimensionality Reduction: PCA
 Goal is to find a projection that captures the

largest amount of variation in data
x2
x1
17/09/2020
Dimensionality Reduction: PCA
17/09/2020
Feature Subset Selection
 Another way to reduce dimensionality of data

 Redundant features
– Duplicate much or all of the information contained in
one or more other attributes
– Example: purchase price of a product and the amount
of sales tax paid
 Irrelevant features
– Contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
 Many techniques developed, especially for
classification
17/09/2020
Feature Creation
 Create new attributes that can capture the

important information in a data set much more
efficiently than the original attributes
 Three general methodologies:

– Feature extraction
 Example: extracting edges from images
– Feature construction
 Example: dividing mass by volume to get density
– Mapping data to new space
 Example: Fourier and wavelet analysis
17/09/2020
Mapping Data to a New Space
 Fourier and wavelet transform
Frequency
Two Sine Waves + Noise Frequency
17/09/2020
Discretization
 Discretization is the process of converting a

continuous attribute into an ordinal attribute
– A potentially infinite number of values are mapped
into a small number of categories
– Discretization is commonly used in classification
– Many classification algorithms work best if both
the independent and dependent variables have
only a few values
– We give an illustration of the usefulness of
discretization using the Iris data set
17/09/2020
Iris Sample Data Set
 Iris Plant data set.

– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher
– Three flower types (classes):
 Setosa
 Versicolour
 Virginica
– Four (non-class) attributes
 Sepal width and length
 Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
17/09/2020
Discretization: Iris Example
Petal width low or petal length low implies Setosa.

Petal width medium or petal length medium implies Versicolour.
Petal width high or petal length high implies Virginica.
Discretization: Iris Example …
 How can we tell what the best discretization is?

– Unsupervised discretization: find breaks in the data
values 50
 Example:
Petal Length 40
Counts
30
20
10
0
0 2 4 6 8
Petal Length
– Supervised discretization: Use class labels to find

breaks
17/09/2020
Discretization Without Using Class Labels
Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
17/09/2020
Equal interval width approach used to obtain 4 values.
17/09/2020
Equal frequency approach used to obtain 4 values.
17/09/2020
K-means approach to obtain 4 values.
17/09/2020
Binarization
 Binarization maps a continuous or categorical

attribute into one or more binary variables
 Typically used for association analysis
 Often convert a continuous attribute to a

categorical attribute and then convert a
categorical attribute to a set of binary attributes
– Association analysis needs asymmetric binary
attributes
– Examples: eye color and height measured as
{low, medium, high}
17/09/2020
Attribute Transformation
 An attribute transform is a function that maps the

entire set of values of a given attribute to a new
set of replacement values such that each old
value can be identified with one of the new values
– Simple functions: xk, log(x), ex, |x|
– Normalization
 Refers to various techniques to adjust to
differences among attributes in terms of frequency
of occurrence, mean, variance, range
 Take out unwanted, common signal, e.g.,
seasonality
– In statistics, standardization refers to subtracting off
the means and dividing by the standard deviation
17/09/2020
Example: Sample Time Series of Plant Growth
Minneapolis
Net Primary
Production (NPP)
is a measure of
plant growth used
by ecosystem
scientists.
Correlations between time series

Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.7591 -0.7581
Atlanta 0.7591 1.0000 -0.5739
Sao Paolo -0.7581 -0.5739 1.0000
17/09/2020
Seasonality Accounts for Much Correlation
Minneapolis
Normalized using
monthly Z Score:
Subtract off monthly
mean and divide by
monthly standard
deviation

Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.0492 0.0906
Atlanta 0.0492 1.0000 -0.0154
Sao Paolo 0.0906 -0.0154 1.0000
17/09/2020
(CS6312)
From August 12, 2019:TC Slot (Room No. PPA)
10.00-11.00 (Monday)
09.00-10.00 (Tuesday)
08.00-09.00 (Thursday)
1
2
(CS6312)
3
(CS6312)
4
(CS6312)
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Computing Gini Index for a Collection of
Nodes
When a node p is split into k partitions (children)
k
ni
GINI split   GINI (i )
i 1 n
where, ni = number of records at child i,

n = number of records at parent node p.
Choose the attribute that minimizes weighted average

Gini index of the children
Gini index is used in decision tree algorithms such as

CART, SLIQ, SPRINT
24
25
26
27
Binary Attributes: Computing GINI Index
Splits into two partitions

Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 7
Yes No C2 5
Gini = 0.486
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2 N1 N2 Weighted Gini of N1 N2
= 0.278
C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 Gini=0.361
= 0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125
4/28/2021 28
Categorical Attributes: Computing Gini Index
For each distinct value, gather counts for each class in the dataset
Use the count matrix to make decisions
Multi-way split Two-way split

(find best partition of values)
CarType CarType CarType

{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167
Which of these is the best?
4/28/2021 29
Continuous Attributes: Computing Gini Index
Use Binary Decisions based on one ID

Home
Owner
Marital
Status
Annual
Income
Defaulted
value 1 Yes Single 125K No
Several Choices for the splitting value 2 No Married 100K No
– Number of possible splitting values 3 No Single 70K No

= Number of distinct values 4 Yes Married 120K No
Each splitting value has a count matrix
6 No Married 60K No
associated with it
7 Yes Divorced 220K
– Class counts in each of the
No
8 No Single 85K Yes
partitions, A < v and A  v 9 No Married 75K No
Simple method to choose best v 10 No Single 90K Yes
– For each v, scan the database to

10
Annual Income ?
gather count matrix and compute
its Gini index
≤ 80 > 80
– Computationally Inefficient!
Defaulted Yes 0 3
Repetition of work.
Defaulted No 3 4
4/28/2021 30
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No

Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
4/28/2021 31


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
4/28/2021 32


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
4/28/2021 33


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
4/28/2021 34


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
4/28/2021 Introduction to Data Mining, 2nd Edition 35

36
Measure of Impurity: Entropy
Entropy at a given node t:

Entropy(t )   p( j | t ) log p( j | t )
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
 Maximum (log nc) when records are equally distributed

among all classes implying least information
 Minimum (0.0) when all records belong to one class,
implying most information
– Entropy based computations are quite similar to

the GINI index computations
4/28/2021 37
38
39
40
41
Gain Ratio
Gain Ratio:
GAIN n n
GainRATIO  SplitINFO    log
Split k
i i
split
SplitINFO n n i 1
Parent Node, p is split into k partitions

ni is the number of records in partition i
CarType CarType CarType

{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167
SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97
4/28/2021 42
43
44
Misclassification Error vs Gini Index
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!
4/28/2021 45
Misclassification Error vs Gini Index
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416
Misclassification error for all three cases = 0.3 !
4/28/2021 46
Comparison among Impurity Measures
For a 2-class problem:
4/28/2021 47
4/28/2021 48
4/28/2021 49
Decision Tree Based Classification
Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid
overfitting are employed)
– Can easily handle redundant or irrelevant attributes (unless
the attributes are interacting)
Disadvantages:
– Space of possible decision trees is exponentially large.
Greedy approaches are often unable to find the best tree.
– Does not take into account interactions between attributes
– Each decision boundary involves only a single attribute
4/28/2021 Introduction to Data Mining, 2nd Edition 50
4/28/2021 51
4/28/2021 52
4/28/2021 53
4/28/2021 54
4/28/2021 55
4/28/2021 56
4/28/2021 57
4/28/2021 58
4/28/2021 59
4/28/2021 60
Outline
Distance Based Classification Algorithms
Dr. Bidyut Kr. Patra Classification Technique (Contd..)

Outline
Nearest Neighbor Classifier(Cover and Hart, 1967)
Let T = {(xi , yi )}ni=1 be set of labeled patterns (Training set),

where xi is a pattern and yi be its class label. Let x be a
pattern with unknown class label (test pattern).
NN Rule is :
Let x ′ ∈ T be the pattern nearest to a test pattern x.
l (x) = l (x ′ )
Complexity:
Time: O(|T |)
Space: O(|T |)

Outline
Condensed NNC (Hart, 1968)
INPUT: Training Set T

OUTPUT: A condenced Set S.
1 Start with a condensed set S = {x}.
2 For each x ∈ T \ S
1 Classify x using NN considering S as training set.
2 if x is misclassified then S = S ∪ {x}

3 Repeat Step 2 until no change found in Condensed Set

Outline
Modified CNN(Devi and Murthy, 2003)

1 Start with condensed set S. S contains one pattern from each
class.
2 G=∅
3 For each x ∈ T
1 Classify x using NN considering S as training set.
2 if x is misclassified then G = G ∪ {x}

4 Find a representative pattern from each class in G; Let
representative set is R.
5 S =S ∪R
6 G=∅
7 Repeat Step 2 to Step 6 until there is no misclassification.
Outline
MCNN is an order independent algorithm
Many Iterations.

Outline
K-Nearest Neighbor Classifier
Let T = {(xi , yi )}ni=1 be a Training set.

Let x be a pattern with unknown class label (test pattern).
Algorithm:
KNN = ∅
For each t ∈ T
1 if |KNN| <= K
KNN = KNN ∪ {t}
2 else
1 Find a x ′ ∈ KNN such that dis(x, x ′ ) > dis(x, t)
2 KNN = KNN − {x ′ }; KNN = KNN ∪ {t}
The pattern x belongs to a Class in which most of the
patterns in KNN belong to.

Outline
How to find the value of K
r -fold Cross Validation

Partition the training set into r blocks. Let these are
T1 , T2 , . . . , Tr
For i = 1 to r do
1 Consider T − Ti as the training set and Ti as the validation set.
2 For a range of K values (say from 1 to m) find the error rates
on the validation set.
3 Let these error rates are ei 1 , ei 2 , . . . , eim
Take ei = mean of {e1i , e2i , . . . , eri }, for i = 1 to m.
The value of K = argmin{e1 , e2 , . . . , em }
i

Outline
Weighted k-NNC(Dudani, 1976)

k-NNC gives equal importance to the first NN and to the last
NN.
Weight is assigned to each nearest neighbor of a quary
pattern.
Cj C
Let X = {xi=1..k | xi j ∈ T } be the set of k-NN of q, whose
class label is to be determined.
Let D = {d1 , d2 , . . . , dk } be an ordered set, where
di = ||xi − q||, di ≤ dj , i < j
The weight wj is assigned to j th nearest neighbor as follows.
( d −d
k j
if dj 6= d1
wj = dk −d1
1 if dj = d1
Calculate weighted sums of patterns belong to each class.
Classify q to a class for which weighted sum is maximum.
Outline
Editing Techniques
Larger the training set, more the computational cost.

Another technique eliminates (edit) training prototypes
(pattern) erroneously labeled, commonly outliers, and at the
same time, to “clean” the possible overlapping between
regions of different classes.

Outline
Editing Techniques
Larger the training set, more the computational cost.

Another technique eliminates (edit) training prototypes
(pattern) erroneously labeled, commonly outliers, and at the
same time, to “clean” the possible overlapping between
regions of different classes.
Wilsons editing relies on the idea that, if a prototype is
erroneously classified using the k-NN, it has to be eliminated
from the training set.

Outline
Edited Nearest Neighbor (Willson, 1976)
INPUT: T is a training set

OUTPUT: S is an edited set
1 for each x ∈ T do
1 classify x using k-NN Classifier (break the ties randomly)
2 if x is misclassified then mark x
2 Delete all marked paterns from T ; let the reduced training set
be S.
3 Output S

Outline
Repeated ENN (Tomek, 1976)
Apply ENN method repeatedly until there is no chnage in the

training set T

Confusion Matrix and ROC Plot
 Confusion Matrix
 Receiver Operating Characteristic (ROC)
1
Which model is better?
PREDICTED
A Class=Yes Class=No
ACTUAL Class=Yes 0 10
Class=No 0 990
PREDICTED
B Class=Yes Class=No
Class=No 90 900
2
Which model is better?
PREDICTED
A Class=Yes Class=No
Class=No 0 990
PREDICTED
B Class=Yes Class=No
Class=No 90 900
3
Alternative Measures
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL
CLASS Class=No c d
a
Precision (p) =
a+c
a
Recall (r) =
a+b
2rp 2a
F - measure (F) = =
r + p 2a + b + c
4
10
PREDICTED CLASS Precision (p) = = 0.5
10 + 10
10
Class=Yes Class=No Recall (r) = =1
10 + 0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F) = = 0.62
CLASS Class=No 10 980 1 + 0.5
990
Accuracy = = 0.99
1000
5
10
PREDICTED CLASS Precision (p) = = 0.5
10 + 10
10
Class=Yes Class=No Recall (r) = =1
10 + 0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F) = = 0.62
CLASS Class=No 10 980 1 + 0.5
990
Accuracy = = 0.99
1000
PREDICTED CLASS 1
Precision (p) = =1
1+ 0
Class=Yes Class=No
1
Recall (r) = = 0.1
Class=Yes 1 9 1+ 9
ACTUAL 2 * 0.1*1
CLASS Class=No 0 990 F - measure (F) = = 0.18
1 + 0.1
991
Accuracy = = 0.991
1000
6
PREDICTED CLASS
Precision (p) = 0.8
Class=Yes Class=No
Recall (r) = 0.8
Class=Yes 40 10 F - measure (F) = 0.8
ACTUAL
CLASS Class=No 10 40 Accuracy = 0.8
7
PREDICTED CLASS
Precision (p) = 0.8
Class=Yes Class=No
Recall (r) = 0.8
A Class=Yes 40 10 F - measure (F) = 0.8
ACTUAL
CLASS Class=No 10 40 Accuracy = 0.8
PREDICTED CLASS
B Class=Yes Class=No Precision (p) =~ 0.04
Class=Yes 40 10 Recall (r) = 0.8
ACTUAL F - measure (F) =~ 0.08
CLASS Class=No 1000 4000
Accuracy =~ 0.8
8
Measures of Classification Performance
PREDICTED CLASS
Yes No
ACTUAL
Yes TP FN
CLASS
No FP TN
α is the probability that we reject

the null hypothesis when it is
true. This is a Type I error or a
false positive (FP).
β is the probability that we

accept the null hypothesis when
it is false. This is a Type II error
or a false negative (FN).
9
PREDICTED CLASS Precision (p) = 0.8

TPR = Recall (r) = 0.8
Class=Yes Class=No FPR = 0.2
F−measure (F) = 0.8
Class=Yes 40 10 Accuracy = 0.8
ACTUAL
TPR
=4
FPR

Class=Yes Class=No
FPR = 0.2
Class=Yes 40 10 F−measure (F) = 0.07
ACTUAL Accuracy = 0.8
TPR
=4
FPR
10
PREDICTED CLASS
Class=Yes Class=No
Precision (p) = 0.5
Class=Yes 10 40
ACTUAL
Class=No 10 40
FPR = 0.2
CLASS
F − measure = 0.28
PREDICTED CLASS
Precision (p) = 0.5
Class=Yes Class=No
Class=Yes 25 25
ACTUAL FPR = 0.5
Class=No 25 25
CLASS F − measure = 0.5

Class=Yes Class=No
Class=Yes 40 10
ACTUAL FPR = 0.8
Class=No 40 10
CLASS
F − measure = 0.61
11
ROC (Receiver Operating Characteristic)
 A graphical approach for displaying trade-off

between detection rate and false alarm rate
 Developed in 1950s for signal detection theory to
analyze noisy signals
 ROC curve plots TPR against FPR
– Performance of a model represented as a
point in an ROC curve
– Changing the threshold parameter of classifier
changes the location of the point
12
ROC Curve
(TPR,FPR):
 (0,0): declare everything
to be negative class
 (1,1): declare everything
to be positive class
 (1,0): ideal
 Diagonal line:
– Random guessing
– Below diagonal line:
 prediction is opposite
of the true class
13
ROC (Receiver Operating Characteristic)
 To draw ROC curve, classifier must produce

continuous-valued output
– Outputs are used to rank test records, from the most
likely positive class record to the least likely positive
class record
 Many classifiers produce only discrete outputs (i.e.,

predicted class)
– How to get continuous-valued outputs?
 Decision trees, rule-based classifiers, neural networks,
Bayesian classifiers, k-nearest neighbors, SVM
14
ROC Curve Example
- 1-dimensional data set containing 2 classes (positive and negative)
- Any points located at x > t is classified as positive
At threshold t:
TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88
15
Using ROC for Model Comparison
 No model consistently
outperforms the other
 M1 is better for
small FPR
 M2 is better for
large FPR
 Area Under the ROC

curve
 Ideal:
 Area =1
 Random guess:
 Area = 0.5
16
How to Construct an ROC curve
• Use a classifier that produces a

Instance Score True Class
continuous-valued score for
1 0.95 +
each instance
2 0.93 +
• The more likely it is for the
3 0.87 - instance to be in the + class, the
4 0.85 - higher the score
5 0.85 - • Sort the instances in decreasing
6 0.85 + order according to the score
7 0.76 - • Apply a threshold at each unique
8 0.53 + value of the score
9 0.43 - • Count the number of TP, FP,
10 0.25 + TN, FN at each threshold
• TPR = TP/(TP+FN)
• FPR = FP/(FP + TN)
17
How to construct an ROC curve
Class + - + - - - + - + +
Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0
ROC Curve:
18
Classification Technique (Contd..)
Outline
Bayesian Classifier
Bayesian Classifier
Bayes Rule
Relationship between the attribute set and the class is

non-deterministic.
A doctor knows that meningitis causes stiff neck 50% of the
time.
1 Prior probability of any patient having meningitis is 1/50, 000
2 Prior probability of any patient having stiff neck is 1/20
Bayesian Classifier or Bayes Rule models probabilistic

relationships between the features and classes.
Bayesian Classifier
Conditional Probability
Let A and C be two random variable.

A conditional probability is the probability that a random
variable will take a particular value given that value of another
random variable is known.
Conditional Probability :
P(A, C )
P(A|C ) = .
P(C )
P(A, C )
P(C |A) =
P(A)
Bayesian Classifier
Bayes Theorem, 1763
P(A|C )P(C )
P(C |A) =
P(A)
If a patient has stiff neck, what is the probability that s/he

has meningitis?
P(S|M)P(M) 0.5 × 1/50, 000
P(M|S) = = = 0.0002
P(S) 1/20
Bayesian Classifier
How to use Bayes Rule in Classification
Consider a two class problem. Let it be ω1 , ω2 .

Given a pattern X = (a1 , a2 , a3 , . . . , am )
Goal is to predict the class label (ω1 /ω2 ) of the pattern X .
Compute:
P(X |ω1 ) × P(ω1 )
P(ω1 |X ) =
P(X )
Compute:
P(X |ω2 ) × P(ω2 )
P(ω2 |X ) =
P(X )
Assign class label ω1 if P(ω1 |X ) > P(ω2 |X ), otherwise ω2 .
Choose a class C which maximixe P(X |C )P(C ).
Bayesian Classifier
˙ Bayes Classifier
Naive
Let T be a training set and X = (a1 , a2 , a3 , . . . , am ) be an

unknown object.
Suppose there are k classes, C1 , C2 , . . . , Ck
Naive Bayes predicts a class Ci if and only if
P(Ci |X ) > P(Cj |X ) for 1 <= j <= k, j 6= i
P(X |Ci ) × P(Ci )
P(Ci |X ) =
P(X )
How to estimate P(Ci ) ?
|Ci ,T |
Prior Probability P(Ci ) = ,
|T |
where, |Ci ,T | = Number of patterns in Class Ci
Bayesian Classifier
Naive Bayes (Contd..)
How to estimate P(X |Ci ) = P(a1 , a2 , a3 , . . . , am |Ci )?

Assume features are independent for a given class.
P(a1 , a2 , a3 , . . . , am |Ci ) = P(a1 |Ci ) × P(a2 |Ci ) . . . P(am |Ci )

Y
= P(ai |Ci )
i
1 If Aj is categorical, then
#pattern in Ci with value aj
P(aj |Ci ) =
|Ci ,T |
Bayesian Classifier
How to Estimate Probabilities from Data?
If Aj is continuous-valued,
Continuous valued attribute is assumed to be Gaussian
distribution.
1 (x−µ)2
G (x, µ, σ) = √ e − 2σ2
2πσ
P(aj |Ci ) = G (aj , µCi , σCi )
Bayesian Classifier
Table: Training Set (Source: Tan, Kumar and Steinbach, Introduction to Data Mining)
Tid Refund Marital Income Defaulter

Status (Class)
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
(aj −µ) 2
1
P(aj |Ci ) = √ e − 2σ2
2πσ
Bayesian Classifier
P(Income|Class = NO) :
1 Sample Mean= 110, Sample Varience=2975

(120−110)2
1
P(Income = 120K |No) = √ e 2×(2975) = 0.0072
2π(54.54)
X = (Refund = No, Married, Income = 120K )? What will be the class label of X ?
P(X |Class = No) = P(Refund = No|Class = No) × P(Married|Class = No) × P(Income = 120|No)
4 4
= × × 0.0072 = 0.0024
7 7
P(X |Class = Yes) = P(Refund = No|Class = Yes) × P(Married|Class = Yes) × P(Income = 120|Yes)
= 0.
P(X |Class = No) × P(Class = No) > P(X |Class = Yes) × P(Class = Yes)
Data Mining
Chapter 4
Artificial Neural Networks
Introduction to Data Mining , 2nd Edition
10/12/2020 1
Artificial Neural Networks (ANN)
 Basic Idea: A complex non-linear function can be

learned as a composition of simple processing units
 ANN is a collection of simple processing units
(nodes) that are connected by directed links (edges)
– Every node receives signals from incoming edges,
performs computations, and transmits signals to
outgoing edges
– Analogous to human brain where nodes are neurons
and signals are electrical impulses
– Weight of an edge determines the strength of
connection between the nodes
– Simplest ANN: Perceptron (single neuron)
10/12/2020 2
Basic Architecture of Perceptron
1
W_0
( x_1 x_2 x_3 1)^T

1 b
( W_1 W_2 W_3 b)
X_1 W_1
X_2 W_2 Activation Function
X_3 W_3
 Learns linear decision boundaries

 Similar to logistic regression (activation function is sign
instead of sigmoid)
10/12/2020 3
Perceptron Example
X1 X2 X3 Y Input Black box

1 0 0 -1
1 0 1 1
X1
1 1 0 1 Output
1 1 1 1
0 0 1 -1
X2 Y
0 1 0 -1
0 1 1 1 X3
0 0 0 -1
Output Y is 1 if at least two of the three inputs are equal to 1.
10/12/2020 4
Perceptron Example
Input
nodes Black box
X1 X2 X3 Y
1 0 0 -1 Output
1 0 1 1
X1 0.3 node
1 1 0 1
1 1 1 1
X2 0.3
0 0 1 -1
 Y
0 1 0 -1
0 1 1 1 X3 0.3 t=0.4
0 0 0 -1
Y  sign (0.3 X 1  0.3 X 2  0.3 X 3  0.4)

 1 if x  0
where sign ( x )  
 1 if x  0
10/12/2020 5
Perceptron Learning Rule
 Initialize the weights (w0, w1, …, wd)

 Repeat
– For each training example (xi, yi)
 Compute 𝑦𝑖
 Update the weights:
 Until stopping condition is met

 k: iteration number; 𝜆: learning rate
10/12/2020 6
Perceptron Learning Rule
 Weight update formula:
 Intuition:
– Update weight based on error: e =
– If y = 𝑦, e=0: no update needed
– If y > 𝑦, e=2: weight must be increased so
that 𝑦 will increase
– If y < 𝑦, e=-2: weight must be decreased so
that 𝑦 will decrease
10/12/2020 7
Example of Perceptron Learning
  0.1
X1 X2 X3 Y w0 w1 w2 w3 Epoch w0 w1 w2 w3
1 0 0 -1 0 0 0 0 0 0 0 0 0 0
1 0 1 1 1 -0.2 -0.2 0 0 1 -0.2 0 0.2 0.2
2 0 0 0 0.2 2 -0.2 0 0.4 0.2
1 1 0 1
3 0 0 0 0.2
1 1 1 1 3 -0.4 0 0.4 0.2
4 0 0 0 0.2
0 0 1 -1 5 -0.2 0 0 0 4 -0.4 0.2 0.4 0.4
0 1 0 -1 6 -0.2 0 0 0 5 -0.6 0.2 0.4 0.2
0 1 1 1 7 0 0 0.2 0.2 6 -0.6 0.4 0.4 0.2
0 0 0 -1 8 -0.2 0 0.2 0.2
Weight updates over
Weight updates over first epoch all epochs
10/12/2020 8
Perceptron Learning
 Since y is a linear
combination of input
variables, decision
boundary is linear
10/12/2020 9
Perceptron Learning
 Since y is a linear
combination of input
variables, decision
boundary is linear
 For nonlinearly separable problems, perceptron

learning algorithm will fail because no linear
hyperplane can separate the data perfectly
10/12/2020 10
Nonlinearly Separable Data
XOR Data
y  x1  x2
x1 x2 y
0 0 -1
1 0 1
0 1 1
1 1 -1
10/12/2020 11
Multi-layer Neural Network
x1 x2 x3 x4 x5
 More than one hidden layer of
Input computing nodes
Layer
 Every node in a hidden layer

operates on activations from
Hidden preceding layer and transmits
Layer
activations forward to nodes of
next layer
Output  Also referred to as

“feedforward neural networks”
Layer
10/12/2020 12
Multi-layer Neural Network
 Multi-layer neural networks with at least one

hidden layer can solve any type of classification
task involving nonlinear decision surfaces
XOR Data
Input Hidden Output

Layer Layer Layer
w31
x1 n1 n3 w53
w41
n5 y
w32
w54
x2 n2 n4
w42
10/12/2020 13
Why Multiple Hidden Layers?
 Activations at hidden layers can be viewed as features

extracted as functions of inputs
 Every hidden layer represents a level of abstraction
– Complex features are compositions of simpler features
 Number of layers is known as depth of ANN

– Deeper networks express complex hierarchy of features
10/12/2020 14
Multi-Layer Network Architecture
�
�
Activation value Activation

at node i at layer l Function Linear Predictor
10/12/2020 15
Activation Functions
10/12/2020 16
Learning Multi-layer Neural Network
 Can we apply perceptron learning rule to each

node, including hidden nodes?
– Perceptron learning rule computes error term
e = y - 𝑦 and updates weights accordingly
 Problem: how to determine the true value of y for
hidden nodes?
– Approximate error in hidden nodes by error in
the output nodes
 Problem:
– Not clear how adjustment in the hidden nodes affect overall
error
– No guarantee of convergence to optimal solution
10/12/2020 17
Gradient Descent
 Loss Function to measure errors across all training points

Squared Loss:
 Gradient descent: Update parameters in the direction of

“maximum descent” in the loss function across all points
𝜆: learning rate
 Stochastic gradient descent (SGD): update the weight for every

instance (minibatch SGD: update over min-batches of instances)
10/12/2020 18
Computing Gradients
𝑦 = 𝑎𝐿
𝑖𝑗 𝑖𝑗
 Using chain rule of differentiation (on a single instance):
 For sigmoid activation function:
 How can we compute 𝛿𝑖𝑙 for every layer?

10/12/2020 19
Backpropagation Algorithm
 At output layer L:
 At a hidden layer 𝑙 (using chain rule):
– Gradients at layer l can be computed using gradients at layer l + 1

– Start from layer L and “backpropagate” gradients to all previous
layers
 Use gradient descent to update weights at every epoch
 For next epoch, use updated weights to compute loss fn. and its gradient
 Iterate until convergence (loss does not change)
10/12/2020 20
Design Issues in ANN
 Number of nodes in input layer

– One input node per binary/continuous attribute
– k or log2 k nodes for each categorical attribute with k
values
 Number of nodes in output layer
– One output for binary class problem
– k or log2 k nodes for k-class problem
 Number of hidden layers and nodes per layer
 Initial weights and biases
 Learning rate, max. number of epochs, mini-batch size for
mini-batch SGD, …
10/12/2020 21
Characteristics of ANN
 Multilayer ANN are universal approximators but could

suffer from overfitting if the network is too large
 Gradient descent may converge to local minimum
 Model building can be very time consuming, but testing
can be very fast
 Can handle redundant and irrelevant attributes because
weights are automatically learnt for all attributes
 Sensitive to noise in training data
 Difficult to handle missing attributes
10/12/2020 22
Deep Learning Trends
 Training deep neural networks (more than 5-10 layers)

could only be possible in recent times with:
– Faster computing resources (GPU)
– Larger labeled training sets
– Algorithmic Improvements in Deep Learning
 Recent Trends:
– Specialized ANN Architectures:
Convolutional NeuralNetworks (for image data)
Recurrent Neural Networks (for sequence data)
Residual Networks (with skip connections)
– Unsupervised Models: Autoencoders

– Generative Models: Generative Adversarial Networks
10/12/2020 23
Vanishing Gradient Problem
 Sigmoid activation function easily saturates (show zero gradient

with z) when z is too large or too small
 Lead to small (or zero) gradients of squared loss with weights,
especially at hidden layers, leading to slow (or no) learning
10/12/2020 24
Handling Vanishing Gradient Problem
 Use of Cross-entropy loss function
 Use of Rectified Linear Unit (ReLU) Activations:
10/12/2020 25
Outline
Ensemble of Classifiers
Bagging
Advanced Classification Techniques
August 25, 2017

Outline
Bagging
Bagging

Outline
Bagging
Model Ensembles
TWO HEADS ARE BETTER THAN ONE

1 Construct a set of classifiers from adopted versions of the
training data (resampled/reweighted).

2 Combine the predictions of these classifiers in some way, often
by simple averaging or voting.

Outline
Bagging
Figure: General Idea of Ensemble

Outline
Bagging
Rationale for Ensemble Method
1 Suppose there are 25 base classifiers.

2 Each classifier has error rate ǫ = 0.35.
3 Assume classifiers are independent.
4 Probability that the ensemble classifier makes wrong
prediction:
25
X 25 i
ǫ (1 − ǫ)25−i = 0.06
i
i=13

Outline
Bagging
Bootstrap aggregating (Bagging)
INPUT: Training Set D, a learning algorithm A, T = ensemble

size.
OUTPUT: Ensemble of Classifiers.
for t = 1to T do
1 Build a bootstrap sample Dt by sampling |D| points.
2 Run A on Dt to build a classifier Mt
Return {Mt |1 ≤ t ≤ T }
Comments: Probability that a particular data point not being
selected is (1 − 1/n)n

Outline
Bagging
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1

Outline
Bagging
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1

Outline
Bagging
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
Bagging Round 2: x ≤ 0.65 : y = +1, y = 1 else.
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1

Outline
Bagging
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1

Outline
Bagging
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1

Outline
Bagging
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1.0 1.0 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1

Outline
Bagging
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1.0 1.0 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1

Outline
Bagging
Bagging
Bagging Round 6: x ≤ 0.75 : y = −1, y = +1 else.
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1.0
y 1 -1 -1 -1 -1 -1 -1 1 1 1
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1.0
y 1 -1 -1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1.0
y 1 1 -1 -1 -1 -1 -1 1 1 1
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1.0 1.0
y 1 1 -1 -1 -1 -1 -1 1 1 1
x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9
y 1 1 1 1 1 1 1 1 1 1

Outline
Bagging
Bagging
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y 1 1 1 -1 -1 -1 -1 1 1 1
x
stump Round 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
≤ 0.35, +1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.65 ≥, +1 2 1 1 1 1 1 1 1 1 1 1
≤ 0.35 3 1 1 1 -1 -1 -1 -1 -1 -1 -1
le0.3, +1 4 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.35 5 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.75, −1 6 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 7 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 8 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 9 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.05, −1 10 1 1 1 1 1 1 1 1 1 1
SUM 2 2 2 -6 -6 -6 -6 2 2 2
Class 1 1 1 -1 -1 -1 -1 1 1 1
Actual 1 1 1 -1 -1 -1 -1 1 1 1

Outline
Bagging
Random Forest
Bagging is useful in combination with decision tree models.

Build each tree from a different random subset of features
(subspace sampling).

Outline
Bagging
Random Forest
INPUT: Training Set D, subspace dimension=d, T = ensemble

size.
for t = 1 to T do
1 Build a bootstrap sample Dt by sampling with replacement.
2 select d features randomly and reduce dimensionality of Dt
accordingly.
3 Build a decision tree Mt using Dt .
Return {Mt |1 ≤ t ≤ T }.

Outline
Bagging
Boosting
Boosting assign a weight to each training example and

adaptively changes the weights at the end of each boosting
round.
How the weight changes?
Half of the total weight assigned to the misclassified examples
and the other half to the rest examples.

Outline
Bagging
Boosting
Initial weights are uniform that sum to 1.
Current weight assigned to misclassified samples are exactly
the error rate ǫ.
Multiply with 1/2ǫ weights for misclassified instances.
Multiply 1/2(1 − ǫ) weights for correctly classified samples.
Table: Confusion Matrix

ACTUAL CLASS
Class=+ Class=-
PREDICTED Class=+ 24 9
CLASS Class=- 16 51
ǫ = 0.25 weight update factor = 1/2ǫ = 2 for misclassified

samples. 1/2(1 − ǫ) = 1/1.5 = 2/3.
Outline
Bagging
Boosting
Confidence/weightage of each boosting model (α).

1 1 − ǫt
αt = ln
2 ǫt

Outline
Bagging
AdaBoost
INPUT: Training Set D; ensemble size T : learning Algorithm A

OUTPUT:Weighted Ensemble of models
w1i = 1/|D| for all xi ∈ D
for t=1 to T do
1 Run A on weight wti to produce a model Mt
2 calculate weighted error.
3 if ǫt ≥ 1/2 then T=t-1 break;
1 − ǫt
4 αt = 12 ln
ǫt
5 w(t+1)i = wZtt × exp(−αt ) for correctly classified examples.
w(t+1)i = wZtt × exp(+αt ) for misclassified classified examples.
PT
Return M(x) = t=1 αt × Mt (x)

Outline
Bagging
Example of AdaBoost
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y 1 1 1 -1 -1 -1 -1 1 1 1

Outline
Bagging
Example of AdaBoost
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y 1 1 1 -1 -1 -1 -1 1 1 1
Boosting Round 1: x ≤ 0.75 : y = −1, y = +1 else.
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Outline
Bagging
Example of AdaBoost
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1
Boosting Round 2: x ≤ 0.05 : y = +1, y = +1 else.
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1

Outline
Bagging
Example of AdaBoost
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1
Boosting Round 3: x ≤ 0.3 : y = +1, y = −1 else.
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1- -1 -1 -1

Outline
Bagging
Example
Round Split Point α

1 0.75 1.738
2 0.05 2.7784
3 0.3 4.1195
Round x=1 2 3 4 5 6 7 8 9 10
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Sign 1 1 1 -1 -1 -1 -1 1 1 1

Outline
Bagging
THANK YOU

Outline
Introduction to Classification
Introduction to Clustering
Clustering Methods for Large Datasets
Application of Clustering Methods to Image Processing
Conclusions and Research Directions
Introduction to Clustering Methods
Dr. Bidyut Kr. Patra
Assistant Professor, National Institute of Technology Rourkela

Rourkela, Orissa
October 24, 2016
Dr. Bidyut Kr. Patra Introduction to Clustering Methods

Outline
Similarity Measures
Category of clustering methods
Partitional Clustering
Soft Clustering
Hierarchical Clustering
Density Based Clustering Method
Major Approaches to Clustering Large Datasets
Hybrid Clustering Method
Data Summarization
BIRCH
Outline
Classification
Definition
The task of classification is to assign an object into one of the
predefined categories.
Working Principle
A set of objects with their categories are provided. Training

Set.
It captures relationship between objects and their categories.
(Find a suitable model)
Assign a class label (category) to an unknown object.

Outline
Examples of Classification
Predicting tumor cells as non malignant or malignant.

Classify emails as spam or genuine e-mail.
Classifying credit card transactions as legitimate or fraudulent
Categorizing news stories as finance, weather, entertainment,
sports, etc

Outline
Drawback
Non availability of proper training set. It is difficult to have

correctly labeled data.

Outline
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Cluster Analysis
Cluster analysis is to discover the natural grouping(s) of a set of
patterns, points, or objects a .
a
A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition
Letters, 31(8):651–666, 2010.
Operational definition of clustering :

Cluster Analysis
Clustering activity or cluster analysis is to find group(s) of patterns,
called cluster (s) in a dataset in such a way that patterns in a cluster
are more similar to each other than patterns in distinct clusters.

Outline
Set Theoretic Definition
Definition
Let D = {x1 , x2 , . . . , xn } be a set of patterns called dataset, where
each xi is a pattern of N dimensions. A clustering π of D can be
defined as follows.
π = {C1 , C2 , . . . , Ck }, such that
Sk
i Ci = D
Ci 6= ∅, i = 1..k
Ci ∩ Cj = ∅, i 6= j , i , j = 1..k
sim (x1 , x2 ) > sim (x1 , y1 ), x1 , x2 ∈ Ci and y1 ∈ Cj , i 6= j

Outline
Application Domains:
1 Biology: Clustering has been applied to genomic data to
group functionally similar genes.
2 Information Retrieval: Search results obtained by various
search engines such as Google, Yahoo can be clustered so that
related documents appear together in a cluster.
(Vivisimo search engine (http://vivisimo.com/) groups related
documents.)
3 Market Research: Entities (people, market, organizations) can
be clustered based on common features or characteristics.
4 Geological mapping, Bio-informatics, Climate, Web mining,
Image Processing

Outline
Similarity
Similarity between a pair of patterns x and y in D is a
mapping, sim(x, y ) : D × D → [0, 1].
Closer the value of sim(.) is to 1, higher the similarity and it is
1, when x = y .
Simple Matching Coefficient (SMC): Let x and y be two N-dimensional

binary vectors, i.e. x, y ∈ {0, 1}N .

1 if x = y
PN i i
i =1 t | t =
0 otherwise
SMC (x, y ) = ,
N
where, xi and yi are the i th feature values of x and y , respectively.

Outline
Similarity(Contd..)
Jaccard Coefficient(J):
a11
J= , (1)
N − a00
(
PN1 if xi = yi = 1
where a11 = i =1 t |t= ,
0 otherwise
(
1 if xi = yi = 0
a00 = N
P
i =1 t | t =
0 otherwise
Let x = (1, 0, 0, 1, 1) and y = (0, 1, 0, 1, 0).
2 1
SMC = ; Jaccard Coefficient =
5 4

Outline
Similarity(Contd..)
Cosine Similarity: Let x and y be two document-vectors. Similarity
between x and y are expressed as cosine of the angle between them.
x •y
cosine(x, y ) =
||x|| ||y ||
where, • is the dot product and ||.|| is L2 -norm of x, y .
Let x = (3, 0, 2, 0, 0, 1) and y = (2, 1, 3, 0, 1, 0) be two document vectors.

6+0+6+0+0+0
cosine(x, y ) = √ √
32 + 02 + 22 + 02 + 02 + 12 22 + 12 + 32 + 02 + 12 + 02
12
= √ √
14 15

Outline
Dissimilarity
Many clustering methods use dissimilarity measures to find clusters
instead of similarity measure.
Euclidean distance between a pair of N dimensional points (patterns) can

be expressed as follow.
v
u N
uX
d(x, y ) = t (xi − yi )2
i =1
Generalization of Euclidean distance is known as Minkowski distance.

N
!1/p
X p
Lp = |xi − yi |
i =1

Outline
Metric Space
Definition
M = (D, d) is said to be metric space if d is a metric on D, i.e.,
d : D × D → R≥0 , which satisfies following conditions.
For three patterns x, y , z ∈ D,
Non-negativity: d(x, y ) ≥ 0
Reflexivity: d(x, y ) = 0, if x = y
Symmetry: d(x, y ) = d(y , x)
Triangle inequality: d(x, y ) + d(y , z) ≥ d(x, z)

Outline
Classification of Clustering Methods
Partitional Clustering: Partitional clustering creates a single clustering of

a given dataset. Let C1 and C2 be two clusters in a clustering.
(i) C1 * C2 or C1 + C2 , (ii) C1 ∩ C2 = ∅
Fuzzy and rough clustering approaches violate constraint (ii). C1 ∩ C2 6= ∅,

Hierarchical Clustering method : Hierarchical clustering method creates a
sequence of partitional clusterings of a dataset.

Outline
DATASET Clustering
Figure: Partitional Clustering

Outline
DATASET
Figure: Hierarchical Clustering

Outline
Partitional Clustering Methods

Basic Sequential Algorithmic Scheme (BSAS) : BSAS needs
two user specified parameters: distance threshold (τ ) and
number of clusters (k).
i = 1, Ci = {x1 }
x ∈ D \ {x1 }
1 Find nearest existing cluster Cmin such that
d(x, Cmin ) = minj=1..i d(x, Cj )
2 if d(x, Cmin ) > τ and i < k, then
i = i + 1; Ci = {x}
ESLE Cmin = Cmin ∪ {x}
3 Repeat Step 1 and Step 2 until all patterns are assigned to
clusters.

Outline
Advantages:
1 Single dataset scan method.

Outline
Advantages:
2 Time Complexity= O(kn)

Outline
Advantages:
2 Time Complexity= O(kn)
Disadvantages:
1 Number of clusters is to be provided.

Outline
leaders clustering method (Hartigan, 1975)

INPUT:D, τ
1 L ← {x1 }
2 For each pattern x ∈ D \ {x1 }, if there is a l ∈ L such that
||l − x|| ≤ τ , then x is assigned to that cluster that is
represented by l . There is no such leader, then x becomes a
leader and is added to L.
3 It outputs leaders set L
Advantages:
Single scan method
Time complexity O(mn), m = |L|

Outline
.
. .
.
.
.
.
. .
. .
.
.
Figure: Leaders find semi-spherical clusters.

Outline
Drawbacks
Clustering results are ordered dependent.
Distance between followers of different clusters (leaders) may
be less than corresponding leaders.
ly
lx
x
y
>τ
Figure: Distance between followers (x, y ) of leaders lx and ly is less

than τ

Outline
The k-means Clustering Methods(MacQueen,1967)
The k-means optimizes Sum of Squared Error (SSE)

defined as follows. Let C1 , . . . , Ck be k clusters of the
dataset. Then,
|Cj |
k X |Cj |
X 2 1 X
SSE = ||xi − x j ||2 , where xi ∈ Cj , xj = xi .
|Cj |
j=1 i =1 i =1

Outline
k-means Clustering(MacQueen,1967)
Select k points as initial centroids.

Outline
k-means Clustering(MacQueen,1967)
Select k points as initial centroids.

repeat
1 Form k clusters by assigning each point to its closest centroid.
2 Recompute the centroid of the each cluster.
until Centroids do not change.

Outline
Example
D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2

Outline
Example
D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2

initial centroids 2 and 4

Outline
Example
D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2

C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25} using L2 distance.

Outline
Example
D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2

mC1 = 2.5, mC2 = 16

Outline
Example
D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2

mC1 = 2.5, mC2 = 16
Next Iteration:
C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25}
mC1 = 3.0, mC2 = 18

Outline
Example
D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2

mC1 = 2.5, mC2 = 16
Next Iteration:
C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25}
mC1 = 3.0, mC2 = 18
C1 = {2, 3, 4, 10}, C2 = {12, 20, 30, 11, 25}

Outline
Example
D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2

mC1 = 2.5, mC2 = 16
Next Iteration:
C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25}
mC1 = 3.0, mC2 = 18
C1 = {2, 3, 4, 10}, C2 = {12, 20, 30, 11, 25}
mC1 = 4.75, mC2 = 19.6

Outline
Example
D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2

mC1 = 2.5, mC2 = 16
Next Iteration:
C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25}
mC1 = 3.0, mC2 = 18
C1 = {2, 3, 4, 10}, C2 = {12, 20, 30, 11, 25}
mC1 = 4.75, mC2 = 19.6
C1 = {2, 3, 4, 10, 11, 12}, C2 = {20, 30, 25}
mC1 = 7, mC2 = 25

Outline
Example
D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2

mC1 = 2.5, mC2 = 16
Next Iteration:
C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25}
mC1 = 3.0, mC2 = 18
C1 = {2, 3, 4, 10}, C2 = {12, 20, 30, 11, 25}
mC1 = 4.75, mC2 = 19.6
C1 = {2, 3, 4, 10, 11, 12}, C2 = {20, 30, 25}
mC1 = 7, mC2 = 25
C1 = {2, 3, 4, 10, 11, 12}, C2 = {20, 30, 25}
Outline
Time and Space Complexity

Time=O(I ∗ k ∗ n) , Space=O(k + n)

Outline
Drawbacks
1 It cannot detect outliers.

2 It can find only convexed shaped clusters.
3 It is applicable to only numeric dataset.
4 With different initial points, it produces different clustering
results.

Outline
DATASET CLUSTERING by k−means
Figure: Result produced by k-means clustering method

Outline
Bisecting k-means(Steinbach, 2000)
Algorithm
1 Initialize the list of clusters L = {C1 }, where C1 = D.
2 repeat
1 Remove a cluster from the list of clusters.
2 Bisect the selected cluster using k-means clustering method for
a number of times.
3 Select two clusters from a bisection with lowest total SSE.
4 Add these two clusters to the list of clusters L.
3 until number of clusters in the list L is k.

Outline
K-means as an Optimization Problem

Outline
Pk
− x)2
P
Minimize SSE = i =1 x∈Ci (ci
1P
ck = x
n x∈Ck

Outline
Pk
− x)2
P
Minimize SSE = i =1 x∈Ci (ci
1P
ck = x
n x∈Ck
Pk P
Minimize SAE = i =1 x∈Ci | ci − x |
ck = median of the objects in the cluster.

Outline
Triangle Inequality to Accelerate k-means (Elkan, 2003)
Let x be a point and let b and c be centers. If

d(b, c) ≥ 2d(x, b) then d(x, c) ≥ d(x, b).

Outline
Soft Clustering
In many application domains clusters do not have crisp

(exact) boundary.
Fuzzy Set Theory [Zadeh,1965], Rough Set Theory
[Pawlak,1982] and hybridization of both approaches.
Fuzzy c-Means Method (FCM)[Dun,1973]
Rough k means (Lingras and West (2004))

Outline
Hierarchical clustering methods create a sequence of
clusterings π1 , π2 , π3 , . . . πi . . . πp of given dataset D.
It can produce inherent nested structures (hierarchical) of
clusters in a data.
Hierarchical Clustering obtained in two ways
1 Divisive (Top-down) approach:

Outline
clusters in a data.
Start with one cluster containing all points.
At each step, split a cluster until each cluster contains a point.

Outline
clusters in a data.
2 Agglomerative (Bottom-up) approach :

Outline
clusters in a data.
2 Agglomerative (Bottom-up) approach :
Start with the points as individual clusters.
Merge the closest pair of clusters until only number of cluster
becomes one.
Outline
π5 = {{p1, p5, p2, p3, p4}}
π4 = {{p1, p5}, {p2, p3, p4}}

π3 = {{p1}, {p2, p3, p4}, {p5}}
π2 = {{p1}, {p2, p3}, {p4}, {p5}}
p1 p2 p3 p4 p5 π1 = {{p1}, {p2}, {p3}, {p4 }, {p5}}
Figure: Dendogram produced by a hierarchical clustering method

Outline
Agglomerative (Bottom-up) approach
Compute distance matrix for the dataset.

Let each data point be a cluster
repeat
1 Merge the two closest clusters.
2 Update the distance matrix.
until only a single cluster remains

Outline
Popular Agglomerative Methods

Single-link :Dis(C1 , C2 ) = min{||xi − xj || | xi ∈ C1 , xj ∈ C2 }
Complete-link
:Dis(C1 , C2 ) = max{||xi − xj || | xi ∈ C1 , xj ∈ C2 }
Average-link:
1 P P
Dis() = |C1 |×|C2| i j ||xi − xj ||, where xi ∈ C1 , xj ∈ C2
(a) Single-link (b) Complete-link (c) Average-link
Figure: Distance between a pair of clusters in three hierarchal clustering

methods.

Outline
Complexity
Space Complexity : O(n2 )

Time Complexity: O(n2 )

Outline
Single-link Clustering Method

Outline
dendogram produced by single-link method
Figure: DendogramIntroduction
Dr. Bidyut Kr. Patra
for single-link
to Clustering Methods
Outline
Distance Updation
Let C = Cx ∪ Cy (merging clusters Cx , Cy ) be a new cluster

formed.
Let Co be an another cluster.
d(Co , (Cx , Cy )) = min{d(Co , Cx ), d(Co , Cy )}
Lance and Williams (1967) generalizes the distance updation
d(Co , (Cx , Cy )) = αi × d(Co , Cx ) + αj × d(Co , Cy )

+β × d(Cx , Cy ) + γ × |d(Co , Cx ) − d(Co , Cy )|
where, d(., .) is a distance function and values of αi , αj , β and γ(∈ R) depend on the method used.

Outline
Table: Lance-Williams parameters for different hierarchal methods

Method αi αj β γ
Single-link 1/2 1/2 0 -1/2
Complete-link 1/2 1/2 0 1/2
mC x mC y
Average-link mCx +mCy mCx +mCy 0 0
mC x mC y −mCx .mCy
Centroid mCx +mCy mCx +mCy mCx +mCy 0
Complete Link:
1 1 1
d(Co , (Cx , Cy )) = × d(Co , Cx ) + × d(Co , Cy ) + 0 × d(Cx , Cy ) + × |d(Co , Cx ) − d(Co , Cy )|
2 2 2
= max{d(Co , Cx ), d(Co , Cy )}

Outline
Density Based Clustering Method
Density based clustering method views clusters as dense

regions in the feature space which are separated by relatively
less dense regions.
Density based clustering approach optimizes local criteria,
which is based on density distribution of the dataset.
DBSCAN (Density Based Spatial Clustering of Applications
with Noise) [Ester et al. (1996)] is a very popular density
based partitional clustering method.

Outline
DBSCAN (M. Ester et al., SIG-KDD 1996)
DBSCAN is a density based clustering method, which can find

arbitrary shaped clusters.
It classifies each point into any of the three categories:
1 Core Point:A hyper-sphere of radius ǫ centered at point x has
more than Min data point, then x is called core point.
2 Border Point: The point x is called border point if the
hyper-sphere has less than Min points but there is a near-by
core point (||x − CP|| < ǫ)
3 Noisy Point: If the hyper-sphere has less than Min points and
there is no near-by core point.
DBSCAN starts with a core point and expand the core point
recursively merging nearby core points.

Outline
DBSCAN(contd..)
Noise Point
minpts=4
Core Point
Border Point
Figure: Classification of points

Outline
DBSCAN(Contd..)
Directly density-reachable: A pattern y is Directly

density-reachable from a pattern x if ||x − y || ≤ ǫ and x is a
core point.
Density reachable: A pattern y is density reachable from
another pattern x if there is a sequence of patterns
x1 , x2 , . . . , xn with x1 = x, xn = y such that xi +1 directly
density reachable from xi , i = 1..(n − 1).
Density connected: A pattern x is density connected from y if
there is a pattern xc such that both x and y are density
reachable from xc .

Outline
Cluster
x
xc y
Noise Points
Minpts = 3
Figure: Cluster produced by DBSCAN:Points x and y are density

connected through xc ; Point x is density reachable from xc .
A cluster C ⊂ D holds following properties.

∀x, y ∈ C , x and y are density connected.
If x ∈ C is a core point, all density-reachable points from x
are included in C .
Outline
DBSCAN
1 Let x be a pattern in the dataset.

1 If x is core point (CP), then explore its ǫ neighbors (x) in
recursive manner and marked the points as “seen”. It forms a
cluster.
(At any stage,if any of the neighbors is border point, no need
to explore the point but include it into the cluster.)
2 If x neither CP nor part of any cluster, then mark x as “seen”
and temporarily marked as Noisy Point.
2 Repeat Step 2 until all points marked as “seen”.
3 If temporarily marked “Noisy Point” does not belong to any
cluster, then the point is declared as final Noisy Point.

Outline
Selection of DBSCAN Parameters
Compute k-dist for each datapoint. (For some value of k)

Sort the points in ascending order and plot the sorted values.
We expect a sharp change at the value of k-dist which

corresponds to the suitable value of ǫ.

Outline
Drawbacks
100 100
"NBC_paper_data.txt" using 1:2 Cluster1
Cluster2
Cluster3
80 80 Noise
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Figure: Dataset Figure: Clusters obtained

by DBSCAN Method

Outline
DBSCAN fails to detect clusters of arbitrary shapes with

highly variable density.

Outline
Neighborhood Based Clustering (NBC)
S. Zhou, Y. Zhao, J. Guan, and J. Z. Huang.

A Neighborhood-Based Clustering Algorithm.
Appeared in Pacific-Asia Conference on Knowledge Discovery and
Data Mining, (PAKDD 2005)
Neighborhood Based Clustering (NBC) discovers clusters
based on the neighborhood characteristics of data.
NBC is effective in discovering clusters of arbitrary shape and
different densities.
NBC needs fewer input parameters than the DBSCAN
clustering method.

Outline
NBC
The core concept of NBC is the Neighborhood Density Factor

(NDF).
Number of reverse K nearest neighbors(x)
NDF (x) =
Number of K nearest neighbors(x)
Reverse K-Nearest Neighbors Set of x(R-KNN): Set of objects

whose KNN contains x.
R − KNN(x) = {p ∈ D | x ∈ KNN(p)}

Outline
It classifies a point x into any of the three categories:

1 Core Point: If NDF (x) > 1
2 Even Point: If NDF (x) = 1
3 Noisy Point: If NDF (x) < 1

Outline
Definition (Neighborhood-based directly density reachable)

Let x, y be two points in D and NNK (x) be the knn of x. The
point y is directly reachable from x if
x is CP or EP and y ∈ NNK (x)
Definition ( Neighborhood-based density reachable )

Let x, y be two points in D and NNK (x) be the KNN of x. The
point y is reachable from x if there is a chain of patterns
p1 = x, p2 , . . . , pn = y such that pi +1 is directly reachable from
pi . i = 1..i = n − 1

Outline
Definition (Neighborhood-based density connected)

Let x, y be two points in D and NNK (x) be the KNN of x. The
points x, y are neighborhood connected if any of the following
conditions holds.
y is reachable from x, or vice-versa.
x and y are reachable from any other pattern in the dataset.
Definition (Neighborhood-based Cluster)

Let D be a dataset. C ⊆ D is a cluster such that
If p, q ∈ C , then p and q are neighborhood connected.
If p ∈ C and q ∈ D are neighborhood connected, then q also
belongs to cluster C .

Outline
Neighborhood based Clustering (NBC) method

1 Calculate NDF for all patterns of the dataset
2 Let x be a pattern in the dataset.
1 If x is cluster point(CP/EP), then explore its neighbors in
recursive manner and marked the points as “seen”. It forms a
cluster.
(At any stage,if any of the neighbors is not CP/EP, no need to
explore the point but included into the cluster.)
2 If x is not CP, then mark x as “seen” and temporarily marked
as Noisy Point.
3 Repeat Step 2 until all points marked as “seen”.
4 If temporarily marked “Noisy Point” does not belong to any
cluster, then the point is declared as final Noisy Point.

Outline
100
Cluster1
Cluster2
Cluster3
80 Cluster4
Cluster5
Noise
60
40
20
0
0 20 40 60 80 100
Figure: Clusters
Dr. Bidyut obtained
Kr. Patra by NBC
Introduction Method
to Clustering Methods
Outline
Data Summarization
Major approaches to clustering large dataset

Many clustering approaches have been proposed to tackle
large size data
These approaches are mainly classified into the following listed
groups1
1 Sampling Based Approach
2 Hybrid Clustering
3 Data Summarization
4 Nearest Neighbor Search

1
A. K. Jain. Data clustering: 50 years beyond k-means. Pattern
Recognition Letters, 31(8):651–666, 2010.
Outline
Data Summarization
Hybrid Method
Partitional clustering method is combined with hierarchical

clustering method.
First hybrid method was proposed by Murthy and Krishna in

1980.
It is the combination of k-means clustering with single-link
method.
Many clustering methods ( Lin and Chen (2005), Liu et al.
(2009), Chaoji et al. (2009) ) have been developed in this line.

Outline
Data Summarization
Divide the dataset into a number of sub clusters applying

k-means clustering method.
Compute Similarity/dissimilarity between a pair of sub
clusters.
Apply an agglomerative method to these sub clusters to
obtain final clustering.

Outline
Data Summarization
Vijaya et al. (2006) used hybrid clustering method in protein se-

quence classification
1 Leaders clustering is applied to whole dataset to obtain set of
leaders.
2 Apply single-link/complete-link to obtain k clusters.
3 Median of each cluster is selected as the representative of the
cluster.

Outline
Data Summarization
Data Summarization
This approaches create summary of a large dataset.
This summary is intelligently used to scale up expensive

clustering method.

Outline
Data Summarization
BIRCH (Zhang et al., 1996)
BIRCH is designed for clustering large datasets.

It can work with limited main memory.
It is an incremental method.
Two-Phase Method
1 Create a summary of a given dataset in form of CF tree.
2 Apply conventional hierarchical clustering to the summary.

Outline
Data Summarization
CF tree
CF-tree is a height-balanced tree with branching factor B.

Outline
Data Summarization
CF tree

Each internal node of CF tree has at most B entries of the
form [CFi , childi ]i =1,...,B , ,where CFi is the
clustering feature(CF) of sub-cluster represented by this child.

Outline
Data Summarization
CF tree

Each internal node of CF tree has at most B entries of the
form [CFi , childi ]i =1,...,B , ,where CFi is the
clustering feature(CF) of sub-cluster represented by this child.
A leaf node has at most Lmax entries of the form

[CFi ], i = 1 . . . Lmax .
Diameter of each sub-cluster in leaf nodes must be less than a

threshold T .

Outline
Data Summarization
Figure: CF-tree

Outline
Data Summarization
Clustering Features(CF)
A CF is a triplet which contains summarized information of a

sub-cluster.
−
→ −→ −
→
Let C1 = {X1 , X2 , . . . Xk } be a sub cluster.
−
→
The CF for C1 is CF = (k, LS , ss),
−→
where LS is linear sum of patterns in C1 , i.e.,
−
→ P − →
LS = i Xi ,
ss is square sum of data points, i.e.,
P −→2
ss = i Xi .
CF values follows additive property, i.e., CF3 = CF1 + CF2 .

Outline
Application of Clustering Methods to Image Segmentation
An image segmentation is typically defined as an exhaustive

partitioning of an input image into regions, each of which is
considered to be homogeneous with respect to some image
property of interest.
1 region-based,
2 edge-based,
3 cluster-based
Idea is to define feature vectors at every image location (pixel)
composed of both functions of image intensity and functions
of the pixel location itself.

Outline
Image Segmentation via Clustering
Hoffman and Jain (1987) employs squared error clustering in a

six-dimensional feature space extracted from a range image.
The technique was enhanced by Flynn and Jain in 1991.
An enhancement of the k -means algorithm called CLUSTER

is used to obtain segment labels for each pixel.
Each pixel in the range image is assigned the segment label of
the nearest cluster center.
Refine the segments.

Outline
Image Segmentation via Clustering
Nguyen and Cohen [1993] used Fuzzy c-means clustering for

segmenting textured images.
The k -means algorithm was applied for segmenting
LANDSAT imagery by Solberg et al. [1996]

Outline
Sriparna Saha and Sanghamitra Bandyopadhyay [2008]

proposed a new Symmetry-Based Clustering method for
segmenting Stalite Image.
Sriparna Saha and Sanghamitra Bandyopadhyay [2007]
proposed MRI Brain Image Segmentation using fuzzy
clustering approach.
Automatic MR brain image segmentation using a multiseed
based multiobjective clustering approach is proposed by Saha
and Bandyopadhyay in 2011.

Outline
Conclusions
1 Many data clustering methods are discussed.

2 Various version of k-means clustering methods are applied for
segmenting images.
3 Density based clustering methods can be explored for finding
disjoint regions with different densities.
4 Rough set theory based clustering method can be useful for
finding overlapped segments in a given image.

Outline
1 S. Theodoridis and K. Koutroumbas. Pattern Recognition, third ed. Academic

Press, Inc., Orlando, 2006.
2 A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM
Computing Surveys, 31(3):264323, 1999.
3 Z. Pawlak. Rough sets. International Journal of Computer and Information Sci-
ences, 11(5):341356, 1982.
4 P. A. Vijaya, M. N. Murty, and D. K. Subramanian. Efficient bottom-up hybrid
hierarchical clustering techniques for protein sequence classification. Pattern
Recognition, 39(12):2344–2355, 2006.

Outline
Questions
THANK YOU

Linear SVM
Margin
Assuming linearly separable data, this tries to find the

best separating hyperplane.
c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.17/43

�
Linear SVM
We like to draw two hyperplanes such that
if
�
��
�
�
�
��
��
�
if
�
��
�
�
�
�
��
�
�
———————————————-
for all OR
��
��
� �
� �
�
�
�
��
��
� �
�
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
The two parallel hyperplanes are
�
� �
��
�
�
�
�
�
��
��
�
�
�
�

�
Linear SVM
Distance between origin and is
��
��
� �
��
Distance between origin and is
�
��
� �
��
��
Then, the margin –
� �
��
�
��
�
��
��
��
�
�
��
��
Then the problem is :
Minimize (Objective function)
�
�
��
��
�
Subject to constraints: for all

��
��
��
�
�
�
�
�
�
�
Note that the objective is convex and the constraints are
linear
Lagrangian method can be applied.

�
Constrained Optimization Problem
Minimize
�
��
Subject to the constraints .
� ��
�
�
�
�
�
�
�
Lagrangian,
�
��
��
�
�
��
� ��
�
��
�
where is called primary variables and are the
�
�
�
Lagrangian multipliers which are also called dual
variables.
has to be minimized with respect to primal varibles
�
and maximized with respect to dual variables.

�
Constrained Optimization Problem
The K.K.T (Karush-Kuhn-Tucker) conditions
“necessary” at optimal are:
�
1. �
� �
�
�
�
2. for all
�
� �
�
�
��
�
�
�
�
��
�
3. for all
� ��
�
� �
�
�
��
�
�
�
�
�
��
�
If is convex and is linear for all , then it turns
�
� ��
�
��
�
�
out that K.K.T conditions are “necessary and sufficient”
for the optimal .
�

�
Convex Function
A real valued function defined in is said to be
��
��
�
�
convex if
�
� �
��
�
� �
��
��
�
� �
��
� �
�
�
�
�
for � and
��
�
�
��
�
�
�
�
a b
This definition can be extended to functions in higher

dimensional spaces.
�
Lagrangian
Minimize (Objective function)
�
��
��
�
Subject to constraints: for all
��
��
� �
�
�
�
�
�
�
�
Lagrangian,
�
� �
��
��
��
��
�
� �
�
�
�
��
�
�
�
�
�
�
Here,
��
� �
� �
� �
� �
�
� ��
� ��
�
�
�
� �
� �
�
�
�
��

�
K.K.T. Conditions
�
�
��
�
��
�
�
�
��
�
��
�
��
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
��
��
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
for to
��
��
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
��
�
� �
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
for to .
�
�
�
�
Solve these equations to get and .

�
� ��
�
� ��
�
� ��
�
�
�
While it is possible to do this, it is tedious ! �

�
Wolfe Dual Formulation
Other easy and advantageous ways to solve the
optimization problem does exist which can be easily
extended to non-linear SVMs.
This is to get where we eliminate and and which
�
�
has .
��
��
�
��
�
We know, has to be maximized w.r.t. the dual

�
variables .
��
��
�
��
�

�
The Lagrangian is :
�
�
�
��
��
�
��
�
��
��
��
�
��
��
�
�
�
�
�
�
�
��
��
�
� �
�
��
��
��
�
��
��
�
�
�
�
�
�
�
�
� �
�
��
�
��
�
��
�
�
�
�
�
��
� �
��
��
�
��
�
�
��
�
�
�
�

�
Maximize w.r.t
�
�
� �
� �
��
��
�
��
�
�
��
�
�
�
�
�
such that and for all .
�
�
��
��
�
��
�
We need to find the Lagrangian multipliers
only.
�
�
�
��
�
��
Primal variables and are eliminated.

�
There exists various numeric iterative methods to solve

this constrained convex quadratic optimization problem.
Sequential minimal optimization (SMO) is one such
technique which is a simple and relatively fast method.

�
The Optimization Problem
let then
�
� �
� �
� �
�
� �
�
�
�
� ��
��
�
��
�
�
��
�
�
��
� �
��
�
��
�
�
�
�
where is a matrix with its entry being
��
�
�
�
�
� �
�
�
�
�
�
�
��
� �
��
��
If (from (4)) is on a hyperplane, i.e., is

� �
� �
�
�
�
��
a support vector.
Note: lies on the hyperplane
� �
� �
�
�
�
��
Similarly, doesnot lie on hyperplane
��
�
�
��
�
That is, for interior points .
�
��
�

�
The Solution
Once is known, We can find
��
�
��
�
��
�
The classifier is
��
�
�
�
��
� �
��
��
��
�
��
�
��
�
�
�
can be found from (4)
�
For any , we have
��
��
� �
� ��
�
�
��
�
�
Multiplying with on both sides we get,
��
�
� ��
��
�
�
So,
�
� �
��
� �
��
��
��
�
��
�
�
�

�
Some observations
In the dual problem formulation and in its solution we
have only dot products between some of the training
patterns.
Once we have the matrix , the problem and its
��
solution are independent of the dimensionality .
�
�
Non-linear SVM
We know that every non-linear function in -space
�
(input space) can be seen as a linear function in an
appropriate -space (feature space).
�
Let the mapping be .
�
��
�
�
Once the is defined, one has to replace in the
�
�
�
problem as well as in the solution, for certain products,

as explained below.
Whenever we see , replace this by .
�
�
�
��
� �
��
� �
�
�
While it is possible to explicitly define the and
�
�
�
generate the training set in the -space, and then
obtain the solution... �
it is tedious and amazingly unnecessary also.

�
Kernel Function
For certain mappings, . That is,
�
�
� �
�
� �
� �
� ��
� �
�
�
dot product in the Y-space can be obtained as a
function in the X-space itself. There is no need to
explicitly generate the patterns in the Y-space.
Eg: Consider a two dimensional problem with
. Let . Then
��
�
�
� ��
��
� ��
�
�
�
��
�
��
��
�
��
.
��
�
� �
�
� �
�
�
� �
� �
� ��
� �
� �
� �
�
�
�
This kernel trick is one of the reasons for the success of
SVMs.

�
Kernel Function
is called the kernel function.
� �
�
� ��
� �
We say is a valid kernel iff there exists a such
� ��
� � � �
�
�
�
that for all and .
� �
�
�
�
� ��
��
� �
��
� �
�
�
Mercer’s Theorem gives the necessary conditions for a
kernel to be valid.
While Mercer’s theorem is a mathematically involved
one, some of the properties of kernels can be used to
verify whether a kernel is valid.

�
Kernel Function
The following three properties are satisfied by any
Kernel.
1.
� �
� �
� � ��
�
��
� �
��
��
� �
�
2.
� �
��
�
�
�
�
�
�
�
3.
�
�
� �
� �
�
� �
��
�
� ��
� �
� ��
� �
��
��
� �
�

�
Some ways to generate new Kernels
are kernels is a kernel,
�
�
��
��
��
��
��
��
�
�
�
�
��
��
�
are kernels is
� �
��
�
�
�
��
� �
��
� �
��
� �
�
��
��
��
�
also a valid kernel. That is, is a valid kernel.
��
��
�
For any symmetric, positive semi-definite matrix
��
, is a valid kernel.
� ��
� �
�
�
� ��
� �
�
� �
�
Let be a polynomial with positive coefficients. Let

��
�
is a kernel then is a valid kernel.

� �
� �
� �
�
�
� ��
� �
� ��
� �
is a kernel exp( ) (called
� �
� �
�
� ��
� �
� ��
� �
�
exponential kernel) is also valid kernel.

exp (called Gaussian
��
�
� �
�
� �
�
� ��
��
�
� ��
� �
��
�
kernel)is a kernel.

�
Soft Margin Formulation
Until now, we assumed that the data is linearly (or
non-linearly) separable.
The SVM derived is sensitive to noise.
Soft Margin formulation allows violation of the
constraints to some extent. That is, we allow some of
the boundary patterns to be misclassified.

�
if
� �
�
��
�
�
�
��
��
�
�
if
� �
�
��
��
�
�
�
�
��
�
�
———————————————-
�� for all OR
��
� �
��
� �
�
�
�
�
��
��
� �
�
� �
��
�
�
�
�
�
�
�
are called slack variables,
� �
and for all
� �
�
� �
�
�
�
�
�
�
�
�
�
�
�
Now, the objective to minimize:
�
�
��
��
� �
�
�
is called penalty parameter, and .
�
�
�

�
Lagrangian
�
��
��
�
� �
�
�
�
��
��
� �
� �
�
�
�
��
�
�
�
� �
� �
�

�
Soft Margin: KKT Conditions
�
�
��
�
�
��
�
�
��
�
� �
�
��
�
�
��
�
��
�
�
�
�
�
�
��
�
�
��
�
��
� �
�
�
�
�
��
�
��
�
�
�
��
� �
�
��
�
�
�
�
��
� �
��
� �
�
�
�
��
�
�
�
Soft Margin: KKT Conditions
� �
�
�
�
�
�
��
�
�
�
� �
�
� �
�
�
�
�
�
��
��
��
��
�
� �
��
� �
��
�
�
�
�
�
�
�
�
��
� �
��
� �
� �
�
�
�
�
�
Maximize w.r.t
�
�
� �
� �
��
��
�
��
�
�
��
�
�
�
�
�
such that and for all .
�
�
�
�
�
��
��
�
��
�
Hard Margin
�
�
�
�
Very Very Soft Margin

�
�
�

�
Training Methods
Any convex quadratic programming technique can be
applied.
But with larger training sets, most of the standard
techniques can become very slow and space
occupying. For example, many techniques needs to
store the kernel matrix whose size is where is the
� �
�
number of training patterns.
These considerations have driven the design of specific
algorithms for SVMs that can exploit the sparseness of
the solution, the convexity of the optimization problem,
and the implicit mapping into feature space.
One such a simple and fast method is Sequential
Minimal Optimization (SMO).

�
Next Class ...
SMO algorithm.

�
Outline
Bagging
August 25, 2017

Outline
Bagging
Bagging

Outline
Bagging
Model Ensembles
TWO HEADS ARE BETTER THAN ONE

1 Construct a set of classifiers from adopted versions of the
training data (resampled/reweighted).

2 Combine the predictions of these classifiers in some way, often
by simple averaging or voting.

Outline
Bagging
Figure: General Idea of Ensemble

Outline
Bagging
Rationale for Ensemble Method
1 Suppose there are 25 base classifiers.

2 Each classifier has error rate = 0.35.
3 Assume classifiers are independent.
4 Probability that the ensemble classifier makes wrong
prediction:
25
X 25 i
(1 − )25−i = 0.06
i
i=13

Outline
Bagging
Bootstrap aggregating (Bagging)
INPUT: Training Set D, a learning algorithm A, T = ensemble

size.
for t = 1to T do
1 Build a bootstrap sample Dt by sampling |D| points.
2 Run A on Dt to build a classifier Mt
Return {Mt |1 ≤ t ≤ T }
Comments: Probability that a particular data point not being
selected is (1 − 1/n)n

Outline
Bagging
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1

Outline
Bagging
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1

Outline
Bagging
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1

Outline
Bagging
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1

Outline
Bagging
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1

Outline
Bagging
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1.0 1.0 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1

Outline
Bagging
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1.0 1.0 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1

Outline
Bagging
Bagging
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1.0
y 1 -1 -1 -1 -1 -1 -1 1 1 1
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1.0
y 1 -1 -1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1.0
y 1 1 -1 -1 -1 -1 -1 1 1 1
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1.0 1.0
y 1 1 -1 -1 -1 -1 -1 1 1 1
x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9
y 1 1 1 1 1 1 1 1 1 1

Outline
Bagging
Bagging
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y 1 1 1 -1 -1 -1 -1 1 1 1
x
stump Round 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
≤ 0.35, +1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.65 ≥, +1 2 1 1 1 1 1 1 1 1 1 1
≤ 0.35 3 1 1 1 -1 -1 -1 -1 -1 -1 -1
le0.3, +1 4 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.35 5 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.75, −1 6 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 7 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 8 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 9 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.05, −1 10 1 1 1 1 1 1 1 1 1 1
SUM 2 2 2 -6 -6 -6 -6 2 2 2
Class 1 1 1 -1 -1 -1 -1 1 1 1
Actual 1 1 1 -1 -1 -1 -1 1 1 1

Outline
Bagging
Random Forest
Bagging is useful in combination with decision tree models.

Build each tree from a different random subset of features
(subspace sampling).

Outline
Bagging
Random Forest
INPUT: Training Set D, subspace dimension=d, T = ensemble

size.
for t = 1 to T do
1 Build a bootstrap sample Dt by sampling with replacement.
2 select d features randomly and reduce dimensionality of Dt
accordingly.
3 Build a decision tree Mt using Dt .
Return {Mt |1 ≤ t ≤ T }.

Outline
Bagging
Boosting
Boosting assign a weight to each training example and

adaptively changes the weights at the end of each boosting
round.
How the weight changes?
Half of the total weight assigned to the misclassified examples
and the other half to the rest examples.

Outline
Bagging
Boosting
Initial weights are uniform that sum to 1.
Current weight assigned to misclassified samples are exactly
the error rate .
Multiply with 1/2 weights for misclassified instances.
Multiply 1/2(1 − ) weights for correctly classified samples.
Table: Confusion Matrix

ACTUAL CLASS
Class=+ Class=-
PREDICTED Class=+ 24 9
CLASS Class=- 16 51
= 0.25 weight update factor = 1/2 = 2 for misclassified

samples. 1/2(1 − ) = 1/1.5 = 2/3.
Outline
Bagging
Boosting
Confidence/weightage of each boosting model (α).

1 1 − t
αt = ln
2 t

Outline
Bagging
AdaBoost
INPUT: Training Set D; ensemble size T : learning Algorithm A

OUTPUT:Weighted Ensemble of models
w1i = 1/|D| for all xi ∈ D
for t=1 to T do
1 Run A on weight wti to produce a model Mt
2 calculate weighted error.
3 if t ≥ 1/2 then T=t-1 break;
1 − t
4 αt = 21 ln
t
5 w(t+1)i = wZtt × exp(−αt ) for correctly classified examples.
w(t+1)i = wZtt × exp(+αt ) for misclassified classified examples.
PT
Return M(x) = t=1 αt × Mt (x)

Outline
Bagging
Example of AdaBoost
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y 1 1 1 -1 -1 -1 -1 1 1 1

Outline
Bagging
Example of AdaBoost
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Outline
Bagging
Example of AdaBoost
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1

Outline
Bagging
Example of AdaBoost
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1
Boosting Round 3: x ≤ 0.3 : y = +1, y = −1 else.
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1- -1 -1 -1

Outline
Bagging
Example
Round Split Point α

1 0.75 1.738
2 0.05 2.7784
3 0.3 4.1195
Round x=1 2 3 4 5 6 7 8 9 10
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Sign 1 1 1 -1 -1 -1 -1 1 1 1

Outline
Bagging
THANK YOU

DWM Merged

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWM Merged

Uploaded by

Copyright:

Available Formats

Data Warehousing and Mining

Instructor: BIDYUT KUMAR PATRA

Tuesday[10:00-11:00], Thursday[09:00-10:00], Friday[08:00-09:00]

 Attributes and Objects

 Similarity and Distance

 Collection of data objects Attributes

object 1 Yes Single 125K No

 A collection of attributes 6 No Married 60K No

 Attributes (objects) may have relationships with

 More generally, data may have structure

 Data can be incomplete

 We will discuss this in more detail later

 Attribute values are numbers or symbols

 Distinction between attributes and attribute values

– Different attributes can be mapped to the same set of

 There are different types of attributes

 The type of an attribute depends on which of the

 Is it physically meaningful to say that a

 Consider measuring the height above average

Ordinal Ordinal attribute hardness of minerals, median,

values are correlation, t and

This categorization of attributes is due to S. S. Stevens

Ordinal An order preserving change of An attribute encompassing

Interval new_value = a * old_value + b Thus, the Fahrenheit and

differ in terms of where their

This categorization of attributes is due to S. S. Stevens

 If we met a friend in the grocery store would we ever say the

 We need two asymmetric binary attributes to represent one

 Asymmetric attributes typically arise from objects that are

 The types of operations you choose should be

– Analysis may depend on these other properties of the data

– Many times what is meaningful is measured by statistical

– But in the end, what is meaningful is measured by the domain

– Dimensionality (number of attributes)

 Data that consists of a collection of records, each

1 Yes Single 125K No

 If data objects have the same fixed set of numeric

 Such a data set can be represented by an m by n matrix,

10.23 5.27 15.22 2.7 1.2

 Each document becomes a ‘term’ vector

 A special type of data, where

 Examples: Generic graph, a molecule, and webpages

Benzene Molecule: C6H6

 Genomic sequence data

 Poor data quality negatively affects many data processing

 Data mining example: a classification model for detecting

 What kinds of data quality problems?

 Examples of data quality problems:

 For objects, noise is an extraneous object

Two Sine Waves Two Sine Waves + Noise

 Outliers are data objects with characteristics that

– Case 2: Outliers are

 Reasons for missing values

 Handling missing values

– Ignore the missing value during analysis

 Missing completely at random (MCAR)

 Data set may include data objects that are

 When should duplicate data not be removed?

The following table shows the similarity and dissimilarity

where n is the number of dimensions (attributes) and

 Standardization is necessary, if scales differ.

 Minkowski Distance is a generalization of Euclidean

Where r is a parameter, n is the number of dimensions

 r = 1. City block (Manhattan, taxicab, L 1 norm) distance.