Professional Documents
Culture Documents
Data
Book: Introduction to Data Mining , by
Tan| Steinbach | Kumar
Saturday[4.15pm-5.15pm]
https://sites.google.com/site/patrabidyutkr/teaching/data-warehousing-and-mining-
cs6312
17/09/2020
Outline
Types of Data
Data Quality
Data Preprocessing
17/09/2020
What is Data?
Objects
– Attribute is also known as
4 Yes Married 120K No
variable, field, characteristic,
dimension, or feature 5 No Divorced 95K Yes
17/09/2020
Attribute Values
17/09/2020
Measurement of Length
The way you measure an attribute may not match the
attributes properties.
5 A 1
B
7 2
C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and additvity
property of properties of
length. 10 4 length.
15 5
Types of Attributes
17/09/2020
Difference Between Ratio and Interval
17/09/2020
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative
female} test
Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
17/09/2020
Asymmetric Attributes
Only presence (a non-zero attribute value) is regarded as
important
Words present in documents
Items present in customer transactions
17/09/2020
Key Messages for Attribute Types
– The data type you see – often numbers or strings – may not
capture all the properties or may suggest properties that are not
present
17/09/2020
Types of data sets
Record
– Data Matrix
– Document Data
– Transaction Data
Graph
– World Wide Web
– Molecular Structures
Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
17/09/2020
Important Characteristics of Data
– Sparsity
Only presence counts
– Resolution
Patterns depend on the scale
– Size
Type of analysis may depend on size of data
17/09/2020
Record Data
17/09/2020
Data Matrix
17/09/2020
Document Data
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
17/09/2020
Transaction Data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
17/09/2020
Graph Data
2
5 1
2
5
Sequences of transactions
Items/Events
An element of
the sequence
17/09/2020
Ordered Data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
17/09/2020
Ordered Data
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
17/09/2020
Data Quality
Causes?
17/09/2020
Missing Values
17/09/2020
Missing Values …
Examples:
– Same person with multiple email addresses
Data cleaning
– Process of dealing with duplicate data issues
Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity
17/09/2020
Similarity/Dissimilarity for Simple Attributes
17/09/2020
Euclidean Distance
Euclidean Distance
17/09/2020
Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
17/09/2020
Minkowski Distance
17/09/2020
Minkowski Distance: Examples
r = 2. Euclidean distance
17/09/2020
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
17/09/2020
Mahalanobis Distance
Covariance
Matrix:
0.3 0.2
C
0.2 0.3
B A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
17/09/2020
Common Properties of a Distance
17/09/2020
Similarity Between Binary Vectors
Common situation is that objects, x and y, have only
binary attributes
17/09/2020
SMC versus Jaccard: Example
x= 1000000000
y= 0000001001
17/09/2020
Cosine Similarity
17/09/2020
Extended Jaccard Coefficient (Tanimoto)
17/09/2020
Correlation measures the linear relationship
between objects
17/09/2020
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
17/09/2020
Drawback of Correlation
y i = x i2
mean(x) = 0, mean(y) = 4
std(x) = 2.16, std(y) = 3.74
17/09/2020
Comparison of Proximity Measures
Domain of application
– Similarity measures tend to be specific to the type of
attribute and data
– Record data, images, graphs, sequences, 3D-protein
structure, etc. tend to have different measures
However, one can talk about various properties that
you would like a proximity measure to have
– Symmetry is a common one
– Tolerance to noise and outliers is another
– Ability to find more types of patterns?
– Many others possible
The measure must be applicable to the data and
produce results that agree with domain knowledge
17/09/2020
Information Based Measures
17/09/2020
Information and Probability
For
– a variable (event), X,
– with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
– the entropy of X , H(X), is given by
𝑛
𝐻 𝑋 =− 𝑝𝑖 log 2 𝑝𝑖
𝑖=1
17/09/2020
Entropy for Sample Data: Example
17/09/2020
Entropy for Sample Data
Suppose we have
– a number of observations (m) of some attribute, X,
e.g., the hair color of students in the class,
– where there are n different possible values
– And the number of observation in the ith category is mi
– Then, for this sample
𝑛
𝑚𝑖 𝑚𝑖
𝐻 𝑋 =− log 2
𝑚 𝑚
𝑖=1
𝐻 𝑋, 𝑌 = − 𝑝𝑖𝑗log 2 𝑝𝑖𝑗
𝑖 𝑗
Where pij is the probability that the ith value of X and the jth value of Y
occur together
17/09/2020
Mutual Information Example
Mutual information of Student Status and Grade = 0.9928 + 1.4406 - 2.2710 = 0.1624
17/09/2020
Maximal Information Coefficient
Reshef, David N., Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter
J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, and Pardis C. Sabeti. "Detecting novel
associations in large data sets." science 334, no. 6062 (2011): 1518-1524.
17/09/2020
Using Weights to Combine Similarities
𝑛
𝑘=1 𝜔𝑘 𝛿𝑘𝑠𝑘 (𝐱,𝐲)
– 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐱, 𝐲 = 𝑛
𝑘=1 𝜔𝑘𝛿𝑘
17/09/2020
Density
17/09/2020
Euclidean Density: Grid-based Approach
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
17/09/2020
Aggregation
Purpose
– Data reduction
Reduce the number of attributes or objects
– Change of scale
Cities aggregated into regions, states, countries, etc.
Days aggregated into weeks, months, or years
17/09/2020
Sampling …
17/09/2020
Sample Size
17/09/2020
Types of Sampling
Simple Random Sampling
– There is an equal probability of selecting any particular
item
– Sampling without replacement
As each item is selected, it is removed from the
population
– Sampling with replacement
Objects are not removed from the population as they
are selected for the sample.
In sampling with replacement, the same object can
be picked up more than once
Stratified sampling
– Split the data into several partitions; then draw random
samples from each partition
17/09/2020
Sample Size
What sample size is necessary to get at least one
object from each of 10 equal-sized groups.
17/09/2020
Curse of Dimensionality
When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
17/09/2020
Dimensionality Reduction: PCA
x1
17/09/2020
Dimensionality Reduction: PCA
17/09/2020
Feature Subset Selection
17/09/2020
Mapping Data to a New Space
Frequency
17/09/2020
Discretization
17/09/2020
Iris Sample Data Set
Counts
30
20
10
0
0 2 4 6 8
Petal Length
Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
17/09/2020
Discretization Without Using Class Labels
17/09/2020
Discretization Without Using Class Labels
17/09/2020
Discretization Without Using Class Labels
17/09/2020
Binarization
Net Primary
Production (NPP)
is a measure of
plant growth used
by ecosystem
scientists.
17/09/2020
Seasonality Accounts for Much Correlation
Minneapolis
Normalized using
monthly Z Score:
Subtract off monthly
mean and divide by
monthly standard
deviation
10.00-11.00 (Monday)
09.00-10.00 (Tuesday)
08.00-09.00 (Thursday)
1
2
Data Warehousing and Mining
(CS6312)
3
Data Warehousing and Mining
(CS6312)
4
Data Warehousing and Mining
(CS6312)
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Computing Gini Index for a Collection of
Nodes
When a node p is split into k partitions (children)
k
ni
GINI split GINI (i )
i 1 n
For each distinct value, gather counts for each class in the dataset
Use the count matrix to make decisions
4/28/2021 29
Continuous Attributes: Computing Gini Index
Annual Income ?
gather count matrix and compute
its Gini index
≤ 80 > 80
– Computationally Inefficient!
Defaulted Yes 0 3
Repetition of work.
Defaulted No 3 4
4/28/2021 30
Continuous Attributes: Computing Gini Index...
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
4/28/2021 31
Continuous Attributes: Computing Gini Index...
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
4/28/2021 32
Continuous Attributes: Computing Gini Index...
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
4/28/2021 33
Continuous Attributes: Computing Gini Index...
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
4/28/2021 34
Continuous Attributes: Computing Gini Index...
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Gain Ratio:
GAIN n n
GainRATIO SplitINFO log
Split k
i i
split
SplitINFO n n i 1
4/28/2021 42
43
44
Misclassification Error vs Gini Index
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!
4/28/2021 45
Misclassification Error vs Gini Index
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416
4/28/2021 46
Comparison among Impurity Measures
4/28/2021 47
4/28/2021 48
4/28/2021 49
Decision Tree Based Classification
Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid
overfitting are employed)
– Can easily handle redundant or irrelevant attributes (unless
the attributes are interacting)
Disadvantages:
– Space of possible decision trees is exponentially large.
Greedy approaches are often unable to find the best tree.
– Does not take into account interactions between attributes
– Each decision boundary involves only a single attribute
4/28/2021 Introduction to Data Mining, 2nd Edition 50
4/28/2021 51
4/28/2021 52
4/28/2021 53
4/28/2021 54
4/28/2021 55
4/28/2021 56
4/28/2021 57
4/28/2021 58
4/28/2021 59
4/28/2021 60
Outline
Distance Based Classification Algorithms
2 For each x ∈ T \ S
1 Classify x using NN considering S as training set.
2 G=∅
3 For each x ∈ T
1 Classify x using NN considering S as training set.
5 S =S ∪R
6 G=∅
7 Repeat Step 2 to Step 6 until there is no misclassification.
Dr. Bidyut Kr. Patra Classification Technique (Contd..)
Outline
Distance Based Classification Algorithms
Many Iterations.
Algorithm:
KNN = ∅
For each t ∈ T
1 if |KNN| <= K
KNN = KNN ∪ {t}
2 else
1 Find a x ′ ∈ KNN such that dis(x, x ′ ) > dis(x, t)
2 KNN = KNN − {x ′ }; KNN = KNN ∪ {t}
The pattern x belongs to a Class in which most of the
patterns in KNN belong to.
Editing Techniques
Editing Techniques
Confusion Matrix
1
Which model is better?
PREDICTED
A Class=Yes Class=No
ACTUAL Class=Yes 0 10
Class=No 0 990
PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 90 900
2
Which model is better?
PREDICTED
A Class=Yes Class=No
ACTUAL Class=Yes 5 5
Class=No 0 990
PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 90 900
3
Alternative Measures
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL
CLASS Class=No c d
a
Precision (p) =
a+c
a
Recall (r) =
a+b
2rp 2a
F - measure (F) = =
r + p 2a + b + c
4
Alternative Measures
10
PREDICTED CLASS Precision (p) = = 0.5
10 + 10
10
Class=Yes Class=No Recall (r) = =1
10 + 0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F) = = 0.62
CLASS Class=No 10 980 1 + 0.5
990
Accuracy = = 0.99
1000
5
Alternative Measures
10
PREDICTED CLASS Precision (p) = = 0.5
10 + 10
10
Class=Yes Class=No Recall (r) = =1
10 + 0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F) = = 0.62
CLASS Class=No 10 980 1 + 0.5
990
Accuracy = = 0.99
1000
PREDICTED CLASS 1
Precision (p) = =1
1+ 0
Class=Yes Class=No
1
Recall (r) = = 0.1
Class=Yes 1 9 1+ 9
ACTUAL 2 * 0.1*1
CLASS Class=No 0 990 F - measure (F) = = 0.18
1 + 0.1
991
Accuracy = = 0.991
1000
6
Alternative Measures
PREDICTED CLASS
Precision (p) = 0.8
Class=Yes Class=No
Recall (r) = 0.8
Class=Yes 40 10 F - measure (F) = 0.8
ACTUAL
CLASS Class=No 10 40 Accuracy = 0.8
7
Alternative Measures
PREDICTED CLASS
Precision (p) = 0.8
Class=Yes Class=No
Recall (r) = 0.8
A Class=Yes 40 10 F - measure (F) = 0.8
ACTUAL
CLASS Class=No 10 40 Accuracy = 0.8
PREDICTED CLASS
B Class=Yes Class=No Precision (p) =~ 0.04
Class=Yes 40 10 Recall (r) = 0.8
ACTUAL F - measure (F) =~ 0.08
CLASS Class=No 1000 4000
Accuracy =~ 0.8
8
Measures of Classification Performance
PREDICTED CLASS
Yes No
ACTUAL
Yes TP FN
CLASS
No FP TN
9
Alternative Measures
10
Alternative Measures
PREDICTED CLASS
Class=Yes Class=No
Precision (p) = 0.5
Class=Yes 10 40
TPR = Recall (r) = 0.2
ACTUAL
Class=No 10 40
FPR = 0.2
CLASS
F − measure = 0.28
PREDICTED CLASS
Precision (p) = 0.5
Class=Yes Class=No
TPR = Recall (r) = 0.5
Class=Yes 25 25
ACTUAL FPR = 0.5
Class=No 25 25
CLASS F − measure = 0.5
12
ROC Curve
(TPR,FPR):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal
Diagonal line:
– Random guessing
– Below diagonal line:
prediction is opposite
of the true class
13
ROC (Receiver Operating Characteristic)
14
ROC Curve Example
- 1-dimensional data set containing 2 classes (positive and negative)
- Any points located at x > t is classified as positive
At threshold t:
TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88
15
Using ROC for Model Comparison
No model consistently
outperforms the other
M1 is better for
small FPR
M2 is better for
large FPR
16
How to Construct an ROC curve
17
How to construct an ROC curve
Class + - + - - - + - + +
Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
ROC Curve:
18
Classification Technique (Contd..)
Outline
Bayesian Classifier
Classification Technique (Contd..)
Bayesian Classifier
Bayes Rule
Conditional Probability
P(A|C )P(C )
P(C |A) =
P(A)
˙ Bayes Classifier
Naive
1 If Aj is categorical, then
#pattern in Ci with value aj
P(aj |Ci ) =
|Ci ,T |
Classification Technique (Contd..)
Bayesian Classifier
If Aj is continuous-valued,
Continuous valued attribute is assumed to be Gaussian
distribution.
1 (x−µ)2
G (x, µ, σ) = √ e − 2σ2
2πσ
P(aj |Ci ) = G (aj , µCi , σCi )
Classification Technique (Contd..)
Bayesian Classifier
Table: Training Set (Source: Tan, Kumar and Steinbach, Introduction to Data Mining)
(aj −µ) 2
1
P(aj |Ci ) = √ e − 2σ2
2πσ
Classification Technique (Contd..)
Bayesian Classifier
P(Income|Class = NO) :
X = (Refund = No, Married, Income = 120K )? What will be the class label of X ?
P(X |Class = No) = P(Refund = No|Class = No) × P(Married|Class = No) × P(Income = 120|No)
4 4
= × × 0.0072 = 0.0024
7 7
P(X |Class = Yes) = P(Refund = No|Class = Yes) × P(Married|Class = Yes) × P(Income = 120|Yes)
= 0.
P(X |Class = No) × P(Class = No) > P(X |Class = Yes) × P(Class = Yes)
Data Mining
Chapter 4
10/12/2020 1
Artificial Neural Networks (ANN)
1
W_0
10/12/2020 4
Perceptron Example
Input
nodes Black box
X1 X2 X3 Y
1 0 0 -1 Output
1 0 1 1
X1 0.3 node
1 1 0 1
1 1 1 1
X2 0.3
0 0 1 -1
Y
0 1 0 -1
0 1 1 1 X3 0.3 t=0.4
0 0 0 -1
10/12/2020 6
Perceptron Learning Rule
Intuition:
– Update weight based on error: e =
– If y = 𝑦, e=0: no update needed
– If y > 𝑦, e=2: weight must be increased so
that 𝑦 will increase
– If y < 𝑦, e=-2: weight must be decreased so
that 𝑦 will decrease
10/12/2020 7
Example of Perceptron Learning
0.1
X1 X2 X3 Y w0 w1 w2 w3 Epoch w0 w1 w2 w3
1 0 0 -1 0 0 0 0 0 0 0 0 0 0
1 0 1 1 1 -0.2 -0.2 0 0 1 -0.2 0 0.2 0.2
2 0 0 0 0.2 2 -0.2 0 0.4 0.2
1 1 0 1
3 0 0 0 0.2
1 1 1 1 3 -0.4 0 0.4 0.2
4 0 0 0 0.2
0 0 1 -1 5 -0.2 0 0 0 4 -0.4 0.2 0.4 0.4
0 1 0 -1 6 -0.2 0 0 0 5 -0.6 0.2 0.4 0.2
0 1 1 1 7 0 0 0.2 0.2 6 -0.6 0.4 0.4 0.2
0 0 0 -1 8 -0.2 0 0.2 0.2
Weight updates over
Weight updates over first epoch all epochs
10/12/2020 8
Perceptron Learning
Since y is a linear
combination of input
variables, decision
boundary is linear
10/12/2020 9
Perceptron Learning
Since y is a linear
combination of input
variables, decision
boundary is linear
10/12/2020 10
Nonlinearly Separable Data
XOR Data
y x1 x2
x1 x2 y
0 0 -1
1 0 1
0 1 1
1 1 -1
10/12/2020 11
Multi-layer Neural Network
x1 x2 x3 x4 x5
More than one hidden layer of
Input computing nodes
Layer
10/12/2020 12
Multi-layer Neural Network
w31
x1 n1 n3 w53
w41
n5 y
w32
w54
x2 n2 n4
w42
10/12/2020 13
Why Multiple Hidden Layers?
10/12/2020 14
Multi-Layer Network Architecture
�
�
10/12/2020 15
Activation Functions
10/12/2020 16
Learning Multi-layer Neural Network
10/12/2020 17
Gradient Descent
𝜆: learning rate
10/12/2020 18
Computing Gradients
𝑦 = 𝑎𝐿
𝑖𝑗 𝑖𝑗
At output layer L:
10/12/2020 21
Characteristics of ANN
10/12/2020 22
Deep Learning Trends
10/12/2020 24
Handling Vanishing Gradient Problem
10/12/2020 25
Outline
Ensemble of Classifiers
Bagging
Ensemble of Classifiers
Bagging
Model Ensembles
for t = 1to T do
1 Build a bootstrap sample Dt by sampling |D| points.
2 Run A on Dt to build a classifier Mt
Return {Mt |1 ≤ t ≤ T }
Comments: Probability that a particular data point not being
selected is (1 − 1/n)n
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1.0 1.0 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1.0 1.0 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging
Bagging Round 6: x ≤ 0.75 : y = −1, y = +1 else.
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1.0
y 1 -1 -1 -1 -1 -1 -1 1 1 1
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1.0
y 1 -1 -1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1.0
y 1 1 -1 -1 -1 -1 -1 1 1 1
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1.0 1.0
y 1 1 -1 -1 -1 -1 -1 1 1 1
x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9
y 1 1 1 1 1 1 1 1 1 1
Bagging
x
stump Round 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
≤ 0.35, +1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.65 ≥, +1 2 1 1 1 1 1 1 1 1 1 1
≤ 0.35 3 1 1 1 -1 -1 -1 -1 -1 -1 -1
le0.3, +1 4 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.35 5 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.75, −1 6 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 7 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 8 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 9 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.05, −1 10 1 1 1 1 1 1 1 1 1 1
SUM 2 2 2 -6 -6 -6 -6 2 2 2
Class 1 1 1 -1 -1 -1 -1 1 1 1
Actual 1 1 1 -1 -1 -1 -1 1 1 1
Random Forest
Random Forest
for t = 1 to T do
1 Build a bootstrap sample Dt by sampling with replacement.
2 select d features randomly and reduce dimensionality of Dt
accordingly.
3 Build a decision tree Mt using Dt .
Return {Mt |1 ≤ t ≤ T }.
Boosting
Boosting
Initial weights are uniform that sum to 1.
Current weight assigned to misclassified samples are exactly
the error rate ǫ.
Multiply with 1/2ǫ weights for misclassified instances.
Multiply 1/2(1 − ǫ) weights for correctly classified samples.
Boosting
AdaBoost
Example of AdaBoost
Example of AdaBoost
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1
Example of AdaBoost
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1
Example of AdaBoost
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1- -1 -1 -1
Example
Round x=1 2 3 4 5 6 7 8 9 10
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Sign 1 1 1 -1 -1 -1 -1 1 1 1
THANK YOU
Introduction to Classification
Introduction to Clustering
Similarity Measures
Category of clustering methods
Partitional Clustering
Soft Clustering
Hierarchical Clustering
Density Based Clustering Method
Clustering Methods for Large Datasets
Major Approaches to Clustering Large Datasets
Hybrid Clustering Method
Data Summarization
BIRCH
Application of Clustering Methods to Image Processing
Conclusions and Research Directions
Dr. Bidyut Kr. Patra Introduction to Clustering Methods
Outline
Introduction to Classification
Introduction to Clustering
Clustering Methods for Large Datasets
Application of Clustering Methods to Image Processing
Conclusions and Research Directions
Classification
Definition
The task of classification is to assign an object into one of the
predefined categories.
Working Principle
Examples of Classification
Drawback
Introduction to Clustering
Cluster Analysis
Cluster analysis is to discover the natural grouping(s) of a set of
patterns, points, or objects a .
a
A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition
Letters, 31(8):651–666, 2010.
Definition
Let D = {x1 , x2 , . . . , xn } be a set of patterns called dataset, where
each xi is a pattern of N dimensions. A clustering π of D can be
defined as follows.
π = {C1 , C2 , . . . , Ck }, such that
Sk
i Ci = D
Ci 6= ∅, i = 1..k
Ci ∩ Cj = ∅, i 6= j , i , j = 1..k
sim (x1 , x2 ) > sim (x1 , y1 ), x1 , x2 ∈ Ci and y1 ∈ Cj , i 6= j
Application Domains:
1 Biology: Clustering has been applied to genomic data to
group functionally similar genes.
2 Information Retrieval: Search results obtained by various
search engines such as Google, Yahoo can be clustered so that
related documents appear together in a cluster.
(Vivisimo search engine (http://vivisimo.com/) groups related
documents.)
3 Market Research: Entities (people, market, organizations) can
be clustered based on common features or characteristics.
4 Geological mapping, Bio-informatics, Climate, Web mining,
Image Processing
Similarity
Similarity between a pair of patterns x and y in D is a
mapping, sim(x, y ) : D × D → [0, 1].
Closer the value of sim(.) is to 1, higher the similarity and it is
1, when x = y .
Similarity(Contd..)
Jaccard Coefficient(J):
a11
J= , (1)
N − a00
(
PN1 if xi = yi = 1
where a11 = i =1 t |t= ,
0 otherwise
(
1 if xi = yi = 0
a00 = N
P
i =1 t | t =
0 otherwise
Let x = (1, 0, 0, 1, 1) and y = (0, 1, 0, 1, 0).
2 1
SMC = ; Jaccard Coefficient =
5 4
Similarity(Contd..)
Cosine Similarity: Let x and y be two document-vectors. Similarity
between x and y are expressed as cosine of the angle between them.
x •y
cosine(x, y ) =
||x|| ||y ||
Dissimilarity
Many clustering methods use dissimilarity measures to find clusters
instead of similarity measure.
Metric Space
Definition
M = (D, d) is said to be metric space if d is a metric on D, i.e.,
d : D × D → R≥0 , which satisfies following conditions.
For three patterns x, y , z ∈ D,
Non-negativity: d(x, y ) ≥ 0
Reflexivity: d(x, y ) = 0, if x = y
Symmetry: d(x, y ) = d(y , x)
Triangle inequality: d(x, y ) + d(y , z) ≥ d(x, z)
(i) C1 * C2 or C1 + C2 , (ii) C1 ∩ C2 = ∅
DATASET Clustering
DATASET
i = 1, Ci = {x1 }
x ∈ D \ {x1 }
1 Find nearest existing cluster Cmin such that
d(x, Cmin ) = minj=1..i d(x, Cj )
2 if d(x, Cmin ) > τ and i < k, then
i = i + 1; Ci = {x}
ESLE Cmin = Cmin ∪ {x}
3 Repeat Step 1 and Step 2 until all patterns are assigned to
clusters.
Advantages:
1 Single dataset scan method.
Advantages:
1 Single dataset scan method.
2 Time Complexity= O(kn)
Advantages:
1 Single dataset scan method.
2 Time Complexity= O(kn)
Disadvantages:
1 Number of clusters is to be provided.
.
. .
.
.
.
.
. .
. .
.
.
Drawbacks
Clustering results are ordered dependent.
Distance between followers of different clusters (leaders) may
be less than corresponding leaders.
ly
lx
x
y
>τ
k-means Clustering(MacQueen,1967)
k-means Clustering(MacQueen,1967)
Example
Example
Example
Example
Example
Example
Example
Example
Example
Drawbacks
Algorithm
1 Initialize the list of clusters L = {C1 }, where C1 = D.
2 repeat
1 Remove a cluster from the list of clusters.
2 Bisect the selected cluster using k-means clustering method for
a number of times.
3 Select two clusters from a bisection with lowest total SSE.
4 Add these two clusters to the list of clusters L.
3 until number of clusters in the list L is k.
Pk
− x)2
P
Minimize SSE = i =1 x∈Ci (ci
1P
ck = x
n x∈Ck
Pk
− x)2
P
Minimize SSE = i =1 x∈Ci (ci
1P
ck = x
n x∈Ck
Pk P
Minimize SAE = i =1 x∈Ci | ci − x |
ck = median of the objects in the cluster.
Soft Clustering
Hierarchical Clustering
Hierarchical clustering methods create a sequence of
clusterings π1 , π2 , π3 , . . . πi . . . πp of given dataset D.
It can produce inherent nested structures (hierarchical) of
clusters in a data.
Hierarchical Clustering obtained in two ways
1 Divisive (Top-down) approach:
Hierarchical Clustering
Hierarchical clustering methods create a sequence of
clusterings π1 , π2 , π3 , . . . πi . . . πp of given dataset D.
It can produce inherent nested structures (hierarchical) of
clusters in a data.
Hierarchical Clustering obtained in two ways
1 Divisive (Top-down) approach:
Start with one cluster containing all points.
At each step, split a cluster until each cluster contains a point.
Hierarchical Clustering
Hierarchical clustering methods create a sequence of
clusterings π1 , π2 , π3 , . . . πi . . . πp of given dataset D.
It can produce inherent nested structures (hierarchical) of
clusters in a data.
Hierarchical Clustering obtained in two ways
1 Divisive (Top-down) approach:
Start with one cluster containing all points.
At each step, split a cluster until each cluster contains a point.
2 Agglomerative (Bottom-up) approach :
Hierarchical Clustering
Hierarchical clustering methods create a sequence of
clusterings π1 , π2 , π3 , . . . πi . . . πp of given dataset D.
It can produce inherent nested structures (hierarchical) of
clusters in a data.
Hierarchical Clustering obtained in two ways
1 Divisive (Top-down) approach:
Start with one cluster containing all points.
At each step, split a cluster until each cluster contains a point.
2 Agglomerative (Bottom-up) approach :
Start with the points as individual clusters.
Merge the closest pair of clusters until only number of cluster
becomes one.
Dr. Bidyut Kr. Patra Introduction to Clustering Methods
Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions
Complexity
Figure: DendogramIntroduction
Dr. Bidyut Kr. Patra
for single-link
to Clustering Methods
Outline
Introduction to Classification
Introduction to Clustering Similarity Measures
Clustering Methods for Large Datasets Category of clustering methods
Application of Clustering Methods to Image Processing
Conclusions and Research Directions
Distance Updation
Complete Link:
1 1 1
d(Co , (Cx , Cy )) = × d(Co , Cx ) + × d(Co , Cy ) + 0 × d(Cx , Cy ) + × |d(Co , Cx ) − d(Co , Cy )|
2 2 2
= max{d(Co , Cx ), d(Co , Cy )}
DBSCAN(contd..)
Noise Point
minpts=4
Core Point
Border Point
DBSCAN(Contd..)
Cluster
x
xc y
Noise Points
Minpts = 3
DBSCAN
Drawbacks
100 100
"NBC_paper_data.txt" using 1:2 Cluster1
Cluster2
Cluster3
80 80 Noise
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
NBC
R − KNN(x) = {p ∈ D | x ∈ KNN(p)}
100
Cluster1
Cluster2
Cluster3
80 Cluster4
Cluster5
Noise
60
40
20
0
0 20 40 60 80 100
Figure: Clusters
Dr. Bidyut obtained
Kr. Patra by NBC
Introduction Method
to Clustering Methods
Outline
Introduction to Classification
Introduction to Clustering
Hybrid Clustering Method
Clustering Methods for Large Datasets
Data Summarization
Application of Clustering Methods to Image Processing
Conclusions and Research Directions
2 Hybrid Clustering
3 Data Summarization
Hybrid Method
Data Summarization
Two-Phase Method
1 Create a summary of a given dataset in form of CF tree.
CF tree
CF tree
CF tree
Figure: CF-tree
Clustering Features(CF)
Conclusions
Questions
THANK YOU
Margin
if
�
���
�
�
�
��
��
�
if
�
���
�
�
�
�
��
�
�
———————————————-
for all OR
�� �
��
� �
� �
�
�
�
�� �
��
� �
�
� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
The two parallel hyperplanes are
�
� �
��
�
�
�
�
�
���
��
�
�
�
�
���
�� �
� �
��
Distance between origin and is
�
��
� �
��
��
Then, the margin –
� �
�� �
�
��
�
��
��
��
�
�
��
��
Then the problem is :
Minimize (Objective function)
�
�
��
��
�
��
���
�
�
�
�
�
�
�
Note that the objective is convex and the constraints are
linear
Lagrangian method can be applied.
�
���
Subject to the constraints .
� ���
�
�
�
�
�
�
�
Lagrangian,
�
��
�� �
�
�
��
� ��
�
��
�
where is called primary variables and are the
�
�
�
Lagrangian multipliers which are also called dual
variables.
has to be minimized with respect to primal varibles
�
�
1. �
� �
�
�
�
2. for all
�
� �
�
�
��
�
�
�
�
��
�
3. for all
� ���
�
� �
�
�
��
�
�
�
�
�
��
�
If is convex and is linear for all , then it turns
�
� ���
�
���
�
�
out that K.K.T conditions are “necessary and sufficient”
for the optimal .
�
���
��
�
�
convex if
�
� �
��
�
� �
��
��
�
� �
��
� �
�
�
�
�
for � and
��
�
�
��
�
�
�
�
a b
�
��
��
�
Subject to constraints: for all
�� �
��
� �
�
�
�
�
�
�
�
Lagrangian,
�
� �
��
��
�� �
��
�
� �
�
�
�
��
�
�
�
�
�
�
Here,
��
� �
� �
� �
� �
�
� ���
� ���
�
�
�
� �
� �
�
�
�
��
�
�
���
�
��
�
�
�
��
�
��
�
���
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
��
��
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
for to
��
��
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�� �
��
��
�
� �
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
for to .
�
�
�
�
�
While it is possible to do this, it is tedious ! �
�
has .
��� �
��
�
��
�
variables .
��� �
��
�
��
�
�
�
�
��
��
�
���
�
��
��
��
�
��
��
�
�
�
�
�
�
�
��
��
�
� �
�
��
��
��
�
��
��
�
�
�
�
�
�
�
�
� �
�
�� �
�
��
�
��
�
�
�
�
�
���
� �
��
��
�
��
�
�
��
�
�
�
�
�
�
� �
� �
��
��
�
��
�
�
��
�
�
�
�
�
such that and for all .
�
�
��
��
�
��
�
We need to find the Lagrangian multipliers
only.
�
�
�
��
�
��
�
� �
� �
� �
�
� �
�
�
�
� ��
��
�
��
�
�
��
�
�
���
� �
��
�
��
�
�
�
�
where is a matrix with its entry being
��
�
�
�
�
� �
�
�
�
�
�
�
���
� �
��
��
� �
�
�
�
��
a support vector.
Note: lies on the hyperplane
� �
� �
�
�
�
��
Similarly, doesnot lie on hyperplane
���
�
�
��
�
That is, for interior points .
�
��
�
���
�
��
�
��
�
The classifier is
��
�
�
�
��
� �
��
��
��
�
��
�
��
�
�
�
can be found from (4)
�
�� �
��
� �
� ��
�
�
��
�
�
Multiplying with on both sides we get,
��
�
� ��
��
�
�
So,
�
� �
���
� �
��
��
��
�
��
�
�
�
��
solution are independent of the dimensionality .
�
c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.30/43
�
Non-linear SVM
We know that every non-linear function in -space
�
(input space) can be seen as a linear function in an
appropriate -space (feature space).
�
Let the mapping be .
�
��
�
�
Once the is defined, one has to replace in the
�
�
�
�
�
�
���
� �
���
� �
�
�
While it is possible to explicitly define the and
�
�
�
generate the training set in the -space, and then
obtain the solution... �
�
�
� �
�
� �
� �
� ���
� �
�
�
dot product in the Y-space can be obtained as a
function in the X-space itself. There is no need to
explicitly generate the patterns in the Y-space.
Eg: Consider a two dimensional problem with
. Let . Then
�� �
�
�
� ���
��
� ���
�
�
�
��
�
��
���
�
��
.
��
�
� �
�
� �
�
�
� �
� �
� ���
� �
� �
� �
�
�
�
This kernel trick is one of the reasons for the success of
SVMs.
�
� ���
� �
We say is a valid kernel iff there exists a such
� �� �
� � � �
�
�
�
that for all and .
� �
�
�
�
� ���
���
� �
���
� �
�
�
Mercer’s Theorem gives the necessary conditions for a
kernel to be valid.
While Mercer’s theorem is a mathematically involved
one, some of the properties of kernels can be used to
verify whether a kernel is valid.
� �
� � ��
�
�� �
� �
�� �
�� �
� �
�
2.
� �
��
�
�
�
�
�
�
�
3.
�
�
� �
� �
�
� �
�� �
�
� ���
� �
� ���
� �
�� �
�� �
� �
�
�
�
��� �
��
��
��
��
��
�
�
�
�
��
��
�
are kernels is
� �
�� �
�
�
�
�� �
� �
�� �
� �
�� �
� �
�
��� �
��
��
�
also a valid kernel. That is, is a valid kernel.
��
��
�
For any symmetric, positive semi-definite matrix
��
, is a valid kernel.
� ��
� �
�
�
� ���
� �
�
� �
�
� �
� �
�
�
� ���
� �
� ���
� �
is a kernel exp( ) (called
� �
� �
�
� ���
� �
� ���
� �
�
�
� �
�
� �
�
� ��
���
�
� ���
� �
���
�
kernel)is a kernel.
if
� �
�
���
�
�
�
��
��
�
�
if
� �
�
���
��
�
�
�
�
��
�
�
———————————————-
�� � for all OR
��
� �
���
� �
�
�
�
�
�� �
��
� �
�
� �
���
�
�
�
�
�
�
�
are called slack variables,
� �
� �
�
� �
�
�
�
�
�
�
�
�
�
�
�
Now, the objective to minimize:
�
�
��
��
� �
�
�
is called penalty parameter, and .
�
�
�
�
��
��
�
� �
�
�
�
�� �
��
� �
� �
�
�
�
��
�
�
�
� �
� �
�
�
�
���
�
�
��
�
�
��
�
� �
�
���
�
�
��
�
��
�
�
�
�
�
�
��
�
�
��
�
��
� �
�
�
�
�
��
�
��
�
�
�
��
� �
�
��
�
�
�
�
���
� �
��
� �
�
�
�
��
�
�
c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.39/43
�
Soft Margin: KKT Conditions
� �
�
�
�
�
�
��
�
�
�
� �
�
� �
�
�
�
�
�
��
�� �
�� �
��
�
� �
��
� �
���
�
�
�
�
�
�
�
�
��
� �
��
� �
� �
�
�
�
�
c P. Viswanath, Indian Institute of Technology-Guwahati, India – p.40/43
�
Wolfe Dual Formulation
Maximize w.r.t
�
�
� �
� �
��
��
�
��
�
�
��
�
�
�
�
�
such that and for all .
�
�
�
�
�
��
��
�
��
�
Hard Margin
�
�
�
�
�
�
� �
�
number of training patterns.
These considerations have driven the design of specific
algorithms for SVMs that can exploit the sparseness of
the solution, the convexity of the optimization problem,
and the implicit mapping into feature space.
One such a simple and fast method is Sequential
Minimal Optimization (SMO).
Ensemble of Classifiers
Bagging
Model Ensembles
for t = 1to T do
1 Build a bootstrap sample Dt by sampling |D| points.
2 Run A on Dt to build a classifier Mt
Return {Mt |1 ≤ t ≤ T }
Comments: Probability that a particular data point not being
selected is (1 − 1/n)n
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1.0 1.0 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Example
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1: x ≤ 0.35 : y = +1, y = −1 else.
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0
y 1 1 1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1- 1 1 1
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 1 1 1
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1.0 1.0 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging
Bagging Round 6: x ≤ 0.75 : y = −1, y = +1 else.
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1.0
y 1 -1 -1 -1 -1 -1 -1 1 1 1
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1.0
y 1 -1 -1 -1 -1 1 1 1 1 1
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1.0
y 1 1 -1 -1 -1 -1 -1 1 1 1
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1.0 1.0
y 1 1 -1 -1 -1 -1 -1 1 1 1
x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9
y 1 1 1 1 1 1 1 1 1 1
Bagging
x
stump Round 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
≤ 0.35, +1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.65 ≥, +1 2 1 1 1 1 1 1 1 1 1 1
≤ 0.35 3 1 1 1 -1 -1 -1 -1 -1 -1 -1
le0.3, +1 4 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.35 5 1 1 1 -1 -1 -1 -1 -1 -1 -1
≤ 0.75, −1 6 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 7 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 8 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.75, −1 9 -1 -1 -1 -1 -1 -1 -1 1 1 1
≤ 0.05, −1 10 1 1 1 1 1 1 1 1 1 1
SUM 2 2 2 -6 -6 -6 -6 2 2 2
Class 1 1 1 -1 -1 -1 -1 1 1 1
Actual 1 1 1 -1 -1 -1 -1 1 1 1
Random Forest
Random Forest
for t = 1 to T do
1 Build a bootstrap sample Dt by sampling with replacement.
2 select d features randomly and reduce dimensionality of Dt
accordingly.
3 Build a decision tree Mt using Dt .
Return {Mt |1 ≤ t ≤ T }.
Boosting
Boosting
Initial weights are uniform that sum to 1.
Current weight assigned to misclassified samples are exactly
the error rate .
Multiply with 1/2 weights for misclassified instances.
Multiply 1/2(1 − ) weights for correctly classified samples.
Boosting
AdaBoost
Example of AdaBoost
Example of AdaBoost
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1
Example of AdaBoost
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1
Example of AdaBoost
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1.0
y 1 -1 -1 -1 -1 -1 -1 -1 1 1
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1- -1 -1 -1
Example
Round x=1 2 3 4 5 6 7 8 9 10
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Sign 1 1 1 -1 -1 -1 -1 1 1 1
THANK YOU