Professional Documents
Culture Documents
naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
This Semester
Run in hybrid course (online and offline) [Most Probably but its possible to change
according to University Instructions [Condition apply]
We strongly encourage you to discuss Machine Learning topics with other students
Students are expected to produce their own work in project and, when using the
work of others, include clear citations.
Failure to properly cite or attribute the work of others will impact your grade for
the course.
Blatant plagiarism will result in a 0% grade for the project and may entail larger consequences
2
Course Project
• We encourage you to form a group of 1-2 people [Not more then 2]
Health care
Without prior permission students can not change their projects, If they do, It will impact
their grade for the course.
Blatant plagiarism will result in a 0% grade for the project and may entail larger consequences
4
• Remaining 20 points are for mid term and final term theory exam.
Course Evaluation • Students will not be allowed to sit in the exam having less than
75% attendance.
Attendance [20 Points] (>=75%)
Class Participation [20 Points] [Class Behavior, Camera Opened/ not, Not Answered
question, and etc.]
Project Based Evaluation
Mid Term Exam [10 Points]: Students must submit their Project Status
Project Title: After Midterm Submission Not changed title/topic
Project Abstract : 200 ~ 500 Words
Literature Review:1000 ~ 5000 Words
Methodology: Requirement Analysis, Algorithm, Pseudocode, Flowchart
Final Term Exams [10 Points] Students must submit to Complete Project Report
Project Implementation: Coding
Project Results: Describe the result in details [ more than1000 words]
Demonstration: Project Demo
Project Report [Plagiarism must be less than 2% from each reference] 5
Contents
• Introduction and Basic Concepts of Machine Learning: Supervised and Unsupervised
Learning Setup, Real-life applications, Linear Regression
• Introduction to Linear Algebra, Logistic Regression and it’s comparison with linear
regression
• Supervised (classification) approaches: KNN, Support Vector Machines
• Supervised (classification) approach: Decision Tree, Naïve Bayes, performance evaluation
• Unsupervised Approaches: K-means, K-medoid
• Unsupervised Approaches: hierarchical clustering algorithms
• Performance evaluation for Clustering algorithms: Cluster Validity Indices
• Dimensionality reduction technique: Principal Component Analysis (PCA)
• Feature Selection Models: Sequential forward and backward, Plus-l Minus-r, bidirectional,
floating selection
• Ensemble Models: Bagging and Boosting
• Multi-label Classification and Reinforcement Learning
• Semi-supervised classification and clustering
• Introduction to Deep Learning
*The instructor reserves the right to modify this schedule based on new information,
extenuating circumstance, or student performance. 6
Source Material
Text Books:
• R. Duda, P. Hart & D. Stork, Pattern Classification (2nd ed.), Wiley (Required)
• T. Mitchell, Machine Learning, McGraw-Hill (Recommended)
• Christopher M. Bishop: Pattern Recognition and Machine Learning, 2006.
• Shai Shalev-Shwartz and Shai Ben-David: Understanding Machine Learning: From
Theory to Algorithms, 2014
Web:
•http://www.cs.toronto.edu/~rgrosse/courses/csc411_f18/
•https://amfarahmand.github.io/csc311/
•https://www.cs.princeton.edu/courses/archive/fall16/cos402/
Slides and assignments will be posted on the Google Classromm in a timely manner.
7
What We Talk About When We Talk About“Learning”
Learning general models from a data of particular
examples
Example in retail:
data)
to the data.
8
Artificial Intelligence
9
What is Machine Learning?
The capability of Artificial Intelligence systems to learn by extracting
being explicitly programmed. Instead of writing code, you feed data to the
Automating automation
14
Difference b/w Artificial Intelligence And Machine Learning
“ AI is a bigger concept to create intelligent machines that can simulate
human thinking capability and behavior, whereas, machine learning is an
application or subset of AI that allows machines to learn from data without
being programmed explicitly.”
16
Growth of Machine Learning
Powerful Processing
Quicker Processing
Accurate
Inexpensive
18
Implementation Platform for Machine Learning
Python is a popular platform used for research and development of
production systems.
It is a vast language with number of modules, packages and libraries that
provides multiple ways of achieving a task.
Python and its libraries like NumPy, Pandas, SciPy, Scikit-Learn, Matplotlib
are used in data science and data analysis.
They are also extensively used for creating scalable machine learning
algorithms.
Python implements popular machine learning techniques such as
Classification, Regression, Recommendation, and Clustering.
Python offers ready-made framework for performing data mining tasks on
large volumes of data effectively in lesser time
19
Machine Learning?
Machine Learning
at some task
with experience
Role of Statistics:
Role of Computer science: [**We will cover some example in the next class]
Efficient algorithms to
following steps:
Algorithm types:
Defining a Problem Association Analysis
Algorithms Regression/Prediction
Unsupervised Learning
Improving Results
Semi-supervised Learning
Presenting Results
Reinforcement Learning
21
Traditional Machine Learning
22
Machine Learning
23
ML in a Nutshell
Tens of thousands of machine learning algorithms
Representation
Evaluation
Optimization
24
Representation
Decision trees
Instances
Graphical models
Neural networks
Model ensembles
etc………
25
Evaluation
Accuracy An Example:
Let’s consider a two class problem where we have to
Precision and recall classify an instance into two categories: Yes or No.
Here, ‘Actual’ represents the original classes/labels
Squared error provided in the data and ‘Predicted’ represents the
classes predicted by a ML model.
Likelihood
Posterior probability
Cost / Utility
Margin
Entropy
K-L divergence
Etc.
26
Optimization
Combinatorial optimization
Convex optimization
Constrained optimization
Meta-heuristic Approach
27
Features of Machine Learning
Let us look at some of the features of Machine Learning.
making of algorithms.
28
Inductive Learning
29
ML in Practice
Learning is the process of converting Understanding domain, prior
discovered knowledge
Loop
30
Machine Learning Algorithms
Supervised (inductive) learning
Unsupervised learning
Semi-supervised learning
Reinforcement learning
31
Machine Learning
Supervised learning Unsupervised learning
Instance-based learning
Bayesian learning
Neural networks
Model ensembles
Learning theory
32
Machine Learning
Applications
Association Analysis
Supervised Learning
Classification
Regression/Prediction
Unsupervised Learning
Reinforcement Learning
33
Machine Learning: Learning Associations
Basket analysis:
Example:
35
Supervised Learning
A majority of practical machine learning uses supervised learning.
In supervised learning, the system tries to learn from the previous examples
variables (x) and output variables(Y) and can use an algorithm to derive the
36
Supervised Learning
When an algorithm learns from example data and associated target
Supervised learning.
supervision of a teacher.
and the student then derives general rules from these specific
37
examples.
Supervised Learning
38
Categories of Supervised learning
Supervised learning problems can be further divided into two parts, namely
Classification:
Regression:
39
Supervised Learning: Classification Problems
“Consists of taking input vectors and deciding which of the N classes they
Is discrete (most of the time). i.e. an example belongs to precisely one class,
and the set of classes covers the whole possible output space.
Find 'decision boundaries' that can be used to separate out the different
classes.
Given the features that are used as inputs to the classifier, we need to
Credit scoring
Differentiating between low-risk and high-risk customers from their income and
savings
Discriminant:
IF income > θ1 AND savings >
θ2, THEN low-risk ELSE high-
risk
41
Classification Problems
42
Classification: Applications
Aka Pattern recognition
Face recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style
Sensor fusion: Combine multiple modalities; eg, visual (lip image) and acoustic
for speech
43
Regression Problems
x y
0 0
0.5236 1.5
1.5708 3.0
2.0944 -2.5981
2.6180 1.5
2.6180 1.5
3.1416 0
To Find: y at x=0.4
46
Supervised Learning: Uses
Example: decision trees tools that create rules
Knowledge extraction:
Compression:
Outlier detection:
47
Unsupervised Learning
Association Analysis
Example applications
– Customer segmentation in CRM
– Image compression: Color
quantization
– Bioinformatics: Learning motifs
48
Reinforcement Learning
49
Reinforcement Learning
• Topics:
– Policies: what actions should an agent take in a particular
situation
– Utility estimation: how good is a state (used by policy)
• No supervised output but delayed reward
• Credit assignment problem (what was responsible for the outcome)
• Applications:
– Game playing
– Robot in a maze
– Multiple agents, partial observability, ...
50
51
Lecture-2, 3, (4)
Introduction to Machine Learning
naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
Project Topics
1. Fake News Detection 16. Color Detection with Python
3. Emojify – Create your own emoji 18. Gender and Age Detection
4. Loan Prediction Project 19. Image Caption Generator Project in
7. Bitcoin Price Predictor Project 21. Edge Detection & Photo Sketching
11. Movie Recommendation System Project 26. Students can suggest their own
https://lionbridge.ai/datasets/18-websites-to-download-free-datasets-for-
machine-learning-projects/
https://www.kaggle.com/datasets
https://msropendata.com/datasets?domain=COMPUTER%20SCIENCE
https://medium.com/towards-artificial-intelligence/best-datasets-for-machine-
learning-data-science-computer-vision-nlp-ai-c9541058cf4f
3
What is Machine Learning?
The capability of Artificial Intelligence systems to learn by extracting
being explicitly programmed. Instead of writing code, you feed data to the
Data
Computer Output
Program
Machine Learning
Data
Computer Program
Output
5
Supervised Learning: The data and the goal
6
An example: data (loan application)
Approved or not
7
An example: the learning task
8
Supervised vs. unsupervised Learning
9
Supervised learning process: two steps
10
What do we mean by learning?
• Given
– a data set D,
– a task T, and
– a performance measure M,
11
An example
• Data: Loan application data
• Task: Predict whether a loan should be approved or not.
• Performance measure: accuracy.
13
Evaluating classification methods
• Predictive accuracy
• Efficiency
– time to construct the model
– time to use the model
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability:
– understandable and insight provided by the model
• Compactness of the model: size of the tree, or the number of
rules.
14
Evaluation methods
• Holdout set: The available data set D is divided into two disjoint subsets,
– the training set Dtrain (for learning a model)
– the test set Dtest (for testing the model)
• Important: training set should not be used in testing and the test set
should not be used in learning.
– Unseen test set provides a unbiased estimate of accuracy.
• The test set is also called the holdout set. (the examples in the original
data set D are all labeled with classes.)
15
Evaluation methods (cont…)
• n-fold cross-validation: The available data is partitioned into n equal-size
disjoint subsets.
• Use each subset as the test set and combine the rest n-1 subsets as the
training set to learn a classifier.
• The procedure is run n times, which give n accuracies.
• The final estimated accuracy of learning is the average of the n accuracies.
• 10-fold and 5-fold cross-validations are commonly used.
• This method is used when the available data is not large.
16
Evaluation methods (cont…)
• Leave-one-out cross-validation: This method is used when
the data set is very small.
• It is a special case of cross-validation
• Each fold of the cross validation has only a single test example
and all the rest of the data is used in training.
• If the original data has m examples, this is m-fold cross-
validation
A dataset of n instances
Testing data
Training data
17
Evaluation methods (cont…)
• In such cases, the values that give the best accuracy on the validation set
are used as the final parameter values.
18
Classification measures
19
Precision and recall measures
20
Precision and recall measures (cont…)
TP TP
p . r .
TP FP TP FN
22
F1-value (also called F1-score)
23
Unsupervised Learning
24
Unsupervised Learning
• Examples:
– Find natural groupings of Xs (X=human languages, stocks, gene
sequences, animal species,…) Prelude to discovery of underlying
properties
– Summarize the news for the past monthCluster first, then report
centroids.
– Sequence extrapolation: E.g. Predict cancer incidence next
decade; predict rise in antibiotic-resistant bacteria
• Methods
– Clustering (n-link, k-means, GAC,…)
– Taxonomy creation (hierarchical clustering)
– Many more ……
25
Clustering Words with Similar Meanings (Hierarchically )
[Arora-Ge-Liang-M.-Risteski,TACL’17,18]
26
Unsupervised learning
Unsupervised learning is used to detect anomalies, outliers, such as
Association:
rules that describe large portions of your data, such as “people that buy
Clustering:
behavior. 28
Unsupervised Learning
29
Supervised vs. Unsupervised
30
Classification vs Clustering
Classification – an object's category Clustering is a classification with no
prediction, and predefined classes.
Used for: Used for:
Spam filtering For market segmentation (types of
Language detection customers, loyalty)
A search of similar documents To merge close points on a map
Sentiment analysis For image compression
Recognition of handwritten characters To analyze and label new data
and numbers To detect abnormal behavior
Fraud detection Popular algorithms: K-
Popular algorithms: Naive Bayes, Decision means_clustering, Mean-Shift, DBSCAN
Tree, Logistic Regression, K-Nearest
Neighbours, Support Vector Machine
31
Semi-Supervised Learning
32
Semi-Supervised Learning
Supervised Learning = learning from labeled data. Dominant paradigm in
Machine Learning.
• E.g, say you want to train an email classifier to distinguish spam from
important messages
• Take sample S of data, labeled according to whether they were/weren’t
spam.
• Train a classifier (like SVM, decision tree, etc) on S. Make sure it’s not
overfitting.
• Use to classify new emails.
33
Basic paradigm has many successes
• recognize speech,
• steer a car,
• classify documents
• classify proteins
• recognizing faces, objects in images
• ...
34
However, for many problems, labeled
data can be rare or expensive.
Need to pay someone to do it, requires special testing,…
36
However, for many problems, labeled
data can be rare or expensive.
Need to pay someone to do it, requires special testing,…
37
Semi-Supervised Learning
Can we use unlabeled data to augment a small labeled sample to
improve learning?
But…But…But…
39
Semi-Supervised Learning
Substantial recent work in ML. A number of interesting methods have
been developed.
40
Reinforcement Learning
A computer program will interact with a dynamic environment in
decisions.
algorithm AlphaGo.
42
Reinforcement Learning
Policies:
Applications:
Game playing
Robot in a maze
43
Reinforcement Learning
Stands in the middle ground between supervised and unsupervised
learning.
The reinforcement learner has to try out different strategies and see
In essence:
44
Reinforcement Learning
45
Reinforcement Learning
46
ML Proof Concept
47
An Example
Consider a problem:
How to distinguish one specie from the other?
(length, width, weight, number and shape of
fins, tail shape,etc.)
An Example
• Suppose somebody at the fish plant say us that:
– Sea bass is generally longer than a salmon
• Then our models for the fish:
– Sea bass have some typical length, and this is greater than
that for salmon.
Remember our
model:
–Sea bass have
some typical
length, and
this is greater
than that for
salmon.
An Example
• From histogram we can see that single criteria
is quite poor.
An Example
• It is obvious that length is not a good feature.
the table)
64
What is Linear and Slope???
Remember this: Y=mX+B?
Linear line
65
Linear Regression analysis
Linear regression analysis means “fitting a straight line to data”
Easy to use
Allows prediction
𝑌 is sometimes called the dependent variable and ‘𝑋𝑖’ the independent variables
Model inference
67
Linear regression
When all we have is a single predictor variable
Linear regression: one of the simplest and most commonly used statistical modeling
techniques
Makes strong assumptions about the relationship between the predictor variables (𝑋𝑖 ) and
the response (𝑌)
(a linear relationship, a straight line when plotted)
only valid for continuous outcome variables (not applicable to category outcomes such
as success/failure)
68
Linear Regression
Assumption: 𝑦 = 𝛽0 + 𝛽1 × 𝑥 + error
Our task: estimate 𝛽0 and 𝛽1 based on the
available data
Resulting model is ̂𝑦 = ̂ 0 + ̂ 1 × 𝑥
the “hats” on the variables represent
the fact that they are estimated from
the available data
̂𝑦 is read as “the estimator for 𝑦”
𝛽0 and 𝛽1 are called the model parameters
or coefficients
Objective: minimize the error, the difference
between our observations and the
predictions made by our linear model
minimize the length of the red lines in
the figure to the right (called the
“residuals”) 69
Supervised Learning: Housing Price Prediction
Given: a dataset that contains 𝑛-samples
(𝑥^(1), 𝑦^(1),…(𝑥^(𝑛), 𝑦^(𝑛))
15th sample
(𝑥^(15), 𝑦^(15)
𝑥 = 800
𝑦=?
70
Logistic Regression for Machine Learning
Logistic regression is another technique borrowed by machine learning from the field of
statistics.
It is the go-to method for binary classification problems (problems with two class values).
Logistic Function
Logistic regression is named for the function used at the core of the method, the logistic
function.
The logistic function, also called the sigmoid function was developed by statisticians to
describe properties of population growth in ecology, rising quickly and maxing out at the
carrying capacity of the environment.
It’s an S-shaped curve that can take any real-valued number and map it into a value
between 0 and 1, but never exactly at those limits.
1 / (1 + e^-value)
Where e is the base of the natural logarithms (Euler’s number or the EXP() function in your
spreadsheet) and value is the actual numerical value that you want to transform.
71
Regression Vs. Classification
Regression:
If 𝑦∈ℝ is a continuous variable, e.g., price prediction
Classification:
The label is a discrete variable, e.g., the task of predicting
the types of residence
(size, lot size) → house or townhouse?
𝑦=
House or
Townhouse?
73
Supervised Learning in Computer Vision
Image Classification
𝑥=raw pixels of the image,
𝑦=the main object
75
ImageNet Large Scale Visual Recognition Challenge. Russakovskyet al.’2015
Supervised Learning in Natural Language Processsing
Note: this course only covers the basic and fundamental techniques
of supervised learning (which are not enough for solving hard vision
or NLP problems.)
76
Unsupervised Learning
Dataset contains no labels: 𝑥^(1), … 𝑥^(𝑛)
Goal (vaguely-posed): to find interesting structures in the data
Supervised Un Supervised
77
78
Supervised approach: KNN and
Support Vector Machine
Dr. Naveen Saini
Assistant Professor
naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
Course Evaluation
Attendance [20 Points]: Online
Class Participation [20 Points] [Class Behavior, Camera Opened/ not, Not Answered
question, and etc.]
Mid Term Exam [20 Points]: Students must submit their Project Status
Project Title: After Midterm Submission Not changed title/topic
Project Abstract : 200 ~ 500 Words
Literature Review:1000 ~ 5000 Words
Methodology: Requirement Analysis, Algorithm, Pseudocode, Flowchart
Final Term Exams [20 Points] Students must submit to Complete Project Report
Project Implementation: Coding
Project Results: Describe the result in details [ more than1000 words]
Demonstration: Project Demo
Project Report [Plagiarism must be less than 2% from each reference]
****Blatant plagiarism will result in a 0% grade for the project and may entail larger consequences*** 2
Course Project
• We encourage you to form a group of 1-2 people [Not more then 2]
Health care
Without prior permission students can not change their projects, If they do, It will impact
their grade for the course.
Blatant plagiarism will result in a 0% grade for the project and may entail larger consequences
3
Project Topics
1. Fake News Detection 16. Color Detection with Python
3. Emojify – Create your own emoji 18. Gender and Age Detection
4. Loan Prediction Project 19. Image Caption Generator Project in
7. Bitcoin Price Predictor Project 21. Edge Detection & Photo Sketching
11. Movie Recommendation System Project 26. Students can suggest their own
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
Project Topics
No. Student Group No Project Title Abstract
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
What is Machine Learning?
The capability of Artificial Intelligence systems to learn by extracting
being explicitly programmed. Instead of writing code, you feed data to the
Instance-based Learning
10
K-Nearest Neighbor Learning: An Example
Here, the object (shown by ?) is unknown.
If K=1, the only neighbor is a cat. Thus, the unknown object => Cat
If K=4, the nearest neighbors contain one chicken and three cats. Thus, the
11
K-Nearest Neighbor Learning
Given a set of categories C={c1,c2,...cm}, also called classes (for e.g. {"male", "female"}).
LS={(o1,co1),(o2,co2),⋯(on,con)}
As it makes no sense to have less labeled items than categories, we can postulate that
n>m and in most cases even n⋙m (n much greater than m.)
Assign weights to the neighbors based on their ‘distance’ from the query
point
Shepard’s method
K-Nearest Neighbor Learning
Remarks
Highly effective inductive inference method for noisy training data and
Very problem-dependent.
Must try them all out (by changing the value of
K and distance measure) and see what works
best.
K-Nearest Neighbor Learning: Distance Metrics
we calculate the distances between the points of the sample and the object to be
classified. To calculate these distances we need a distance function.
Very bad idea. The test set is a proxy for the generalization performance!
Use only VERY SPARINGLY, at the end.
K-Nearest Neighbor Learning
Cross-validation
cycle through the choice of which fold
is the validation fold, average results.
K-Nearest Neighbor Learning: Deciding parameters
Example of
5-fold cross-validation
for the value of k.
np.random.seed(42)
indices = np.random.permutation(len(data))
permutation from np.random to split the
data randomly
n_training_samples = 12
learn_data = data[indices[:-n_training_samples]]
learn_labels = labels[indices[:-n_training_samples]] Learnset
test_data = data[indices[-n_training_samples:]]
test_labels = labels[indices[-n_training_samples:]] Test Set
#The function get_neighbors returns a list with k neighbors, which are closest to the instance
test_instance:
def get_neighbors(training_set, labels, test_instance, k, distance): Function to find neighbors
"""
get_neighors calculates a list of the k nearest neighbors of an instance 'test_instance'.
The function returns a list of k 3-tuples. Each 3-tuples consists of (index, dist, label) where index
is the index from the training_set, dist is the distance between the test_instance and the instance
training_set[index] distance is a reference to a function used to calculate the distances
"""
distances = []
for index in range(len(training_set)):
dist = distance(test_instance, training_set[index])
distances.append((training_set[index], dist, labels[index]))
distances.sort(key=lambda x: x[1])
neighbors = distances[:k]
return neighbors
#We will test the function with our iris samples: Testing the above function on testing
for i in range(5): data to predict their labels
neighbors = get_neighbors(learn_data, learn_labels, test_data[i], 3, distance=distance)
print("Index: ",i,'\n', "Testset Data: ",test_data[i],'\n', "Testset Label: ",test_labels[i],'\n',
"Neighbors: ",neighbors,'\n')
23
Output of Python
program
24
K-Nearest Neighbor
Advantage
Disadvantage
26
Classification: Definition
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Examples of Classification Task
• Predicting tumor cells as benign or malignant
The blue circles in the plot represent females and green squares represents
male. A few expected insights from the graph are :
Males in our population have a higher average height.
Females in our population have longer scalp hairs.
If we were to see an individual with height 180 cms and hair length 4 cms,
our best guess will be to classify this individual as a male. This is how we do
a classification analysis.
What is SVM?
How do we decide which is the best frontier for this particular problem statement?
The easiest way to interpret the objective function in a SVM is to find the minimum distance of the
frontier from closest support vector (this can belong to any class).
For instance, orange frontier is closest to blue circles.
And the closest blue circle is 2 units away from the frontier.
Once we have these distances for all the frontiers, we simply choose the frontier with the maximum
distance (from the closest support vector).
Out of the three shown frontiers, we see the black frontier is farthest from nearest support vector
(i.e. 15 units).
What is SVM?
What if we do not find a clean frontier which segregates the classes?
Our job was relatively easier finding the SVM in this business case. What if the
distribution looked something like as follows :
Such cases will be covered once we start with the formulation of SVM.
For now, you can visualize that such transformation will result into following
type of SVM.
What is SVM?
Some of you may have selected the hyper-plane B as it has higher margin
compared to A.
But, here is the catch, SVM selects the hyper-plane which classifies the
classes accurately prior to maximizing margin.
Here, hyper-plane B has a classification error and A has classified all
correctly. Therefore, the right hyper-plane is A.
Support Vector Machine (SVM)
Can we classify two classes (Scenario-4)?: Below, I am unable to
segregate the two classes using a straight line, as one of the stars lies in the
territory of other(circle) class as an outlier.
As I have already mentioned, one star at other end is like an outlier for star
class. The SVM algorithm has a feature to ignore outliers and find the hyper-
plane that has the maximum margin. Hence, we can say, SVM classification
is robust to outliers.
Support Vector Machine (SVM)
Find the hyper-plane to segregate two
classes (Scenario-5): In the scenario
below, we can’t have linear hyper-plane
between the two classes, so how does
SVM classify these two classes? Till now,
we have only looked at the linear hyper-
plane.
plt.subplot(1, 1, 1)
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, cmap=plt.cm.Paired,
alpha=0.8)
Support Vector Machine (SVM)
Example: Use SVM rbf kernel
Change the kernel type to rbf in below line and look at the impact.
48
Sec. 15.1
Support Vector Machine (SVM)
wT x + b
• Distance from example to the separator is r = y
w
• Examples closest to the hyperplane are support vectors.
• Margin ρ of the separator is the width of separation between support vectors
of classes.
Derivation of finding r:
x ρ Dotted line x’−x is perpendicular to
decision boundary so parallel to w.
r Unit vector is w/|w|, so line is
x′ rw/|w|.
x’ = x – yrw/|w|.
x’ satisfies wTx’+b = 0.
So wT(x –yrw/|w|) + b = 0
Recall that |w| = sqrt(wTw).
So wTx –yr|w| + b = 0
So, solving for r gives:
w r = y(wTx + b)/|w|
52
Sec. 15.1
Linear SVM Mathematically
The linearly separable case
• Assume that all data is at least distance 1 from the hyperplane, then the
following two constraints follow for a training set {(xi ,yi)}
wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ −1 if yi = −1
• For support vectors, the inequality becomes an equality
• Then, since each example’s distance from the hyperplane is
wT x + b
r=y
w
• The margin is:
2
r=
w
53
Sec. 15.1
Linear Support Vector Machine (SVM)
• Hyperplane wTxa + b = 1
ρ
wT x + b = 0
wTxb + b = -1
• This implies:
wT(xa–xb) = 2
Recall that |w| = sqrt(wTw)
This implies, ρ = ||xa–xb||2 = 2/||w||2 wT x + b = 0
54
Linear SVMs Mathematically (cont.)
• Datasets that are linearly separable (with some noise) work out great:
0 x
0 x
• How about … mapping data to a higher-dimensional space:
x2
0 x
63
Sec. 15.2.3
Non-linear SVMs: Feature spaces
Φ: x → φ(x)
64
Sec. 15.2.3
The “Kernel Trick”
65
What the kernel trick achieves
• These scalar products are the only part of the computation that
depends on the dimensionality of the high-dimensional space.
– So if we had a fast way to do the scalar products we would not
have to pay a price for solving the learning problem in the
high-D space.
K ( x a , x b ) ( x a ) . ( x b )
(xa )
Letting the doing the scalar ( xb )
kernel do product in the
the work obvious way
Sec. 15.2.3
Kernels
• Why use kernels?
– Make non-separable problem separable.
– Map data into better representational space
• Common kernels
– Linear
– Polynomial K(x,z) = (1+xTz)d
• Gives feature conjunctions
– Radial basis function (infinite dimensional space)
Gaussian Parameters
|| x y|| 2 / 2 2
K (x, y ) e
that the user
radial basis must choose
function
71
72
Lecture -8
Supervised approach: Decision Tree-
based Classification
naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
Classification: Definition
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Examples of Classification Task
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification Task
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Classification Task
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Decision Tree Induction
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART (Classification and Regression Tree)
– ID3, C4.5
– SLIQ (Fast scalable algorithm for large
application)
Can handle both numeric and categorical attributes
– SPRINT (scalable parallel classifier for
datamining)
General Structure of Hunt’s Algorithm
Tid Refund Marital Taxable
Let Dt be the set of training records Status Income Cheat
that reach a node t 1 Yes Single 125K No
Single, Single,
Married Married
Divorced Divorced
Don’t Cheat
Cheat
Tree Induction
Greedy strategy
– Split the records based on an attribute test
that optimizes certain criterion
Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Greedy strategy
– Split the records based on an attribute test
that optimizes certain criterion
Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Size
{Small,
Medium} {Large}
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Greedy approach:
– Nodes with homogeneous class distribution
are preferred
Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity
Gini Index
Entropy
Misclassification error
How to Find the Best Split
Before Splitting: C0 N00
M0
C1 N01
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
Higher Gini Gain = Better Split
Measure of Impurity: GINI
GINI (t ) 1 [ p( j | t )]2
j
GINI (t ) 1 [ p( j | t )]2
j
k
ni
GINI split GINI (i )
i 1 n
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Alternative Splitting Criteria based on INFO
Entropy(t ) p( j | t ) log p( j | t )
j 2
n
split i 1
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.361 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves !!
Tree Induction
Greedy strategy
– Split the records based on an attribute test
that optimizes certain criterion
Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets
Practical Issues of Classification
Costs of Classification
Missing Values
Errors
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise
Lack of data points in the lower half of the diagram makes it difficult to
predict correctly the class labels of that region
- Insufficient number of training records in the region causes the decision
tree to predict the test examples using other training records that are
irrelevant to the classification task
Notes on Overfitting
PREDICTED CLASS
Class=Yes Class=No
a: TP (true positive)
b: FN (false negative)
Class=Yes a b
ACTUAL c: FP (false positive)
d: TN (true negative)
CLASS Class=No c d
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
ad TP TN
Accuracy
a b c d TP TN FP FN
Limitation of Accuracy
PREDICTED CLASS
wa wb wc w d
1 2 3 4
Model Evaluation
Any Queries??
naveensaini@wsu.ac.kr
Unsupervised Learning: K-means and
K-medoid
naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
Course Evaluation
Attendance [20 Points]: Online
Class Participation [20 Points] [Class Behavior, Camera Opened/ not, Not Answered
question, and etc.]
Mid Term Exam [20 Points]: Students must submit their Project Status
Project Title: After Midterm Submission Not changed title/topic
Project Abstract : 200 ~ 500 Words
Literature Review:1000 ~ 5000 Words
Methodology: Requirement Analysis, Algorithm, Pseudocode, Flowchart
Final Term Exams [20 Points] Students must submit to Complete Project Report
Project Implementation: Coding
Project Results: Describe the result in details [ more than1000 words]
Demonstration: Project Demo
Project Report [Plagiarism must be less than 2% from each reference]
****Blatant plagiarism will result in a 0% grade for the project and may entail larger consequences*** 2
Course Project
• We encourage you to form a group of 1-2 people [Not more then 2]
Health care
Without prior permission students can not change their projects, If they do, It will impact
their grade for the course.
Blatant plagiarism will result in a 0% grade for the project and may entail larger consequences
3
Project Topics
1. Fake News Detection 16. Color Detection with Python
3. Emojify – Create your own emoji 18. Gender and Age Detection
4. Loan Prediction Project 19. Image Caption Generator Project in
7. Bitcoin Price Predictor Project 21. Edge Detection & Photo Sketching
11. Movie Recommendation System Project 26. Students can suggest their own
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
Project Topics
No. Student Group No Project Title Abstract
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Unsupervised learning
It is the opposite of supervised learning.
7
Categories of Unsupervised learning
Unsupervised learning problems can be further divided into association
Association:
rules that describe large portions of your data, such as “people that buy
Clustering:
behavior. 8
Supervised vs. Unsupervised
9
CLUSTERING
● Similarity measures:
○ Euclidean distance, Cosine similarity
● Main Objective:
○ High compactness
○ Maximize Separation
● Examples:
○ K-means
○ K-medoids
○ Hierarchical
10
Classification vs Clustering
Classification – an object's category Clustering is a classification with no
prediction, and predefined classes.
Used for: Used for:
Spam filtering For market segmentation (types of
Language detection customers, loyalty)
A search of similar documents To merge close points on a map
Sentiment analysis For image compression
Recognition of handwritten characters To analyze and label new data
and numbers To detect abnormal behavior
Fraud detection Popular algorithms: K-
Popular algorithms: Naive Bayes, Decision means_clustering, Mean-Shift, DBSCAN
Tree, Logistic Regression, K-Nearest
Neighbours, Support Vector Machine
11
Classification vs. Clustering
Classification:
Supervised learning:
Learns a method for predicting the
instance class from pre-labeled
(classified) instances
Clustering
Unsupervised learning:
Finds “natural” grouping of
instances given un-labeled data
Clustering Algorithms
Jain, A. K., Murty, M. N., and Flynn, P. J., Data Clustering: A Survey.
ACM Computing Surveys, 1999. 31: pp. 264-323.
─ Fake news is being created and spread at a rapid rate due to technology
innovations such as social media.
─ Here, clustering algorithm works is by taking in the content of the fake
news article, the corpus, examining the words used and then clustering
them.
─ Certain words are found more commonly in sensationalized, click-bait
articles. When you see a high percentage of specific terms in an article, it
gives a higher probability of the material being fake news.
Clustering Application: Document Analysis
• Manual inspection
• Benchmarking on existing labels
• Cluster quality measures
–distance measures
–high similarity within a cluster, low across clusters
The Distance Function
The Distance Function
X
Data Samples
K-means example, step 1
k1
Y
Pick 3 k2
initial
cluster
centers
(randomly)
k3
X
K-means example, step 2
k1
Y
k2
Assign
each point
to the closest
cluster
center k3
X
K-means example, step 3
k1 k1
Y
Move k2
each cluster
center k3
k2
to the mean
of each cluster k3
X
K-means example, step 4
Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?
X
K-means example, step 4 …
k1
Y
A: three
points with
animation k3
k2
X
K-means example, step 4b
k1
Y
re-compute
cluster
means k3
k2
X
K-means example, step 5
k1
Y
k2
move cluster
centers to k3
cluster means
X
K-means example, All steps in a single diagram
32
K-means Algorithm
Basic idea: randomly initialize the k cluster centers, and iterate between
the two steps we just saw.
Randomly initialize the cluster centers, c1 , ..., cK
Given cluster centers, determine points in each cluster
For each point p, find the closest ci . Put p into cluster i
Given points in each cluster, solve for ci
Set ci to be the mean of points in cluster i
If ci have changed, repeat Step 2
K-means Algorithm
K-means Algorithm
Squared Error Criterion
Pros and cons of K-Means
Python implementation of K-Means
Download Iris dataset from https://www.kaggle.com/uciml/iris
Python implementation of K-Means
• Visualizing the data using matplotlib :
38
Python implementation of K-Means
39
Sample Output of Implementation
40
A Tutorial on K-means
https://matteucci.faculty.polimi.it/Clusterin
g/tutorial_html/AppletKM.html
Outliers
Advantages Disadvantages
• Simple, understandable • Must pick number of
• items automatically clusters before hand
assigned to clusters • All items forced into a
cluster
• Too sensitive to outliers
since an object with an
extremely large value may
substantially distort the
distribution of data
Python implementation of K-Medoid (1/2)
https://scikit-learn-
extra.readthedocs.io/en/stable/auto_examples/cluster/plot_kmedoids_di
gits.html#sphx-glr-auto-examples-cluster-plot-kmedoids-digits-py
52
Python implementation of K-Medoid (2/2)
53
Unsupervised Learning
How to choose a clustering algorithm
A vast collection of algorithms are available. Which one to choose for our
problem ?
Every algorithm has limitations and works well with certain data
distributions.
data follow. The data may not fully follow any “ideal” structure or
score.
disjoint subsets, D1 , D2 , …, Dk .
59
Hierarchical Clustering Algorithms
naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
Unsupervised learning
It is the opposite of supervised learning.
2
Categories of Unsupervised learning
Unsupervised learning problems can be further divided into association
Association:
rules that describe large portions of your data, such as “people that buy
Clustering:
behavior. 3
CLUSTERING
● Similarity measures:
○ Euclidean distance, Cosine similarity
● Main Objective:
○ High compactness
○ Maximize Separation
● Examples:
○ K-means
○ K-medoids
○ Hierarchical
4
Supervised vs. Unsupervised
5
Classification vs Clustering
Classification – an object's category Clustering is a classification with no
prediction, and predefined classes.
Used for: Used for:
Spam filtering For market segmentation (types of
Language detection customers, loyalty)
A search of similar documents To merge close points on a map
Sentiment analysis For image compression
Recognition of handwritten characters To analyze and label new data
and numbers To detect abnormal behavior
Fraud detection Popular algorithms: K-
Popular algorithms: Naive Bayes, Decision means_clustering, Mean-Shift, DBSCAN
Tree, Logistic Regression, K-Nearest
Neighbours, Support Vector Machine
6
Hierarchical Clustering
Algorithms
7
Introduction
8
Introduction
• Illustrative Example
Agglomerative and divisive clustering on the data set
{a, b, c, d ,e }
Step 0 Step 1 Step 2 Step 3 Step 4
Agglomerative
a
ab
b Two things to know:
abcde Cluster distance
c Termination condition
cde
d
de
e
Divisive
Step 4 Step 3 Step 2 Step 1 Step 0
9
Cluster Distance Measures
single link
• Single link: smallest distance (min)
between an element in one cluster
and an element in the other, i.e., d(Ci,
Cj) = min{d(xip, xjq)}
10
Cluster Distance Measures
Example: Given a data set of five objects characterised by a single feature, assume that
there are two clusters: C1: {a, b} and C2: {c, d, e}.
a b c d e
Feature 1 2 4 5 6
1. Calculate the distance matrix. 2. Calculate three cluster distances between C1 and C2.
Single link
a b c d e dist(C1 , C 2 ) min{ d(a, c), d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
a 0 1 3 4 5 min{3, 4, 5, 2, 3, 4} 2
b 1 0 2 3 4 Complete link
dist(C1 , C 2 ) max{d(a, c), d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
c 3 2 0 1 2
max{3, 4, 5, 2, 3, 4} 5
d 4 3 1 0 1
Average d(a, c) d(a, d) d(a, e) d(b, c) d(b, d) d(b, e)
e 5 4 2 1 0 dist(C1 , C 2 )
6
3 4 5 2 3 4 21
3.5
6 6
11
Agglomerative Algorithm
• The Agglomerative algorithm is carried out in
three steps:
1) Convert object attributes to
distance matrix
2) Set each object as a cluster
(thus if we have N objects, we
will have N clusters at the
beginning)
3) Repeat until number of cluster
is one (or known # of clusters)
Merge two closest clusters
Update distance matrix
12
Example
• Problem: clustering analysis with agglomerative
algorithm
data matrix
Euclidean distance
distance matrix
(Symmetric metric along the diagonal)
13
Example
14
Example
15
Example
16
Example
17
Example
18
Example
19
Example
20
Example
21
Example
Apply the agglomerative algorithm with single-link, complete-link and averaging cluster
distance measures to produce three dendrogram trees, respectively.
a b c d e
a 0 1 3 4 5
b 1 0 2 3 4
c 3 2 0 1 2
d 4 3 1 0 1
e 5 4 2 1 0
22
Example
Agglomerative Demo
23
Google Colab Link
https://colab.research.google.com/drive/1XIriFb
6YCmKSvgr7j6f5io0lZ3IpdQUF?usp=sharing
24
Conclusions
25
Any Queries:
naveen@iiitl.ac.in
26
Dr. Naveen Saini
Assistant Professor
naveen@iiitl.ac.in https://sites.google.com/view/nsaini
1
Philosophy of PCA
Introduced by Pearson (1901) and Hotelling (1933)
to describe the variation in a set of multivariate
data (more than two variables) in terms of a set of
uncorrelated variables
v(x
1)c(x
1,x2)........
c(x1,xp)
C= c(x
1,x
2)v(x2)........
c(x 2,xp)
c(x,x)c(x ,x )
..........
v(x )
1 p 2 p p
Covariance matrix describes relationship between variables
It’s actually the sign of the covariance that matters :
•if positive then : the two variables increase or decrease together (correlated)
•if negative then : One increases when the other decreases (Inversely
correlated)
And so.. We find that
The direction of where is most variance,
is given by the eigenvector 1 correponding
to the largest eigenvalue of matrix C
And so on …
Some points
• Geometrically speaking, principal components represent the
directions of the data that explain a maximal amount of
variance, that is to say, the lines that capture most
information of the data.
det(C-I)=(1- )²-c²
Solving we find A1 A2
How many components to keep?
Clearly, the second eigen value is very small compared to the first eigen
value.
So, the second eigen vector can be left out.
Eigen vector corresponding to the greatest eigen value is the principal
component for the given data set.
So. we find the eigen vector corresponding to eigen value λ1.
We use the following equation to find the eigen vector-
MX = λX
where-
• M = Covariance Matrix
• X = Eigen vector
• λ = Eigen value
Substituting the values in the above equation, we get-
Solving these, we get-
2.92X1 + 3.67X2 = 8.22X1
3.67X1 + 5.67X2 = 8.22X2
On simplification, we get-
5.3X1 = 3.67X2 ………(1)
3.67X1 = 2.55X2 ………(2)
https://www.slideshare.net/ParthaSarathiKa
r3/principal-component-analysis-75693461
https://builtin.com/data-science/step-step-
explanation-principal-component-analysis
Thank you!!
Any Queries??
naveensaini@wsu.ac.kr
DBSCAN Clustering Algorithms
naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
Unsupervised learning
It is the opposite of supervised learning.
2
Categories of Unsupervised learning
Unsupervised learning problems can be further divided into association
Association:
rules that describe large portions of your data, such as “people that buy
Clustering:
behavior. 3
CLUSTERING
● Similarity measures:
○ Euclidean distance, Cosine similarity
● Main Objective:
○ High compactness
○ Maximize Separation
● Examples:
○ K-means
○ K-medoids
○ Hierarchical
4
Supervised vs. Unsupervised
5
Classification vs Clustering
Classification – an object's category Clustering is a classification with no
prediction, and predefined classes.
Used for: Used for:
Spam filtering For market segmentation (types of
Language detection customers, loyalty)
A search of similar documents To merge close points on a map
Sentiment analysis For image compression
Recognition of handwritten characters To analyze and label new data
and numbers To detect abnormal behavior
Fraud detection Popular algorithms: K-
Popular algorithms: Naive Bayes, Decision means_clustering, Mean-Shift, DBSCAN
Tree, Logistic Regression, K-Nearest
Neighbours, Support Vector Machine
6
Density-based Clustering
Algorithms
7
Density-based Approaches
8
DBSCAN: Density Based Spatial Clustering of
Applications with Noise
9
Density-Based Clustering
Basic Idea:
Clusters are dense regions in the data
space, separated by regions of lower
object density
Results of a k-medoid
algorithm for k=4
11
e-Neighborhood
• e-Neighborhood – Objects within a radius of e from
an object.
Ne ( p) : {q | d ( p, q) e }
• “High density” - ε-Neighborhood of an object contains
at least MinPts of objects.
ε-Neighborhood of p
ε ε ε-Neighborhood of q
q p
Density of p is “high” (MinPts = 4)
Density of q is “low” (MinPts = 4)
Core, Border & Outlier
13
Example
Minpts = 3
Eps=radius
of the circles
14
Density-Reachability
Directly density-reachable
An object q is directly density-reachable from
object p if p is a core object and q is in p’s e-
neighborhood.
15
Density-reachability
p
p is (indirectly) density-reachable
p2 from q
p1 q is not density- reachable from p?
q
MinPts = 7
16
Density-Connectivity
Density-Connected
A pair of points p and q are density-connected
if they are commonly density-reachable from a
point o.
Density-connectivity is
symmetric
p q
o
17
Formal Description of Cluster
18
Review of Concepts
DBScan Algorithm
20
DBSCAN: The Algorithm
– Continue the process until all of the points have been processed.
21
DBSCAN Algorithm: Example
• Parameter
• e = 2 cm
• MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
22
DBSCAN Algorithm: Example
• Parameter
• e = 2 cm
• MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
23
DBSCAN Algorithm: Example
• Parameter
• e = 2 cm
• MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
24
MinPts = 5
e
P1
e
e P
C1 C1
C1 P
e
C1
C1
e
e
e
C1 C1
C1
26
Example
e = 10, MinPts = 4
27
When DBSCAN Works Well
• Resistant to Noise
• Can handle clusters of different shapes and sizes
28
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.92).
Original Points
(MinPts=4, Eps=9.75)
29
DBSCAN: Sensitive to Parameters
30
Determining the Parameters e and MinPts
p 3-distance(p) :
q
3-distance(q) :
31
Determining the Parameters e and MinPts
3-distance
first „valley“
Objects
32
Determining the Parameters e and MinPts
• Problematic example
A C
F A, B, C
E
B, D, E
3-Distance
G
B‘, D‘, F, G
G1
G3 D1, D2,
D G2 G1, G2, G3
B D’
B’ D1
D2 Objects
33
Density Based Clustering: Discussion
• Advantages
– Clusters can have arbitrary shape and size
– Number of clusters is determined automatically
– Can separate clusters from surrounding noise
• Disadvantages
– Input parameters may be difficult to determine
– In some situations very sensitive to input
parameter setting
34
35
Cluster Validation
naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
What is Cluster Analysis?
• Understanding
• Structuring search results
• Suggesting related pages
• Automatic directory construction/update
• Finding near identical/duplicate pages
• Summarization
• Reduce the size of large data sets
Notion of a Cluster can be Ambiguous
• Partitional Clustering
• A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
Partitional Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Hierarchical Clustering Dendrogram
Types of Clusters
Well-separated clusters
Center-based clusters
Contiguous clusters
Density-based clusters
Property or Conceptual
Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster
is closer (or more similar) to every other point in the
cluster than to any point not in the cluster.
3 well-separated clusters
Types of Clusters: Center-Based
Center-based
– A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a cluster,
than to the center of any other cluster
– The center of a cluster is often a centroid, the average of
all the points in the cluster, or a medoid, the most
“representative” point of a cluster
4 center-based clusters
Types of Clusters: Contiguity-Based
8 contiguous clusters
Types of Clusters: Density-Based
Density-based
– A cluster is a dense region of points, which is separated
by low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and
when noise and outliers are present.
6 density-based clusters
Types of Clusters: Conceptual Clusters
2 Overlapping Circles
Types of Clusters: Objective Function
Quality of clustering:
– There is usually a separate “quality” function that measures
the “goodness” of a cluster
– It is hard to define “similar enough” or “good enough”
Answer is typically highly subjective
Requirements and Challenges
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of
these
Constraint-based clustering
User may give constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
High dimensionality
Sec. 16.2
Issues for clustering
8
4
7
2
6
SSE
0 5
4
-2
3
-4 2
1
-6
0
5 10 15 2 5 10 15 20 25 30
K
Internal Measures: SSE
1
2 6
3
4
WSS(xm
i)2
ix C i
– Separation is measured by the between cluster sum of squares
BSSCi(m
mi)2
i
–Where |Ci| is the size of cluster i
Internal Measures: Cohesion and Separation
Example: SSE
– BSS + WSS = constant
m
1 m1 2 3 4 m2 5
K=1 cluster:
1
WSS
( 2
3
)(
22
3
)(
42
3
)(
52
3
)10
4
BSS
(
33
)
2
0
Total
10
0 10
K=2 clusters:
WSS
(
11.
52
)(
21
.
52
)(
44
.
52
)(
54
.
52
)1
BSS
2(
31
.
52
)
2(
4.
52
3
)9
Total
1910
Internal Measures: Cohesion and Separation
cohesion separation
Internal Measures: Silhouette Coefficient
Feature Selection
naveen@iiitl.ac.in https://sites.google.com/view/nsaini
1
Feature Extraction/Selection
Objective
LECTURE 11: Sequential Feature Selection
g Feature extraction vs. Feature selection
g Search strategy and objective functions
g Objective functions
n Filters
n Wrappers
g Sequential search strategies
n Sequential Forward Selection
n Sequential Backward Selection
n Plus-l Minus-r Selection
n Bidirectional Search
n Floating Search
x1 x1 x1
x x i1 x y 1
2 x 2 x2
y
M → 2 → 2 = f M
i
feature selection
M feature extraction
x N x iM x N yM x
N
M → 2
feature selection i
i1 i2 ,..., x iM = argmax [J{x i | i = 1...N}]
M,im
x N x iM
Search Search
Objective PR
function algorithm
ML PR
algorithm algorithm
∑ ∑ρ
i=1 j=i +1
ij
n Where ρic is the correlation coefficient between feature ‘i’ and the class label and ρij is the correlation coefficient
between features ‘i’ and ‘j’
n Non-Linear relation measures
g Correlation is only capable of measuring linear dependence. A more powerful measure is the mutual
information I(Yk;C)
C
P(YM , ωc )
J(YM ) = I(YM; C) = H(C) − H(C | YM ) = ∑ ∫ P(YM , ωc ) lg dx
c =1 YM P(YM ) P(ωc )
g The mutual information between the feature vector and the class label I(YM;C) measures the amount
by which the uncertainty in the class H(C) is decreased by knowledge of the feature vector H(C|YM),
where H(·) is the entropy function
n Note that mutual information requires the computation of the multivariate densities P(YM) and P(YM,ωC), which is
an ill-posed problem for high-dimensional spaces. In practice [Battiti, 1994], mutual information is replaced by a
heuristic like
( ) ∑ I(x )
M M M
J(YM ) = ∑ I x im ; C − β∑ im ; x in
m =1 m =1 n =m +1
g Wrappers
n Advantages
g Accuracy: wrappers generally achieve better recognition rates than filters since they are tuned to the
specific interactions between the classifier and the dataset
g Ability to generalize: wrappers have a mechanism to avoid overfitting, since they typically use
cross-validation measures of predictive accuracy
n Disadvantages
g Slow execution: since the wrapper must train a classifier for each feature subset (or several
classifiers if cross-validation is used), the method can become unfeasible for computationally
intensive methods
g Lack of generality: the solution lacks generality since it is tied to the bias of the classifier used in the
evaluation function. The “optimal” feature subset will be specific to the classifier under consideration
g where xk are indicator variables that determine if the k-th feature has been selected
(xk=1) or not (xk=0)
g Solution
(IV) J(x3x2x4x1)=13
Introduction to Pattern Analysis 11
Ricardo Gutierrez-Osuna
Texas A&M University
Sequential Backward Selection (SBS)
g Sequential Backward Selection works in the opposite direction of SFS
n Starting from the full set, sequentially remove the feature x− that results in the smallest decrease
in the value of the objective function J(Y-x−)
g Notice that removal of a feature may actually lead to an increase in the objective function J(Yk-x−)>
J(Yk). Such functions are said to be non-monotonic (more on this when we cover Branch and Bound)
g Algorithm
Empty feature set
1.
1. Start
Start with
with the
the full
full set
set Y
Y00=X
=X
feature x = arg max [J(Yk − x )]
−
2.
2. Remove
Remove the the worst
worst feature
3.
3. Update
Update Y Yk+1 =Y -x−−; k=k+1
k+1=Ykk-x ; k=k+1
x∈Yk
4.
4. Go
Go to
to 22
g Notes
n SBS works best when the optimal feature subset has a large
number of features, since SBS spends most of its time visiting large
subsets
n The main limitation of SBS is its inability to reevaluate the
usefulness of a feature after it has been discarded
x∉Yk
Yk +1 = Yk + x + ; k = k + 1
3.
3. Repeat
Repeat R
R times
times
x = arg max [J(Yk − x )]
−
x∈Yk
Yk +1 = Yk − x − ; k = k + 1
Full feature set
4.
4. Go
Go to
to 22
g Notes
n LRS attempts to compensate for the weaknesses of SFS and SBS with some backtracking
capabilities
n Its main limitation is the lack of a theory to help predict the optimal values of L and R
1.
1. Start
Start SFS
SFS with
with the
the empty
empty set set Y
YFF={∅}
={∅}
2.
2. Start
Start SBS
SBS with
with the
the full
full set YBB=X
set Y =X
3.
3. Select
Select the
the best
best feature
feature
+
x = argmax J YFk + x
x∉YFk
[( )]
x∈YBk
YFk+1 = YFk + x +
3.
3. Remove
Remove thethe worst
worst feature
feature
−
x = arg max J YBk − x
x∈YBk
[( )]
x∉YFk +1
YBk +1 = YBk − x − ; k = k + 1
4.
4. Go
Go to
to 22
Full feature set
Yk = Yk + x + ; k = k + 1
3.
3. Select
Select the
the worst
worst feature*
feature*
x = arg max [J(Yk − x )]
−
x∈Yk
4.
4. IfIf J(Y -x--)>J(Y
J(Ykk-x )>J(Ykk)) then
then
YYk+1
k+1
=Y
=Y kk-x;
-x; k=k+1
k=k+1
go
go to
to Step
Step 33 *Notice that you’ll need to
else
else do some book-keeping to
go
go to
to Step
Step 22 avoid infinite loops
Full feature set
• https://machinelearningmastery.com/feature-selection-with-numerical-
input-data/
• https://www.analyticsvidhya.com/blog/2020/10/a-comprehensive-guide-
to-feature-selection-using-wrapper-methods-in-python/
Thank you!!
Any Queries??
Dr. Naveen Saini
Assistant Professor
Ensemble Methods
• Rationale
• Combining classifiers
• Bagging
• Boosting
– Ada-Boosting
Rationale
• In any application, we can use several
learning algorithms; hyperparameters
affect the final learner
• The No Free Lunch Theorem: no single
learning algorithm in any domains always
induces the most accurate learner
• Try many and choose the one with the best
cross-validation results
Rationale
• On the other hand …
– Each learning model comes with a set of
assumption and thus bias
– Learning is an ill-posed problem (finite data):
each model converges to a different solution
and fails under different circumstances
– Why do not we combine multiple learners
intelligently, which may lead to improved
results?
Rationale
• How about combining learners that always make
similar decisions?
– Advantages?
– Disadvantages?
• Complementary?
25
25 i
(1 )25i 0.06
i 13 i
Works if …
• Rationale
• Combining classifiers
• Bagging
• Boosting
– Ada-Boosting
Combining classifiers
• Examples: classification trees and neural
networks, several neural networks, several
classification trees, etc.
• Average results from different models
• Why?
– Better classification performance than
individual classifiers
– More resilience to noise
• Why not?
– Time consuming
– Overfitting
Why
• Why?
– Better classification performance than individual
classifiers
– More resilience to noise
• Beside avoiding the selection of the worse classifier
under particular hypothesis, fusion of multiple
classifiers can improve the performance of the best
individual classifiers
• This is possible if individual classifiers make
“different” errors
• For linear combiners, Turner and Ghosh (1996)
showed that averaging outputs of individual
classifiers with unbiased and uncorrelated errors can
improve the performance of the best individual
classifier and, for infinite number of classifiers,
provide the optimal Bayes classifier
Different classifier
Architecture
serial
parallel
hybrid
Architecture
Architecture
Classifiers Fusion
• Fusion is useful only if the combined classifiers are
mutually complementary
• Majority vote fuser: the majority should be always
correct
Complementary classifiers
• Rationale
• Combining classifiers
• Bagging
• Boosting
– Ada-Boosting
Bagging
• Breiman, 1996
Original 1 2 3 4 5 6 7 8
Training set 1 2 7 8 3 7 6 3 1
Training set 2 7 8 5 6 4 2 7 1
Training set 3 3 6 2 7 5 6 2 2
Training set 4 4 5 1 4 6 4 3 8
Bagging
• Sampling (with replacement) according to a uniform
probability distribution
– Each bootstrap sample D has the same size as the original data.
– Some instances could appear several times in the same training set,
while others may be omitted.
Original 1 2 3 4 5 6 7 8
Training set 1 2 7 8 3 7 6 3 1
Training set 2 1 4 5 4 1 5 6 4
Training set 3 7 1 5 8 1 8 1 4
Training set 4 1 1 6 1 1 3 1 5
Ada-Boosting
• Input:
– Training samples S = {(xi, yi)}, i = 1, 2, …, N
– Weak learner h
• Initialization
– Each sample has equal weight wi = 1/N
• For k = 1 … T
– Train weak learner hk according to weighted sample sets
– Compute classification errors
– Update sample weights wi
• Output
– Final model which is a linear combination of hk
Ada-Boosting
Ada-Boosting
Ada-Boosting
Ada-Boosting
Ada-Boosting
Ada-Boosting
Schematic of AdaBoost
Sign[sum]
Weighted Samples h3(x)
• Classification
– AdaBoost.M1 (two-class problem)
– AdaBoost.M2 (multiple-class problem)
Bagging vs. Boosting
Training Data
1, 2, 3, 4, 5, 6, 7, 8
• Ada-Boosting
• Arcing
• Bagging
Box represents
reduction in error
Arcing
Bagging
Noise
• Hurts boosting the most
Conclusions
• Performance depends on data and classifier
• In some cases, ensembles can overcome bias of
component learning algorithm
• Bagging is more consistent than boosting
• Boosting can give much better results on some
data
Thank you!!
Any Queries??
Multi-Label Classification
Dr. Naveen Saini
Assistant Professor
naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
Multi-label Classification
∈ {yes, no}
Multi-label Classification
K =2 K >2
L=1 binary multi-class
L>1 multi-label multi-output†
†
also known as multi-target, multi-dimensional.
2 Applications
3 Background
4 Problem Transformation
5 Algorithm Adaptation
6 Label Dependence
7 Multi-label Evaluation
wedding
accident
romance
horror
comedy
action
violent
...
...
i X1 X2 ... X1000 X1001 Y1 Y2 ... Y27 Y28
1 1 0 ... 0 1 0 1 ... 0 0
2 0 1 ... 1 0 1 0 ... 0 0
3 0 0 ... 0 1 0 1 ... 0 0
4 1 1 ... 0 1 1 0 ... 0 1
5 1 1 ... 0 1 0 1 ... 0 1
.. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . .
120919 1 1 ... 0 0 0 0 ... 0 1
Labelling E-mails
2 Applications
3 Background
4 Problem Transformation
5 Algorithm Adaptation
6 Label Dependence
7 Multi-label Evaluation
x1 x2 x3 x4 x5
ŷ = h(x) • classifier h
= argmax p(y|x) • MAP Estimate
y∈{0,1}
Multi-label Classification
x
y1 y2 y3 y4
and then,
2 Applications
3 Background
4 Problem Transformation
5 Algorithm Adaptation
6 Label Dependence
7 Multi-label Evaluation
y1 y2 y3 y4
L
Y
p(y|x) ∝ p(x) p(yj |x, y1 , . . . , yj−1 )
j=1
and,
ŷ = argmax p(y|x)
y∈{0,1}L
CC Transformation
X Y1 X Y1 Y2 X Y1 Y2 Y3 X Y1 Y3 Y3 Y4
x(1) 0 x(1) 0 1 x(1) 0 1 1 x(1) 0 1 1 0
x(2) 1 x(2) 1 0 x(2) 1 0 0 x(2) 1 0 0 0
x(3) 0 x(3) 0 1 x(3) 0 1 0 x(3) 0 1 0 0
x(4) 1 x(4) 1 0 x(4) 1 0 0 x(4) 1 0 0 1
x(5) 0 x(5) 0 0 x(5) 0 0 0 x(5) 0 0 0 1
y1 y2 y3 y4
1
1 x
0
1
1
0 y1 y2 y3
0
1
1
0
0
1
0
0
ŷ = h(x̃) = [?, ?, ?]
Example
1
1 x
0
1
1
0.6 0 y1 y2 y3
0
0.4 1
1
0
0 1 ŷ1 = h1 (x̃) =
1
argmaxy1 p(y1 |x̃) = 1
0
0
ŷ = h(x̃) = [1, ?, ?]
Example
1
1 x
0.3
0
1 0.7
1
0.6 0 y1 y2 y3
0
1
1
0
0 1 ŷ1 = h1 (x̃) =
1
argmaxy1 p(y1 |x̃) = 1
0
0 2 ŷ2 = h2 (x̃, ŷ1 ) = . . . = 0
ŷ = h(x̃) = [1, 0, ?]
Example
1
1 x
0
1 0.7
0.6 1
0.6 0 0.4 y1 y2 y3
0
1
1
0
0 1 ŷ1 = h1 (x̃) =
1
argmaxy1 p(y1 |x̃) = 1
0
0 2 ŷ2 = h2 (x̃, ŷ1 ) = . . . = 0
3 ŷ3 = h3 (x̃, ŷ1 , ŷ2 ) = . . . = 1
ŷ = h(x̃) = [1, 0, 1]
Example
1
1 x
0
1 0.7
0.6 1
0.6 0 y1 y2 y3
0
1
1
0
0 1 ŷ1 = h1 (x̃) =
1
argmaxy1 p(y1 |x̃) = 1
0
0 2 ŷ2 = h2 (x̃, ŷ1 ) = . . . = 0
3 ŷ3 = h3 (x̃, ŷ1 , ŷ2 ) = . . . = 1
ŷ = h(x̃) = [1, 0, 1]
Example
In the Enron dataset, 44% of labelsets are unique (a single
training example or test instance). In del.icio.us dataset, 98%
are unique.
RAkEL
X Y ∈ 2L
x(1) 0110
x(2) 1000
x(3) 0110
x(4) 1001
x(5) 0001
Ensemble Voting
ŷ1 ŷ2 ŷ3 ŷ4
h1 (x̃) 1 1 1 x
h2 (x̃) 0 1 0
h3 (x̃) 1 0 0 y123 y124 y134 y234
h4 (x̃) 1 0 0
score 0.75 0.25 0.75 0 y1 y2 y3 y4
ŷ 1 0 1 0
2 Applications
3 Background
4 Problem Transformation
5 Algorithm Adaptation
6 Label Dependence
7 Multi-label Evaluation
3
c1
2 c2
c3
c4
1 c5
c6
0 ?
x2
4
4 3 2 1 0 1 2 3
x1
Multi-label kNN
Assigns the most common labels of the k nearest neighbours
1 X (i)
p(yj = 1|x) = yj
k
i∈Nk
3
000
2 001
010
1 011
101
0 ?
1
x2
5
4 3 2 1 0 1 2 3 4
x1
x1
>0.3 ≤0.3
!
y~ x3
>−2.9 ≤−2.9
}
x2 y
=A =B
y~ !y
2 Applications
3 Background
4 Problem Transformation
5 Algorithm Adaptation
6 Label Dependence
7 Multi-label Evaluation
mountain
foliage
urban
beach
1 0 0 0
1 1 0 0
0 0 0 0
0 1 1 1
N L
1 X X (i) (i)
H AMMING LOSS = I[ŷj 6= yj ]
NL
i=1 j=1
= 0.20
0/1 Loss
Example
y(i) ŷ(i)
x̃(1) [1 0 1 0] [1 0 0 1]
x̃(2) [0 1 0 1] [0 1 0 1]
x̃(3) [1 0 0 1] [1 0 0 1]
x̃(4) [0 1 1 0] [0 1 0 0]
x̃(5) [1 0 0 0] [1 0 0 1]
N
1 X (i)
0/1 LOSS = I(ŷ 6= y(i) )
N
i=1
= 0.60
Other Metrics
JACCARD INDEX – often called multi-label ACCURACY
RANK LOSS – average fraction of pairs not correctly ordered
ONE ERROR – if top ranked label is not in set of true labels
COVERAGE – average “depth” to cover all true labels
LOG LOSS – i.e., cross entropy
PRECISION – predicted positive labels that are relevant
RECALL – relevant labels which were predicted
PRECISION vs. RECALL curves
F- MEASURE
micro-averaged (‘global’ view)
macro-averaged by label (ordinary averaging of a binary
measure, changes in infrequent labels have a big impact)
macro-averaged by example (one example at a time,
average across examples)
i.e., BR
favours sparse labelling
does not benefit directly from modelling label
dependence
0/1 loss
evaluation by label, suitable for evaluating
y = argmax p(y|x)
y∈{0,1}L
i.e., PCC, LP
does not favour sparse labelling
benefits from models of label dependence
H AMMING LOSS vs. 0/1 LOSS
y(i) ŷ(i)
x̃(1) [1 0 1 0] [1 0 0 1]
x̃(2) [1 0 0 1] [1 0 0 1] H AM . L OSS 0.3
x̃(3) [0 1 1 0] [0 1 0 0] 0/1 L OSS 0.6
x̃(4) [1 0 0 0] [1 0 1 1]
x̃(5) [0 1 0 1] [0 1 0 1]
H AMMING LOSS vs. 0/1 LOSS
y(i) ŷ(i)
Optimize 0/1 L OSS . . .
x̃(1) [1 0 1 0] [0 1 0 1]
x̃(2) [1 0 0 1] [1 0 0 1] H AM . L OSS 0.4
x̃(3) [0 1 1 0] [0 0 1 0] 0/1 L OSS 0.4
x̃(4) [1 0 0 0] [0 1 1 1] . . . H AMMING LOSS goes up
x̃(5) [0 1 0 1] [0 1 0 1]
H AMMING LOSS vs. 0/1 LOSS
y(i) ŷ(i)
x̃(1) [1 0 1 0] [0 1 0 1]
x̃(2) [1 0 0 1] [1 0 0 1]
x̃(3) [0 1 1 0] [0 0 1 0]
x̃(4) [1 0 0 0] [0 1 1 1]
x̃(5) [0 1 0 1] [0 1 0 1]
Overview [26]
Review/Survey of Algorithms [33]
Extensive empirical comparison [14]
Some slides: A, B, C
http://users.ics.aalto.fi/jesse/
Software & Datasets
Mulan (Java)
Meka (Java)
Scikit-Learn (Python) offers some multi-label support
Clus (Java)
LAMDA (Matlab)
Datasets
http://mulan.sourceforge.net/datasets.html
http://meka.sourceforge.net/#datasets
MEKA
http://meka.sourceforge.net
A MEKA Classifier
package weka.classifiers.multilabel;
import weka.core.∗;
/∗∗
∗ BuildClassifier
∗/
public void buildClassifier (Instances D) throws Exception {
// the first L attributes are the labels
int L = D.classIndex();
}
/∗∗
∗ DistributionForInstance − return the distribution p(y[j ]| x)
∗/
public double[] distributionForInstance(Instance x) throws Exception {
int L = x.classIndex();
// predict 0 for each label
return new double[L];
}
}
References
Antonucci Alessandro, Giorgio Corani, Denis Mauá, and Sandra Gabaglio.
An ensemble of Bayesian networks for multilabel classification.
In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI’13, pages
1220–1225. AAAI Press, 2013.
Hanen Borchani.
Multi-dimensional classification using Bayesian networks for stationary and evolving streaming data.
PhD thesis, Departamento de Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de
Madrid, 2013.
Johannes Fürnkranz, Eyke Hüllermeier, Eneldo Loza Mencı́a, and Klaus Brinker.
Multilabel classification via calibrated label ranking.
Machine Learning, 73(2):133–153, November 2008.
Jason Weston, Olivier Chapelle, André Elisseeff, Bernhard Schölkopf, and Vladimir Vapnik.
Kernel dependency estimation.
In NIPS, pages 897–904, 2003.
Julio H. Zaragoza, Luis Enrique Sucar, Eduardo F. Morales, Concha Bielza, and Pedro Larrañaga.
Bayesian chain classifiers for multidimensional classification.
In 24th International Joint Conference on Artificial Intelligence (IJCAI ’11), pages 2192–2197, 2011.
• http://www.ecmlpkdd2015.org/sites/defa
ult/files/JesseRead.pdf
Any Queries:
naveen@iiitl.ac.in