You are on page 1of 576

Lecture -1

Introduction to Machine Learning

Dr. Naveen Saini


Assistant Professor

Department of Computer Science


Indian Institute of Information Technology Lucknow
Uttar Pardesh

naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
This Semester

 Project Based Learning Course

 Run in hybrid course (online and offline) [Most Probably but its possible to change
according to University Instructions [Condition apply]

 We strongly encourage you to discuss Machine Learning topics with other students

 Online Technology: WebEx/Google meet

 Assignment must submit on time [Extension may be given in special cases].

 Students are expected to produce their own work in project and, when using the
work of others, include clear citations.

 Failure to properly cite or attribute the work of others will impact your grade for
the course.
Blatant plagiarism will result in a 0% grade for the project and may entail larger consequences
2
Course Project
• We encourage you to form a group of 1-2 people [Not more then 2]

• List of potential topics

Information Retrieval Multi-modal data fusion

Computer Vision Finance & Commerce

General Machine Learning Life Sciences

Natural Language Physical Sciences

Covid-19 Smart home

Health care

 Students can suggest their own Idea.

 Without prior permission students can not change their projects, If they do, It will impact
their grade for the course.

Blatant plagiarism will result in a 0% grade for the project and may entail larger consequences
4
• Remaining 20 points are for mid term and final term theory exam.
Course Evaluation • Students will not be allowed to sit in the exam having less than
75% attendance.
 Attendance [20 Points] (>=75%)

 Four HomeWorks: 5 Points /Assignment [20 points]

 Class Participation [20 Points] [Class Behavior, Camera Opened/ not, Not Answered
question, and etc.]
 Project Based Evaluation

 Mid Term Exam [10 Points]: Students must submit their Project Status
 Project Title: After Midterm Submission Not changed title/topic
 Project Abstract : 200 ~ 500 Words
 Literature Review:1000 ~ 5000 Words
 Methodology: Requirement Analysis, Algorithm, Pseudocode, Flowchart
 Final Term Exams [10 Points] Students must submit to Complete Project Report
 Project Implementation: Coding
 Project Results: Describe the result in details [ more than1000 words]
 Demonstration: Project Demo
 Project Report [Plagiarism must be less than 2% from each reference] 5
Contents
• Introduction and Basic Concepts of Machine Learning: Supervised and Unsupervised
Learning Setup, Real-life applications, Linear Regression
• Introduction to Linear Algebra, Logistic Regression and it’s comparison with linear
regression
• Supervised (classification) approaches: KNN, Support Vector Machines
• Supervised (classification) approach: Decision Tree, Naïve Bayes, performance evaluation
• Unsupervised Approaches: K-means, K-medoid
• Unsupervised Approaches: hierarchical clustering algorithms
• Performance evaluation for Clustering algorithms: Cluster Validity Indices
• Dimensionality reduction technique: Principal Component Analysis (PCA)
• Feature Selection Models: Sequential forward and backward, Plus-l Minus-r, bidirectional,
floating selection
• Ensemble Models: Bagging and Boosting
• Multi-label Classification and Reinforcement Learning
• Semi-supervised classification and clustering
• Introduction to Deep Learning

*The instructor reserves the right to modify this schedule based on new information,
extenuating circumstance, or student performance. 6
Source Material

Text Books:
• R. Duda, P. Hart & D. Stork, Pattern Classification (2nd ed.), Wiley (Required)
• T. Mitchell, Machine Learning, McGraw-Hill (Recommended)
• Christopher M. Bishop: Pattern Recognition and Machine Learning, 2006.
• Shai Shalev-Shwartz and Shai Ben-David: Understanding Machine Learning: From
Theory to Algorithms, 2014

Web:
•http://www.cs.toronto.edu/~rgrosse/courses/csc411_f18/
•https://amfarahmand.github.io/csc311/
•https://www.cs.princeton.edu/courses/archive/fall16/cos402/

Slides and assignments will be posted on the Google Classromm in a timely manner.

7
What We Talk About When We Talk About“Learning”
 Learning general models from a data of particular

examples

 Data is cheap and abundant (data warehouses, data

marts); knowledge is expensive and scarce.

 Example in retail:

 People who bought “Bread” also bought

“Butter” (analyzed by learning from the past

data)

 Build a model that is a good and useful approximation

to the data.

8
Artificial Intelligence

• Artificial intelligence is the simulation of human intelligence


processes by machines, especially computer systems.

• Applications: Specific applications of AI include expert systems, information


retrieval (e.g., web page ranking), speech recognition and machine vision
(e.g., face detection), natural language processing (e.g., text summarization)

9
What is Machine Learning?
 The capability of Artificial Intelligence systems to learn by extracting

patterns from data is known as Machine Learning.

 Machine Learning is an idea to learn from examples and experience, without

being explicitly programmed. Instead of writing code, you feed data to the

generic algorithm, and it builds logic based on the data given.


*A Few Quotes
 “A breakthrough in machine learning would be worth ten Microsoft”
(Bill Gates, Chairman, Microsoft)
 “Machine learning is the next Internet”
(Tony Tether, Director, DARPA)
 Machine learning is the hot new thing”
(John Hennessy, President, Stanford)
 “Web rankings today are mostly a matter of machine learning”
(Prabhakar Raghavan, Dir. Research, Yahoo)
 “Machine learning is going to result in a real revolution”
(Greg Papadopoulos, CTO, Sun) 10
Machine Learning
 Machine learning is programming computers to optimize a performance criterion

using example data or past experience.

 Learning is used when:

 Human expertise does not exist (navigating on Mars)

 Humans are unable to explain their expertise (speech recognition)

 Solution changes in time (routing on a computer network)

 Solution needs to be adapted to particular cases (user biometrics)

 Automating automation

 Getting computers to program themselves

 Writing software is the bottleneck

 Let the data do the work instead!


11
Difference b/w Artificial Intelligence And Machine Learning
“ AI is a bigger concept to create intelligent machines that can simulate
human thinking capability and behavior, whereas, machine learning is an
application or subset of AI that allows machines to learn from data without
being programmed explicitly.”

14
Difference b/w Artificial Intelligence And Machine Learning
“ AI is a bigger concept to create intelligent machines that can simulate
human thinking capability and behavior, whereas, machine learning is an
application or subset of AI that allows machines to learn from data without
being programmed explicitly.”

Artificial Intelligence Machine learning


Artificial intelligence is a technology Machine learning is a subset of AI which
which enables a machine to simulate allows a machine to automatically learn
human behavior. from past data without programming
explicitly.
The goal of AI is to make a smart The goal of ML is to allow machines to
computer system like humans to solve learn from data so that they can give
complex problems. accurate output.
In AI, we make intelligent systems to In ML, we teach machines with data to
perform any task like a human. perform a particular task and give an
accurate result.
Machine learning and deep learning are Deep learning is a main subset of
the two main subsets of AI. machine learning.
15
Sample Applications
 Web search
 Social networks
 Finance (stock market)
 Debugging
 Computational biology
 E-commerce
 Space exploration
 Robotics
 Information extraction
 [Your favorite area]

16
Growth of Machine Learning

 Machine learning is preferred approach to


 Speech recognition,
 Computer vision
 Medical outcomes analysis
 Robot control
 Natural language processing

 This trend is accelerating


 Improved machine learning algorithms
 Improved data capture, networking, faster computers
 Software too complex to write by hand
 New sensors / IO devices
 Demand for self-customization to user, environment
 Automated Car 17
Benefits of Machine Learning

 Powerful Processing

 Better Decision Making & Prediction

 Quicker Processing

 Accurate

 Affordable Data Management

 Inexpensive

 Analyzing Complex Big Data

18
Implementation Platform for Machine Learning
 Python is a popular platform used for research and development of
production systems.
 It is a vast language with number of modules, packages and libraries that
provides multiple ways of achieving a task.
 Python and its libraries like NumPy, Pandas, SciPy, Scikit-Learn, Matplotlib
are used in data science and data analysis.
 They are also extensively used for creating scalable machine learning
algorithms.
 Python implements popular machine learning techniques such as
Classification, Regression, Recommendation, and Clustering.
 Python offers ready-made framework for performing data mining tasks on
large volumes of data effectively in lesser time

19
Machine Learning?
 Machine Learning

 Study of algorithms that

 improve their performance

 at some task

 with experience

 Optimize a performance criterion using example data or past experience.

 Role of Statistics:

 Inference from the samples

 Role of Computer science: [**We will cover some example in the next class]

 Efficient algorithms to

 Solve the optimization problem

 Representing and evaluating the model for inference 20


Steps Involved in Machine Learning
 A machine learning project involves the

following steps:
Algorithm types:
 Defining a Problem  Association Analysis

 Preparing Data  Supervised Learning

 Implementing and Evaluating  Classification

Algorithms  Regression/Prediction

 Unsupervised Learning
 Improving Results
 Semi-supervised Learning
 Presenting Results
 Reinforcement Learning

21
Traditional Machine Learning

22
Machine Learning

23
ML in a Nutshell
 Tens of thousands of machine learning algorithms

 Hundreds new every year

 Every machine learning algorithm has three components:

 Representation

 Evaluation

 Optimization

24
Representation
 Decision trees

 Sets of rules / Logic programs

 Instances

 Graphical models

 Neural networks

 Support vector machines (SVM)

 Model ensembles

etc………

25
Evaluation
 Accuracy An Example:
Let’s consider a two class problem where we have to
 Precision and recall classify an instance into two categories: Yes or No.
Here, ‘Actual’ represents the original classes/labels
 Squared error provided in the data and ‘Predicted’ represents the
classes predicted by a ML model.
 Likelihood

 Posterior probability

 Cost / Utility

 Margin

 Entropy

 K-L divergence

 Etc.

26
Optimization
 Combinatorial optimization

 E.g.: Greedy search

 Convex optimization

 E.g.: Gradient descent

 Constrained optimization

 E.g.: Linear programming

 Meta-heuristic Approach

 E.g.: Evolutionary Algorithms

27
Features of Machine Learning
 Let us look at some of the features of Machine Learning.

 Machine Learning is computing-intensive and generally requires a large

amount of training data in case of supervised learning.

 It involves repetitive training to improve the learning and decision

making of algorithms.

 As more data gets added, Machine Learning training can be automated

for learning new data patterns and adapting its algorithm.

28
Inductive Learning

 Given examples of a function (X, F(X))

 Predict function F(X) for new examples X

 Discrete F(X): Classification

 Continuous F(X): Regression

 F(X) = Probability(X): Probability estimation

29
ML in Practice
 Learning is the process of converting  Understanding domain, prior

experience into expertise or knowledge. knowledge, and goals

 Learning can be broadly classified into  Data integration, selection,

three categories, as mentioned below, cleaning,

based on the nature of the learning data pre-processing, etc.

and interaction between the learner and  Learning models

the environment.  Interpreting results

 Consolidating and deploying

discovered knowledge

 Loop

30
Machine Learning Algorithms
 Supervised (inductive) learning

 Training data includes desired outputs

 Unsupervised learning

 Training data does not include desired outputs

 Semi-supervised learning

 Training data includes a few desired outputs

 Reinforcement learning

 Rewards from sequence of actions

31
Machine Learning
 Supervised learning  Unsupervised learning

 Decision tree induction  Clustering

 Rule induction  Dimensionality reduction

 Instance-based learning

 Bayesian learning

 Neural networks

 Support vector machines

 Model ensembles

 Learning theory

32
Machine Learning
 Applications

 Association Analysis

 Supervised Learning

 Classification

 Regression/Prediction

 Unsupervised Learning

 Reinforcement Learning

33
Machine Learning: Learning Associations
 Basket analysis:

P (Y | X ) probability that somebody who buys X also buys Y where

X and Y are products/services.

 Example:

 P ( chips | beer ) = 0.7


Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
34
Supervised Learning: An Example

35
Supervised Learning
 A majority of practical machine learning uses supervised learning.

 In supervised learning, the system tries to learn from the previous examples

that are given.

 (On the other hand, in unsupervised learning, the system attempts to

find the patterns directly from the example given.)

 Speaking mathematically, supervised learning is where you have both input

variables (x) and output variables(Y) and can use an algorithm to derive the

mapping function from the input to the output.

 The mapping function is expressed as Y = f(X).

36
Supervised Learning
 When an algorithm learns from example data and associated target

responses that can consist of numeric values or string labels, such

as classes or tags, in order to later predict the correct response

when posed with new examples comes under the category of

Supervised learning.

 This approach is indeed similar to human learning under the

supervision of a teacher.

 The teacher provides good examples for the student to memorize,

and the student then derives general rules from these specific
37
examples.
Supervised Learning

38
Categories of Supervised learning
 Supervised learning problems can be further divided into two parts, namely

classification, and regression.

 Classification:

 A classification problem is when the output variable is a category or a

group, such as “black” or “white” or “spam” and “no spam”.

 Regression:

 A regression problem is when the output variable is a real value, such

as “Rupees” or “height.” Example: House price prediction

39
Supervised Learning: Classification Problems
“Consists of taking input vectors and deciding which of the N classes they

belong to, based on training from exemplars of each class.“

 Is discrete (most of the time). i.e. an example belongs to precisely one class,

and the set of classes covers the whole possible output space.

 How it's done:

 Find 'decision boundaries' that can be used to separate out the different

classes.

 Given the features that are used as inputs to the classifier, we need to

identify some values of those features that will enable us to decide

which class the current input belongs to


40
Supervised Machine Learning: Classification
 Example:

 Credit scoring

 Differentiating between low-risk and high-risk customers from their income and

savings

Discriminant:
IF income > θ1 AND savings >
θ2, THEN low-risk ELSE high-
risk

41
Classification Problems

42
Classification: Applications
 Aka Pattern recognition

 Face recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style

 Character recognition: Different handwriting styles.

 Speech recognition: Temporal dependency.

 Use of a dictionary or the syntax of the language.

 Sensor fusion: Combine multiple modalities; eg, visual (lip image) and acoustic

for speech

 Medical diagnosis: From symptoms to illnesses

 Web Advertizing: Predict if a user clicks on an ad on the Internet.

43
Regression Problems
x y
0 0
0.5236 1.5
1.5708 3.0
2.0944 -2.5981
2.6180 1.5
2.6180 1.5
3.1416 0

To Find: y at x=0.4

46
Supervised Learning: Uses
Example: decision trees tools that create rules

 Prediction of future cases:

 Use the rule to predict the output for future inputs

 Knowledge extraction:

 The rule is easy to understand

 Compression:

 The rule is simpler than the data it explains

 Outlier detection:

 Exceptions that are not covered by the rule, e.g., fraud

47
Unsupervised Learning

 Learning “what normally happens”

 Uses no annotated data

 Clustering: Grouping similar instances

 Other applications: Summarization,

Association Analysis

 Example applications
– Customer segmentation in CRM
– Image compression: Color
quantization
– Bioinformatics: Learning motifs

48
Reinforcement Learning

49
Reinforcement Learning

• Topics:
– Policies: what actions should an agent take in a particular
situation
– Utility estimation: how good is a state (used by policy)
• No supervised output but delayed reward
• Credit assignment problem (what was responsible for the outcome)
• Applications:
– Game playing
– Robot in a maze
– Multiple agents, partial observability, ...

50
51
Lecture-2, 3, (4)
Introduction to Machine Learning

Dr. Naveen Saini


Assistant Professor

Department of Computer Science


Indian Institute of Information Technology Lucknow
Uttar Pardesh

naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
Project Topics
1. Fake News Detection 16. Color Detection with Python

2. Email Classification 17. Sentiment Analysis

3. Emojify – Create your own emoji 18. Gender and Age Detection
4. Loan Prediction Project 19. Image Caption Generator Project in

5. Housing Prices Prediction Project Python

6. Music Genre Classification Project 20. Traffic Signs Recognition

7. Bitcoin Price Predictor Project 21. Edge Detection & Photo Sketching

8. Uber Data Analysis Project 22. Object Detection

9. Speech Emotion Recognition Project 23. Image Segmentation

10. Catching Illegal Fishing Project 24. Hand Gesture Recognition

11. Movie Recommendation System Project 26. Students can suggest their own

12. Handwritten Digits Recognition Project project


13. Road Lane Line Detection & Traffic Signs
Recognition Project
2
14. Next word predictor Project
Project Topics
 Download Data Set

 https://lionbridge.ai/datasets/18-websites-to-download-free-datasets-for-

machine-learning-projects/

 https://www.kaggle.com/datasets

 https://msropendata.com/datasets?domain=COMPUTER%20SCIENCE

 https://medium.com/towards-artificial-intelligence/best-datasets-for-machine-

learning-data-science-computer-vision-nlp-ai-c9541058cf4f

3
What is Machine Learning?
 The capability of Artificial Intelligence systems to learn by extracting

patterns from data is known as Machine Learning.

 Machine Learning is an idea to learn from examples and experience, without

being explicitly programmed. Instead of writing code, you feed data to the

generic algorithm, and it builds logic based on the data given.


*A Few Quotes
 “A breakthrough in machine learning would be worth ten Microsoft”
(Bill Gates, Chairman, Microsoft)
 “Machine learning is the next Internet”
(Tony Tether, Director, DARPA)
 Machine learning is the hot new thing”
(John Hennessy, President, Stanford)
 “Web rankings today are mostly a matter of machine learning”
(Prabhakar Raghavan, Dir. Research, Yahoo)
 “Machine learning is going to result in a real revolution”
(Greg Papadopoulos, CTO, Sun) 4
Why “Learn”?
Traditional Programming

Data
Computer Output
Program

Machine Learning

Data
Computer Program
Output

5
Supervised Learning: The data and the goal

• Data: A set of data records (also called examples, instances or


cases) described by
– k attributes: A1, A2, … Ak.
– a class: Each example is labelled with a pre-defined class.

• Goal: To learn a classification model from the data that can be


used to predict the classes of new (future, or test)
cases/instances.

6
An example: data (loan application)
Approved or not

7
An example: the learning task

• Learn a classification model from the data


• Use the model to classify future loan applications into
– Yes (approved) and
– No (not approved)
• What is the class for following case/instance?

8
Supervised vs. unsupervised Learning

• Supervised learning: classification is seen as supervised


learning from examples.
– Supervision: The data (observations, measurements, etc.)
are labeled with pre-defined classes. It is like that a
“teacher” gives the classes (supervision).
– Test data are classified into these classes too.

• Unsupervised learning (clustering)


– Class labels of the data are unknown
– Given a set of data, the task is to establish the existence of
classes or clusters in the data

9
Supervised learning process: two steps

Learning (training): Learn a model using the training data


Testing: Test the model using unseen test data to assess the model accuracy

Number of correct classifications


Accuracy  ,
Total number of test cases

10
What do we mean by learning?

• Given
– a data set D,
– a task T, and
– a performance measure M,

a computer system is said to learn from D to perform the task


T if after learning the system’s performance on T improves as
measured by M.

• In other words, the learned model helps the system to


perform T better as compared to no learning.

11
An example
• Data: Loan application data
• Task: Predict whether a loan should be approved or not.
• Performance measure: accuracy.

No learning: classify all future applications (test data) to the majority


class (i.e., Yes):
Accuracy = 9/15 = 60%.
• We can do better than 60% with learning.
Fundamental assumption of learning

Assumption: The distribution of training examples is identical to


the distribution of test examples (including future unseen
examples).

• In practice, this assumption is often violated to certain


degree.
• Strong violations will clearly result in poor classification
accuracy.

• To achieve good accuracy on the test data, training examples


must be sufficiently representative of the test data.

13
Evaluating classification methods

• Predictive accuracy

• Efficiency
– time to construct the model
– time to use the model
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability:
– understandable and insight provided by the model
• Compactness of the model: size of the tree, or the number of
rules.

14
Evaluation methods

• Holdout set: The available data set D is divided into two disjoint subsets,
– the training set Dtrain (for learning a model)
– the test set Dtest (for testing the model)

• Important: training set should not be used in testing and the test set
should not be used in learning.
– Unseen test set provides a unbiased estimate of accuracy.

• The test set is also called the holdout set. (the examples in the original
data set D are all labeled with classes.)

• This method is mainly used when the data set D is large.

15
Evaluation methods (cont…)
• n-fold cross-validation: The available data is partitioned into n equal-size
disjoint subsets.
• Use each subset as the test set and combine the rest n-1 subsets as the
training set to learn a classifier.
• The procedure is run n times, which give n accuracies.
• The final estimated accuracy of learning is the average of the n accuracies.
• 10-fold and 5-fold cross-validations are commonly used.
• This method is used when the available data is not large.

16
Evaluation methods (cont…)
• Leave-one-out cross-validation: This method is used when
the data set is very small.
• It is a special case of cross-validation
• Each fold of the cross validation has only a single test example
and all the rest of the data is used in training.
• If the original data has m examples, this is m-fold cross-
validation

A dataset of n instances

Testing data
Training data

17
Evaluation methods (cont…)

• Validation set: the available data is divided into three subsets,


– a training set,
– a validation set and
– a test set.

• A validation set is used frequently for estimating parameters in learning


algorithms.

• In such cases, the values that give the best accuracy on the validation set
are used as the final parameter values.

• Cross-validation can be used for parameter estimating as well.

18
Classification measures

• Accuracy is only one measure (error = 1-accuracy). But, accuracy is not


suitable in some applications.

• In text mining, we may only be interested in the documents of a particular


topic, which are only a small portion of a big document collection.

• In classification involving skewed or highly imbalanced data, e.g., network


intrusion and financial fraud detections, we are interested only in the
minority class.
– High accuracy does not mean any intrusion is detected.
– E.g., 1% intrusion. Achieve 99% accuracy by doing nothing.
• The class of interest is commonly called the positive class, and the rest
negative classes.

19
Precision and recall measures

• Used in information retrieval and text classification.


• We use a confusion matrix to introduce them.

20
Precision and recall measures (cont…)

TP TP
p . r .
TP  FP TP  FN

Precision p is the number of correctly classified positive examples


divided by the total number of examples that are classified as
positive.

Recall r is the number of correctly classified positive examples


divided by the total number of actual positive examples in the
test set. 21
An example

• This confusion matrix gives


– precision p = 100% and
– recall r = 1%
because we only classified one positive example correctly and
no negative examples wrongly.
• Note: precision and recall only measure classification on the
positive class.

22
F1-value (also called F1-score)

• It is hard to compare two classifiers using two measures. F1 score


combines precision and recall into one measure

• The harmonic mean of two numbers tends to be closer to the


smaller of the two.
• For F1-value to be large, both p and r much be large.

23
Unsupervised Learning

Definition of Unsupervised Learning:

Learning useful structure without labeled classes,


optimization criterion, feedback signal, or any other
information beyond the raw data

In unsupervised learning, the algorithms are left to


themselves to discover interesting structures in
the data.

24
Unsupervised Learning

• Examples:
– Find natural groupings of Xs (X=human languages, stocks, gene
sequences, animal species,…) Prelude to discovery of underlying
properties
– Summarize the news for the past monthCluster first, then report
centroids.
– Sequence extrapolation: E.g. Predict cancer incidence next
decade; predict rise in antibiotic-resistant bacteria

• Methods
– Clustering (n-link, k-means, GAC,…)
– Taxonomy creation (hierarchical clustering)
– Many more ……

25
Clustering Words with Similar Meanings (Hierarchically )

[Arora-Ge-Liang-M.-Risteski,TACL’17,18]

26
Unsupervised learning
 Unsupervised learning is used to detect anomalies, outliers, such as

fraud or defective equipment, or to group customers with similar

behaviours for a sales campaign.

 It is the opposite of supervised learning.

 There is no labelled data here.

 When learning data contains only some indications without any

description or labels, it is up to the coder or to the algorithm to find the

structure of the underlying data, to discover hidden patterns, or to

determine how to describe the data.

 This kind of learning data is called unlabeled data. 27


Categories of Unsupervised learning
 Unsupervised learning problems can be further divided into association

and clustering problems.

 Association:

 An association rule learning problem is where you want to discover

rules that describe large portions of your data, such as “people that buy

X also tend to buy Y” (e.g., purchasing butter with bread/jam)

 Clustering:

 A clustering problem is where you want to discover the inherent

groupings in the data, such as grouping customers by purchasing

behavior. 28
Unsupervised Learning

29
Supervised vs. Unsupervised

30
Classification vs Clustering
 Classification – an object's category  Clustering is a classification with no
prediction, and predefined classes.
 Used for:  Used for:
 Spam filtering  For market segmentation (types of
 Language detection customers, loyalty)
 A search of similar documents  To merge close points on a map
 Sentiment analysis  For image compression
 Recognition of handwritten characters  To analyze and label new data
and numbers  To detect abnormal behavior
 Fraud detection  Popular algorithms: K-
 Popular algorithms: Naive Bayes, Decision means_clustering, Mean-Shift, DBSCAN
Tree, Logistic Regression, K-Nearest
Neighbours, Support Vector Machine

31
Semi-Supervised Learning

Supervised Learning = learning from labeled data. Dominant paradigm in


Machine Learning.
• E.g, say you want to train an email classifier to distinguish spam from
important messages

32
Semi-Supervised Learning
Supervised Learning = learning from labeled data. Dominant paradigm in
Machine Learning.
• E.g, say you want to train an email classifier to distinguish spam from
important messages
• Take sample S of data, labeled according to whether they were/weren’t
spam.
• Train a classifier (like SVM, decision tree, etc) on S. Make sure it’s not
overfitting.
• Use to classify new emails.

33
Basic paradigm has many successes

• recognize speech,
• steer a car,
• classify documents
• classify proteins
• recognizing faces, objects in images
• ...

34
However, for many problems, labeled
data can be rare or expensive.
Need to pay someone to do it, requires special testing,…

Unlabeled data is much cheaper.

Speech Customer modeling


Images Protein sequences
Medical outcomes Web pages
35
However, for many problems, labeled
data can be rare or expensive.
Need to pay someone to do it, requires special testing,…

Unlabeled data is much cheaper.


[From Jerry Zhu]

36
However, for many problems, labeled
data can be rare or expensive.
Need to pay someone to do it, requires special testing,…

Unlabeled data is much cheaper.

Can we make use of cheap


unlabeled data?

37
Semi-Supervised Learning
Can we use unlabeled data to augment a small labeled sample to
improve learning?

But unlabeled data is


missing the most
important info!!
But maybe still has useful
regularities that we can
use.

But…But…But…
39
Semi-Supervised Learning
Substantial recent work in ML. A number of interesting methods have
been developed.

• Several diverse methods for taking advantage of unlabeled data.


• General framework to understand when unlabeled data can help,
and make sense of what’s going on.

40
Reinforcement Learning
 A computer program will interact with a dynamic environment in

which it must perform a particular goal (such as playing a game with

an opponent or driving a car).

 The program is provided feedback in terms of rewards and

punishments as it navigates its problem space.

 Using this algorithm, the machine is trained to make specific

decisions.

 It works this way:

 The machine is exposed to an environment where it continuously

trains itself using trial and error method. 41


Reinforcement Learning
 Here learning data gives feedback so that the system adjusts to

dynamic conditions in order to achieve a certain objective.

 The system evaluates its performance based on the feedback

responses and reacts accordingly.

 The best-known instances include self-driving cars and chess master

algorithm AlphaGo.

42
Reinforcement Learning
 Policies:

 What actions should an agent take in a particular situation

 Utility estimation: how good is a state (used by policy)

 No supervised output but delayed reward

 Credit assignment problem (what was responsible for the outcome)

 Applications:

 Game playing

 Robot in a maze

 Multiple agents, partial observability, ...

43
Reinforcement Learning
 Stands in the middle ground between supervised and unsupervised

learning.

 The algorithm is provided information about whether or not the

answer is correct but not how to improve it

 The reinforcement learner has to try out different strategies and see

which works best

 In essence:

 The algorithm searches over the state space of possible inputs

and outputs in order to maximize a reward

44
Reinforcement Learning

45
Reinforcement Learning

46
ML Proof Concept

47
An Example

Consider a problem:
How to distinguish one specie from the other?
(length, width, weight, number and shape of
fins, tail shape,etc.)
An Example
• Suppose somebody at the fish plant say us that:
– Sea bass is generally longer than a salmon
• Then our models for the fish:
– Sea bass have some typical length, and this is greater than
that for salmon.

• Then length becomes a feature,


• We might attempt to classify the fish by seeing whether or
not the length of a fish exceeds some critical value (threshold
value).
An Example
• How to decide the critical value (threshold value)?

– We could obtain some training samples of different types


of fish,
– make length measurements,
– Inspect the results.
An Example
• Measurement results on the training sample
related to two species.
An Example
• Can we reliably seperate sea bass from salmon
by using length as a feature ?

Remember our
model:
–Sea bass have
some typical
length, and
this is greater
than that for
salmon.
An Example
• From histogram we can see that single criteria
is quite poor.
An Example
• It is obvious that length is not a good feature.

• What we can do to seperate sea bass from


salmon?
An Example

• What we can do to seperate sea bass


from salmon?
• Try another feature:
– average lightness of the fish scales.
An Example
• Can we reliably seperate sea bass from salmon
by using lightness as a feature ?
An Example
• Lighness is better than length as a feature but
again there are some problems.
An Example

• Suppose we also know that:


– Sea bass are typically wider than salmon.

• We can use more than one feature for our


decision:
– Lightness (x1) and width (x2)
An Example
• Each fish is now a point in two dimension.
– Lightness (x1) and width (x2)
Cost of error
• Cost of different errors must be considered when making
decisions,
• We try to make a decision rule so as to minimize such a cost,
• This is the central task of decision theory.

• For example, if the fish packing company knows that:


– Customers who buy salmon will object if they see sea bass
in their cans.
– Customers who buy sea bass will not be unhappy if they
occasionally see some expensive salmon in their cans.
Decision boundaries
• We can perform better if we use more
complex decision boundaries.
Decision boundaries
• There is a trade of between complexity of the decision rules
and their performances to unknown samples.

• Generalization: The ability of the classifier to produce


correct results on novel patterns.

• Simplify the decision boundary!


The design cycle
Supervised Learning: Linear Regression & Gradient
Descent
 Notation:

 m : Number of training examples

 x : Input variables (Features)

 y: Output variables (Targets)

 (x,y): Training Example (Represents 1 row on the table)

 (x^(i), y^(i) ) : ith training example (Represent's ith row on

the table)

 n : Number of features (Dimensionality of the input)

64
What is Linear and Slope???
Remember this: Y=mX+B?

Linear line

A slope of 2 means that every 1-unit change in X yields a 2-unit change in Y.

65
Linear Regression analysis
 Linear regression analysis means “fitting a straight line to data”

 Also called linear modelling

 It’s a widely used technique to help model and understand real-world


phenomena

 Easy to use

 Easy to understand intuitively

 Allows prediction

 A regression problem is composed of

 An outcome or response variable ‘𝑌’

 A number of risk factors or predictor variables ‘𝑋𝑖’ that affect ‘𝑌’

 Also called explanatory variables, or features in the machine learning


community

 A question about ‘𝑌’, such as How to predict ‘𝑌’ under different


conditions?

 𝑌 is sometimes called the dependent variable and ‘𝑋𝑖’ the independent variables

 Not the same meaning as statistical independence

 Experimental setting where the ‘𝑋𝑖’ variables can be modified and


changes in ‘𝑌’ can be observed
66
Regression analysis: objectives
Prediction

We want to estimate ‘𝑌’ at some


specific values of ‘𝑋i’ (feature
values)

Model inference

We want to learn about the


relationship between ‘𝑌’ and ‘𝑋𝑖’ ,
such as the combination of predictor
variables which has the most effect
on ‘Y’

67
Linear regression
 When all we have is a single predictor variable
 Linear regression: one of the simplest and most commonly used statistical modeling
techniques
 Makes strong assumptions about the relationship between the predictor variables (𝑋𝑖 ) and
the response (𝑌)
 (a linear relationship, a straight line when plotted)
 only valid for continuous outcome variables (not applicable to category outcomes such
as success/failure)

68
Linear Regression
 Assumption: 𝑦 = 𝛽0 + 𝛽1 × 𝑥 + error
 Our task: estimate 𝛽0 and 𝛽1 based on the
available data
 Resulting model is ̂𝑦 = ̂ 0 + ̂ 1 × 𝑥
 the “hats” on the variables represent
the fact that they are estimated from
the available data
 ̂𝑦 is read as “the estimator for 𝑦”
 𝛽0 and 𝛽1 are called the model parameters
or coefficients
 Objective: minimize the error, the difference
between our observations and the
predictions made by our linear model
 minimize the length of the red lines in
the figure to the right (called the
“residuals”) 69
Supervised Learning: Housing Price Prediction
 Given: a dataset that contains 𝑛-samples
(𝑥^(1), 𝑦^(1),…(𝑥^(𝑛), 𝑦^(𝑛))

 Task: if a residence has 𝑥 square feet, predict its price?

15th sample
(𝑥^(15), 𝑦^(15)

𝑥 = 800
𝑦=?
70
Logistic Regression for Machine Learning
 Logistic regression is another technique borrowed by machine learning from the field of
statistics.
 It is the go-to method for binary classification problems (problems with two class values).
 Logistic Function
 Logistic regression is named for the function used at the core of the method, the logistic
function.
 The logistic function, also called the sigmoid function was developed by statisticians to
describe properties of population growth in ecology, rising quickly and maxing out at the
carrying capacity of the environment.
 It’s an S-shaped curve that can take any real-valued number and map it into a value
between 0 and 1, but never exactly at those limits.
 1 / (1 + e^-value)
 Where e is the base of the natural logarithms (Euler’s number or the EXP() function in your
spreadsheet) and value is the actual numerical value that you want to transform.

71
Regression Vs. Classification
 Regression:
 If 𝑦∈ℝ is a continuous variable, e.g., price prediction
 Classification:
 The label is a discrete variable, e.g., the task of predicting
the types of residence
(size, lot size) → house or townhouse?

𝑦=
House or
Townhouse?

73
Supervised Learning in Computer Vision
 Image Classification
 𝑥=raw pixels of the image,
 𝑦=the main object

ImageNet Large Scale Visual Recognition Challenge. Russakovskyet al.’2015


74
Supervised Learning in Computer Vision
 Object localization and detection
 𝑥=raw pixels of the image, 𝑦=the bounding boxes

75
ImageNet Large Scale Visual Recognition Challenge. Russakovskyet al.’2015
Supervised Learning in Natural Language Processsing

Note: this course only covers the basic and fundamental techniques
of supervised learning (which are not enough for solving hard vision
or NLP problems.)

76
Unsupervised Learning
 Dataset contains no labels: 𝑥^(1), … 𝑥^(𝑛)
 Goal (vaguely-posed): to find interesting structures in the data

Supervised Un Supervised

77
78
Supervised approach: KNN and
Support Vector Machine
Dr. Naveen Saini
Assistant Professor

Department of Computer Science


Indian Institute of Information Technology Lucknow
Uttar Pardesh

naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
Course Evaluation
 Attendance [20 Points]: Online

 Four HomeWorks: 5 Points /Assignment [20 points]

 Class Participation [20 Points] [Class Behavior, Camera Opened/ not, Not Answered
question, and etc.]

 Project Based Evaluation

 Mid Term Exam [20 Points]: Students must submit their Project Status
 Project Title: After Midterm Submission Not changed title/topic
 Project Abstract : 200 ~ 500 Words
 Literature Review:1000 ~ 5000 Words
 Methodology: Requirement Analysis, Algorithm, Pseudocode, Flowchart
 Final Term Exams [20 Points] Students must submit to Complete Project Report
 Project Implementation: Coding
 Project Results: Describe the result in details [ more than1000 words]
 Demonstration: Project Demo
 Project Report [Plagiarism must be less than 2% from each reference]

****Blatant plagiarism will result in a 0% grade for the project and may entail larger consequences*** 2
Course Project
• We encourage you to form a group of 1-2 people [Not more then 2]

• List of potential topics

Information Retrieval Multi-modal data fusion

Computer Vision Finance & Commerce

General Machine Learning Life Sciences

Natural Language Physical Sciences

Covid-19 Smart home

Health care

 Students can suggest their own Idea.

 Without prior permission students can not change their projects, If they do, It will impact
their grade for the course.

Blatant plagiarism will result in a 0% grade for the project and may entail larger consequences
3
Project Topics
1. Fake News Detection 16. Color Detection with Python

2. Email Classification 17. Sentiment Analysis

3. Emojify – Create your own emoji 18. Gender and Age Detection
4. Loan Prediction Project 19. Image Caption Generator Project in

5. Housing Prices Prediction Project Python

6. Music Genre Classification Project 20. Traffic Signs Recognition

7. Bitcoin Price Predictor Project 21. Edge Detection & Photo Sketching

8. Uber Data Analysis Project 22. Object Detection

9. Speech Emotion Recognition Project 23. Image Segmentation

10. Catching Illegal Fishing Project 24. Hand Gesture Recognition

11. Movie Recommendation System Project 26. Students can suggest their own

12. Handwritten Digits Recognition Project project


13. Road Lane Line Detection & Traffic Signs
Recognition Project
4
14. Next word predictor Project
Project Topics
No. Student Group No. Project Title Abstract
1.
2.
3.

4.

5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
Project Topics
No. Student Group No Project Title Abstract
19
20
21

22

23
24
25
26
27
28
29
30
31
32
33
What is Machine Learning?
 The capability of Artificial Intelligence systems to learn by extracting

patterns from data is known as Machine Learning.

 Machine Learning is an idea to learn from examples and experience, without

being explicitly programmed. Instead of writing code, you feed data to the

generic algorithm, and it builds logic based on the data given.


*A Few Quotes
 “A breakthrough in machine learning would be worth ten Microsoft”
(Bill Gates, Chairman, Microsoft)
 “Machine learning is the next Internet”
(Tony Tether, Director, DARPA)
 Machine learning is the hot new thing”
(John Hennessy, President, Stanford)
 “Web rankings today are mostly a matter of machine learning”
(Prabhakar Raghavan, Dir. Research, Yahoo)
 “Machine learning is going to result in a real revolution”
(Greg Papadopoulos, CTO, Sun) 7
Different Learning Methods
 Eager Learning

 a learning method in which the system tries to construct a general,

input-independent target function during training of the system

 Explicit description of target function on the whole training set

 Example: Support vector machine, decision tree, etc.

 Instance-based Learning

 Learning=storing all training instances

 Classification=assigning target function to a new instance

 Referred to as “Lazy” learning (generalization of the training data is

delayed until a query is made to the system)


K-Nearest Neighbor Learning
 K-NN is a typical approach of Instance-based Learning

Its very similar to a


Desktop!!
K-Nearest Neighbor Learning

10
K-Nearest Neighbor Learning: An Example
 Here, the object (shown by ?) is unknown.

 If K=1, the only neighbor is a cat. Thus, the unknown object => Cat

 If K=4, the nearest neighbors contain one chicken and three cats. Thus, the

unknown object => Cat

11
K-Nearest Neighbor Learning
 Given a set of categories C={c1,c2,...cm}, also called classes (for e.g. {"male", "female"}).

There is also a learnset LS consisting of labelled instances:

LS={(o1,co1),(o2,co2),⋯(on,con)}

 As it makes no sense to have less labeled items than categories, we can postulate that

n>m and in most cases even n⋙m (n much greater than m.)

• The task of classification consists in assigning a category or class c to an arbitrary


instance/object o.

• For this, we have to differentiate between two cases:


•Case1: The instance o is an element of LS, i.e. there is a tupel (o,c) ∈ LS
In this case, we will use the class c as the classification result.

Case2: We assume now that o is not in LS, or to be precise:


∀c ∈ C, (o,c) ∉ LS
• o is compared with all the instances of LS. A distance metric d is used for the
comparisons.
• We determine the k (user defined and constant) closest neighbors of o, i.e. the items
with the smallest distances. 12
K-Nearest Neighbor Learning

 Distance-Weighted Nearest Neighbor Algorithm

 Assign weights to the neighbors based on their ‘distance’ from the query

point

 Weight ‘may’ be inverse square of the distances

 All training points may influence a particular instance

 Shepard’s method
K-Nearest Neighbor Learning

 Remarks

 Highly effective inductive inference method for noisy training data and

complex target functions

 Target function for a whole space may be described as a combination of less

complex local approximations

 Learning is very simple

 Classification is time consuming


K-Nearest Neighbor Learning

What is the best distance to use?


What is the best value of k to use?

i.e. how do we set the hyperparameters?


K-Nearest Neighbor Learning

What is the best distance to use?


What is the best value of k to use?

i.e. how do we set the hyperparameters?

 Very problem-dependent.
 Must try them all out (by changing the value of
K and distance measure) and see what works
best.
K-Nearest Neighbor Learning: Distance Metrics
we calculate the distances between the points of the sample and the object to be
classified. To calculate these distances we need a distance function.

• Euclidean Distance: distance between two objects x and y

• Manhattan Distance: defined as sum of the absolute values of the differences


between the coordinates of x and y:

• Minkowski Distance: generalizes the Euclidean and the Manhatten distance in


one distance metric. If we set the parameter p in the following formula to 1 we
get the manhattan distance an using the value 2 gives us the Euclidean
distance:
K-Nearest Neighbor Learning

Trying out what hyperparameters work best on test set: ???

Very bad idea. The test set is a proxy for the generalization performance!
Use only VERY SPARINGLY, at the end.
K-Nearest Neighbor Learning

5-fold cross validation (but


may be other)
Validation data
use to tune hyperparameters
K-Nearest Neighbor Learning

Cross-validation
cycle through the choice of which fold
is the validation fold, average results.
K-Nearest Neighbor Learning: Deciding parameters

Example of
5-fold cross-validation
for the value of k.

Each point: single


outcome.

The line goes


through the mean, bars
indicated standard
deviation

(Seems that k ~= 7 works best


for this data)

NOTE: The value of K should be odd.


.
Python program for K-Nearest Neighbor Learning(1/2)
import numpy as np
from sklearn import datasets This library has already some datasets to work.

iris = datasets.load_iris() The data set consists of 50 samples from


data = iris.data each of three species of Iris:
labels = iris.target Iris setosa, Iris virginica and
Iris versicolor.
for i in [0, 79, 99, 121]:
print(f"index: {i:3}, features: {data[i]}, label: {labels[i]}")

np.random.seed(42)
indices = np.random.permutation(len(data))
permutation from np.random to split the
data randomly
n_training_samples = 12
learn_data = data[indices[:-n_training_samples]]
learn_labels = labels[indices[:-n_training_samples]] Learnset

test_data = data[indices[-n_training_samples:]]
test_labels = labels[indices[-n_training_samples:]] Test Set

print("The first samples of our learn set:")


print(f"{'index':7s}{'data':20s}{'label':3s}")
for i in range(5):
print(f"{i:4d} {learn_data[i]} {learn_labels[i]:3}")

print("The first samples of our test set:")


print(f"{'index':7s}{'data':20s}{'label':3s}")
for i in range(5): Test Set
print(f"{i:4d} {learn_data[i]} {learn_labels[i]:3}")
Python program for K-Nearest Neighbor Learning(2/2)
#Following function calculate the Euclidean distance with the function norm of the module np.linalg:
def distance(instance1, instance2):
""" Calculates the Eucledian distance between two instances""" Computing distance between two
return np.linalg.norm(np.subtract(instance1, instance2)) instances
#Testing the above function
print(distance([3, 5], [1, 1]))
print(distance(learn_data[3], learn_data[44]))

#The function get_neighbors returns a list with k neighbors, which are closest to the instance
test_instance:
def get_neighbors(training_set, labels, test_instance, k, distance): Function to find neighbors
"""
get_neighors calculates a list of the k nearest neighbors of an instance 'test_instance'.
The function returns a list of k 3-tuples. Each 3-tuples consists of (index, dist, label) where index
is the index from the training_set, dist is the distance between the test_instance and the instance
training_set[index] distance is a reference to a function used to calculate the distances
"""
distances = []
for index in range(len(training_set)):
dist = distance(test_instance, training_set[index])
distances.append((training_set[index], dist, labels[index]))
distances.sort(key=lambda x: x[1])
neighbors = distances[:k]
return neighbors

#We will test the function with our iris samples: Testing the above function on testing
for i in range(5): data to predict their labels
neighbors = get_neighbors(learn_data, learn_labels, test_data[i], 3, distance=distance)
print("Index: ",i,'\n', "Testset Data: ",test_data[i],'\n', "Testset Label: ",test_labels[i],'\n',
"Neighbors: ",neighbors,'\n')
23
Output of Python
program

24
K-Nearest Neighbor

Advantage

• The algorithm is simple and easy to implement.

• There’s no need to build a model, tune several parameters, or


make additional assumptions.

Disadvantage

• The algorithm gets significantly slower as the number of


examples and/or predictors/independent variables increase.
Support Vector Machine
(A Supervised ML Algorithm)

26
Classification: Definition

• Given a collection of records (training set )


– Each record contains a set of attributes, one of the
attributes is the class
• Find a model for class attribute as a function
of the values of other attributes
• Goal: previously unseen records should be
assigned a class as accurately as possible
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into training
and test sets, with training set used to build the model
and test set used to validate it
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Examples of Classification Task
• Predicting tumor cells as benign or malignant

• Classifying credit card transactions


as legitimate or fraudulent

• Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

• Categorizing news stories as finance,


weather, entertainment, sports, etc
Classification Techniques

• Support Vector Machines


• Decision Tree based Methods
• Rule-based Methods
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
Support Vector Machine
• A supervised Machine Learning
algorithm often used for classification
(also for regression challenges)

• In the SVM algorithm, we plot each data


item as a point in n-dimensional space
(where n is a number of features with
the value of each feature being the
value of a particular coordinate.)

• Then, we perform classification by Hyperplane


finding the hyper-plane that
differentiates the two classes very well.

• Used for wide variety of applications: text classification, loan prediction,


weather prediction, etc.
What is SVM?
 What is a classification analysis?
 Let’s consider an example to understand these concepts.
 We have a population composed of 50%-50% Males and Females.
 Using a sample of this population, you want to create some set of rules
which will guide us the gender class for rest of the population.
 Using this algorithm, we intend to build a robot which can identify
whether a person is a Male or a Female.
 This is a sample problem of classification analysis.
 Using some set of rules, we will try to classify the population into two
possible segments.
 For simplicity, let’s assume that the two differentiating factors identified
are : Height of the individual and Hair Length.
 Following is a scatter plot of the sample.
What is SVM?

 The blue circles in the plot represent females and green squares represents
male. A few expected insights from the graph are :
 Males in our population have a higher average height.
 Females in our population have longer scalp hairs.
 If we were to see an individual with height 180 cms and hair length 4 cms,
our best guess will be to classify this individual as a male. This is how we do
a classification analysis.
What is SVM?

 Support Vectors are simply the co-ordinates of individual observation.


 For instance, (45,150) is a support vector which corresponds to a female.
 Support Vector Machine is a frontier which best segregates the Male from
the Females.
 In this case, the two classes are well separated from each other, hence it is
easier to find a SVM.
 How to find the Support Vector Machine for case in hand?
 There are many possible frontier which can classify the problem in
hand. Following are the three possible frontiers.
How to find the Support Vector Machine for case
in hand?

 How do we decide which is the best frontier for this particular problem statement?
 The easiest way to interpret the objective function in a SVM is to find the minimum distance of the
frontier from closest support vector (this can belong to any class).
 For instance, orange frontier is closest to blue circles.
 And the closest blue circle is 2 units away from the frontier.
 Once we have these distances for all the frontiers, we simply choose the frontier with the maximum
distance (from the closest support vector).
 Out of the three shown frontiers, we see the black frontier is farthest from nearest support vector
(i.e. 15 units).
What is SVM?
 What if we do not find a clean frontier which segregates the classes?
 Our job was relatively easier finding the SVM in this business case. What if the
distribution looked something like as follows :

 In such cases, we do not see a straight-line frontier directly in current plane


which can serve as the SVM.

 In such cases, we need to map these vector to a higher dimension plane so


that they get segregated from each other.

 Such cases will be covered once we start with the formulation of SVM.

 For now, you can visualize that such transformation will result into following
type of SVM.
What is SVM?

 Each of the green square in


original distribution is mapped on
a transformed scale. And
transformed scale has clearly
segregated classes.
How does it work?
 We got accustomed to the process of segregating the two classes with a hyper-plane.
 Now the burning question is “How can we identify the right hyper-plane?”. Don’t worry, it’s not as hard as
you think!
 Let’s understand:
 Identify the right hyper-plane (Scenario-1): Here, we have three hyper-planes (A, B and C).
 Now, identify the right hyper-plane to classify star and circle.
 You need to remember a thumb rule to identify the right hyper-plane: “Select the hyper-plane which
segregates the two classes better”.
 In this scenario, hyper-plane “B” has excellently performed this job.
Support Vector Machine (SVM)
 Identify the right hyper-plane (Scenario-2): Here, we have three hyper-planes (A, B and C)
and all are segregating the classes well. Now, How can we identify the right hyper-
plane?
 Here, maximizing the distances between nearest data point (either class) and hyper-plane
will help us to decide the right hyper-plane. This distance is called as Margin. Let’s look at
the below snapshot:
 Above, you can see that the margin for hyper-plane C is high as compared to both A and B.
Hence, we name the right hyper-plane as C. Another lightning reason for selecting the
hyper-plane with higher margin is robustness. If we select a hyper-plane having low
margin then there is high chance of miss-classification.
Support Vector Machine (SVM)
 Identify the right hyper-plane (Scenario-3):Hint: Use the rules as
discussed in previous section to identify the right hyper-plane

 Some of you may have selected the hyper-plane B as it has higher margin
compared to A.
 But, here is the catch, SVM selects the hyper-plane which classifies the
classes accurately prior to maximizing margin.
 Here, hyper-plane B has a classification error and A has classified all
correctly. Therefore, the right hyper-plane is A.
Support Vector Machine (SVM)
 Can we classify two classes (Scenario-4)?: Below, I am unable to
segregate the two classes using a straight line, as one of the stars lies in the
territory of other(circle) class as an outlier.

 As I have already mentioned, one star at other end is like an outlier for star
class. The SVM algorithm has a feature to ignore outliers and find the hyper-
plane that has the maximum margin. Hence, we can say, SVM classification
is robust to outliers.
Support Vector Machine (SVM)
 Find the hyper-plane to segregate two
classes (Scenario-5): In the scenario
below, we can’t have linear hyper-plane
between the two classes, so how does
SVM classify these two classes? Till now,
we have only looked at the linear hyper-
plane.

 SVM can solve this problem. Easily! It


solves this problem by introducing additional
feature. Here, we will add a new feature
z=x^2+y^2. Now, let’s plot the data points
on axis x and z:
Support Vector Machine (SVM)
 In above plot, points to consider are:
 All values for z would be positive always because z is the squared sum of both x and y
 In the original plot, red circles appear close to the origin of x and y axes, leading to lower value of z and star
relatively away from the origin result to higher value of z.
 In the SVM classifier, it is easy to have a linear hyper-plane between these two classes.
 But, another burning question which arises is, should we need to add this feature manually to have a hyper-
plane. No, the SVM algorithm has a technique called the kernel trick.
 The SVM kernel is a function that takes low dimensional input space and transforms it to a higher dimensional
space, i.e., it converts not separable problem to separable problem. It is mostly useful in non-linear separation
problem.
 Simply put, it does some extremely complex data transformations, then finds out the process to separate the
data based on the labels or outputs you’ve defined.
 When we look at the hyper-plane in original input space it looks like a circle:
Support Vector Machine (SVM)
Example: Have a linear SVM kernel plt.scatter(X[:, 0], X[:, 1], c=y,
cmap=plt.cm.Paired)
import numpy as np
plt.xlabel('Sepal length’)
import matplotlib.pyplot as plt
plt.ylabel('Sepal width’)
from sklearn import svm, datasets
plt.xlim(xx.min(), xx.max())
# import some data to play with plt.title('SVC with linear kernel’)
iris = datasets.load_iris() plt.show()
X = iris.data[:, :2] # we only take the first two
features. We could avoid this ugly slicing by using a
two-dim dataset
y = iris.target
# we create an instance of SVM and fit out data. We do not
scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=1,gamma=0).fit(X, y)
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
h = (x_max / x_min)/100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))

plt.subplot(1, 1, 1)
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, cmap=plt.cm.Paired,
alpha=0.8)
Support Vector Machine (SVM)
 Example: Use SVM rbf kernel
 Change the kernel type to rbf in below line and look at the impact.

svc = svm.SVC(kernel='rbf', C=1,gamma=0).fit(X, y)


Support Vector Machine (SVM)
 Pros:
 It works really well with a clear margin of separation
 It is effective in high dimensional spaces.
 It is effective in cases where the number of dimensions is greater than the
number of samples.
 It uses a subset of training points in the decision function (called support
vectors), so it is also memory efficient.
 Cons:
 It doesn’t perform well when we have large data set because the required training
time is higher
 It also doesn’t perform very well, when the data set has more noise, i.e., target
classes are overlapping
 SVM doesn’t directly provide probability estimates, these are calculated using an
expensive five-fold cross-validation. It is included in the related SVC method of
Python scikit-learn library.
Support Vector Machines

• The line that maximizes the minimum


margin is a good bet.

• This maximum-margin separator is


determined by a subset of the datapoints.
– Datapoints in this subset are called
“support vectors”.
– It will be useful computationally if
only a small fraction of the datapoints
are support vectors, because we use
the support vectors to decide which The support vectors
side of the separator a test case is on. are indicated by the
circles around them.
Ch. 15
Linear classifiers: Which Hyperplane?

• Lots of possible solutions for a, b, c.


• Some methods find a separating hyperplane,
but not the optimal one [according to some This line
criterion of expected goodness]
represents the
– E.g., perceptron
decision
• Support Vector Machine (SVM) finds an boundary:
optimal* solution.
ax + by − c = 0
– Maximizes the distance between the
hyperplane and the “difficult points” close to
decision boundary
– One intuition: if there are no points near the
decision surface, then there are no very
uncertain classification decisions

48
Sec. 15.1
Support Vector Machine (SVM)

• SVMs maximize the margin around Support vectors


the separating hyperplane.
• A.k.a. large margin classifiers
• The decision function is fully
specified by a subset of training
samples, the support vectors.
• Solving SVMs is a quadratic
programming problem
Maximizes
• Seen by many as the most successful Narrower
margin
margin
current text classification method*

*but other discriminative methods


50
often perform very similarly
Sec. 15.1
Maximum Margin: Formalization

• w: decision hyperplane normal vector


• xi: data point i
• yi: class of data point i (+1 or -1)

• Classifier is: f(xi) = sign(wTxi + b)


• Functional margin of xi is: yi (wTxi + b)
– But note that we can increase this margin simply by scaling w, b….
• Functional margin of dataset is twice the minimum
functional margin for any point
– The factor of 2 comes from measuring the whole width of the
margin
51
Sec. 15.1
Geometric Margin

wT x + b
• Distance from example to the separator is r = y
w
• Examples closest to the hyperplane are support vectors.
• Margin ρ of the separator is the width of separation between support vectors
of classes.
Derivation of finding r:
x ρ Dotted line x’−x is perpendicular to
decision boundary so parallel to w.
r Unit vector is w/|w|, so line is
x′ rw/|w|.
x’ = x – yrw/|w|.
x’ satisfies wTx’+b = 0.
So wT(x –yrw/|w|) + b = 0
Recall that |w| = sqrt(wTw).
So wTx –yr|w| + b = 0
So, solving for r gives:
w r = y(wTx + b)/|w|
52
Sec. 15.1
Linear SVM Mathematically
The linearly separable case

• Assume that all data is at least distance 1 from the hyperplane, then the
following two constraints follow for a training set {(xi ,yi)}

wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ −1 if yi = −1
• For support vectors, the inequality becomes an equality
• Then, since each example’s distance from the hyperplane is
wT x + b
r=y
w
• The margin is:
2
r=
w

53
Sec. 15.1
Linear Support Vector Machine (SVM)

• Hyperplane wTxa + b = 1
ρ
wT x + b = 0
wTxb + b = -1

• Extra scale constraint:


mini=1,…,n |wTxi + b| = 1

• This implies:
wT(xa–xb) = 2
Recall that |w| = sqrt(wTw)
This implies, ρ = ||xa–xb||2 = 2/||w||2 wT x + b = 0

54
Linear SVMs Mathematically (cont.)

• Then we can formulate the quadratic optimization problem:

Find w and b such that


2
r= is maximized; and for all {(xi , yi)}
w
wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1
• A better formulation (min ||w|| = max 1/ ||w|| ):

Find w and b such that


Φ(w) =½ wTw is minimized;

and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1


55
Sec. 15.2.3
Non-linear SVMs

• Datasets that are linearly separable (with some noise) work out great:

0 x

• But what are we going to do if the dataset is just too hard?

0 x
• How about … mapping data to a higher-dimensional space:

x2

0 x
63
Sec. 15.2.3
Non-linear SVMs: Feature spaces

• General idea: the original feature space can


always be mapped to some higher-
dimensional feature space where the training
set is separable:

Φ: x → φ(x)

64
Sec. 15.2.3
The “Kernel Trick”

• The linear classifier relies on an inner product between vectors K(xi,xj)=xiTxj


• If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)

• A kernel function is some function that corresponds to an inner product in


some expanded feature space.
• Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj) where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]

65
What the kernel trick achieves

• All of the computations that we need to do to find the maximum-


margin separator can be expressed in terms of scalar products
between pairs of datapoints (in the high-dimensional feature
space).

• These scalar products are the only part of the computation that
depends on the dimensionality of the high-dimensional space.
– So if we had a fast way to do the scalar products we would not
have to pay a price for solving the learning problem in the
high-D space.

• The kernel trick is just a magic way of doing scalar products a


whole lot faster than is usually possible.
– It relies on choosing a way of mapping to the high-dimensional
feature space that allows fast scalar products.
The kernel trick

• For many mappings from a


low-D space to a high-D
Low-D
space, there is a simple
operation on two vectors in xb
the low-D space that can be
xa
used to compute the scalar
product of their two images 
in the high-D space.
High-D

K ( x a , x b )   ( x a ) . ( x b )
 (xa )
Letting the doing the scalar  ( xb )
kernel do product in the
the work obvious way
Sec. 15.2.3
Kernels
• Why use kernels?
– Make non-separable problem separable.
– Map data into better representational space
• Common kernels
– Linear
– Polynomial K(x,z) = (1+xTz)d
• Gives feature conjunctions
– Radial basis function (infinite dimensional space)

• Haven’t been very useful in text classification


68
Some commonly used kernels

Polynomial: K (x, y )  (x.y  1) p

Gaussian Parameters
|| x  y|| 2 / 2 2
K (x, y )  e
that the user
radial basis must choose
function

Neural net: K (x, y )  tanh ( k x.y   )

For the neural network kernel, there is one “hidden unit”


per support vector, so the process of fitting the maximum
margin hyperplane decides how many hidden units to use.
Also, it may violate Mercer’s condition.
Performance of SVM

• Support Vector Machines work very well in practice.


– The user must choose the kernel function and its parameters,
but the rest is automatic.
– The test performance is very good.
• They can be expensive in time and space for big datasets
– The computation of the maximum-margin hyper-plane
depends on the square of the number of training cases.
– We need to store all the support vectors.
• SVM’s are very good if you have no idea about what structure to
impose on the task.
• The kernel trick can also be used to do PCA in a much higher-
dimensional space, thus giving a non-linear version of PCA in the
original space.
References
https://www.python-course.eu/k_nearest_neighbor_classifier.php

71
72
Lecture -8
Supervised approach: Decision Tree-
based Classification

Dr. Naveen Saini


Assistant Professor

Department of Computer Science


Indian Institute of Information Technology Lucknow
Uttar Pardesh

naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
Classification: Definition

 Given a collection of records (training set )


– Each record contains a set of attributes, one of the
attributes is the class
 Find a model for class attribute as a function
of the values of other attributes
 Goal: previously unseen records should be
assigned a class as accurately as possible
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it
Illustrating Classification Task

Tid Attrib1 Attrib2 Attrib3 Class Learning


1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Examples of Classification Task

 Predicting tumor cells as benign or malignant

 Classifying credit card transactions


as legitimate or fraudulent

 Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

 Categorizing news stories as finance,


weather, entertainment, sports, etc
Classification Techniques

 Support Vector Machines


 Decision Tree based Methods
 Rule-based Methods
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data

Test Data
Start from the root of tree Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Decision Tree Induction

 Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART (Classification and Regression Tree)
– ID3, C4.5
– SLIQ (Fast scalable algorithm for large
application)
Can handle both numeric and categorical attributes
– SPRINT (scalable parallel classifier for
datamining)
General Structure of Hunt’s Algorithm
Tid Refund Marital Taxable
 Let Dt be the set of training records Status Income Cheat
that reach a node t 1 Yes Single 125K No

 General Procedure: 2 No Married 100K No

– If Dt contains records that 3 No Single 70K No


4 Yes Married 120K No
belong the same class yt, then t
5 No Divorced 95K Yes
is a leaf node labeled as yt 6 No Married 60K No
– If Dt is an empty set, then t is a 7 Yes Divorced 220K No
leaf node labeled by the default 8 No Single 85K Yes
class, yd 9 No Married 75K No

– If Dt contains records that 10


10 No Single 90K Yes

belong to more than one class, Dt


use an attribute test to split the
data into smaller subsets.
Recursively apply the ?
procedure to each subset.
Hunt’s Algorithm
Tid Refund Marital Taxable
Status Income Cheat
Refund 1 Yes Single 125K No
Don’t
Yes No 2 No Married 100K No
Cheat
Don’t ?
3 No Single 70K No
Cheat 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No

Refund Refund 7 Yes Divorced 220K No

Yes No Yes No 8 No Single 85K Yes


9 No Married 75K No
Don’t Don’t Marital
Marital Cheat 10 No Single 90K Yes
Cheat Status Status 10

Single, Single,
Married Married
Divorced Divorced

Don’t Taxable Don’t


? Cheat
Cheat Income
< 80K >= 80K

Don’t Cheat
Cheat
Tree Induction

 Greedy strategy
– Split the records based on an attribute test
that optimizes certain criterion

 Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?

– Determine when to stop splitting


Tree Induction

 Greedy strategy
– Split the records based on an attribute test
that optimizes certain criterion

 Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?

– Determine when to stop splitting


How to Specify Attribute Test Condition?

 Depends on attribute types


– Nominal
– Ordinal
– Continuous

 Depends on number of ways to split


– 2-way split
– Multi-way split
Splitting Based on Nominal Attributes
The values of a Nominal attribute are names of things, some kind of symbols. Also referred as categorical
attributes and there is no order (rank, position) among values of the nominal attribute.

 Multi-way split: Use as many partitions as distinct


values
CarType
Family Luxury
Sports

 Binary split: Divides values into two subsets


Need to find optimal partitioning
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal Attributes
The Ordinal Attributes contains values that have a meaningful sequence or ranking(order) between them

 Multi-way split: Use as many partitions as distinct


values.
Size
Small Large
Medium

 Binary split: Divides values into two subsets.


Need to find optimal partitioning.

Size
{Small,
Medium} {Large}
Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Tree Induction

 Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.

 Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?

– Determine when to stop splitting


How to determine the Best Split??
Before Splitting: 10 records of class 0,
10 records of class 1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


How to determine the Best Split

 Greedy approach:
– Nodes with homogeneous class distribution
are preferred
 Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity

 Gini Index

 Entropy

 Misclassification error
How to Find the Best Split
Before Splitting: C0 N00
M0
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs M0 – M34
Higher Gini Gain = Better Split
Measure of Impurity: GINI

 Gini Index for a given node t :

GINI (t )  1   [ p( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally


distributed among all classes, implying least
interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI

GINI (t )  1   [ p( j | t )]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Splitting Based on GINI

 When a node p is split into k partitions (children), the


quality of split is computed as,

k
ni
GINI split   GINI (i )
i 1 n

where, ni = number of records at child i,


n = number of records at node p.
Binary Attributes: Computing GINI Index
k
ni
 Splits into two partitions GINI split   GINI (i )
 Effect of Weighing partitions: i 1 n

– Larger and Purer Partitions are sought for


Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/7)2 – (2/7)2 N1 N2 Gini(Children)
= 0.428
C1 5 1 = 7/12 * 0.428 +
Gini(N2) C2 2 4 5/12 * 0.528
= 1 – (1/5)2 – (4/5)2 Gini=0.469 = 0.469
= 0.528
Categorical Attributes: Computing Gini Index

 For each distinct value, gather counts for each class in


the dataset
 Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of values)

CarType CarType CarType


Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419
Continuous Attributes: Computing Gini Index

Tid Refund Marital Taxable


 Use Binary Decisions based on one Status Income Cheat
value
1 Yes Single 125K No
 Several Choices for the splitting value 2 No Married 100K No
– Number of possible splitting values 3 No Single 70K No
= Number of distinct values 4 Yes Married 120K No

 Each splitting value has a count matrix 5 No Divorced 95K Yes

associated with it 6 No Married 60K No


7 Yes Divorced 220K No
– Class counts in each of the
partitions, A < v and A  v 8 No Single 85K Yes
9 No Married 75K No
 Simple method to choose best v 10 No Single 90K Yes
– For each v, scan the database to
10

gather count matrix and compute Taxable


its Gini index Income
> 80K?
– Computationally Inefficient!
Repetition of work Yes No
Continuous Attributes: Computing Gini Index...

 For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Taxable Income

Sorted Values 60 70 75 85 90 95 100 120 125 220


55 65 72 80 87 92 97 110 122 172 230
Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Alternative Splitting Criteria based on INFO

 Entropy at a given node t:


Entropy(t )   p( j | t ) log p( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).


– Measures homogeneity of a node
 Maximum (log nc) when records are equally distributed
among all classes implying least information
 Minimum (0.0) when all records belong to one class,
implying most information
– Entropy based computations are similar to the
GINI index computations
Examples for computing Entropy

Entropy(t )   p( j | t ) log p( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Splitting Based on INFO...
 Information Gain:

 Entropy ( p )    Entropy (i ) 
n
k
GAIN i

 n 
split i 1

Parent Node, p is split into k partitions;


ni is number of records in partition i
– Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most reduction
(maximize GAIN== minimize Entropy at the child)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure (i.e.
having many distinct attribute values)
Splitting Criteria based on Classification Error

 Classification error at a node t :

Error (t )  1  max P(i | t )


i

 Measures misclassification error made by a node


 Maximum (1 - 1/nc) when records are equally distributed
among all classes, implying least interesting information
 Minimum (0.0) when all records belong to one class, implying
most interesting information
Examples for Computing Error

Error (t )  1  max P(i | t )


i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Comparison among Splitting Criteria

For a 2-class problem:


Misclassification Error vs Gini

A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.361 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves !!
Tree Induction

 Greedy strategy
– Split the records based on an attribute test
that optimizes certain criterion

 Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?

– Determine when to stop splitting


Stopping Criteria for Tree Induction

 Stop expanding a node when all the records


belong to the same class

 Stop expanding a node when all the records have


similar attribute values

 Early termination (to be discussed later)


Decision Tree Based Classification

 Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets
Practical Issues of Classification

 Underfitting and Overfitting

 Costs of Classification

 Missing Values
Errors

 Training errors (resubstitution error): # misclassifications


in training records

 Generalization error: expected error of the model on the


previous unseen records

 Good model: must have low training error as well as low


generalization error

Model that fits training data well can have a poorer


generalization error than a model with a higher training
error
Underfitting and Overfitting

Overfitting

Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise

Decision boundary is distorted by noise point


Overfitting due to Insufficient Examples

Lack of data points in the lower half of the diagram makes it difficult to
predict correctly the class labels of that region
- Insufficient number of training records in the region causes the decision
tree to predict the test examples using other training records that are
irrelevant to the classification task
Notes on Overfitting

 Overfitting results in the decision trees that are more


complex than necessary

 Training error no longer provides a good estimate of how


well the tree will perform on previously unseen records

 Needs new ways for estimating errors


Model Evaluation

 Metrics for Performance Evaluation


– How to evaluate the performance of a model?

 Methods for Performance Evaluation


– How to obtain reliable estimates?
Model Evaluation

 Metrics for Performance Evaluation


– How to evaluate the performance of a model?

 Methods for Performance Evaluation


– How to obtain reliable estimates?
Metrics for Performance Evaluation

 Focus on the predictive capability of a model


– Rather than how fast it takes to classify or build
models, scalability, etc.
 Confusion Matrix:

PREDICTED CLASS

Class=Yes Class=No
a: TP (true positive)
b: FN (false negative)
Class=Yes a b
ACTUAL c: FP (false positive)
d: TN (true negative)
CLASS Class=No c d
Metrics for Performance Evaluation…

PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)

 Most widely-used metric:

ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
Limitation of Accuracy

 Consider a 2-class problem


– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10

 If model predicts everything to be class 0,


accuracy is 9990/10000 = 99.9 %
– Accuracy is misleading because model does
not detect any class 1 example
Cost Matrix

PREDICTED CLASS

C(i|j) Class=Yes Class=No

Class=Yes C(Yes|Yes) C(No|Yes)


ACTUAL
CLASS Class=No C(Yes|No) C(No|No)

C(i|j): Cost of misclassifying class j example as class i


Computing Cost of Classification

Cost PREDICTED CLASS


Matrix
C(i|j) + -
ACTUAL
+ -1 100
CLASS
- 1 0

Model PREDICTED CLASS Model PREDICTED CLASS


M1 M2
+ - + -
ACTUAL ACTUAL
+ 150 40 + 250 45
CLASS CLASS
- 60 250 - 5 200

Accuracy = 80% Accuracy = 90%


Cost = 3910 Cost = 4255
Cost vs Accuracy

Count PREDICTED CLASS Accuracy is proportional to cost if


1. C(Yes|No)=C(No|Yes) = q
Class=Yes Class=No
2. C(Yes|Yes)=C(No|No) = p
Class=Yes a b
ACTUAL N=a+b+c+d
CLASS Class=No c d
Accuracy = (a + d)/N

Cost PREDICTED CLASS


Cost = p (a + d) + q (b + c)
Class=Yes Class=No
= p (a + d) + q (N – a – d)
Class=Yes p q = q N – (q – p)(a + d)
ACTUAL
CLASS Class=No
= N [q – (q-p)  Accuracy]
q p
Cost-Sensitive Measures
a
Precision (p) 
ac
a
Recall (r) 
ab
2rp 2a
F - measure (F)  
r  p 2a  b  c

 Precision is biased towards C(Yes|Yes) & C(Yes|No)


 Recall is biased towards C(Yes|Yes) & C(No|Yes)
 F-measure is biased towards all except C(No|No)
wa  w d
Weighted Accuracy  1 4

wa  wb wc w d
1 2 3 4
Model Evaluation

 Metrics for Performance Evaluation


– How to evaluate the performance of a model?

 Methods for Performance Evaluation


– How to obtain reliable estimates?

 Methods for Model Comparison


– How to compare the relative performance
among competing models?
Methods for Performance Evaluation

 How to obtain a reliable estimate of


performance?

 Performance of a model may depend on other


factors besides the learning algorithm:
– Class distribution
– Cost of misclassification
– Size of training and test sets
Learning Curve

 Learning curve shows


how accuracy changes
with varying sample size
 Requires a sampling
schedule for creating
learning curve:
 Arithmetic sampling
(Langley, et al)
 Geometric sampling
(Provost et al)

Effect of small sample size:


- Bias in the estimate
- Variance of estimate
Methods of Estimation
 Holdout
– Reserve 2/3 for training and 1/3 for testing
 Random subsampling
– Repeated holdout
 Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
 Stratified sampling
– oversampling vs undersampling
 Bootstrap [Will be covered in one of the coming lecture]
– Sampling with replacement
Thank you!!

Any Queries??
naveensaini@wsu.ac.kr
Unsupervised Learning: K-means and
K-medoid

Dr. Naveen Saini


Assistant Professor

Department of Computer Science


Indian Institute of Information Technology Lucknow
Uttar Pardesh

naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
Course Evaluation
 Attendance [20 Points]: Online

 Four HomeWorks: 5 Points /Assignment [20 points]

 Class Participation [20 Points] [Class Behavior, Camera Opened/ not, Not Answered
question, and etc.]

 Project Based Evaluation

 Mid Term Exam [20 Points]: Students must submit their Project Status
 Project Title: After Midterm Submission Not changed title/topic
 Project Abstract : 200 ~ 500 Words
 Literature Review:1000 ~ 5000 Words
 Methodology: Requirement Analysis, Algorithm, Pseudocode, Flowchart
 Final Term Exams [20 Points] Students must submit to Complete Project Report
 Project Implementation: Coding
 Project Results: Describe the result in details [ more than1000 words]
 Demonstration: Project Demo
 Project Report [Plagiarism must be less than 2% from each reference]

****Blatant plagiarism will result in a 0% grade for the project and may entail larger consequences*** 2
Course Project
• We encourage you to form a group of 1-2 people [Not more then 2]

• List of potential topics

Information Retrieval Multi-modal data fusion

Computer Vision Finance & Commerce

General Machine Learning Life Sciences

Natural Language Physical Sciences

Covid-19 Smart home

Health care

 Students can suggest their own Idea.

 Without prior permission students can not change their projects, If they do, It will impact
their grade for the course.

Blatant plagiarism will result in a 0% grade for the project and may entail larger consequences
3
Project Topics
1. Fake News Detection 16. Color Detection with Python

2. Email Classification 17. Sentiment Analysis

3. Emojify – Create your own emoji 18. Gender and Age Detection
4. Loan Prediction Project 19. Image Caption Generator Project in

5. Housing Prices Prediction Project Python

6. Music Genre Classification Project 20. Traffic Signs Recognition

7. Bitcoin Price Predictor Project 21. Edge Detection & Photo Sketching

8. Uber Data Analysis Project 22. Object Detection

9. Speech Emotion Recognition Project 23. Image Segmentation

10. Catching Illegal Fishing Project 24. Hand Gesture Recognition

11. Movie Recommendation System Project 26. Students can suggest their own

12. Handwritten Digits Recognition Project project


13. Road Lane Line Detection & Traffic Signs
Recognition Project
4
14. Next word predictor Project
Project Topics
No. Student Group No. Project Title Abstract
1. NaKyung Lee Price Negotiator Ecommerce *****Not submitted??*****
G1
2. Hyunwook Kim Chatbot System
3.

4.

5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
Project Topics
No. Student Group No Project Title Abstract
19
20
21

22

23
24
25
26
27
28
29
30
31
32
33
Unsupervised learning
 It is the opposite of supervised learning.

 There is no labelled data here.

 When learning data contains only some indications without any

description or labels, it is up to the coder or to the algorithm to find the

structure of the underlying data, to discover hidden patterns, or to

determine how to describe the data.

 Unsupervised learning is used to detect anomalies, outliers, such as

fraud or defective equipment, or to group customers with similar

behaviours for a sales campaign.

7
Categories of Unsupervised learning
 Unsupervised learning problems can be further divided into association

and clustering problems.

 Association:

 An association rule learning problem is where you want to discover

rules that describe large portions of your data, such as “people that buy

X also tend to buy Y” (e.g., purchasing butter with bread/jam)

 Clustering:

 A clustering problem is where you want to discover the inherent

groupings in the data, such as grouping customers by purchasing

behavior. 8
Supervised vs. Unsupervised

9
CLUSTERING

● Grouping of similar elements into various groups in an unsupervised way

● Similarity measures:
○ Euclidean distance, Cosine similarity

● Main Objective:
○ High compactness
○ Maximize Separation

● Examples:
○ K-means
○ K-medoids
○ Hierarchical

10
Classification vs Clustering
 Classification – an object's category  Clustering is a classification with no
prediction, and predefined classes.
 Used for:  Used for:
 Spam filtering  For market segmentation (types of
 Language detection customers, loyalty)
 A search of similar documents  To merge close points on a map
 Sentiment analysis  For image compression
 Recognition of handwritten characters  To analyze and label new data
and numbers  To detect abnormal behavior
 Fraud detection  Popular algorithms: K-
 Popular algorithms: Naive Bayes, Decision means_clustering, Mean-Shift, DBSCAN
Tree, Logistic Regression, K-Nearest
Neighbours, Support Vector Machine

11
Classification vs. Clustering
Classification:
Supervised learning:
Learns a method for predicting the
instance class from pre-labeled
(classified) instances
Clustering

Unsupervised learning:
Finds “natural” grouping of
instances given un-labeled data
Clustering Algorithms

─ Clustering has been a popular area of research


─ Several methods and techniques have been
developed to determine natural grouping among
the objects
─ Some well-known references

Jain, A. K., Murty, M. N., and Flynn, P. J., Data Clustering: A Survey.
ACM Computing Surveys, 1999. 31: pp. 264-323.

Jain, A. K. and Dubes, R. C., Algorithms for Clustering Data. 1988,


Englewood Cliffs, NJ: Prentice Hall. 013022278X
Clustering Application: Search Result Clustering

─ searching something particular at Google, these results are a mixture of


the similar matches of your original query. Basically, this is the result of
clustering.
─ it makes groups of similar objects in a single cluster and renders to you, i.e
provides results of searched data in terms with most closely related objects
that are clustered across the data to be searched.
─ Better the clustering algorithm deployed, more the possibilities of
achieving required outcomes of the leading desk.
Clustering Application: Recommendation Engines

─ providing automated personalized suggestions about products, services


and information
─ E.g., It is broadly used in Amazon, Flipkart to recommend product and
Youtube to suggest songs of the same genre as of user interest.
─ Here, each cluster will be assigned to specific preferences on the basis of
customers’ choices who belong to the cluster
Clustering Application: Identifying Fake News

─ Fake news is being created and spread at a rapid rate due to technology
innovations such as social media.
─ Here, clustering algorithm works is by taking in the content of the fake
news article, the corpus, examining the words used and then clustering
them.
─ Certain words are found more commonly in sensationalized, click-bait
articles. When you see a high percentage of specific terms in an article, it
gives a higher probability of the material being fake news.
Clustering Application: Document Analysis

─ Task: you want to be able to organize the documents quickly and


efficiently.
─ To be able to complete this ask you need to: understand the theme of the
text, compare it with other documents and group it using any clustering
algorithm.
Types of Clustering Algorithms
Clustering

Hierarchical Partitioning Grid-Based Clustering Algorithms For


Methods Methods Methods Algorithms Used in High Dimensional
Machine Learning Data

Agglomerative Divisive Gradient Descent Evolutionary


Algorithms Algorithms and Artificial Methods
Neural Networks

Subspace Projection Co-Clustering


Clustering Techniques Techniques

Relocation Probabilistic K-medoids K-means Methods Density-Based


Algorithms Clustering Methods Algorithms

Density-Based Density Functions


Connectivity Clustering
Clustering
Clustering Evaluation

• Manual inspection
• Benchmarking on existing labels
• Cluster quality measures
–distance measures
–high similarity within a cluster, low across clusters
The Distance Function
The Distance Function

• Simplest case: one numeric attribute A


– Distance(X,Y) = A(X) – A(Y)

• Several numeric attributes:


– Distance(X,Y) = Euclidean distance between X,Y

• Are all attributes equally important?


– Weighting the attributes might be necessary
Simple Clustering: K-means

Works with numeric data only


1) Pick a number (K) of cluster centers (at
random)
2) Assign every item to its nearest cluster
center (e.g. using Euclidean distance)
3) Move each cluster center to the mean of its
assigned items
4) Repeat steps 2,3 until convergence (change
in cluster assignments less than a threshold)
K-means example

X
Data Samples
K-means example, step 1

k1
Y
Pick 3 k2
initial
cluster
centers
(randomly)
k3

X
K-means example, step 2

k1
Y

k2
Assign
each point
to the closest
cluster
center k3

X
K-means example, step 3

k1 k1
Y

Move k2
each cluster
center k3
k2
to the mean
of each cluster k3

X
K-means example, step 4

Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?

X
K-means example, step 4 …

k1
Y
A: three
points with
animation k3
k2

X
K-means example, step 4b

k1
Y
re-compute
cluster
means k3
k2

X
K-means example, step 5

k1
Y

k2
move cluster
centers to k3
cluster means

X
K-means example, All steps in a single diagram

32
K-means Algorithm
Basic idea: randomly initialize the k cluster centers, and iterate between
the two steps we just saw.
 Randomly initialize the cluster centers, c1 , ..., cK
 Given cluster centers, determine points in each cluster
 For each point p, find the closest ci . Put p into cluster i
 Given points in each cluster, solve for ci
 Set ci to be the mean of points in cluster i
 If ci have changed, repeat Step 2
K-means Algorithm
K-means Algorithm
Squared Error Criterion
Pros and cons of K-Means
Python implementation of K-Means
Download Iris dataset from https://www.kaggle.com/uciml/iris
Python implementation of K-Means
• Visualizing the data using matplotlib :

38
Python implementation of K-Means

39
Sample Output of Implementation

40
A Tutorial on K-means

https://matteucci.faculty.polimi.it/Clusterin
g/tutorial_html/AppletKM.html
Outliers

• An outlier is a data point that is noticeably different from the rest.


• They represent errors in measurement, bad data collection, or simply show
variables not considered when collecting the data.
• Wikipedia defines it as ‘an observation point that is distant from other
observations.’
• Outliers threaten to skew your results and render inaccurate insights. How
to find and handle outliers in machine learning and its impact on models.
K-means variations

• K-medoids – instead of mean, use medians of


each cluster
–Mean of 1, 3, 5, 7, 9 is 5
–Mean of 1, 3, 5, 7, 1009 is 205
–Median of 1, 3, 5, 7, 1009 is 5
–Median advantage: not affected by extreme values
k-Medoids
The k-Medoids Algorithm
Evaluating Cost of Swapping Medoids
Evaluating Cost of Swapping Medoids
Four Cases
K-means clustering summary

Advantages Disadvantages
• Simple, understandable • Must pick number of
• items automatically clusters before hand
assigned to clusters • All items forced into a
cluster
• Too sensitive to outliers
since an object with an
extremely large value may
substantially distort the
distribution of data
Python implementation of K-Medoid (1/2)

KMedoids Demo — scikit-learn-extra 0.2.0 documentation 51


A demo comparing K-means and K-medoids

https://scikit-learn-
extra.readthedocs.io/en/stable/auto_examples/cluster/plot_kmedoids_di
gits.html#sphx-glr-auto-examples-cluster-plot-kmedoids-digits-py
52
Python implementation of K-Medoid (2/2)

53
Unsupervised Learning
 How to choose a clustering algorithm

 A vast collection of algorithms are available. Which one to choose for our

problem ?

 Choosing the “best” algorithm is a challenge.

 Every algorithm has limitations and works well with certain data

distributions.

 It is very hard, if not impossible, to know what distribution the application

data follow. The data may not fully follow any “ideal” structure or

distribution required by the algorithms.

 One also needs to decide how to standardize the data, to choose a

suitable distance function and to select other parameter values.


Unsupervised Learning

 Cluster evaluation: ground truth

 We use some labeled data (for classification)

 Assumption: Each class is a cluster.

 After clustering, a confusion matrix is constructed. From the matrix, we

compute various measurements, entropy, purity, precision, recall and F-

score.

 Let the classes in the data D be C = (c1 , c2 , …, ck ). The

clustering method produces k clusters, which divides D into k

disjoint subsets, D1 , D2 , …, Dk .

Copyright © reserved by Madhusudan Singh, PhD


Unsupervised Learning
Evaluation measures: Entropy

Copyright © reserved by Madhusudan Singh, PhD


Unsupervised Learning
Evaluation measures: purity

More evaluation measures will be discussed in the coming lecture.


Unsupervised Learning
 Indirect evaluation
 In some applications, clustering is not the primary task, but used to help
perform another task.
 We can use the performance on the primary task to compare clustering
methods.
 For instance, in an application, the primary task is to provide
recommendations on book purchasing to online shoppers.
 If we can cluster books according to their features, we might be able to
provide better recommendations.
 We can evaluate different clustering algorithms based on how well they
help with the recommendation task.
 Here, we assume that the recommendation can be reliably evaluated.
Any Queries:
naveen@iiitl.ac.in

59
Hierarchical Clustering Algorithms

Dr. Naveen Saini


Assistant Professor

Department of Computer Science


Indian Institute of Information Technology Lucknow
Uttar Pardesh

naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
Unsupervised learning
 It is the opposite of supervised learning.

 There is no labelled data here.

 When learning data contains only some indications without any

description or labels, it is up to the coder or to the algorithm to find the

structure of the underlying data, to discover hidden patterns, or to

determine how to describe the data.

 Unsupervised learning is used to detect anomalies, outliers, such as

fraud or defective equipment, or to group customers with similar

behaviours for a sales campaign.

2
Categories of Unsupervised learning
 Unsupervised learning problems can be further divided into association

and clustering problems.

 Association:

 An association rule learning problem is where you want to discover

rules that describe large portions of your data, such as “people that buy

X also tend to buy Y” (e.g., purchasing butter with bread/jam)

 Clustering:

 A clustering problem is where you want to discover the inherent

groupings in the data, such as grouping customers by purchasing

behavior. 3
CLUSTERING

● Grouping of similar elements into various groups in an unsupervised way

● Similarity measures:
○ Euclidean distance, Cosine similarity

● Main Objective:
○ High compactness
○ Maximize Separation

● Examples:
○ K-means
○ K-medoids
○ Hierarchical

4
Supervised vs. Unsupervised

5
Classification vs Clustering
 Classification – an object's category  Clustering is a classification with no
prediction, and predefined classes.
 Used for:  Used for:
 Spam filtering  For market segmentation (types of
 Language detection customers, loyalty)
 A search of similar documents  To merge close points on a map
 Sentiment analysis  For image compression
 Recognition of handwritten characters  To analyze and label new data
and numbers  To detect abnormal behavior
 Fraud detection  Popular algorithms: K-
 Popular algorithms: Naive Bayes, Decision means_clustering, Mean-Shift, DBSCAN
Tree, Logistic Regression, K-Nearest
Neighbours, Support Vector Machine

6
Hierarchical Clustering
Algorithms

7
Introduction

• Hierarchical Clustering Approach


– A typical clustering analysis approach via partitioning data set sequentially
– Construct nested partitions layer by layer via grouping objects into a tree of clusters
(without the need to know the number of clusters in advance)
– Uses distance matrix as clustering criteria

• Agglomerative vs. Divisive


– Two sequential clustering strategies for constructing a tree of clusters
– Agglomerative: a bottom-up strategy
• Initially each data object is in its own (atomic) cluster
• Then merge these atomic clusters into larger and larger clusters
– Divisive: a top-down strategy
• Initially all objects are in one single cluster
• Then the cluster is subdivided into smaller and smaller clusters

8
Introduction

• Illustrative Example
Agglomerative and divisive clustering on the data set
{a, b, c, d ,e }
Step 0 Step 1 Step 2 Step 3 Step 4
Agglomerative

a
ab
b Two things to know:
abcde  Cluster distance
c  Termination condition
cde
d
de
e
Divisive
Step 4 Step 3 Step 2 Step 1 Step 0
9
Cluster Distance Measures

single link
• Single link: smallest distance (min)
between an element in one cluster
and an element in the other, i.e., d(Ci,
Cj) = min{d(xip, xjq)}

• Complete link: largest distance complete link


(max)
between an element in one cluster
and an element in the other, i.e., d(Ci,
Cj) = max{d(xip, xjq)}

• Average: avg distance between average


elements in one cluster and elements
in the other, i.e.,
d(Ci, Cj) = avg{d(xip, xjq)}

10
Cluster Distance Measures

Example: Given a data set of five objects characterised by a single feature, assume that
there are two clusters: C1: {a, b} and C2: {c, d, e}.
a b c d e
Feature 1 2 4 5 6

1. Calculate the distance matrix. 2. Calculate three cluster distances between C1 and C2.
Single link
a b c d e dist(C1 , C 2 )  min{ d(a, c), d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
a 0 1 3 4 5  min{3, 4, 5, 2, 3, 4}  2

b 1 0 2 3 4 Complete link
dist(C1 , C 2 )  max{d(a, c), d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
c 3 2 0 1 2
 max{3, 4, 5, 2, 3, 4}  5
d 4 3 1 0 1
Average d(a, c)  d(a, d)  d(a, e)  d(b, c)  d(b, d)  d(b, e)
e 5 4 2 1 0 dist(C1 , C 2 ) 
6
3  4  5  2  3  4 21
   3.5
6 6

11
Agglomerative Algorithm
• The Agglomerative algorithm is carried out in
three steps:
1) Convert object attributes to
distance matrix
2) Set each object as a cluster
(thus if we have N objects, we
will have N clusters at the
beginning)
3) Repeat until number of cluster
is one (or known # of clusters)
 Merge two closest clusters
 Update distance matrix

12
Example
• Problem: clustering analysis with agglomerative
algorithm

data matrix

Euclidean distance

distance matrix
(Symmetric metric along the diagonal)
13
Example

• Merge two closest clusters (iteration 1)

14
Example

• Update distance matrix (iteration 1)

15
Example

• Merge two closest clusters (iteration 2)

16
Example

• Update distance matrix (iteration 2)

17
Example

• Merge two closest clusters/update distance


matrix (iteration 3)

18
Example

• Merge two closest clusters/update distance


matrix (iteration 4)

19
Example

• Final result (meeting termination condition)

20
Example

• Dendrogram tree representation


1. In the beginning we have 6
clusters: A, B, C, D, E and F
6 2. We merge clusters D and F into
cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B
into (A, B) at distance 0.71
lifetime

4. We merge clusters E and (D, F)


5 into ((D, F), E) at distance 1.00
5. We merge clusters ((D, F), E) and C
4 into (((D, F), E), C) at distance 1.41
3 6. We merge clusters (((D, F), E), C)
2 and (A, B) into ((((D, F), E), C), (A, B))
at distance 2.50
7. The last cluster contain all the objects,
thus conclude the computation
object

21
Example

Given a data set of five objects characterised by a single feature:


a b C d e
Feature 1 2 4 5 6

Apply the agglomerative algorithm with single-link, complete-link and averaging cluster
distance measures to produce three dendrogram trees, respectively.
a b c d e

a 0 1 3 4 5

b 1 0 2 3 4

c 3 2 0 1 2

d 4 3 1 0 1

e 5 4 2 1 0

22
Example

Agglomerative Demo

23
Google Colab Link

https://colab.research.google.com/drive/1XIriFb
6YCmKSvgr7j6f5io0lZ3IpdQUF?usp=sharing

24
Conclusions

• Hierarchical algorithm is a sequential clustering algorithm


– With distance matrix to construct a tree of clusters (dendrogram)
– Hierarchical representation without the need of knowing # of clusters (can
set termination condition with known # of clusters)

• Major weakness of agglomerative clustering methods


– Can never undo what was done previously
– Sensitive to cluster distance measures and noise/outliers
– Less efficient: O (n2 ), where n is the number of total objects

• There are several variants to overcome its weaknesses


– BIRCH: uses clustering feature tree and incrementally adjusts the quality of sub-clusters, which
scales well for a large data set
– ROCK: clustering categorical data via neighbour and link analysis, which is insensitive to noise
and outliers
– CHAMELEON: hierarchical clustering using dynamic modeling, which integrates hierarchical
method with other clustering methods

25
Any Queries:
naveen@iiitl.ac.in

26
Dr. Naveen Saini
Assistant Professor

Principal Component Analysis

Dr. Naveen Saini


Assistant Professor

Department of Computer Science


Indian Institute of Information Technology Lucknow
Uttar Pardesh

naveen@iiitl.ac.in https://sites.google.com/view/nsaini
1
Philosophy of PCA
 Introduced by Pearson (1901) and Hotelling (1933)
to describe the variation in a set of multivariate
data (more than two variables) in terms of a set of
uncorrelated variables

 We typically have a data matrix of n observations


on p correlated variables x1,x2,…xp

 PCA looks for a transformation of the xi into p new


variables yi that are uncorrelated
Philosophy of PCA
 It’s a dimensionality-reduction method that is
often used to reduce the dimensionality of large
data sets, by transforming a large set of variables
into a smaller one that still contains most of the
information in the large set.

 Reducing the number of variables of a data set


naturally comes at the expense of accuracy, but
the trick in dimensionality reduction is to trade a
little accuracy for simplicity.
The data matrix
case ht (x1) wt(x2) age(x3) sbp(x4) heart rate (x5)
1 175 1225 25 117 56
2 156 1050 31 122 63
n 202 1350 58 154 67
What is variance??
Variance: The variance is the average of the squared differences from
the mean. Standard deviation is the square root of the variation.

What is Variance? | Definition, Examples & Formulas (scribbr.com)


Reduce dimension

 The simplet way is to keep one


variable and discard all others: not
reasonable!
 Weight all variable equally: not
reasonable (unless they have same
variance)
 Weighted average based on some
citerion.
 Which criterion?
Let us write it first

 Looking for a transformation of the data


matrix X (nxp) such that

Y= T X=1 X1+ 2 X2+..+ p Xp

Where =(1 , 2 ,.., p)T is a column vector


of weights with

1²+ 2²+..+ p² =1


One good criterion

 Maximize the variance of the projection of


the observations on the Y variables
 Find  so that

Var(T X)= T Var(X)  is maximal

 The matrix C=Var(X) is the covariance


matrix of the Xi variables
Some points
 If there are large differences between the ranges
of initial variables, those variables with larger
ranges will dominate over those with small
ranges.

 For example, a variable that ranges between 0


and 100 will dominate over a variable that ranges
between 0 and 1, which will lead to biased
results. So, transforming the data to comparable
scales can prevent this problem.
Let us see it on a figure
Good Better
Covariance matrix

v(x
1)c(x
1,x2)........
c(x1,xp) 
 
C= c(x
1,x
2)v(x2)........
c(x 2,xp) 
 
 
c(x,x)c(x ,x )
..........
v(x )
 1 p 2 p p 
Covariance matrix describes relationship between variables
It’s actually the sign of the covariance that matters :
•if positive then : the two variables increase or decrease together (correlated)
•if negative then : One increases when the other decreases (Inversely
correlated)
And so.. We find that
 The direction of  where is most variance,
is given by the eigenvector 1 correponding
to the largest eigenvalue of matrix C

 The second vector that is orthogonal


(uncorrelated) to the first is the one that
has the second highest variance which
comes to be the eignevector corresponding
to the second eigenvalue

 And so on …
Some points
• Geometrically speaking, principal components represent the
directions of the data that explain a maximal amount of
variance, that is to say, the lines that capture most
information of the data.

• The relationship between variance and information here, is


that, the larger the variance carried by a line, the larger the
dispersion of the data points along it, and the larger the
dispersion along a line, the more the information it has.

• To put all this simply, just think of principal components as


new axes that provide the best angle to see and evaluate the
data, so that the differences between the observations are
better visible.
So PCA gives

 New variables Yi that are linear


combination of the original variables
(xi):
 Yi= ai1x1+ai2x2+…aipxp ; i=1..p
 The new variables Yi are derived in
decreasing order of importance;
 they are called ‘principal components’
Calculating eignevalues and
eigenvectors
 The eigenvalues i are found by
solving the equation
det(C-I)=0
 Eigenvectors are columns of the
matrix A such that
 1 0 ........ 0 
C=A D A T  
 0 2 ....... 0 
 Where D=  0 
 
 0 .......... .. 
 p
An example

 Let us take two variables with covariance c>0

C= 1 c  C-I= 1 c



 c 1 c 1

   

det(C-I)=(1- )²-c²

 Solving this we find 1 =1+c


2 =1-c < 1
and eigenvectors

 Any eigenvector A satisfies the condition


CA=A

 a1  1 c   a1  a1 ca2   a1 


 
A=   CA=     =  
    

 c 1  a2  ca1 a2   a2 
=
 2
a

Solving we find A1 A2
How many components to keep?

 Enough PCs to have a cumulative


variance explained by the PCs that is
>50-70%
 Kaiser criterion: keep PCs with
eigenvalues >1
 Scree plot: represents the ability of
PCs to explain the variation in data
PCA Algorithm

The steps involved in PCA Algorithm are as follows-

Step-01: Get data.


Step-02: Compute the mean vector (µ).
Step-03: Subtract mean from the given data.
Step-04: Calculate the covariance matrix.
Step-05: Calculate the eigen vectors and eigen values of
the covariance matrix.
Step-06: Choosing components and forming a feature
vector.
Step-07: Deriving the new data set.
Numerical Example

Consider the two dimensional patterns


(2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (7, 8).
Compute the principal component using
PCA Algorithm.
The given feature vectors are-
• x1 = (2, 1)
• x2 = (3, 5)
• x3 = (4, 3)
• x4 = (5, 6)
• x5 = (6, 7)
• x6 = (7, 8)
Calculate the mean vector (µ).
Mean vector (µ) = ((2 + 3 + 4 + 5 + 6 + 7) /
6, (1 + 5 + 3 + 6 + 7 + 8) / 6)
= (4.5, 5)
Subtract mean vector (µ) from the given feature
vectors.
• x1 – µ = (2 – 4.5, 1 – 5) = (-2.5, -4)
• x2 – µ = (3 – 4.5, 5 – 5) = (-1.5, 0)
• x3 – µ = (4 – 4.5, 3 – 5) = (-0.5, -2)
• x4 – µ = (5 – 4.5, 6 – 5) = (0.5, 1)
• x5 – µ = (6 – 4.5, 7 – 5) = (1.5, 2)
• x6 – µ = (7 – 4.5, 8 – 5) = (2.5, 3)
Feature vectors (xi) after subtracting mean vector (µ)
are-
Calculate the covariance matrix.
Covariance matrix = (m1 + m2 + m3 + m4 +
m5 + m6) / 6
• Calculate the eigen values and eigen
vectors of the covariance matrix.
• λ is an eigen value for a matrix M if it is
a solution of the characteristic equation
|M – λI| = 0.
So, we have-
From here,
(2.92 – λ)(5.67 – λ) – (3.67 x 3.67) = 0
16.56 – 2.92λ – 5.67λ + λ2 – 13.47 = 0
λ2 – 8.59λ + 3.09 = 0

Solving this quadratic equation, we get λ = 8.22, 0.38


Thus, two eigen values are λ1 = 8.22 and λ2 = 0.38.

Clearly, the second eigen value is very small compared to the first eigen
value.
So, the second eigen vector can be left out.
Eigen vector corresponding to the greatest eigen value is the principal
component for the given data set.
So. we find the eigen vector corresponding to eigen value λ1.
We use the following equation to find the eigen vector-
MX = λX
where-
• M = Covariance Matrix
• X = Eigen vector
• λ = Eigen value
Substituting the values in the above equation, we get-
Solving these, we get-
2.92X1 + 3.67X2 = 8.22X1
3.67X1 + 5.67X2 = 8.22X2

On simplification, we get-
5.3X1 = 3.67X2 ………(1)
3.67X1 = 2.55X2 ………(2)

From (1) and (2), X1 = 0.69X2


From (2), the eigen vector is-
Thus, principal component for the given
data set is-
The feature vector (2,1) gets transformed to =
Transpose of Eigen vector x (Feature Vector –
Mean Vector)
Using dimension reduction techniques-
In machine learning, •We convert the dimensions of data from 2
•Using both these dimensions convey similar dimensions (x1 and x2) to 1 dimension (z1).
information. •It makes the data relatively easier to explain.
•Also, they introduce a lot of noise in the
system.
•So, it is better to use just one dimension.
Benefits of Dimension Reduction

Dimension reduction offers several benefits such as-


• It compresses the data and thus reduces the storage
space requirements.
• It reduces the time required for computation since less
dimensions require less computation.
• It eliminates the redundant features.
• It improves the model performance.
Disadvantages

Some of the disadvantages of dimensionality reduction


are as follows:

1. While doing dimensionality reduction, we lost some of


the information, which can possibly affect the
performance of subsequent training algorithms.
2. It can be computationally intensive.
3. Transformed features are often hard to interpret.
4. It makes the independent variables less interpretable.
Question: Excercise

A data matrix X is given by [ [ -3, -1, 1 ,3 ],


[ -3, -1, 1, 3 ] ]

What will be the eigen values??


Acknowledgement

 https://www.slideshare.net/ParthaSarathiKa
r3/principal-component-analysis-75693461
 https://builtin.com/data-science/step-step-
explanation-principal-component-analysis
Thank you!!

Any Queries??
naveensaini@wsu.ac.kr
DBSCAN Clustering Algorithms

Dr. Naveen Saini


Assistant Professor

Department of Computer Science


Indian Institute of Information Technology Lucknow
Uttar Pardesh

naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
Unsupervised learning
 It is the opposite of supervised learning.

 There is no labelled data here.

 When learning data contains only some indications without any

description or labels, it is up to the coder or to the algorithm to find the

structure of the underlying data, to discover hidden patterns, or to

determine how to describe the data.

 Unsupervised learning is used to detect anomalies, outliers, such as

fraud or defective equipment, or to group customers with similar

behaviours for a sales campaign.

2
Categories of Unsupervised learning
 Unsupervised learning problems can be further divided into association

and clustering problems.

 Association:

 An association rule learning problem is where you want to discover

rules that describe large portions of your data, such as “people that buy

X also tend to buy Y” (e.g., purchasing butter with bread/jam)

 Clustering:

 A clustering problem is where you want to discover the inherent

groupings in the data, such as grouping customers by purchasing

behavior. 3
CLUSTERING

● Grouping of similar elements into various groups in an unsupervised way

● Similarity measures:
○ Euclidean distance, Cosine similarity

● Main Objective:
○ High compactness
○ Maximize Separation

● Examples:
○ K-means
○ K-medoids
○ Hierarchical

4
Supervised vs. Unsupervised

5
Classification vs Clustering
 Classification – an object's category  Clustering is a classification with no
prediction, and predefined classes.
 Used for:  Used for:
 Spam filtering  For market segmentation (types of
 Language detection customers, loyalty)
 A search of similar documents  To merge close points on a map
 Sentiment analysis  For image compression
 Recognition of handwritten characters  To analyze and label new data
and numbers  To detect abnormal behavior
 Fraud detection  Popular algorithms: K-
 Popular algorithms: Naive Bayes, Decision means_clustering, Mean-Shift, DBSCAN
Tree, Logistic Regression, K-Nearest
Neighbours, Support Vector Machine

6
Density-based Clustering
Algorithms

7
Density-based Approaches

• Why Density-Based Clustering methods?


• Discover clusters of arbitrary shape.
• Clusters – Dense regions of objects separated by
regions of low density

– DBSCAN: the first Density Based Spatial


Clustering

8
DBSCAN: Density Based Spatial Clustering of
Applications with Noise

• Proposed by Ester, Kriegel, Sander, and Xu


(KDD96)
• Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points.
• Discovers clusters of arbitrary shape in spatial
databases with noise

9
Density-Based Clustering

Basic Idea:
Clusters are dense regions in the data
space, separated by regions of lower
object density

• Why Density-Based Clustering?

Results of a k-medoid
algorithm for k=4

Different density-based approaches exist (see Textbook & Papers)


Here we discuss the ideas underlying the DBSCAN algorithm
10
Density Based Clustering: Basic Concept

• Intuition for the formalization of the basic idea


– For any point in a cluster, the local point density around
that point has to exceed some threshold

• Local point density at a point p defined by two parameters


– e : radius for the neighborhood of point p:
Ne (p) := {q in data set D | dist(p, q)  e}
– MinPts:– minimum number of points in the given
neighbourhood N(p)

11
e-Neighborhood
• e-Neighborhood – Objects within a radius of e from
an object.
Ne ( p) : {q | d ( p, q)  e }
• “High density” - ε-Neighborhood of an object contains
at least MinPts of objects.

ε-Neighborhood of p
ε ε ε-Neighborhood of q
q p
Density of p is “high” (MinPts = 4)
Density of q is “low” (MinPts = 4)
Core, Border & Outlier

Outlier Given e and MinPts,


categorize the objects into
Border three exclusive groups.

A point is a core point if it has more than


Core a specified number of points (MinPts)
within Eps. These are points that are at
the interior of a cluster.

A border point has fewer than MinPts


e = 1unit, MinPts = 5 within Eps, but is in the neighborhood of
a core point.

A noise point is any point that is not a


core point nor a border point.

13
Example

• M, P, O, and R are core objects since each is


in an Eps neighborhood containing at least
3 points

Minpts = 3
Eps=radius
of the circles

14
Density-Reachability

 Directly density-reachable
An object q is directly density-reachable from
object p if p is a core object and q is in p’s e-
neighborhood.

 q is directly density-reachable from p


ε ε  p is not directly density- reachable
q p from q?
 Density-reachability is asymmetric.
MinPts = 4

15
Density-reachability

• Density-Reachable (directly and indirectly):


– A point p is directly density-reachable from p2;
– p2 is directly density-reachable from p1;
– p1 is directly density-reachable from q;
– pp2p1q form a chain.

p
 p is (indirectly) density-reachable
p2 from q
p1  q is not density- reachable from p?
q
MinPts = 7

16
Density-Connectivity

Density-reachable is not symmetric


 not good enough to describe clusters

Density-Connected
A pair of points p and q are density-connected
if they are commonly density-reachable from a
point o.
 Density-connectivity is
symmetric
p q

o
17
Formal Description of Cluster

• Given a data set D, parameter e and


threshold MinPts.
• A cluster C is a subset of objects satisfying
two criteria:
– Connected: p,q C: p and q are density-
connected.
– Maximal: p,q: if p C and q is density-
reachable from p, then q C. (avoid redundancy)
P is a core object.

18
Review of Concepts

Is an object o in a cluster Are objects p and q in


or an outlier? the same cluster?

Are p and q density-


Is o a core object?
connected?

Is o density-reachable by Are p and q density-


some core object? reachable by some
object o?

Directly density- Indirectly density-


reachable reachable through a chain
19
DBSCAN Algorithm

Input: The data set D


Parameter: e, MinPts
For each object p in D
if p is a core object and not processed then
C = retrieve all objects density-reachable from p
mark all objects in C as processed
report C as a cluster
else mark p as outlier
end if
End For

DBScan Algorithm

20
DBSCAN: The Algorithm

– Arbitrary select a point p

– Retrieve all points density-reachable from p wrt Eps and MinPts.

– If p is a core point, a cluster is formed.

– If p is a border point, no points are density-reachable from p and


DBSCAN visits the next point of the database.

– Continue the process until all of the points have been processed.

21
DBSCAN Algorithm: Example

• Parameter
• e = 2 cm
• MinPts = 3

for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
22
DBSCAN Algorithm: Example

• Parameter
• e = 2 cm
• MinPts = 3

for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
23
DBSCAN Algorithm: Example

• Parameter
• e = 2 cm
• MinPts = 3

for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE

24
MinPts = 5

e
P1
e
e P
C1 C1
C1 P

1. Check the e- 1. Check the unprocessed


neighborhood of p; objects in C
2. If p has less than MinPts 2. If no core object, return C
neighbors then mark p
3. Otherwise, randomly pick
as outlier and continue
up one core object p1,
with the next object
mark p1 as processed,
3. Otherwise mark p as and put all unprocessed
processed and put all neighbors of p1 in cluster
the neighbors in cluster C
C
25
e

e
C1
C1

e
e

e
C1 C1
C1

26
Example

Original Points Point types: core,


border and outliers

e = 10, MinPts = 4
27
When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise
• Can handle clusters of different shapes and sizes

28
When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.92).

Original Points

• Cannot handle Varying


densities
• sensitive to parameters

(MinPts=4, Eps=9.75)
29
DBSCAN: Sensitive to Parameters

30
Determining the Parameters e and MinPts

• Cluster: Point density higher than specified by e and MinPts


• Idea: use the point density of the least dense cluster in the data
set as parameters – but how to determine this?
• Heuristic: look at the distances to the k-nearest neighbors

p 3-distance(p) :

q
3-distance(q) :

• Function k-distance(p): distance from p to the its k-nearest


neighbor
• k-distance plot: k-distances of all objects, sorted in decreasing
order

31
Determining the Parameters e and MinPts

• Example k-distance plot

3-distance
first „valley“

Objects

• Heuristic method: „border object“

– Fix a value for MinPts (default: 2  d –1)


– User selects “border object” o from the MinPts-distance plot;
e is set to MinPts-distance(o)

32
Determining the Parameters e and MinPts

• Problematic example
A C
F A, B, C
E

B, D, E

3-Distance
G
B‘, D‘, F, G
G1
G3 D1, D2,
D G2 G1, G2, G3
B D’
B’ D1
D2 Objects

33
Density Based Clustering: Discussion

• Advantages
– Clusters can have arbitrary shape and size
– Number of clusters is determined automatically
– Can separate clusters from surrounding noise

• Disadvantages
– Input parameters may be difficult to determine
– In some situations very sensitive to input
parameter setting

34
35
Cluster Validation

Dr. Naveen Saini


Assistant Professor

Department of Computer Science


Indian Institute of Information Technology Lucknow
Uttar Pardesh

naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
What is Cluster Analysis?

 Finding groups of objects such that the objects in a


group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Applications of Cluster Analysis

• Understanding
• Structuring search results
• Suggesting related pages
• Automatic directory construction/update
• Finding near identical/duplicate pages

• Summarization
• Reduce the size of large data sets
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters


Types of Clusterings

• A clustering is a set of clusters

• Important distinction between hierarchical


and partitional sets of clusters

• Partitional Clustering
• A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset

• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
Partitional Clustering

Original Points A Partitional Clustering


Hierarchical Clustering

p1
p3 p4
p2

p1 p2 p3 p4
Hierarchical Clustering Dendrogram
Types of Clusters

 Well-separated clusters

 Center-based clusters

 Contiguous clusters

 Density-based clusters

 Property or Conceptual

 Described by an Objective Function


Types of Clusters: Well-Separated

 Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster
is closer (or more similar) to every other point in the
cluster than to any point not in the cluster.

3 well-separated clusters
Types of Clusters: Center-Based

 Center-based
– A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a cluster,
than to the center of any other cluster
– The center of a cluster is often a centroid, the average of
all the points in the cluster, or a medoid, the most
“representative” point of a cluster

4 center-based clusters
Types of Clusters: Contiguity-Based

 Contiguous Cluster (Nearest neighbor or


Transitive)
– A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.

8 contiguous clusters
Types of Clusters: Density-Based

 Density-based
– A cluster is a dense region of points, which is separated
by low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and
when noise and outliers are present.

6 density-based clusters
Types of Clusters: Conceptual Clusters

 Shared Property or Conceptual Clusters


– Finds clusters that share some common property or
represent a particular concept.
.

2 Overlapping Circles
Types of Clusters: Objective Function

 Clusters Defined by an Objective Function


– Finds clusters that minimize or maximize an objective
function.
– Enumerate all possible ways of dividing the points into
clusters and evaluate the `goodness' of each potential set of
clusters by using the given objective function
– Can have global or local objectives.
 Hierarchical clustering algorithms typically have local objectives
 Partitional algorithms typically have global objectives
– A variation of the global objective function approach is to fit
the data to a parameterized model.
 Parameters for the model are determined from the data.
 Mixture models assume that the data is a ‘mixture' of a number of
statistical distributions.
Inter/Intra Cluster Distances

Intra-cluster distance Inter-cluster distance


 (Sum/Min/Max/Avg) the Sum the (squared) distance
(absolute/squared) distance between all pairs of clusters
between
Where distance between two
- All pairs of points in the clusters is defined as:
cluster OR - distance between their
- Between the centroid and centroids/medoids
all points in the cluster -(Spherical clusters)
OR - Distance between the
- Between the “medoid” closest pair of points
and all points in the belonging to the clusters
-(Chain shaped clusters)
cluster
How hard is clustering?

 One idea is to consider all possible


clusterings, and pick the one that has best
inter and intra cluster distance properties
 Suppose we are given n points, and would like n
to cluster them into k-clusters k
– How many possible clusterings?
k!
• Too hard to do it brute force or optimally
• Solution: Iterative optimization
algorithms
– Start with a clustering, iteratively
improve it (eg. K-means)
Quality: What Is Good Clustering?

A good clustering method will produce high quality


clusters
– high intra-class similarity: cohesive within clusters
– low inter-class similarity: distinctive between
clusters
Quality of a clustering method depends on
– the similarity measure used by the method
– its implementation, and
– its ability to discover some or all of the hidden
patterns
Measure the Quality of Clustering
Dissimilarity/Similarity metric
– Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
– Definitions of distance functions are usually rather different
for interval-scaled, boolean, categorical, ordinal ratio, and
vector variables
– Weights should be associated with different variables based
on applications and data semantics

Quality of clustering:
– There is usually a separate “quality” function that measures
the “goodness” of a cluster
– It is hard to define “similar enough” or “good enough”
 Answer is typically highly subjective
Requirements and Challenges
 Ability to deal with different types of attributes
 Numerical, binary, categorical, ordinal, linked, and mixture of
these
 Constraint-based clustering
 User may give constraints
 Use domain knowledge to determine input parameters
 Interpretability and usability
 Others
 Discovery of clusters with arbitrary shape
 Ability to deal with noisy data
 Incremental clustering and insensitivity to input order
 High dimensionality
Sec. 16.2
Issues for clustering

• Representation for clustering


• Document representation
• Vector space? Normalization?
• Centroids aren’t length normalized
• Need a notion of similarity/distance

• How many clusters?


• Fixed a priori?
• Completely data driven?
• Avoid “trivial” clusters - too large or small
• If a cluster's too large, then for navigation purposes you've
wasted an extra user click without whittling down the set
of documents much.
Notion of similarity/distance

Ideal: semantic similarity.


Practical: term-statistical similarity
– We will use cosine similarity.
– Docs as vectors.
– For many algorithms, easier to think in terms of a
distance (rather than similarity) between docs.
– We will mostly speak of Euclidean distance
But real implementations use cosine similarity
Different Aspects of Cluster Validation

1. Determining the clustering tendency of a set of data, i.e.,


distinguishing whether non-random structure actually exists in
the data.
2. Comparing the results of a cluster analysis to externally known
results, e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data
without reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses
to determine which is better.
5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to


evaluate the entire clustering or just individual clusters.
Measures of Cluster Validity

Numerical measures that are applied to judge various


aspects of cluster validity, are classified into the following
three types.
– External Index: Used to measure the extent to which cluster
labels match externally supplied class labels.
Entropy

– Internal Index: Used to measure the goodness of a clustering


structure without respect to external information.
Sum of Squared Error (SSE)
– Relative Index: Used to compare two different clusterings or
clusters.
Often an external or internal index is used for this function, e.g., SSE or
entropy

Sometimes these are referred to as criteria instead of


indices
– However, sometimes criterion is the general strategy and index is the
numerical measure that implements the criterion.
External Measures
• The correct or ground truth clustering is known
priori.

• Given a clustering partition C and ground truth


partitioning T, we redefine TP, TN, FP, FN in the
context of clustering.

• Given the number of pairs N


N=TP+FP+FN+TN
External Measures …
• True Positives (TP): Xi and Xj are a true positive pair if they belong
to the same partition in T, and they are also in the same cluster in
C. TP is defined as the number of true positive pairs.
• False Negatives (FN): Xi and Xj are a false negative pair if they
belong to the same partition in T, but they do not belong to the
same cluster in C. FN is defined as the number of false negative
pairs.
• • False Positives (FP): Xi and Xj are a false positive pair if the do
not belong to the same partition in T, but belong to the same
cluster in C. FP is the number of false positive pairs.
• True Negatives (TN): Xi and Xj are a false negative pair if they do
not belong to the same partition in T, nor to the same cluster in C.
TN is the number of true negative pairs.
Jaccard Coefficient
•Measures the fraction of true positive point pairs, but
after ignoring the true negatives as,
Jaccard = TP/ (TP+FP+FN)

•For a perfect clustering C, the coefficient is one, that


is, there are no false positives nor false negatives.

•Note that the Jaccard coefficient is asymmetric in that


it ignores the true negatives
Rand Statistic
• Measures the fraction of true positives and true
negatives over all pairs as

Rand = (TP + TN)/ N


• The Rand statistic measures the fraction of point
pairs where both the clustering C and the ground
truth T agree.

• A perfect clustering has a value of 1 for the statistic.


• The adjusted rand index is the extension of the rand
statistic corrected for chance.
External Measures of Cluster Validity: Entropy and Purity
Internal Measures: SSE
 Clusters in more complicated figures aren’t well
separated
 Internal Index: Used to measure the goodness of a
clustering structure without respect to external information
– SSE
 SSE is good for comparing two clusterings or two
clusters (average SSE).
 Can also be used to estimate the number of clusters
10
6 9

8
4
7
2
6
SSE
0 5

4
-2
3

-4 2

1
-6
0
5 10 15 2 5 10 15 20 25 30
K
Internal Measures: SSE

 SSE curve for a more complicated data set

1
2 6

3
4

SSE of clusters found using K-means


Internal Measures: Cohesion and Separation

 Cluster Cohesion (compactness): Measures how


closely related are objects in a cluster
– Example: SSE

 Cluster Separation (separation): Measure how distinct


or well-separated a cluster is from other clusters
 Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)


WSS(xm
i)2
ix C i
– Separation is measured by the between cluster sum of squares

BSSCi(m
mi)2

i
–Where |Ci| is the size of cluster i
Internal Measures: Cohesion and Separation

 Example: SSE
– BSS + WSS = constant
m
  
1 m1 2 3 4 m2 5

K=1 cluster: 
1
WSS
( 2
3
)(
22
3
)(
42
3
)(
52
3
)10

4
BSS
(
33
)
2
0
 
Total
10
0 10

K=2 clusters:  
WSS
(
11.
52
)(
21
.
52
)(
44
.
52
)(
54
.
52
)1

BSS
2(
31
.
52
)
2(
4.
52
3
)9

Total
1910
Internal Measures: Cohesion and Separation

 A proximity graph based approach can also be used for


cohesion and separation.
– Cluster cohesion is the sum of the weight of all links within a cluster.
– Cluster separation is the sum of the weights between nodes in the
cluster and nodes outside the cluster.

cohesion separation
Internal Measures: Silhouette Coefficient

 Silhouette Coefficient combine ideas of both cohesion and


separation, but for individual points, as well as clusters and
clusterings
 For an individual point, i
– Calculate a = average distance of i to the points in its cluster
– Calculate b = min (average distance of i to points in another cluster)
– The silhouette coefficient for a point is then given by

s = 1 – a/b if a < b, (or s = b/a - 1 if a  b, not the usual case)


b
a
– Typically between 0 and 1.
– The closer to 1 the better.

 Can calculate the Average Silhouette width for a cluster or


a clustering
Silhouette coefficient
Dunn’s Index:
Davies–Bouldin index:
Xie-Beni index:
Final Comment on Cluster Validity

• “The validation of clustering structures is the most


difficult and frustrating part of cluster analysis.

• Without a strong effort in this direction, cluster


analysis will remain a black art accessible only to
those true believers who have experience and great
courage.”

• Reference [Book]: Algorithms for Clustering Data,


Jain and Dubes
50
Dr. Naveen Saini
Assistant Professor

Feature Selection

Dr. Naveen Saini


Assistant Professor

Department of Computer Science


Indian Institute of Information Technology Lucknow
Uttar Pardesh

naveen@iiitl.ac.in https://sites.google.com/view/nsaini
1
Feature Extraction/Selection
Objective
LECTURE 11: Sequential Feature Selection
g Feature extraction vs. Feature selection
g Search strategy and objective functions
g Objective functions
n Filters
n Wrappers
g Sequential search strategies
n Sequential Forward Selection
n Sequential Backward Selection
n Plus-l Minus-r Selection
n Bidirectional Search
n Floating Search

Introduction to Pattern Analysis 1


Ricardo Gutierrez-Osuna
Texas A&M University
Feature extraction vs. Feature selection
g As we discussed in Lecture 9, there are two general approaches for performing
dimensionality reduction
n Feature extraction: Transforming the existing features into a lower dimensional space
n Feature selection: Selecting a subset of the existing features without a transformation

 x1   x1    x1  
x   x i1  x   y 1    
  2 x    2    x2  
y
 M     → 2    → 2  = f   M  
i
feature selection
 M   feature extraction

          
          
 x N   x iM   x N  yM   x  
  N

g Feature extraction was covered in lectures 9 and 10


n We derived the “optimal” linear features for two objective functions
g Signal
PCA representation:
(Principal PCA Analysis)
Component
g Signal
LDA classification:
(Latent Dirichlet LDA
Allocation
g Feature selection, also called Feature Subset Selection (FSS) in the literature,
will be the subject of the last two lectures
n Although FSS can be thought of as a special case of feature extraction (think of a sparse
projection matrix with a few ones), in practice it is a quite different problem
g FSS looks at the issue of dimensionality reduction from a different perspective
g FSS has a unique set of methodologies

Introduction to Pattern Analysis 2


Ricardo Gutierrez-Osuna
Texas A&M University
Feature Subset Selection
g Definition
n Given a feature set X={xi | i=1…N}, find a subset YM ={xi1, xi2, …, xiM}, with M<N, that
optimizes an objective function J(Y), ideally the probability of correct classification
 x1 
x   x i1 
  x 
{x , x }
2

 M     → 2 
feature selection i
i1 i2 ,..., x iM = argmax [J{x i | i = 1...N}]
    M,im

   
 x N   x iM 

g Why Feature Subset Selection?


n Why not use the more general feature extraction methods, and simply project a high-
dimensional feature vector onto a low-dimensional space?
g Feature Subset Selection is necessary in a number of situations
n Features may be expensive to obtain
g You evaluate a large number of features (sensors) in the test bed and select only a few for the final
implementation
n You may want to extract meaningful rules from your classifier
g When you transform or project, the measurement units of your features (length, weight, etc.) are lost
n Features may not be numeric
g A typical situation in the machine learning domain
g In addition, fewer features means fewer parameters for pattern recognition
n Improved the generalization capabilities
n Reduced complexity and run-time

Introduction to Pattern Analysis 3


Ricardo Gutierrez-Osuna
Texas A&M University
Search strategy and objective function
g Feature Subset Selection requires
n A search strategy to select candidate subsets
n An objective function to evaluate these candidates
Training data
g Search Strategy  N
 
n Exhaustive evaluation of feature subsets involves  M Complete feature set
combinations for a fixed value of M, and 2N combinations if
M must be optimized as well Feature Subset Selection
g This number of combinations is unfeasible, even for moderate
values of M and N, so a search procedure must be used in Search
practice
g For example, exhaustive evaluation of 10 out of 20 features Feature Information
involves 184,756 feature subsets; exhaustive evaluation of 10 subset content
out of 100 involves more than 1013 feature subsets [Devijver and
Kittler, 1982] Objective
function
n A search strategy is therefore needed to direct the FSS
process as it explores the space of all possible combination
of features Final feature subset
g Objective Function
n The objective function evaluates candidate subsets and PR
returns a measure of their “goodness”, a feedback signal algorithm
used by the search strategy to select new candidates

Introduction to Pattern Analysis 4


Ricardo Gutierrez-Osuna
Texas A&M University
Objective function
g Objective functions are divided in two groups
n Filters: The objective function evaluates feature subsets by their information content, typically
interclass distance, statistical dependence or information-theoretic measures
n Wrappers: The objective function is a pattern classifier, which evaluates feature subsets by
their predictive accuracy (recognition rate on test data) by statistical resampling or cross-
validation

Training data Training data

Complete feature set Complete feature set

Filter FSS Wrapper FSS

Search Search

Feature Information Feature Predictive


subset content subset accuracy

Objective PR
function algorithm

Final feature subset Final feature subset

ML PR
algorithm algorithm

Introduction to Pattern Analysis 5


Ricardo Gutierrez-Osuna
Texas A&M University
Filter types
g Distance or separability measures
n These methods use distance metrics to measure class separability, such as
g Distance between classes: Euclidean, Mahalanobis, etc.
g Determinant of SW-1SB (LDA eigenvalues)
g Correlation and information-theoretic measures
n These methods are based on the rationale that good feature subsets contain features highly
correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other
n Linear relation measures M
g Linear relationship between variables can be ∑ρ ic
measured using the correlation coefficient J(YM ) = i=1
M M

∑ ∑ρ
i=1 j=i +1
ij

n Where ρic is the correlation coefficient between feature ‘i’ and the class label and ρij is the correlation coefficient
between features ‘i’ and ‘j’
n Non-Linear relation measures
g Correlation is only capable of measuring linear dependence. A more powerful measure is the mutual
information I(Yk;C)
C
P(YM , ωc )
J(YM ) = I(YM; C) = H(C) − H(C | YM ) = ∑ ∫ P(YM , ωc ) lg dx
c =1 YM P(YM ) P(ωc )
g The mutual information between the feature vector and the class label I(YM;C) measures the amount
by which the uncertainty in the class H(C) is decreased by knowledge of the feature vector H(C|YM),
where H(·) is the entropy function
n Note that mutual information requires the computation of the multivariate densities P(YM) and P(YM,ωC), which is
an ill-posed problem for high-dimensional spaces. In practice [Battiti, 1994], mutual information is replaced by a
heuristic like
( ) ∑ I(x )
M M M
J(YM ) = ∑ I x im ; C − β∑ im ; x in
m =1 m =1 n =m +1

Introduction to Pattern Analysis 6


Ricardo Gutierrez-Osuna
Texas A&M University
Filters vs. Wrappers
g Filters
n Advantages
g Fast execution: Filters generally involve a non-iterative computation on the dataset, which can
execute much faster than a classifier training session
g Generality: Since filters evaluate the intrinsic properties of the data, rather than their interactions
with a particular classifier, their results exhibit more generality: the solution will be “good” for a larger
family of classifiers
n Disadvantages
g Tendency to select large subsets: Since the filter objective functions are generally monotonic, the
filter tends to select the full feature set as the optimal solution. This forces the user to select an
arbitrary cutoff on the number of features to be selected

g Wrappers
n Advantages
g Accuracy: wrappers generally achieve better recognition rates than filters since they are tuned to the
specific interactions between the classifier and the dataset
g Ability to generalize: wrappers have a mechanism to avoid overfitting, since they typically use
cross-validation measures of predictive accuracy
n Disadvantages
g Slow execution: since the wrapper must train a classifier for each feature subset (or several
classifiers if cross-validation is used), the method can become unfeasible for computationally
intensive methods
g Lack of generality: the solution lacks generality since it is tied to the bias of the classifier used in the
evaluation function. The “optimal” feature subset will be specific to the classifier under consideration

Introduction to Pattern Analysis 7


Ricardo Gutierrez-Osuna
Texas A&M University
Search strategies
g There is a large number of search strategies, which can be grouped in three
categories
n Exponential algorithms (Lecture 12)
g These algorithms evaluate a number of subsets that grows exponentially with the dimensionality of
the search space
g The most representative algorithms under this class are
n Exhaustive Search (already discussed)
n Branch and Bound
n Approximate Monotonicity with Branch and Bound
n Beam Search
n Sequential algorithms (Lecture 11)
g These algorithms add or remove features sequentially, but have a tendency to become trapped in
local minima
g Representative examples of sequential search include
n Sequential Forward Selection
n Sequential Backward Selection
n Plus-l Minus-r Selection
n Bidirectional Search
n Sequential Floating Selection
n Randomized algorithms (Lecture 12)
g These algorithms incorporating randomness into their search procedure to escape local minima
g Representative examples are
n Random Generation plus Sequential Selection
n Simulated Annealing
n Genetic Algorithms

Introduction to Pattern Analysis 8


Ricardo Gutierrez-Osuna
Texas A&M University
Naïve sequential feature selection
g One may be tempted to evaluate each individual feature separately and select
those M features with the highest scores
n Unfortunately, this strategy will VERY RARELY work since it does not account for feature
dependence x 2
An example will help illustrate the poor performance
ωω45
g

that can be expected from this naïve approach


n The figures show a 4-dimensional pattern recognition problem
with 5 classes. Features are shown in pairs of 2D scatter plots ω3
n The objective is to select the best subset of 2 features using the
naïve sequential feature selection procedure
n Any reasonable objective function will rank features according ω1 ω2
to this sequence: J(x1)>J(x2)≈J(x3)>J(x4)
g x1 is, without a doubt, the best feature. It clearly separates ω1, ω2, ω3
and {ω4, ω5} x1
g x2 and x3 have similar performance, separating classes in three x4
groups
g x4 is the worst feature since it can only separate ω4 from ω5, the rest
of the classes having a heavy overlap
ω5
n The optimal feature subset turns out to be {x1, x4}, because x4
provides the only information that x1 needs: discrimination ω1 ω3
between classes ω4 and ω5
ω4
n However, if we were to choose features according to the
individual scores J(xk), we would certainly pick x1 and either x2 ω2
or x3, leaving classes ω4 and ω5 non separable
g This naïve strategy fails because it cannot consider features with
complementary information x3

Introduction to Pattern Analysis 9


Ricardo Gutierrez-Osuna
Texas A&M University
Sequential Forward Selection (SFS)
g Sequential Forward Selection is the simplest greedy search algorithm
n Starting from the empty set, sequentially add the feature x+ that results in the highest objective
function J(Yk+x+) when combined with the features Yk that have already been selected
Empty feature set
g Algorithm
1.
1. Start
Start with
with the
the empty
empty set
set Y
Y00={∅}
={∅}
Select the next best feature x = argmax [J(Yk + x )]
+
2.
2. Select the next best feature
x∉Yk
3.
3. Update
Update Y Yk+1 =Y +x++; k=k+1
k+1=Ykk+x ; k=k+1
4.
4. Go
Go to
to 22
g Notes
n SFS performs best when the optimal subset has a small number of
features
g When the search is near the empty set, a large number of states can be
potentially evaluated
g Towards the full set, the region examined by SFS is narrower since most of the
features have already been selected
n The search space is drawn like an ellipse to emphasize the fact that there Full feature set
are fewer states towards the full or empty sets
0000
g As an example, the state space for 4 features
is shown. Notice that the number of states is 1000 0100 0010 0001

larger in the middle of the search tree


1100 1010 1001 0110 0101 0011
g The main disadvantage of SFS is that it is unable
to remove features that become obsolete after the 1110 1101 1011 0111
addition of other features
1111

Introduction to Pattern Analysis 10


Ricardo Gutierrez-Osuna
Texas A&M University
SFS example
g Assuming the objective function J(X) below, perform a Sequential
Forward Selection to completion
J(X) = - 2x1x 2 + 3x1 + 5x 2 - 2x1x 2 x 3 + 7x 3 + 4x 4 - 2x1x 2 x 3 x 4

g where xk are indicator variables that determine if the k-th feature has been selected
(xk=1) or not (xk=0)
g Solution

(I) J(x1)=3 J(x2)=5 J(x3)=7 J(x4)=4

(II) J(x3x1)=10 J(x3x2)=12 J(x3x4)=11

(III) J(x3x2x1)=11 J(x3x2x4)=16

(IV) J(x3x2x4x1)=13
Introduction to Pattern Analysis 11
Ricardo Gutierrez-Osuna
Texas A&M University
Sequential Backward Selection (SBS)
g Sequential Backward Selection works in the opposite direction of SFS
n Starting from the full set, sequentially remove the feature x− that results in the smallest decrease
in the value of the objective function J(Y-x−)
g Notice that removal of a feature may actually lead to an increase in the objective function J(Yk-x−)>
J(Yk). Such functions are said to be non-monotonic (more on this when we cover Branch and Bound)
g Algorithm
Empty feature set
1.
1. Start
Start with
with the
the full
full set
set Y
Y00=X
=X
feature x = arg max [J(Yk − x )]

2.
2. Remove
Remove the the worst
worst feature
3.
3. Update
Update Y Yk+1 =Y -x−−; k=k+1
k+1=Ykk-x ; k=k+1
x∈Yk

4.
4. Go
Go to
to 22

g Notes
n SBS works best when the optimal feature subset has a large
number of features, since SBS spends most of its time visiting large
subsets
n The main limitation of SBS is its inability to reevaluate the
usefulness of a feature after it has been discarded

Full feature set

Introduction to Pattern Analysis 12


Ricardo Gutierrez-Osuna
Texas A&M University
Plus-L Minus-R Selection (LRS)
g Plus-L Minus-R is a generalization of SFS and SBS
n If L>R, LRS starts from the empty set and repeatedly adds ‘L’ features and removes ‘R’ features
n If L<R, LRS starts from the full set and repeatedly removes ‘R’ features followed by ‘L’ feature
additions
g Algorithm Empty feature set
1.
1. IfIf L>R
L>R then
then
start
start with
with the
the empty
empty setset Y={∅}
Y={∅}
else
else
start
start with
with the
the full
full set
set Y=X
Y=X
go to step
go to step 3 3
2.
2. Repeat
Repeat LL times
times
x = argmax [J(Yk + x )]
+

x∉Yk

Yk +1 = Yk + x + ; k = k + 1

3.
3. Repeat
Repeat R
R times
times
x = arg max [J(Yk − x )]

x∈Yk

Yk +1 = Yk − x − ; k = k + 1
Full feature set
4.
4. Go
Go to
to 22
g Notes
n LRS attempts to compensate for the weaknesses of SFS and SBS with some backtracking
capabilities
n Its main limitation is the lack of a theory to help predict the optimal values of L and R

Introduction to Pattern Analysis 13


Ricardo Gutierrez-Osuna
Texas A&M University
Bidirectional Search (BDS)
g Bidirectional Search is a parallel implementation of SFS and SBS
n SFS is performed from the empty set
n SBS is performed from the full set
n To guarantee that SFS and SBS converge to the same solution, we must ensure that
g Features already selected by SFS are not removed by SBS
g Features already removed by SBS are not selected by SFS
g For example, before SFS attempts to add a new feature, it checks if it has been removed by SBS
and, if it has, attempts to add the second best feature, and so on. SBS operates in a similar fashion.
g Algorithm
Empty feature set

1.
1. Start
Start SFS
SFS with
with the
the empty
empty set set Y
YFF={∅}
={∅}
2.
2. Start
Start SBS
SBS with
with the
the full
full set YBB=X
set Y =X
3.
3. Select
Select the
the best
best feature
feature
+
x = argmax J YFk + x
x∉YFk
[( )]
x∈YBk

YFk+1 = YFk + x +
3.
3. Remove
Remove thethe worst
worst feature
feature

x = arg max J YBk − x
x∈YBk
[( )]
x∉YFk +1

YBk +1 = YBk − x − ; k = k + 1
4.
4. Go
Go to
to 22
Full feature set

Introduction to Pattern Analysis 14


Ricardo Gutierrez-Osuna
Texas A&M University
Sequential Floating Selection (SFFS and SFBS)
g Sequential Floating Selection methods are an extension to the LRS algorithms
with flexible backtracking capabilities
n Rather than fixing the values of ‘L’ and ‘R’, these floating methods allow those values to be
determined from the data:
g The dimensionality of the subset during the search can be though to be “floating” up and down
g There are two floating methods
n Sequential Floating Forward Selection (SFFS) starts from the empty set
g After each forward step, SFFS performs backward steps as long as the objective function increases
n Sequential Floating Backward Selection (SFBS) starts from the full set
g After each backward step, SFBS performs forward steps as long as the objective function increases
g SFFS Algorithm (SFBS is analogous) Empty feature set
1.
1. Start
Start with
with the
the empty
empty set
set Y={∅}
Y={∅}
2. Select the best feature
2. Select the best feature
x + = argmax [J(Yk + x )]
x∉Yk

Yk = Yk + x + ; k = k + 1
3.
3. Select
Select the
the worst
worst feature*
feature*
x = arg max [J(Yk − x )]

x∈Yk

4.
4. IfIf J(Y -x--)>J(Y
J(Ykk-x )>J(Ykk)) then
then
YYk+1
k+1
=Y
=Y kk-x;
-x; k=k+1
k=k+1
go
go to
to Step
Step 33 *Notice that you’ll need to
else
else do some book-keeping to
go
go to
to Step
Step 22 avoid infinite loops
Full feature set

Introduction to Pattern Analysis 15


Ricardo Gutierrez-Osuna
Texas A&M University
References for Practical Knowledge

• https://machinelearningmastery.com/feature-selection-with-numerical-
input-data/

• https://www.analyticsvidhya.com/blog/2020/10/a-comprehensive-guide-
to-feature-selection-using-wrapper-methods-in-python/
Thank you!!

Any Queries??
Dr. Naveen Saini
Assistant Professor

Ensemble Methods

Dr. Naveen Saini


Assistant Professor

Department of Computer Science


Indian Institute of Information Technology Lucknow
Uttar Pardesh
1
naveen@iiitl.ac.in https://sites.google.com/view/nsaini
Ensemble Methods

• Rationale
• Combining classifiers
• Bagging
• Boosting
– Ada-Boosting
Rationale
• In any application, we can use several
learning algorithms; hyperparameters
affect the final learner
• The No Free Lunch Theorem: no single
learning algorithm in any domains always
induces the most accurate learner
• Try many and choose the one with the best
cross-validation results
Rationale
• On the other hand …
– Each learning model comes with a set of
assumption and thus bias
– Learning is an ill-posed problem (finite data):
each model converges to a different solution
and fails under different circumstances
– Why do not we combine multiple learners
intelligently, which may lead to improved
results?
Rationale
• How about combining learners that always make
similar decisions?
– Advantages?
– Disadvantages?

• Complementary?

• To build ensemble: Your suggestions?


Rationale
• Why it works?
• Suppose there are 25 base classifiers
– Each classifier has error rate,  = 0.35
– If the base classifiers are identical, then the ensemble
will misclassify the same examples predicted incorrectly
by the base classifiers.
– Assume classifiers are independent, i.e., their errors are
uncorrelated. Then the ensemble makes a wrong prediction
only if more than half of the base classifiers predict
incorrectly.
– Probability that the ensemble classifier makes a wrong
prediction:

25
 25  i
   (1   )25i  0.06
i 13  i 
Works if …

• The base classifiers should be independent.


• The base classifiers should do better than
a classifier that performs random guessing.
(error < 0.5)
• In practice, it is hard to have base
classifiers perfectly independent.
Nevertheless, improvements have been
observed in ensemble methods when they
are slightly correlated.
Rationale
• One important note is that:
– When we generate multiple base-learners, we
want them to be reasonably accurate but do
not require them to be very accurate
individually, so they are not, and need not be,
optimized separately for best accuracy. The
base learners are not chosen for their
accuracy, but for their simplicity.
Ensemble Methods

• Rationale
• Combining classifiers
• Bagging
• Boosting
– Ada-Boosting
Combining classifiers
• Examples: classification trees and neural
networks, several neural networks, several
classification trees, etc.
• Average results from different models
• Why?
– Better classification performance than
individual classifiers
– More resilience to noise
• Why not?
– Time consuming
– Overfitting
Why
• Why?
– Better classification performance than individual
classifiers
– More resilience to noise
• Beside avoiding the selection of the worse classifier
under particular hypothesis, fusion of multiple
classifiers can improve the performance of the best
individual classifiers
• This is possible if individual classifiers make
“different” errors
• For linear combiners, Turner and Ghosh (1996)
showed that averaging outputs of individual
classifiers with unbiased and uncorrelated errors can
improve the performance of the best individual
classifier and, for infinite number of classifiers,
provide the optimal Bayes classifier
Different classifier
Architecture
serial

parallel

hybrid
Architecture
Architecture
Classifiers Fusion
• Fusion is useful only if the combined classifiers are
mutually complementary
• Majority vote fuser: the majority should be always
correct
Complementary classifiers

• Several approaches have been proposed to


construct ensembles made up of
complementary classifiers. Among the others:
– Using problem and designer knowledge
– Injecting randomness
– Varying the classifier type, architecture, or parameters
– Manipulating training data
– Manipulating features
If you are interested …
• L. Xu, A. Kryzak, C. V. Suen, “Methods of Combining Multiple
Classifiers and Their Applications to Handwriting
Recognition”, IEEE Transactions on Systems, Man Cybernet,
22(3), 1992, pp. 418-435.
• J. Kittle, M. Hatef, R. Duin and J. Matas, “On Combining
Classifiers”, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 20(3), March 1998, pp. 226-239.
• D. Tax, M. Breukelen, R. Duin, J. Kittle, “Combining Multiple
Classifiers by Averaging or by Multiplying?”, Patter
Recognition, 33(2000), pp. 1475-1485.
• L. I. Kuncheva, “A Theoretical Study on Six Classifier Fusion
Strategies”, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 24(2), 2002, pp. 281-286.
Alternatively …

• Instead of designing multiple


classifiers with the same dataset,
we can manipulate the training set:
multiple training sets are created
by resampling the original data
according to some distribution. E.g.,
bagging and boosting
Ensemble Methods

• Rationale
• Combining classifiers
• Bagging
• Boosting
– Ada-Boosting
Bagging
• Breiman, 1996

• Create classifiers using training sets that


are bootstrapped (drawn with
replacement)

• Average results for each case


Bagging Example

Original 1 2 3 4 5 6 7 8

Training set 1 2 7 8 3 7 6 3 1

Training set 2 7 8 5 6 4 2 7 1

Training set 3 3 6 2 7 5 6 2 2

Training set 4 4 5 1 4 6 4 3 8
Bagging
• Sampling (with replacement) according to a uniform
probability distribution
– Each bootstrap sample D has the same size as the original data.
– Some instances could appear several times in the same training set,
while others may be omitted.

• Build classifier on each bootstrap sample D


• D will contain approximately 63% of the original data.
Bagging
• Bagging improves generalization performance by reducing
variance of the base classifiers. The performance of
bagging depends on the stability of the base classifier.

– If a base classifier is unstable, bagging helps to reduce the


errors associated with random fluctuations in the training
data.
– If a base classifier is stable, bagging may not be able to
improve, rather it could degrade the performance.

• Bagging is less susceptible to model overfitting when applied


to noisy data.
Boosting

• Sequential production of classifiers


• Each classifier is dependent on the
previous one, and focuses on the previous
one’s errors
• Examples that are incorrectly predicted in
previous classifiers are chosen more often
or weighted more heavily
Ada-Boosting
• Freund and Schapire, 1997
• Ideas
– Complex hypotheses tend to overfitting
– Simple hypotheses may not explain data
well
– Combine many simple hypotheses into a
complex one
– Ways to design simple ones, and
combination issues
Ada-Boosting
• Two approaches

– Select examples according to error in previous


classifier (more representatives of
misclassified cases are selected) – more
common

– Weigh errors of the misclassified cases higher


(all cases are incorporated, but weights are
different) – does not work for some algorithms
Boosting Example

Original 1 2 3 4 5 6 7 8

Training set 1 2 7 8 3 7 6 3 1

Training set 2 1 4 5 4 1 5 6 4

Training set 3 7 1 5 8 1 8 1 4

Training set 4 1 1 6 1 1 3 1 5
Ada-Boosting
• Input:
– Training samples S = {(xi, yi)}, i = 1, 2, …, N
– Weak learner h
• Initialization
– Each sample has equal weight wi = 1/N
• For k = 1 … T
– Train weak learner hk according to weighted sample sets
– Compute classification errors
– Update sample weights wi
• Output
– Final model which is a linear combination of hk
Ada-Boosting
Ada-Boosting
Ada-Boosting
Ada-Boosting
Ada-Boosting
Ada-Boosting
Schematic of AdaBoost

Training Samples h1(x)

Weighted Samples h2(x)

Sign[sum]
Weighted Samples h3(x)

Weighted Samples hT(x)


AdaBoost

• It penalizes models that have poor accuracy


• If any intermediate rounds produce error rate
higher than 50%, the weights are reverted back to
1/n and the resampling procedure is repeated

• because of its tendency to focus on training


examples that are wrongly classified, the boosting
technique can be quite susceptible to overfitting.
AdaBoost

• Classification
– AdaBoost.M1 (two-class problem)
– AdaBoost.M2 (multiple-class problem)
Bagging vs. Boosting
Training Data
1, 2, 3, 4, 5, 6, 7, 8

Bagging training set Boosting training set


Set 1: 2, 7, 8, 3, 7, 6, 3, 1 Set 1: 2, 7, 8, 3, 7, 6, 3, 1
Set 2: 7, 8, 5, 6, 4, 2, 7, 1 Set 2: 1, 4, 5, 4, 1, 5, 6, 4
Set 3: 3, 6, 2, 7, 5, 6, 2, 2 Set 3: 7, 1, 5, 8, 1, 8, 1, 4
Set 4: 4, 5, 1, 4, 6, 4, 3, 8 Set 4: 1, 1, 6, 1, 1, 3, 1, 5
stan simple bag arc ada stan bag arc ada

breast-cancer-w 3.4 3.5 3.4 3.8 4 5 3.7 3.5 3.5


credit-a 14.8 13.7 13.8 15.8 15.7 14.9 13.4 14 13.7
credit-g 27.9 24.7 24.2 25.2 25.3 29.6 25.2 25.9 26.7
diabetes 23.9 23 22.8 24.4 23.3 27.8 24.4 26 25.7
glass 38.6 35.2 33.1 32 31.1 31.3 25.8 25.5 23.3
heart-cleveland 18.6 17.4 17 20.7 21.1 24.3 19.5 21.5 20.8
hepatitis 20.1 19.5 17.8 19 19.7 21.2 17.3 16.9 17.2
house-votes-84 4.9 4.8 4.1 5.1 5.3 3.6 3.6 5 4.8
hypo 6.4 6.2 6.2 6.2 6.2 0.5 0.4 0.4 0.4
ionosphere 9.7 7.5 9.2 7.6 8.3 8.1 6.4 6 6.1
iris 4.3 3.9 4 3.7 3.9 5.2 4.9 5.1 5.6
kr-vs-kp 2.3 0.8 0.8 0.4 0.3 0.6 0.6 0.3 0.4
labor 6.1 3.2 4.2 3.2 3.2 16.5 13.7 13 11.6
letter 18 12.8 10.5 5.7 4.6 14 7 4.1 3.9
promoters-936 5.3 4.8 4 4.5 4.6 12.8 10.6 6.8 6.4
ribosome-bind 9.3 8.5 8.4 8.1 8.2 11.2 10.2 9.3 9.6
satellite 13 10.9 10.6 9.9 10 13.8 9.9 8.6 8.4
segmentation 6.6 5.3 5.4 3.5 3.3 3.7 3 1.7 1.5
sick 5.9 5.7 5.7 4.7 4.5 1.3 1.2 1.1 1
sonar 16.6 15.9 16.8 12.9 13 29.7 25.3 21.5 21.7
soybean 9.2 6.7 6.9 6.7 6.3 8 7.9 7.2 6.7
splice 4.7 4 3.9 4 4.2 5.9 5.4 5.1 5.3
vehicle 24.9 21.2 20.7 19.1 19.7 29.4 27.1 22.5 22.9

1. Single NN; 2. ensemble; 3. bagging; 4. arcing; 5. ada;


6. decision tree; 7. bagging of decision trees; 8. arcing; 9. ada - dt
Neural Networks
Reduction in error for Ada-
boosting, arcing, and bagging
of NN as a percentage of the
original error rate as well as
standard deviation

• Ada-Boosting
• Arcing
• Bagging

White bar represents 1


standard deviation
Decision Trees
Composite Error Rates
Neural Networks:
Bagging vs Simple
Ada-Boost:
Neural Networks vs.
Decision Trees
• NN
• DT

Box represents
reduction in error
Arcing
Bagging
Noise
• Hurts boosting the most
Conclusions
• Performance depends on data and classifier
• In some cases, ensembles can overcome bias of
component learning algorithm
• Bagging is more consistent than boosting
• Boosting can give much better results on some
data
Thank you!!

Any Queries??
Multi-Label Classification
Dr. Naveen Saini
Assistant Professor

Department of Computer Science


Indian Institute of Information Technology Lucknow
Uttar Pardesh

naveen@iiitl.ac.in https://sites.google.com/view/nsaini1
Multi-label Classification

Binary classification: Is this a picture of the sea?

∈ {yes, no}
Multi-label Classification

Multi-class classification: What is this a picture of?

∈ {sea, sunset, trees, people, mountain, urban}


Multi-label Classification

Multi-label classification: Which labels are relevant to this


picture?

⊆ {sea, sunset, trees, people, mountain, urban}

i.e., multiple labels per instance instead of a single label!


Multi-label Classification

K =2 K >2
L=1 binary multi-class
L>1 multi-label multi-output†

also known as multi-target, multi-dimensional.

Figure: For L target variables (labels), each of K values.

multi-output can be cast to multi-label, just as multi-class


can be cast to binary.
tagging / keyword assignment: set of labels (L) is not
predefined
Increasing Interest

year in text in title


1996-2000 23 1
2001-2005 188 18
2006-2010 1470 164
2011-2015 4550 485
Table: Academic articles containing the phrase ‘multi-label
classification’ (Google Scholar)
Single-label vs. Multi-label
Table: Single-label Y ∈ {0, 1}
X1 X2 X3 X4 X5 Y
1 0.1 3 1 0 0
0 0.9 1 0 1 1
0 0.0 1 1 0 0
1 0.8 2 0 1 1
1 0.0 2 0 1 0
0 0.0 3 1 1 ?

Table: Multi-label Y ⊆ {λ1 , . . . , λL }


X1 X2 X3 X4 X5 Y
1 0.1 3 1 0 {λ2 , λ3 }
0 0.9 1 0 1 {λ1 }
0 0.0 1 1 0 {λ2 }
1 0.8 2 0 1 {λ1 , λ4 }
1 0.0 2 0 1 {λ4 }
0 0.0 3 1 1 ?
Single-label vs. Multi-label
Table: Single-label Y ∈ {0, 1}
X1 X2 X3 X4 X5 Y
1 0.1 3 1 0 0
0 0.9 1 0 1 1
0 0.0 1 1 0 0
1 0.8 2 0 1 1
1 0.0 2 0 1 0
0 0.0 3 1 1 ?

Table: Multi-label [Y1 , . . . , YL ] ∈ 2L


X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4
1 0.1 3 1 0 0 1 1 0
0 0.9 1 0 1 1 0 0 0
0 0.0 1 1 0 0 1 0 0
1 0.8 2 0 1 1 0 0 1
1 0.0 2 0 1 0 0 0 1
0 0.0 3 1 1 ? ? ? ?
Outline
1 Introduction

2 Applications

3 Background

4 Problem Transformation

5 Algorithm Adaptation

6 Label Dependence

7 Multi-label Evaluation

8 Summary & Resources


Text Categorization
For example, the news . . .

Novo Banco: Portugal bank sell-off hits snag

Portugal’s central bank has missed its deadline to sell


Novo Banco, a bank created after the collapse of the
country’s second-biggest lender.

Reuters collection, newswire stories into 103 topic codes


Text Categorization

For example, the IMDb dataset: Textual movie plot summaries


associated with genres (labels).
Text Categorization
For example, the IMDb dataset: Textual movie plot summaries
associated with genres (labels).
abandoned

wedding
accident

romance
horror

comedy

action
violent
...
...
i X1 X2 ... X1000 X1001 Y1 Y2 ... Y27 Y28
1 1 0 ... 0 1 0 1 ... 0 0
2 0 1 ... 1 0 1 0 ... 0 0
3 0 0 ... 0 1 0 1 ... 0 0
4 1 1 ... 0 1 1 0 ... 0 1
5 1 1 ... 0 1 0 1 ... 0 1
.. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . .
120919 1 1 ... 0 0 0 0 ... 0 1
Labelling E-mails

For example, the Enron e-mails multi-labelled to 53


categories by the UC Berkeley Enron Email Analysis
Project
Company Business, Strategy, etc.
Purely Personal
Empty Message
Forwarded email(s)
...
company image – current
...
Jokes, humor (related to business)
...
Emotional tone: worry / anxiety
Emotional tone: sarcasm
...
Emotional tone: shame
Company Business, Strategy, etc.
Labelling Images

Images are labelled to indicate


multiple concepts
multiple objects
multiple people
e.g., Scene data with concept labels
⊆ {beach, sunset, foliage, field, mountain, urban}
Applications: Audio
Labelling music/tracks with genres / voices, concepts, etc.

e.g., Music dataset, audio tracks labelled with different moods,


among: {
amazed-surprised,
happy-pleased,
relaxing-calm,
quiet-still,
sad-lonely,
angry-aggressive
}
Outline
1 Introduction

2 Applications

3 Background

4 Problem Transformation

5 Algorithm Adaptation

6 Label Dependence

7 Multi-label Evaluation

8 Summary & Resources


Single-label Classification

x1 x2 x3 x4 x5

ŷ = h(x) • classifier h
= argmax p(y|x) • MAP Estimate
y∈{0,1}
Multi-label Classification
x

y1 y2 y3 y4

ŷj = hj (x) = argmax p(yj |x) • for index, j = 1, . . . , L


yj ∈{0,1}

and then,

ŷ = h(x) = [ŷ1 , . . . , ŷ4 ]


h i
= argmax p(y1 |x), · · · , argmax p(y4 |x)
y1 ∈{0,1} y4 ∈{0,1}
h i
= f1 (x), · · · , f4 (x) = f (W> x)

This is the Binary Relevance method (BR).


Outline
1 Introduction

2 Applications

3 Background

4 Problem Transformation

5 Algorithm Adaptation

6 Label Dependence

7 Multi-label Evaluation

8 Summary & Resources


BR Transformation
1 Transform dataset . . .
X Y1 Y2 Y3 Y4
x(1) 0 1 1 0
x(2) 1 0 0 0
x(3) 0 1 0 0
x(4) 1 0 0 1
x(5) 0 0 0 1
. . . into L separate binary problems (one for each label)
X Y1 X Y2 X Y3 X Y4
x(1) 0 x(1) 1 x(1) 1 x(1) 0
x(2) 1 x(2) 0 x(2) 0 x(2) 0
x(3) 0 x(3) 1 x(3) 0 x(3) 0
x(4) 1 x(4) 0 x(4) 0 x(4) 1
x(5) 0 x(5) 0 x(5) 0 x(5) 1
2 and train with any off-the-shelf binary base classifier.
Classifier Chains
Modelling label dependence,

y1 y2 y3 y4

L
Y
p(y|x) ∝ p(x) p(yj |x, y1 , . . . , yj−1 )
j=1

and,
ŷ = argmax p(y|x)
y∈{0,1}L
CC Transformation

Similar to BR: make L binary problems, but include previous


predictions as feature attributes,

X Y1 X Y1 Y2 X Y1 Y2 Y3 X Y1 Y3 Y3 Y4
x(1) 0 x(1) 0 1 x(1) 0 1 1 x(1) 0 1 1 0
x(2) 1 x(2) 1 0 x(2) 1 0 0 x(2) 1 0 0 0
x(3) 0 x(3) 0 1 x(3) 0 1 0 x(3) 0 1 0 0
x(4) 1 x(4) 1 0 x(4) 1 0 0 x(4) 1 0 0 1
x(5) 0 x(5) 0 0 x(5) 0 0 0 x(5) 0 0 0 1

and, again, apply any classifier (not necessarily a probabilistic


one)!
Greedy CC
x

y1 y2 y3 y4

L classifiers for L labels. For test instance x̃, classify [22],


1 ŷ1 = h1 (x̃)
2 ŷ2 = h2 (x̃, ŷ1 )
3 ŷ3 = h3 (x̃, ŷ1 , ŷ2 )
4 ŷ4 = h4 (x̃, ŷ1 , ŷ2 , ŷ3 )
and return
ŷ = [ŷ1 , . . . , ŷL ]
Example

1
1 x
0
1
1
0 y1 y2 y3
0
1
1
0
0
1
0
0

ŷ = h(x̃) = [?, ?, ?]
Example

1
1 x
0
1
1
0.6 0 y1 y2 y3
0
0.4 1
1

0
0 1 ŷ1 = h1 (x̃) =
1
argmaxy1 p(y1 |x̃) = 1
0
0

ŷ = h(x̃) = [1, ?, ?]
Example

1
1 x
0.3
0
1 0.7
1
0.6 0 y1 y2 y3
0
1
1

0
0 1 ŷ1 = h1 (x̃) =
1
argmaxy1 p(y1 |x̃) = 1
0
0 2 ŷ2 = h2 (x̃, ŷ1 ) = . . . = 0

ŷ = h(x̃) = [1, 0, ?]
Example

1
1 x
0
1 0.7
0.6 1
0.6 0 0.4 y1 y2 y3
0
1
1

0
0 1 ŷ1 = h1 (x̃) =
1
argmaxy1 p(y1 |x̃) = 1
0
0 2 ŷ2 = h2 (x̃, ŷ1 ) = . . . = 0
3 ŷ3 = h3 (x̃, ŷ1 , ŷ2 ) = . . . = 1
ŷ = h(x̃) = [1, 0, 1]
Example

1
1 x
0
1 0.7
0.6 1
0.6 0 y1 y2 y3
0
1
1

0
0 1 ŷ1 = h1 (x̃) =
1
argmaxy1 p(y1 |x̃) = 1
0
0 2 ŷ2 = h2 (x̃, ŷ1 ) = . . . = 0
3 ŷ3 = h3 (x̃, ŷ1 , ŷ2 ) = . . . = 1
ŷ = h(x̃) = [1, 0, 1]

Improves over BR; similar build time (if L < D);


able to use any off-the-shelf classifier for hj ; parralelizable
But, errors may be propagated down the chain
Label Powerset Method (LP)
1 Transform dataset . . .
X Y1 Y2 Y3 Y4
x(1) 0 1 1 0
x(2) 1 0 0 0
x(3) 0 1 1 0
x(4) 1 0 0 1
x(5) 0 0 0 1
. . . into a multi-class problem, taking 2L possible values:
X Y ∈ 2L
x(1) 0110
x(2) 1000
x(3) 0110
x(4) 1001
x(5) 0001
2 . . . and train any off-the-shelf multi-class classifier.
Issues with LP

complexity: there is no greedy label-by-label option


imbalance: few examples per class label
overfitting: how to predict new value?

Example
In the Enron dataset, 44% of labelsets are unique (a single
training example or test instance). In del.icio.us dataset, 98%
are unique.
RAkEL
X Y ∈ 2L
x(1) 0110
x(2) 1000
x(3) 0110
x(4) 1001
x(5) 0001

Ensembles of RAndom k-labEL subsets (RAkEL) [27]


Do LP on M subsets ⊂ {1, . . . , L} of size k
X Y123 ∈ 2k X Y124 ∈ 2k X Y234 ∈ 2k
x(1) 011 x(1) 010 x(1) 110
x(2) 100 x(2) 100 x(2) 000
x(3) 011 x(3) 010 x(3) 110
x(4) 100 x(4) 101 x(4) 001
x(5) 000 x(5) 001 x(5) 001
Ensemble-based Voting
Most problem-transformation methods are ensemble-based,
e.g., ECC, EPS, RAkEL.

Ensemble Voting
ŷ1 ŷ2 ŷ3 ŷ4
h1 (x̃) 1 1 1 x
h2 (x̃) 0 1 0
h3 (x̃) 1 0 0 y123 y124 y134 y234
h4 (x̃) 1 0 0
score 0.75 0.25 0.75 0 y1 y2 y3 y4
ŷ 1 0 1 0

more predictive power (ensemble effect)


LP can predict novel label combinations
Outline
1 Introduction

2 Applications

3 Background

4 Problem Transformation

5 Algorithm Adaptation

6 Label Dependence

7 Multi-label Evaluation

8 Summary & Resources


Algorithm Adaptation

1 Take your favourite (most suitable) classifier


2 Modify it for multi-label classification

Advantage: a single model, usually very scalable


Disadvantage: predictive performance depends on the
problem domain
k Nearest Neighbours (kNN)
Assign to x̃ the majority class of the k ‘nearest neighbours’
X
ŷ = argmax y (i)
y
i∈Nk

where Nk contains the training pairs with x(i) closest to x̃.

3
c1
2 c2
c3
c4
1 c5
c6
0 ?
x2

4
4 3 2 1 0 1 2 3
x1
Multi-label kNN
Assigns the most common labels of the k nearest neighbours
1 X (i)
p(yj = 1|x) = yj
k
i∈Nk

ŷj = argmax[p(yj |x) > 0.5]


yj ∈{0,1}

3
000
2 001
010
1 011
101
0 ?

1
x2

5
4 3 2 1 0 1 2 3 4
x1

For example, [32]. Related to ensemble voting.


Decision Trees

x1
>0.3 ≤0.3

!
y~ x3
>−2.9 ≤−2.9

}
x2 y
=A =B

y~ !y

construct like C4.5 (multi-label entropy [3])


multiple labels at the leaves
predictive clustering trees [12] are highly competitive in
an random forest/ensemble
Outline
1 Introduction

2 Applications

3 Background

4 Problem Transformation

5 Algorithm Adaptation

6 Label Dependence

7 Multi-label Evaluation

8 Summary & Resources


Multi-label Evaluation
In single-label classification, simply compare true label y with
predicted label ŷ [or p(y|x̃)].What about in multi-label
classification?
Example
If true label vector is y = [1, 0, 0, 0], then ŷ =?

mountain

foliage
urban

beach
1 0 0 0
1 1 0 0
0 0 0 0
0 1 1 1

compare bit-wise? too lenient?


compare vector-wise? too strict?
Hamming Loss
Example
y(i) ŷ(i)
x̃(1) [1 0 1 0] [1 0 0 1]
x̃(2) [0 1 0 1] [0 1 0 1]
x̃(3) [1 0 0 1] [1 0 0 1]
x̃(4) [0 1 1 0] [0 1 0 0]
x̃(5) [1 0 0 0] [1 0 0 1]

N L
1 X X (i) (i)
H AMMING LOSS = I[ŷj 6= yj ]
NL
i=1 j=1

= 0.20
0/1 Loss
Example
y(i) ŷ(i)
x̃(1) [1 0 1 0] [1 0 0 1]
x̃(2) [0 1 0 1] [0 1 0 1]
x̃(3) [1 0 0 1] [1 0 0 1]
x̃(4) [0 1 1 0] [0 1 0 0]
x̃(5) [1 0 0 0] [1 0 0 1]

N
1 X (i)
0/1 LOSS = I(ŷ 6= y(i) )
N
i=1
= 0.60
Other Metrics
JACCARD INDEX – often called multi-label ACCURACY
RANK LOSS – average fraction of pairs not correctly ordered
ONE ERROR – if top ranked label is not in set of true labels
COVERAGE – average “depth” to cover all true labels
LOG LOSS – i.e., cross entropy
PRECISION – predicted positive labels that are relevant
RECALL – relevant labels which were predicted
PRECISION vs. RECALL curves
F- MEASURE
micro-averaged (‘global’ view)
macro-averaged by label (ordinary averaging of a binary
measure, changes in infrequent labels have a big impact)
macro-averaged by example (one example at a time,
average across examples)

For general evaluation, use multiple and contrasting


evaluation measures!
H AMMING LOSS vs. 0/1 LOSS
Hamming loss
evaluation by example, suitable for evaluating
ŷj = argmax p(yj |x)
yj ∈{0,1}

i.e., BR
favours sparse labelling
does not benefit directly from modelling label
dependence
0/1 loss
evaluation by label, suitable for evaluating
y = argmax p(y|x)
y∈{0,1}L

i.e., PCC, LP
does not favour sparse labelling
benefits from models of label dependence
H AMMING LOSS vs. 0/1 LOSS

Example: 0/1 LOSS vs. H AMMING LOSS

y(i) ŷ(i)
x̃(1) [1 0 1 0] [1 0 0 1]
x̃(2) [1 0 0 1] [1 0 0 1] H AM . L OSS 0.3
x̃(3) [0 1 1 0] [0 1 0 0] 0/1 L OSS 0.6
x̃(4) [1 0 0 0] [1 0 1 1]
x̃(5) [0 1 0 1] [0 1 0 1]
H AMMING LOSS vs. 0/1 LOSS

Example: 0/1 LOSS vs. H AMMING LOSS

y(i) ŷ(i) Optimize H AMMING L OSS


x̃(1) [1 0 1 0] [1 0 1 1] ...
x̃(2) [1 0 0 1] [1 1 0 1] H AM . L OSS 0.2
x̃(3) [0 1 1 0] [0 1 1 0] 0/1 L OSS 0.8
x̃(4) [1 0 0 0] [1 0 1 0]
. . . 0/1 LOSS goes up
x̃(5) [0 1 0 1] [0 1 0 1]
H AMMING LOSS vs. 0/1 LOSS

Example: 0/1 LOSS vs. H AMMING LOSS

y(i) ŷ(i)
Optimize 0/1 L OSS . . .
x̃(1) [1 0 1 0] [0 1 0 1]
x̃(2) [1 0 0 1] [1 0 0 1] H AM . L OSS 0.4
x̃(3) [0 1 1 0] [0 0 1 0] 0/1 L OSS 0.4
x̃(4) [1 0 0 0] [0 1 1 1] . . . H AMMING LOSS goes up
x̃(5) [0 1 0 1] [0 1 0 1]
H AMMING LOSS vs. 0/1 LOSS

Example: 0/1 LOSS vs. H AMMING LOSS

y(i) ŷ(i)
x̃(1) [1 0 1 0] [0 1 0 1]
x̃(2) [1 0 0 1] [1 0 0 1]
x̃(3) [0 1 1 0] [0 0 1 0]
x̃(4) [1 0 0 0] [0 1 1 1]
x̃(5) [0 1 0 1] [0 1 0 1]

Usually cannot minimize both at the same time . . .


. . . unless: labels are independent of each other! [5]
Resources

Overview [26]
Review/Survey of Algorithms [33]
Extensive empirical comparison [14]
Some slides: A, B, C
http://users.ics.aalto.fi/jesse/
Software & Datasets

Mulan (Java)
Meka (Java)
Scikit-Learn (Python) offers some multi-label support
Clus (Java)
LAMDA (Matlab)
Datasets
http://mulan.sourceforge.net/datasets.html
http://meka.sourceforge.net/#datasets
MEKA

A WEKA-based framework for multi-label classification


and evaluation
support for data-stream, semi-supervised classification

http://meka.sourceforge.net
A MEKA Classifier
package weka.classifiers.multilabel;
import weka.core.∗;

public class DumbClassifier extends MultilabelClassifier {

/∗∗
∗ BuildClassifier
∗/
public void buildClassifier (Instances D) throws Exception {
// the first L attributes are the labels
int L = D.classIndex();
}

/∗∗
∗ DistributionForInstance − return the distribution p(y[j ]| x)
∗/
public double[] distributionForInstance(Instance x) throws Exception {
int L = x.classIndex();
// predict 0 for each label
return new double[L];
}
}
References
Antonucci Alessandro, Giorgio Corani, Denis Mauá, and Sandra Gabaglio.
An ensemble of Bayesian networks for multilabel classification.
In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI’13, pages
1220–1225. AAAI Press, 2013.

Hanen Borchani.
Multi-dimensional classification using Bayesian networks for stationary and evolving streaming data.
PhD thesis, Departamento de Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de
Madrid, 2013.

Amanda Clare and Ross D. King.


Knowledge discovery in multi-label phenotype data.
Lecture Notes in Computer Science, 2168, 2001.

Krzysztof Dembczyński, Weiwei Cheng, and Eyke Hüllermeier.


Bayes optimal multilabel classification via probabilistic classifier chains.
In ICML ’10: 27th International Conference on Machine Learning, pages 279–286, Haifa, Israel, June 2010.
Omnipress.

Krzysztof Dembczyński, Willem Waegeman, Weiwei Cheng, and Eyke Hüllermeier.


On label dependence and loss minimization in multi-label classification.
Mach. Learn., 88(1-2):5–45, July 2012.

Chun-Sung Ferng and Hsuan-Tien Lin.


Multi-label classification with error-correcting codes.
In Proceedings of the 3rd Asian Conference on Machine Learning, ACML 2011, Taoyuan, Taiwan, November
13-15, 2011, pages 281–295, 2011.

Johannes Fürnkranz, Eyke Hüllermeier, Eneldo Loza Mencı́a, and Klaus Brinker.
Multilabel classification via calibrated label ranking.
Machine Learning, 73(2):133–153, November 2008.

Nadia Ghamrawi and Andrew McCallum.


Collective multi-label classification.
In CIKM ’05: 14th ACM international Conference on Information and Knowledge Management, pages
195–200, New York, NY, USA, 2005. ACM Press.

Shantanu Godbole and Sunita Sarawagi.


Discriminative methods for multi-labeled classification.
In PAKDD ’04: Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 22–30.
Springer, 2004.

Yuhong Guo and Suicheng Gu.


Multi-label classification using conditional dependency networks.
In IJCAI ’11: 24th International Conference on Artificial Intelligence, pages 1300–1305. IJCAI/AAAI, 2011.

Daniel Hsu, Sham M. Kakade, John Langford, and Tong Zhang.


Multi-label prediction via compressed sensing.
In NIPS ’09: Neural Information Processing Systems 2009, 2009.

Dragi Kocev, Celine Vens, Jan Struyf, and Sašo Deroski.


Tree ensembles for predicting structured outputs.
Pattern Recognition, 46(3):817–833, March 2013.

Abhishek Kumar, Shankar Vembu, AdityaKrishna Menon, and Charles Elkan.


Beam search algorithms for multilabel learning.
Machine Learning, 92(1):65–89, 2013.

Gjorgji Madjarov, Dragi Kocev, Dejan Gjorgjevikj, and Sašo Džeroski.


An extensive experimental comparison of methods for multi-label learning.
Pattern Recognition, 45(9):3084–3104, September 2012.

Andrew Kachites McCallum.


Multi-label text classification with a mixture model trained by em.
In AAAI 99 Workshop on Text Learning, 1999.

Antti Puurula, Jesse Read, and Albert Bifet.


Kaggle LSHTC4 winning solution.
Technical report, Kaggle LSHTC4 Winning Solution, 2014.

Piyush Rai and Hal Daume.


Multi-label prediction via sparse infinite CCA.
In NIPS 2009: Advances in Neural Information Processing Systems 22, pages 1518–1526. 2009.

Jesse Read and Jaakko Hollmén.


A deep interpretation of classifier chains.
In Advances in Intelligent Data Analysis XIII - 13th International Symposium, IDA 2014, pages 251–262,
October 2014.
Jesse Read and Jaakko Hollmén.
Multi-label classification using labels as hidden nodes.
ArXiv.org, stats.ML(1503.09022v1), 2015.

Jesse Read, Luca Martino, and David Luengo.


Efficient monte carlo methods for multi-dimensional learning with classifier chains.
Pattern Recognition, 47(3):15351546, 2014.

Jesse Read, Bernhard Pfahringer, and Geoff Holmes.


Multi-label classification using ensembles of pruned sets.
In ICDM 2008: Eighth IEEE International Conference on Data Mining, pages 995–1000. IEEE, 2008.

Jesse Read, Bernhard Pfahringer, Geoffrey Holmes, and Eibe Frank.


Classifier chains for multi-label classification.
Machine Learning, 85(3):333–359, 2011.

Jesse Read, Antti Puurula, and Albert Bifet.


Multi-label classification with meta labels.
In ICDM’14: IEEE International Conference on Data Mining (ICDM 2014), pages 941–946. IEEE, December
2014.
Robert E. Schapire and Yoram Singer.
Boostexter: A boosting-based system for text categorization.
Machine Learning, 39(2/3):135–168, 2000.

F. A. Thabtah, P. Cowling, and Yonghong Peng.


MMAC: A new multi-class, multi-label associative classification approach.
In ICDM ’04: Fourth IEEE International Conference on Data Mining, pages 217–224, 2004.
Grigorios Tsoumakas and Ioannis Katakis.
Multi label classification: An overview.
International Journal of Data Warehousing and Mining, 3(3):1–13, 2007.

Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas.


Random k-labelsets for multi-label classification.
IEEE Transactions on Knowledge and Data Engineering, 23(7):1079–1089, 2011.

Grigorios Tsoumakas, Ioannis Katakis, and Ioannis P. Vlahavas.


Effective and efficient multilabel classification in domains with large number of labels.
In ECML/PKDD Workshop on Mining Multidimensional Data, 2008.

Jason Weston, Olivier Chapelle, André Elisseeff, Bernhard Schölkopf, and Vladimir Vapnik.
Kernel dependency estimation.
In NIPS, pages 897–904, 2003.

Julio H. Zaragoza, Luis Enrique Sucar, Eduardo F. Morales, Concha Bielza, and Pedro Larrañaga.
Bayesian chain classifiers for multidimensional classification.
In 24th International Joint Conference on Artificial Intelligence (IJCAI ’11), pages 2192–2197, 2011.

Min-Ling Zhang and Kun Zhang.


Multi-label learning by exploiting label dependency.
In KDD ’10: 16th ACM SIGKDD International conference on Knowledge Discovery and Data mining, pages
999–1008. ACM, 2010.

Min-Ling Zhang and Zhi-Hua Zhou.


ML-KNN: A lazy learning approach to multi-label learning.
Pattern Recognition, 40(7):2038–2048, 2007.

Min-Ling Zhang and Zhi-Hua Zhou.


A review on multi-label learning algorithms.
IEEE Transactions on Knowledge and Data Engineering, 99(PrePrints):1, 2013.
References

• http://www.ecmlpkdd2015.org/sites/defa
ult/files/JesseRead.pdf
Any Queries:
naveen@iiitl.ac.in

You might also like