Welcome To:: Introduction To Classification & Classification Algorithms

IBM ICE (Innovation Centre for Education)
Welcome to:
Introduction to Classification & Classification Algorithms
9.1
Unit objectives IBM ICE (Innovation Centre for Education)
IBM Power Systems
After completing this unit, you should be able to:
• Understand the concepts of classification
• Gain knowledge on applications of classification algorithms
• Gain an insight into the classification techniques such as k-nearest neighbor algorithm,
decision trees, Naive Bayesian classifier
• Gain an insight into the ensemble methods namely bagging and boosting
• Gain knowledge on advanced classification methods namely support vector machines,

artificial neural networks
• Understand the metrics for evaluation of classification methods

Preamble: Machine learning IBM ICE (Innovation Centre for Education)
IBM Power Systems
Classify faces and expressions IBM ICE (Innovation Centre for Education)
IBM Power Systems
Introduction IBM ICE (Innovation Centre for Education)
IBM Power Systems
Source: Dengxin Dai , and Luc Van Gool

Unsupervised High-level Feature Learning by Ensemble Projection for Semi-supervised Image Classification and Image Clustering.
ML classifier IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Topic 1 words:
– Baseball, owners, sports, selig, ball, bill, Indians, isringhausen, mets, minors, players, specter,
stadium, power, send, new, bud, comes, compassion, game, headaches, lite, nfl, powerful,
strawberry, urges, home, ambassadors, building, calendar, commish, costs, day, dolan, drive, hits,
league, little, match, payments, pitch, play, player, red, stadiums, umpire, wife, youth, field, leads.
• Topic 2 words:
– Merger, business, bank, buy, announces, new, acquisition, finance, companies, com, company,
disclosure, emm, news, us, acquire, chemical, inc, results, shares, takeover, corporation, european,
financial, investment, market, quarter, two, acquires, bancorp, bids, communications, first, mln,
purchase, record, stake, west, sale, bid, bn, brief, briefs, capital, control, europe, inculab.
Classification and general approach IBM ICE (Innovation Centre for Education)
IBM Power Systems
x2
?
?
?
x1
Classification algorithms IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Each one has its properties with respect to bias, speed, accuracy and transparency.
– Fisher linear discriminant.
– KNN.
– Decision tree.
– Neural networks.
– SVM.
– Naïve Bayes.
– Adaboost etc.
Instance based learning IBM ICE (Innovation Centre for Education)
IBM Power Systems
Its very similar to a

Desktop!!
K-Nearest Neighbor (KNN) IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Most basic instance-based method.
• Data are represented in a vector space . f : d  V
• Supervised learning.
• Feature Space:
– All instances correspond to points in an n-dimensional Euclidean space.
– Classification is delayed till a new instance arrives.
– Classification done by comparing feature vectors of the different points.
– Target function may be discrete or real-valued.
(1) (1) (2) (2) (n ) (n )


 x , f ( x ) , x f ( x ) ,..., x , f ( x ) 
x1
   d
 (x
x
 
 2 xy   y i )2
x   ..   d i
 ..  i1


 x d
Decision trees IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Goal: Categorization.
• Given an event, predict is category. Examples:

– Who won a given ball game?
– How should we file a given email?
– What word sense was intended for a given occurrence of a word?
• Event = list of features. Examples:

– Ball game: Which players were on offense?
– Email: Who sent the email?
– Disambiguation: What was the preceding word?
Attribute selection measure:
Information gain IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Select the attribute with the highest information gain.
• Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci,
D|/|D|.
• Expected information (entropy) needed to classify a tuple in D:

m
Info( D)   pi log 2 ( pi )
i 1
• Information needed (after using A to split D into v partitions) to classify D:
v | Dj |
InfoA ( D)    I (Dj )
j 1 | D|
• Information gained by branching on attribute A.

ID3 algorithm IBM ICE (Innovation Centre for Education)
IBM Power Systems
[D1,D2,…,D14] Outlook
[9+,5-]
Sunny Overcast Rain
Ssunny =[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]

[2+,3-] [4+,0-] [3+,2-]
? Yes ?
Gain(Ssunny, Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970

Gain(Ssunny, Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny, Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
Decision tree: Weekend example IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Step-1: Let T be the set of training instances.

Weekend Weather Parents Money Decision
(Example) (Category
W1 Sunny Yes Rich Cinema
W2 Sunny No Rich Tennis
W3 Windy Yes Rich Cinema
W4 Rainy Yes Poor Cinema
W5 Rainy No Rich Stay in
W6 Rainy Yes Poor Cinema
W7 Windy No Poor Cinema
W8 Windy No Rich Shopping
W9 Windy Yes Rich Cinema
W10 Sunny No Rich Tennis
Converting a tree to rules IBM ICE (Innovation Centre for Education)
IBM Power Systems
Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak

No Yes No Yes
R1 : If (Outlook=Sunny)  (Humidity=High) Then PlayTennis=No

R2 : If (Outlook=Sunny)  (Humidity=Normal) Then PlayTennis=Yes
R3 : If (Outlook=Overcast) Then PlayTennis=Yes
R4 : If (Outlook=Rain)  (Wind=Strong) Then PlayTennis=No
R5 : If (Outlook=Rain)  (Wind=Weak) Then PlayTennis=Yes
Bayesian algorithms IBM ICE (Innovation Centre for Education)
IBM Power Systems
• These methods apply Bayes’ theorem for problems such as classification and regression.
• Bays theorem considers the prior probability and evidence for predicting the posterior
probability.
• Highly used versions of Bayesian approaches are:

– Naive Bayes.
– Gaussian Naive Bayes.
– Multinomial Naive Bayes.
– Averaged One-Dependence Estimators (AODE).
– Bayesian Belief Network (BBN).
– Bayesian Network (BN).
Ensemble IBM ICE (Innovation Centre for Education)
IBM Power Systems
model 1
Ensemble model
Data
model 2
……
Combine multiple models into one!
model k
Applications: Classification, Clustering, Collaborative Filtering, Anomaly Detection……

Stories of success IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Million-dollar prize:
– Improve the baseline movie recommendation
approach of Netflix by 10% in accuracy.
– The top submissions all combine several teams and
algorithms as an ensemble.
• Data mining competitions

– Classification problems.
– Winning teams employ an ensemble of classifiers.
Why ensemble works? IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Intuition
– Combining diverse, independent opinions in human decision-making as a protective mechanism (e.g.
stock portfolio).
• Uncorrelated error reduction

– Suppose we have 5 completely independent classifiers for majority voting.
– If accuracy is 70% for each:
• 10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5) .
• 83.7% majority vote accuracy.
– 101 such classifiers:
• 99.9% majority vote accuracy.
Ensemble of classifiers IBM ICE (Innovation Centre for Education)
IBM Power Systems
training test
Ensemble
classifier 1 model
labeled unlabeled
data data
classifier 2
……
classifier k
final
predictions
learn the combination from labeled data
Algorithms: Boosting, Stacked Generalization, Rule Ensemble, Bayesian Model Averaging

Bagging IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Bootstrap
– Sampling with replacement.
– Contains around 63.2% original records in each sample.
• Bootstrap aggregation
– Train a classifier on each bootstrap sample.
– Use majority voting to determine the class label of ensemble classifier.
Boosting IBM ICE (Innovation Centre for Education)
IBM Power Systems
• The basic idea:

– Take a weak learning algorithm.
• Only requirement: Should be slightly better than random.
– Turn it into an awesome one by making it focus on difficult cases.
• Most boosting algorithms follow these steps:

– Train a weak model on some training data.
– Compute the error of the model on each training example.
– Give higher importance to examples on which the model made mistakes.
– Re-train the model using \importance weighted" training examples.
– Go back to step 2.
Random forests IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Algorithm
– Choose T number of trees to grow.
– Choose m<M (M is the number of total features) number of features used to calculate the best
split at each node (typically 20%).
– For each tree
• Choose a training set by choosing N times (N is the number of training examples) with replacement
from the training set.
• For each node, randomly choose m features and calculate the best split.
• Fully grown and not pruned.
– Use majority voting among all the trees.
Neural networks IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Neural network learning methods provide a robust approach to approximating real-valued,
discrete-valued, and vector-valued target functions.
• For certain types of problems, such as learning to interpret complex real-world sensor data,
artificial neural networks are among the most effective learning methods currently known.
• For example, the back-propagation algorithm described in this module has proven
surprisingly successful in many practical problems such as learning to recognize handwritten
characters, learning to recognize spoken words and learning to recognize faces.
• The study of Artificial Neural Networks (ANNs) has been inspired in part by the observation
that biological learning systems are built of very complex webs of interconnected neurons.
• In rough analogy, artificial neural networks are built out of a densely interconnected set of
simple units, where each unit takes a number of real-valued inputs (possibly the outputs of
other units) and produces a single real-valued output (which may become the input to many
other units).
Activation functions IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Sigmoid: Takes a real-valued input and squashes it to range between [0,1].
• Tanh: Takes a real-valued input and squashes it to the range [-1, 1].
• ReLU: It stands for rectified linear unit.
• It takes a real-valued input and thresholds it at zero (replaces negative values with zero).
Feedforward neural network IBM ICE (Innovation Centre for Education)
IBM Power Systems
• The simplest type of artificial neural network.

• Contains multiple neurons (nodes) arranged in layers.
• Nodes from adjacent layers have connections or edges between them.
• All these connections have associated weights.
• A feedforward neural network consists of three types of layers:
Multi layer perceptron IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Input layer:
– The Input layer has three nodes.
– The Bias node has a value of 1.
– The other two nodes take X1 and X2 as external inputs.
• Hidden layer:
– It has three nodes with the Bias node having an output of 1.
– The output of the other two nodes in the Hidden layer depends on the outputs from the Input layer (1,
X1, X2) as well as the weights associated with the connections (edges).
– The output of the highlighted node is calculated.
– Similarly, the output from other hidden node can be calculated.
– f refers to the activation function.
– These outputs are then fed to the nodes in the output layer.
• Output layer:
– It has two nodes which take inputs from the hidden layer and perform similar computations as shown
for the highlighted hidden node.
– The values calculated (Y1 and Y2) as a result of these computations are outputs of the MLP.
– Given a set of features X = (x1, x2, … xn) and a target y, the MLP can learn the relationship between
the features and the target for classification.
Backpropagation (Backprop) algorithm IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Initially all the edge weights are randomly assigned.
• For every input in the training dataset, the ANN is activated and its output is observed.
• This output is compared with the desired output that we already know, and the error is
“propagated” back to the previous layer.
• This error is noted and the weights are adjusted (updated) accordingly.
• This process is repeated until the output error is below a predetermined threshold.
• Once the above algorithm terminates, it is considered as the MLP has “learned” and is ready
to predict when new data point (new input) comes.
• This ANN is said to have learned from several examples (labeled data) and from its mistakes
(error propagation).
Recurrent or feedback architecture IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Among the main feedback networks are the Hopfield and the Perceptron with feedback
between neurons from distinct layers, whose learning algorithms used in their training
processes are respectively based on energy function minimization and generalized delta rule,
as will be investigated in the next chapters.
• Thus, using the feedback process, the networks with this architecture produce current
outputs also taking into consideration the previous output values.
Perceptron rule IBM ICE (Innovation Centre for Education)
IBM Power Systems
• One way to learn an acceptable weight vector is to begin with random weights, then
iteratively apply the perceptron to each training example, modifying the perceptron weights
whenever it misclassifies an example.
• This process is repeated, iterating through the training examples as many times as needed
until the perceptron classifies all training examples correctly.
• Weights are modified at each step according to the perceptron training rule, which revises
the weight wi associated with input xi according to the rule.
– 𝑤 ← 𝑤 + ∆𝑤 where ∆𝑤 = 𝜂(𝑡 − 𝑜)𝑥
• Here t is the target output for the current training example, o is the output generated by the
perceptron, and η is a positive constant called the learning rate.
• The role of the learning rate is to moderate the degree to which weights are changed at each
step.
• It is usually set to some small value (e.g., 0.1) and is sometimes made to decay as the
number of weight-tuning iterations increases.
Gradient-descent (training examples, η) IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Each training example is a pair of the form <x, t > , where x is the vector of input values, and
t is the target output value. η is the learning rate.
• Initialize each wi to some small random value
• Until the termination condition is met, Do

– Initialize each Δwi to zero.
– For each < x, t > in training_examples, Do
• Input the instance x to the unit and compute the output o
• For each linear unit weight wi, Do Δwi = Δwi + η(t − o)xi
• For each linear unit weight wi, Do wi = wi + Δwi

Multilayer networks and back
propagation algorithm IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Single perceptrons can only express linear decision surfaces.
• In contrast, the kind of multilayer networks learned by the BACKPROPACATION algorithm

are capable of expressing a rich variety of nonlinear decision surfaces.
Support vector machine IBM ICE (Innovation Centre for Education)
IBM Power Systems
• A supervised classifier
Classification model evaluation and
selection IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Classification accuracy: Percentage of correct predictions.
• Confusion matrix: Table that describes the performance of a classification model.
• Recall:
• Precision:
• F-measure:
ROC curves IBM ICE (Innovation Centre for Education)
IBM Power Systems
• ROC curve is a performance measurement for classification problem at various thresholds

settings.
• ROC is a probability curve and AUC represents degree or measure of separability.
• It tells how much model is capable of distinguishing between classes. Higher the AUC, better
the model is at predicting 0s as 0s and 1s as 1s.
• The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is on
the x-axis.
• True Positive Rate:
• False Positive Rate:

Cost Benefit Analysis (CBA) IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Cost Benefit Analysis (CBA), sometimes called Benefit Costs Analysis (BCA), is a systematic
approach to estimating the strengths and weaknesses of alternatives.
• A CBA may be used to compare potential (or completed) courses of actions, or to estimate
(or evaluate) the value against the cost of a decision, project, or policy.
• CBA is often used by organizations to appraise the desirability of a given policy.
• The value of a cost–benefit analysis depends on the accuracy of the individual cost and
benefit estimates.
• In health economics, CBA may be an inadequate measure because willingness-to-pay

methods of determining the value of human life can be influenced by income level. Variants,
such as cost–utility analysis, QALY and DALY to analyze the effects of health policies, may
be more suitable.
Checkpoint (1 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
Multiple choice questions:
1. Decision trees are appropriate for the problems where:

a) Attributes are both numeric and nominal
b) Target function takes on a discrete number of values
c) Data may have errors
d) All of the mentioned
2. Which of the following SVM model can be used to detect the outliers?
a) nu-SVM classification
b) nu-SVM regression
c) one-classification
d) None of the above
3. Suppose your model is overfitting. Which of the following is NOT a valid way to try and
reduce the overfitting?
a) Increase the amount of training data
b) Improve the optimization algorithm being used for error minimization
c) Decrease the model complexity
d) Reduce the noise in the training data
Checkpoint solutions (1 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
Multiple choice questions:
1. Decision trees are appropriate for the problems where:

a) Attributes are both numeric and nominal
b) Target function takes on a discrete number of values
c) Data may have errors
d) All of the mentioned
2. Which of the following SVM model can be used to detect the outliers?
a) nu-SVM classification
b) nu-SVM regression
c) one-classification
d) None of the above
3. Suppose your model is overfitting. Which of the following is NOT a valid way to try and
reduce the overfitting?
a) Increase the amount of training data
b) Improve the optimization algorithm being used for error minimization
c) Decrease the model complexity
d) Reduce the noise in the training data
Checkpoint (2 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
Fill in the blanks:
1. _________ learning is used when you build component classifiers that are more accurate
and independent from each other.
2. __________are supervised learning algorithms used for classification and regression
analysis.
3. SVM got into mainstream because of their exceptional performance in _____ recognition.
4. ________ indicates the class is correctly recognized (small number of FN).
True or False:
1. Designing and developing algorithms according to the behaviors based on empirical data
are known as machine learning. True/False
2. Sequence learning is a method of teaching and learning in a logical manner. True/False
3. Instance based learning algorithm is also referred as Lazy learning algorithm as they delay
the induction or generalization process until classification is performed. True/False
Checkpoint solutions (2 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
Fill in the blanks:
1. Ensemble learning is used when you build component classifiers that are more accurate
and independent from each other.
2. Support vector machines are supervised learning algorithms used for classification and
regression analysis.
3. SVM got into mainstream because of their exceptional performance in handwritten digit
recognition.
4. High recall indicates the class is correctly recognized (small number of FN).
True or False:
1. Designing and developing algorithms according to the behaviors based on empirical data
are known as Machine Learning. True
2. Sequence learning is a method of teaching and learning in a logical manner. True
3. Instance based learning algorithm is also referred as Lazy learning algorithm as they delay
the induction or generalization process until classification is performed. True
Question bank IBM ICE (Innovation Centre for Education)
IBM Power Systems
Two mark questions:
1. Define machine learning. Give an example.

2. Define curse of dimensionality. How do you address this issue in ML?
3. Define entropy. Illustrate.
4. Define ensemble of classifiers.
Four mark questions:
1. Discuss the problems in classifying data.

2. Discuss in brief k-nearest neighbor technique.
3. Explain bagging with an example.
4. Discuss the different types of activation functions used in neural networks.
Eight mark questions:
1. Explain the general classification model.

2. Explain the procedure of constructing decision tree with example.
Unit summary IBM ICE (Innovation Centre for Education)
IBM Power Systems
Having completed this unit, you should be able to:
• Understand the concepts on classification
• Gain knowledge on applications of classification algorithms
• Gain an insight into the classification techniques such as k-Nearest neighbor algorithm,
decision trees, Naive Bayesian classifier
• Gain an insight into the ensemble methods namely bagging and boosting
• Gain knowledge on advanced classification methods namely support vector machines,

artificial neural networks
• Understand the metrics for evaluation of classification methods

Welcome To:: Introduction To Classification & Classification Algorithms

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Welcome To:: Introduction To Classification & Classification Algorithms

Uploaded by

Copyright:

Available Formats

IBM ICE (Innovation Centre for Education)

Introduction to Classification & Classification Algorithms

After completing this unit, you should be able to:

• Understand the concepts of classification

• Gain knowledge on applications of classification algorithms

• Gain knowledge on advanced classification methods namely support vector machines,

• Understand the metrics for evaluation of classification methods

Source: Dengxin Dai , and Luc Van Gool

Its very similar to a

• Most basic instance-based method.

• Data are represented in a vector space . f : d  V

• Given an event, predict is category. Examples:

• Event = list of features. Examples:

• Select the attribute with the highest information gain.

• Expected information (entropy) needed to classify a tuple in D:

• Information needed (after using A to split D into v partitions) to classify D:

• Information gained by branching on attribute A.

Sunny Overcast Rain

Ssunny =[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]

Gain(Ssunny, Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970

• Step-1: Let T be the set of training instances.

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

R1 : If (Outlook=Sunny)  (Humidity=High) Then PlayTennis=No

• Highly used versions of Bayesian approaches are:

Applications: Classification, Clustering, Collaborative Filtering, Anomaly Detection……

• Data mining competitions

• Uncorrelated error reduction

learn the combination from labeled data

Algorithms: Boosting, Stacked Generalization, Rule Ensemble, Bayesian Model Averaging

• The basic idea:

• Most boosting algorithms follow these steps:

• Sigmoid: Takes a real-valued input and squashes it to range between [0,1].

• ReLU: It stands for rectified linear unit.

• The simplest type of artificial neural network.

• Initially all the edge weights are randomly assigned.

• Initialize each wi to some small random value

• Until the termination condition is met, Do

• For each linear unit weight wi, Do wi = wi + Δwi

• Single perceptrons can only express linear decision surfaces.

• In contrast, the kind of multilayer networks learned by the BACKPROPACATION algorithm

• Classification accuracy: Percentage of correct predictions.

• Confusion matrix: Table that describes the performance of a classification model.

• ROC curve is a performance measurement for classification problem at various thresholds

• ROC is a probability curve and AUC represents degree or measure of separability.

• True Positive Rate:

• False Positive Rate:

• CBA is often used by organizations to appraise the desirability of a given policy.

• In health economics, CBA may be an inadequate measure because willingness-to-pay

Multiple choice questions:

1. Decision trees are appropriate for the problems where:

Multiple choice questions:

1. Decision trees are appropriate for the problems where:

Fill in the blanks:

Fill in the blanks:

Two mark questions:

1. Define machine learning. Give an example.

Four mark questions:

1. Discuss the problems in classifying data.

Eight mark questions:

1. Explain the general classification model.

Having completed this unit, you should be able to: