UCS551 Chapter 6 - Classification

CHAPTER 6 :
CLASSIFICATION
AND REGRESSION
(MACHINE
LEARNING MODEL)
DR AZLIN AHMAD
CONTENT
1. Regression problems- outcomes will be integer/ real numbers,
 Naï
ve Bayes either positive or negative
2. Classification problems- output will be classes, categorical etc,

 Random Forest
 Logistic Regression
 Decision Tree
 Support Vector Machine
 Neural Networks (RANGKAIAN NEURAL)
 K-Nearest Neighbours
WHAT IS CLASSIFICATION (PENGKELASAN)?
 Classification:
 a technique for determining class the dependent variable (target/ class)belongs to, based
on the one or more independent variables.
 used for predicting discrete responses.
WHAT IS REGRESSION?
 Regression models are used to predict a

continuous value.
 Predicting prices of a house given the features of
house like size, price etc is one of the common
examples of Regression.
CLASSIFIERS: NAÏVE BAYES
 Based on Bayes’ theorem with the

independence assumptions between predictors
 it assumes the presence of a feature in a class is
unrelated to any other feature.
 Even if these features depend on each other or
upon the existence of the other features, all of
these properties independently. Thus, the name
Naive Bayes.
 Based on Naive Bayes, Gaussian Naive Bayes is
used for classification based on the
binomial(normal) distribution of data.
GAUSSIAN NAIVE BAYES
 P(data) is the posterior probability of class (target)

given predictor(attribute). The probability of a data
point having either class, given the data point. This is
the value that we are looking to calculate.
 P(class) is the prior probability of class.
 P(data|class) is the likelihood which is the
probability of predictor given class.
 P(data) is the prior probability of predictor or
marginal likelihood.
Prior probabilities are actually the initial guesses. This guess can be any probability that we want, but a common guest is estimated from the training data
P(class)=Number of data points in the class/ Total no. of observations

Calculate Prior P(yellow)=10/17
P(green)=7/17
Probability
STEPS
P(data)=Number of data points similar to observation/Total no.of
Calculate observations
Marginal P(?)=4/17
Likelihood The value is present in checking both the probabilities.
P(data/class)=Number of similar observations to the class/Total no.

of points in the class.
Calculate P(?/yellow)=1/7
Likelihood P(?/green)=3/10
Posterior
Probability for
each Class
The higher probability, the class

belongs to that category as from
Classification above 75% probability the point
belongs to class green.
CLASSIFIER: DECISION TREE
In general, a Decision Tree makes a statement, and then makes a decision
based on whether or not the statement is true or false.
 Decision tree builds classification or

regression models in the form of a tree
structure.
 It breaks down a dataset into smaller and
smaller subsets while at the same time an
associated decision tree is incrementally
developed.
 The final result is a tree with decision
nodes and leaf nodes. It follows
Iterative Dichotomiser 3(ID3) algorithm
structure for determining the split.
If a Decision Tree predicts numeric values, it is called a Regression Tree. Vice
versa
 HOW?
Decision Tree uses Entropy and Information Gain to construct a decision tree.
 Entropy
1. To identify the root node of the three
 Entropy is the degree or amount of uncertainty in the randomness of elements or in 2.
other words it is a measure of impurity.
 Entropy calculates the homogeneity of a sample. If the sample is completely
homogeneous the entropy is zero and if the sample is an equally divided it has an
entropy of one.
 Information Gain
 It measures the relative change in entropy with respect to the independent attribute.
It tries to estimate the information contained by each attribute.
 Information Gain ranks attribute for filtering at a given node in the tree. The ranking
is based on the highest information gain entropy in each split.
 The disadvantage of a Decision Tree Model is overfitting as it tries to fit the
model by going deeper in the training set and thereby reducing test accuracy.
CLASSIFIER: RANDOM FOREST
 An ensemble algorithm based on bagging -bootstrap

aggregation.
 The general idea : combination of learning models that
increases the overall result is selected.
 Random forests prevent from overfitting by creating trees
on random subsets
 it takes the average of all the predictions, which cancels out
the biases.
 Random Forest adds additional randomness to the model
while growing the trees.
 Instead of searching for the most important feature while
splitting a node, it searches for the best feature among a
random subset of features.
CLASSIFIER: LOGISTIC REGRESSION
 Similar to linear regression but is used when the

dependent variable is not a number, but something
else (like a Yes/No response).
 Its called Regression but performs classification as
based on the regression it classifies the dependent
variable into either of the classes.
 Logistic regression is used for prediction of output
which is binary, as stated above.
 For example, if a credit card company is going to
build a model to decide whether to issue a credit
card to a customer or not, it will model for whether
the customer is going to “Default” or “Not Default”
on this credit card.
STEPS:
Firstly, Linear Regression is performed on the

relationship between variables to get the model .
The threshold for the classification line is assumed to
be at 0.5
Logistic Function is applied to the regression to get

the probabilities of it belonging in either class.
It gives the log of the probability of the event occurring to log

of the probability of it not occurring. In the end, it classifies the
variable based on the higher probability of either class.
CLASSIFIER: SUPPORT VECTOR MACHINE
 Support Vector is used for both regression and
Classification.
 It is based on the concept of decision planes that define
decision boundaries.
 A decision plane(hyperplane) is one that separates
between a set of objects having different class
memberships.
 It performs classification by finding the hyperplane that
maximizes the margin between the two classes with the
help of support vectors.
 The learning of the hyperplane in SVM is done by
transforming the problem using some linear algebra i.e
using kernel above is a linear kernel which has a linear
separability between each variable.
 For higher dimensional data other kernels are used as
points cannot be classified easily. They are specified in the
next section.
WHAT IS KERNEL SVM?
 Kernel SVM takes in a kernel function in the SVM algorithm and transforms into the required form
that maps data on a higher dimension which is separable.
 Types of kernel function are:
 Linear SVM is the one we discussed earlier.
 In Polynomial kernel, the degree of the polynomial should be specified. It allows for curved lines in the input
space.
 In the RBF Kernel, it is used for non-linearly separable variables. For distance metric squared Euclidean
distance is used. Using a typical value of the parameter can lead to overfitting our data. It is used by default in
sklearn
 Sigmoid kernel, similar to logistic regression is used for binary classification
CLASSIFIER : NEURAL NETWORKS/ ARTIFICAL NEURAL NETWORKS/
ANN/ RANGKAIAN NEURAL BUATAN
 Artificial Neural Network is a set of connected input/output units where each connection has a weight associated
with it started by psychologists and neurobiologists to develop and test computational analogs of neurons.
 During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class
label of the input tuples.
 Artificial Neural Networks have performed impressively in most of the real world applications. It is high tolerance to
noisy data and able to classify untrained patterns.
 Usually, Artificial Neural Networks perform better with continuous-valued inputs and outputs.
input layer hidden layer output
layer
HOW?
K-NEAREST NEIGHBOR
 The KNN algorithm assumes that similar

things exist in close proximity.
 In other words, similar things are near to each
other.
 Notice in the image above that most of the
time, similar data points are close to each
other.
 KNN captures the idea of similarity
(sometimes called distance, proximity, or
closeness) with some mathematics we might
have learned in our childhood— calculating
the distance between points on a graph.
HOW?
Initialize K to your chosen

Load the data For each example in the data:
number of neighbors
Calculate the distance between the query
example and the current example from
Sort the ordered collection of the data.
Pick the first K
distances and indices from
entries from the
smallest to largest (in Add the distance and the index of the
sorted collection
ascending order) by the example to an ordered collection
distances
Get the labels of the If regression, return the mean of the K labels
selected K entries
If classification, return the mode of the K labels

ADVANTAGES & DISADVANTAGES
 Advantages
 The algorithm is simple and easy to implement.
 There’s no need to build a model, tune several parameters, or make additional
assumptions.
 The algorithm is versatile. It can be used for classification, regression, and search (as we
will see in the next section).
 Disadvantages
 The algorithm gets significantly slower as the number of examples and/or
predictors/independent variables increase.
 References:
 https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/
 https://towardsdatascience.com/supervised-machine-learning-classification-5e685fe18a6d
 https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623
 https://medium.com/@Mandysidana/machine-learning-types-of-classification-9497bd4f2e14

UCS551 Chapter 6 - Classification

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UCS551 Chapter 6 - Classification

Uploaded by

Copyright:

Available Formats

CHAPTER 6 :

2. Classification problems- output will be classes, categorical etc,

 Regression models are used to predict a

 Based on Bayes’ theorem with the

 P(data) is the posterior probability of class (target)

P(class)=Number of data points in the class/ Total no. of observations

P(data/class)=Number of similar observations to the class/Total no.

The higher probability, the class

 Decision tree builds classification or

 An ensemble algorithm based on bagging -bootstrap

 Similar to linear regression but is used when the

Firstly, Linear Regression is performed on the

Logistic Function is applied to the regression to get

It gives the log of the probability of the event occurring to log

 The KNN algorithm assumes that similar

Initialize K to your chosen

If classification, return the mode of the K labels

You might also like