Supervised Learning 1 PDF

Supervising Learning 1
Adane L. Mamuye
June 2020
Outline
• Data quality problems

• Data preprocessing
• Regression
Data Quality
• Data have quality if they satisfy the requirements of the

intended use.
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability
– Interpretability
Data Quality Problems
• Missing values
• Duplication
• Inconsistent
• Outliers
• Preprocessing is one of the most critical steps in a data

analytics process.
Why data Preprocessing is Important?
• Less data: machine learning methods can learn faster

• Higher accuracy: machine learning methods can generalize
better
• Simple results: output of machine learning methods are easier
to understand
Data Preprocessing Major Tasks
Data Cleaning
• Data cleaning:
– Filling in missing values
– Smoothing noisy data
– Identifying or removing outliers

Data Integration
• Blending data from multiple sources into a coherent data store.
• Issues to be considered during integration:

– Entity identification
– Redundancy and correlation analysis
• Some redundancies can be detected by correlation analysis

Data Reduction
• Most machine learning and data mining techniques may not

be effective for high-dimensional data.
• Dimensionality reduction is the process of taking data in a
high dimensional space and mapping it into a new space
whose dimensionality is much smaller.
• Reason:
– Impose computational challenges
– Lead to poor generalization abilities of the learning
algorithm– KNN
• Finding meaningful structure of the data
Data Reduction
• Data reduction strategies include:
– Dimensionality reduction:
• Wavelet transform, Principal component analysis, Attribute
subset selection.
– Numerosity reduction
• Parametric (regression and log-linear model) and
nonparametric (histograms, clustering, sampling and data
cube aggregation.)
– Data compression
• Lossless and lossy data compression techniques
Data Transformation and Discretization
• Data are transformed or consolidated into forms appropriate for

mining
• Involves the following:
– Smoothing: remove noise from the data – binning, regression,
and clustering
– Attribute construction: new attributes are constructed from the
given set of attributes
– Aggregation: summary or aggregation operations are applied to
the data
Dimensionality Reduction in Detail
• Reduction is performed by applying a linear transformation to

the original data
• Principal Component Analysis (PCA)
– Let {𝑥1 ; : : : ; 𝑥𝑚 } be m vectors in 𝑅 𝑑 -- reduce the
dimensionality of these vectors
– A matrix W ϵ 𝑅 n,𝑑 , where n < d, induces a mapping x -> 𝑤𝑥 ,
where W ϵ 𝑅 n is the lower dimensionality representation of
x.
– A matrix U ϵ 𝑅 n,𝑑 , can be used to (approximately) recover
each original vector x from its compressed version.
• Principal Component Analysis (PCA)

– Let {𝑥1 ; : : : ; 𝑥𝑚 } be m vectors in 𝑅 𝑑 -- reduce the
dimensionality of these vectors
– A matrix W ϵ 𝑅 n,𝑑 , where n < d, induces a mapping x -> 𝑤𝑥 ,
where W ϵ 𝑅 n is the lower dimensionality representation of
x.
– A matrix U ϵ 𝑅 n,𝑑 , can be used to (approximately) recover
each original vector x from its compressed version.
• PCA is a method that brings together:

– A measure of how each variable is associated with one
another. (Covariance matrix.)
– The directions in which our data are dispersed.

(Eigenvectors.)
– The relative importance of these different directions.

(Eigenvalues.)
Dimensionality Reduction- Example
- Compute the mean of every dimension of
the whole dataset.
- Compute the covariance matrix of the

whole dataset
- Covariance matrix would be:
- Art test scores have more variability than English test

scores
Src: towards data science - Covariance between math and English; math and art is
positive but cov. b/n Art and English is unpredictable
– Compute Eigenvectors (whose direction remains unchanged when a
linear transformation is applied to it) and corresponding Eigenvalues
– Let A be a square matrix, ν a vector and λ a scalar that satisfies Aν =
λν, then λ is called eigenvalue associated with eigenvector ν of A.
– The eigenvalues of A are roots of the characteristic equation
- We can find the determinant of the same :

• Calculate the eigenvectors corresponding to the above
eigenvalues: -Computer eigenvectors by
Gaussian Elimination
- Where x is the eigenvector

associated with eigenvalue λ
• Sort the eigenvectors by decreasing eigenvalues
• Eigenvectors corresponding to two maximum eigenvalues are
• Transform the samples onto the new subspace

When should I use PCA?
• Do you want to reduce the number of variables, but aren’t

able to identify variables to completely remove from
consideration?
• Do you want to ensure your variables are independent of one

another?
• Are you comfortable making your independent variables less

interpretable?
• If you answered “yes” to all three questions, then PCA is a
good method to use.
Supervised Learning
Supervised Learning
• Each type of task is characterized by the kinds of data they

require and the kinds of output they generate
Classification Tasks
• Given:
– A set of classes
– Instances (examples) of
each class
• Described as a set of
features or attributes
and their values
• Generate: A method (aka
model) that when given a
new instance it will
determine its class
Supervised Learning
• Classification
– Output type: discrete (binary/multi-classes)
– Trying to find: a boundary
– Evaluation: accuracy
• Regression
– Output type: continuous
– Trying to find: best fit line
– Evaluation: sum of squared errors
Slide share
Classification Techniques
What classification techniques do you know?

Linear Regression
• Regression comes from fact that we fit a linear model to the

feature space.
• When the outcome and all the attributes are numeric, linear
regression is a natural technique to consider.
• The idea is to express the class as a linear combination of the
attributes, with predetermined weights:
Where
• x---- class
• a1 to ak– attribute values
• Wo-wk--- weights– calculated from the
training data
Linear Regression
• The predicted value, not the actual, for the first instance’s
class can be written as:
• Linear regression is an excellent, simple method for numeric

prediction.
Linear Regression
• Best-fitting straight line will be found, where “best” is

interpreted as the least mean-squared difference.
• Linear regression measures the goodness of fit using the

squared error.
• Linear models serve well as building blocks for more complex
learning methods.
Linear Regression
Logistic Regression
• One type of classification algorithm

• Logistic regression builds a linear model based on a
transformed target variable.
Towards Data Science

Logistic Regression
Source
Logistic Regression
• It is a special case of linear regression where the target

variable is categorical in nature.
• The outcome or target variable is dichotomous (two possible
classes) in nature.
• Logistic Regression predicts the probability of occurrence of a
binary event utilizing a logit function.
Logistic Regression
source
Logistic Regression
• From linear to logistic regression--- using sigmoid function.
0 + 1 𝑥1 + 2 𝑥2 …𝑛 𝑥𝑛
• Sigmoid function: also called logistic function gives an ‘S’ shaped

curve that can take any real-valued number and map it into a value
between 0 and 1.
Sigmoid function of
weighted sum
f(x) = 1Τ1 + 𝑒 −(𝑦)
f(x) = 1Τ1 + 0 + 1 𝑥1 + …𝑛 𝑥𝑛
If the curve goes to positive infinity, y

predicted will become 1, and if the curve goes
to negative infinity, y predicted will become 0
Logistic Regression
• Decision Boundary
– Based upon this threshold, the obtained estimated

probability is classified into classes.
– Say, if predicted value ≥ 0.5, then classify email as spam else
as not spam.
Example: If the output is 0.75, we can say in terms of
probability as: There is a 75 percent chance that email will be
spam.
• Decision boundary can be linear or non-linear. Polynomial order
can be increased to get complex decision boundary.
Logistic Regression
Sigmoid function
Logistic Regression
Logistic Regression
The cost function can be

reduced by using Gradient
Descent.
Types of Logistic Regression
1. Binary Logistic Regression
– The categorical response has only two 2 possible
outcomes. Example: Spam or Not
– one of the most simple and commonly used Machine
Learning algorithms for two-class classification.
2. Multinomial Logistic Regression
– Three or more categories without ordering. Example:
Predicting which food is preferred more (Veg, Non-Veg,
Vegan)
3. Ordinal Logistic Regression
– Three or more categories with ordering. Example: Movie
rating from 1 to 5
Logistic Regression -- Advantage
• Advantages:
- Makes no assumptions about distributions of classes in
feature space
- Easily extended to multiple classes (multinomial regression)
- Natural probabilistic view of class predictions
- Quick to train
- Very fast at classifying unknown records
- Good accuracy for many simple data sets
- Resistant to overfitting
- Can interpret model coefficients as indicators of feature
importance
• What are the dis advantages of LR?
Question
• Discuss the difference between linear

regression and logistic regression
Adane L. Mamuye
June 2020
Outline
• Decision Tree classification algorithm

• Classification algorithms performance evaluation
• K-Nearest Neighbours
Classification Tasks
Data: A set of data records (also

called examples, instances or
cases) described by
k attributes: A1, A2, … Ak.
a class: Each example is
labelled with a pre-
defined class.
Goal: To learn a classification
model from the data that can be
used to predict the classes of new
(future, or test) cases/instances.
Learning (training): Learn a model using the training data

Testing: Test the model using unseen test data to assess the model accuracy
Decision Tree
• A decision tree is a predictor, h : X Y, that predicts the label

associated with an instance X by traveling from a root node of
a tree to a leaf.
• A method for approximating discrete classification functions
by means of a tree-based representation
• Tree leaf ↔ contains a specific

label
•Tree branch ↔ possible
attribute value for the instance in
question
DT Learning Algorithm
• Tree is constructed in a top-down recursive manner– greedy

search - through the space of possible solutions
• A general Decision Tree learning algorithm:
1. Perform a statistical test of each attribute- mostly categorical
(possible to handle continuous attribute values) to determine how
well it classifies the training examples when considered alone;
1. Select the attribute that performs best and use it as the root of the
tree;Order the attribute based on decreasing entropy or highest
information gain to lowest information gain.
2. To decide the descendant node down each branch of the root
(parent node), sort the training examples according to the value
related to the current branch and repeat the process described in
steps 1 and 2--- a recursive process
What is Good Attribute
• A good attribute prefers attributes that split the data so that

each successor node is as pure as possible.
• In other words:
– We want a measure that prefers attributes that have a high
degree of ”order”
• Maximum order: all examples are of the same class
• Minimum order: all classes are equally likely
• Needs a measure of impurity
Measures of Node Impurity
• Information Gain
– Determine how informative an attribute is
– Attributes are assumed to be categorical
• Gini Index
– Attributes are assumed to be continuous
– Assume there exist several possible split values for each
attribute
Information Gain
• The encoding information that would be gained by branching

on A.
Gain(A) = Info(D) - InfoA(D)
• Gain(A) tells us how much would be gained by branching on
A.
• The attribute A with the highest information gain is chosen as

the splitting attribute at node N.
Information Gain
Entropy in information theory specifies the minimum number of

bits needed to encode the classification accuracy of an instance.
Information Gain
• Suppose we were to partition the tuples in D on some

attribute A having v distinct values, {a1, a2, …, av}
• Attribute A can be used to split D into v partitions or subsets,
{D1,D2,…, Dv}, where Dj contains those tuples in D that have
outcome aj of A.
• These partitions would correspond to the branches grown
from node N.
• The expected information required to classify a tuple from D
based on the partitioning by A:
Information Gain
• Assume an attribute A split the set S into subsets {S1, S2,…,

Sv}
• To compute the Information Gain, we have to compute:

– The average entropy
– The sum of the entropy of the original set S
• The encoding information that would be gained by branching

on A.
E(S)= 0 if S contains only positive or only negative examples

E(S)= 1 if S contains equal amount of positive and negative examples
Play Tennis Example
Tree Construction-1
Tree Construction-2
Tree Construction-3
Tree Construction-4
Note: No root-to-leaf path should contain the

same discrete attribute twice
Stopping Criteria for Tree Induction
• Stop expanding a node when all the records belong to the

same class OR when all the records have similar attribute
values.
Tree pruning
• Pruning reduces the size of decision trees by removing parts

of the tree that do not provide power to classify instances
• Prepruning: stopping tree construction early on before it is

full
• Postpruning: to get simpler tree (to find and prune

unnecessary sub trees)
• Comparing prepruning and postpruning:

– The prepruning is faster but postpruning leads to more
accurate trees.
Overfitting
• The goal of a good machine learning model is to generalize
well from the problem domain.
• Overfitting and underfitting are the two biggest causes for

poor performance of machine learning algorithms.
• If a decision tree is fully grown, it may lose some

generalization capability.
– This is a phenomenon known as overfitting.
• Overfitting refers to a model that models the training data
too well.
• Happens the model learns the detail and noise in the training
data---- accuracy compromised.
Issues in decision trees
• Overfitting negatively impact the performance.
• A decision tree is said to overfit the training data if

– It results in poor accuracy to classify test samples
– It has too many branches, that reflect anomalies

Causes of Overfitting
• Presence of Noise
– Mislabeled instances may contradict the class labels of
other similar records.
• Lack of Representative Instances

– Lack of representative instances in the training data can
prevent refinement of the learning algorithm.
• The Multiple Comparison Procedure

– Failure to compensate for algorithms that explore a large
number of alternatives can result in spurious fitting.
Avoiding Over fitting
• Ways to avoid overfitting:

1. Stop the training process before the learner reaches the point
where it perfectly classifies the training data.
2. Apply backtracking in the search for the optimal hypothesis. In

the case of Decision Tree Learning, backtracking process is
referred to as ‘post-pruning of the overfitted tree’.
Underfitting
• Underfitting refers to a model that can neither model the

training data nor generalize to new data.
• Underfitting: when model is too simple, both training and test

errors are large
• Easy to detect given a good performance metric
• Remedy: try out an alternative machine learning algorithm

Evaluation Matrices
Nearest Neighbours
• Based on learning by analogy– by comparing a given test tuple

with training tuples that are similar to it.
• Training tuple represents a point in an n-dimensional space–

described by n attributes.
• When unknown tuple is given – searches the pattern space for

the k training tuples (k nearest nieghbors) that are closest to
the unknown tuple.
K-Nearest Neighbours
• Euclidean distance: the most commonly used measure
• Determine the class from nearest neighbor list take the

majority vote of class labels among the k-nearest neighbors
• Weigh the vote according to distance
• Example:
Name Acidity Strength Class
Durability
Type-1 7 7 Bad
Type-2 7 4 Bad
Type-3 3 1 Good
Type-4 1 4 Good
• Test data: AD=3, S=7, Class=? Check for K=1/2/3

• Choosing k for K-NN is just one of the many model selection

problems we face in machine learning
– If k is too small, sensitive to noise points

– If k is too large, neighborhood may include points from other
classes
• Normalize the values of each attribute before computing

closeness– min-max
• Advantages
– Conceptually simple, easy to understand and explain
– Very flexible decision boundaries
– Not much learning at all
• Disadvantages
– It can be hard to find a good distance measure
– Irrelevant features and noise can be very detrimental
– Typically can not handle more than a few dozen attributes
– Computational cost: requires a lot computation and memory
Thank You
Adane L. Mamuye
June 2020
Outline
• Artificial Neural Network

• Bayes learning
ANN
• Neural networks are nonlinear models inspired by the

structure of neural networks in the brain.
• We use ANN:
– They are extremely powerful computational devices
– Massive parallelism makes them very efficient
– They are particularly fault tolerant

How ANN Works
– Receives inputs
– Combines them in someway
– Performs a generally nonlinear operation on the result
– Outputs the final result

Artificial Neural Networks: the dimensions
How ANN Works
• The three basic components of the (artificial) neuron are:

– synapses or connecting links
– adder that sums
– Activation function.
How ANN works
Activation functions
Preceptor
Perceptron = a neuron that its input is the dot product of W

and X and uses a step function as a transfer function
Perceptron: Example 1 -AND
Perceptron: Example 2 -OR
NNs: Architecture
Classification Back Propagation
• Back Propagation learns by iteratively processing a set of

training data (samples).
• For each sample, weights are modified to minimize the error

between network’s classification and actual classification.
Steps in Back Propagation Algorithm
• STEP ONE: initialize the weights and biases.
• The weights in the network are initialized to random numbers

from the interval.
• Each unit has a BIAS associated with it
• The biases are similarly initialized to random numbers from
the interval .
• STEP TWO: feed the training sample.

Steps in Back Propagation Algorithm
• STEP THREE: propagate the inputs forward; we compute the

net input and output of each unit in the hidden and output
layers.
• STEP FOUR: back propagate the error.
• STEP FIVE: update weights and biases to reflect the

propagated errors.
• STEP SIX: terminating conditions.

Example
x1 1
• Given w14
w15 4 w46
2 w24 6
x2
w25
w56
w34 5
w35
x3 3
Initial Input, Weight and biases values:

• x1=1, x2=0,x3=1, w14=0.2, w15=-0.3, w24=0.4, w25 = 0.1, w34= -
0.5, w35=0.2, w46=-0.3, w56=-0.3, 4=-0.4, 5=0.2, 6=0.1, 𝓁 = 0.9
Solution
• I4= σ𝑖 𝑤𝑖𝑗 𝑂𝑖 + 𝑗 The input of a neuron is
obtained by taking a
= w14 * x1 + w24 * x2 + w34 * x3 + 4 weighted sum of the outputs
of all the neurons connected
= 1*0.2 + 0*0.4 + 1*(-0.5) + (-0.4) to it
= -0.7
• Given net input I4, the output of I4 is computed as
O4 = 1ൗ −𝐼𝑗  O4 = 1ൗ1+𝑒 − −0.7  0.332

1+𝑒
Accordingly, the net input of:
- 𝐼5 = -0.3 + 0 + 0.2 + 0.2 = 0.1  O5 = 1ൗ1+𝑒 −(0.1)  0.525
- 𝐼6 = -0.3 *0.332 + (-0.2)*(0.525) + 0.1= -0.105  O6 = 1ൗ1+𝑒 −(−0.105) 
0.474
Solution
• Calculating of error at each node of output layer
– 𝐸𝑟𝑗 = 𝑂𝑗 (1- 𝑂𝑗 ) (𝑇𝑗 - 𝑂𝑗 )
Where 𝑂𝑗 is the actual output and 𝑇𝑗 in the known target value
− 𝐸𝑟6 = (0.474)(1-0.474)(1-0.474) = 0.1311
• Error rate of hidden layer unit j

– 𝐸𝑟𝑗 = 𝑂𝑗 (1- 𝑂𝑗 ) σ𝐸𝑟𝑘 𝑤𝑗𝑘
Where𝑤𝑗𝑘 is connection weight from j to k and 𝐸𝑟𝑘 error rate of
unit k
- 𝐸𝑟5 = (0.525)(1- 0.525)(0.1311) (-0.2)= -0.0065
- 𝐸𝑟4 = (0.332)(1- 0.332)(0.1311) (-0.3)= -0.0087
Solution
• Calculation of weight of biases:
– Weight and biases updated to reflect the propagated error
Δ𝑤𝑖𝑗 = (𝓁)𝐸𝑟𝑗 𝑂𝑗 𝑤𝑖𝑗 = 𝑤𝑖𝑗 + Δ𝑤𝑖𝑗

𝑤46 = 𝑤46 + Δ𝑤𝑖𝑗  (-0.3)(0.9)(0.1311)(0.332) = -0.261
– Biases are updated by

Δ𝑗 = (𝓁)𝐸𝑟𝑗 𝑗 = 𝑗 + Δ𝑗
6 = 6 + Δ𝑗  (0.1)(0.9)(0.1311
• Forward propagation starts again

ANN Application
• Real world applications:

– Financial modelling:- predicting the stock market
– Time series prediction:- climate, weather, seizures
– Computer games:- intelligent agents, chess, backgammon
– Robotics:- autonomous adaptable robots
– Pattern recognition:- speech recognition, seismic activity,
sonar
– signals Data analysis ? data compression, data mining
– Bioinformatics:- DNA sequencing, alignment
Weakness of ANN
• The complex internal structure shows black box behavior: it is

• Very hard to get an idea of the meaning of the internal
computations.
• Another feature of neural networks is their random behavior.
– The training process contains random elements. When this is
repeated, the same input set may yield very different
networks.
– Sometimes they differ in performance, one showing good
behavior, while others behave badly.
• Neural Network needs long time for training.
Bayes Learning
• Use a probability framework for fitting a predictive model to a
training dataset.
• Has two roles
– Provides learning algorithms
• Naïve Bayes learning
• Bayes Belief Network learning
– Provides conceptual framework
• Provides “gold standard” to evaluate other learning
algorithms
Probability Theory
• Conditional (posterior) probabilities:

– Formalize the process of accumulating evidence and
updating probabilities based on new evidence
– Specify the belief in one proposition (event, conclusion,
diagnosis, ets) conditioned on another proposition (evidence,
feature, symptom, etc)
• P(A|B) is the conditional probability of A given evidence B:
Probability Theory
• Conditional probabilities behave like standard probabilities:

– With in a range of [0,1]
– Sum to 1
• Can have P( conjunction of events |B)

– P(A ᴧ B ᴧ C |E) is the sentence “A ᴧ B ᴧ C” is true
conditioned on the evidence E being true.
Rules of Probability Theory
• Negation: probability event A being false:

P(¬A|B) = 1- P(A|B)
• Sum rule: probability of a disjunction of two events A and B:
P(A v B) = P(A) + P(B) – P(AᴧB)
• Product Rule: probability of a conjunction of two events A and B
P(A ᴧ B) = P(A|B) x P(B)
= P(B|A) x P(A)
• Chain rule: generalization of the product rule for any number of
events.
P(A ᴧ B ᴧ C) = P(A|B ᴧ C) x P(B|C) x P(C)
Rules of Probability Theory
• Conditional chain rule: variant of the chain rule for conditional

probabilities
P(A ᴧ B|C) = P(A|B ᴧ C) x P(B|C)
• Total Probability: summing out over mutually exclusive events
B1,…..,Bn
Bayes Classification Method
• The goal is to learn a model and use this model to predict.

– Learn a probability distribution
– Use the distribution to make a decision
• A learner tries to find the most probably hypothesis h from a

set of hypothesis H, given the observed data.
• Bayesian classifiers (statistical classifiers) -- predict class
membership probabilities -- based on Bayes’ theorem
Bayes Theorem
• Let X be a data tuple.
• Let H be some hypothesis such as that the data tuple X belongs

to a specified class C.
• We want to determine P(H|X)– posterior probability of H

conditioned on X
• We are looking for the probability that tuple X belongs to class C,

given that we know the attribute description of X.
Example of Bayes Theorem
• Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is1/50,000
– Prior probability of any patient having stiff neck is1/20
• If a patient has stiff neck, what is the probability he/she has

meningitis?
Naïve Bayes Classification
• Simple Bayesian classifier known as the Naïve Bayesian

– Comparable in performance with decision tree and neural
network
• How it works
1. Each tuple is represented by n-dimensional attribute vector X
={𝑥1 , 𝑥2 , … , 𝑥𝑛 }
2. Suppose there are m classes --- {𝑐1, 𝑐2 , …𝑐𝑛 }
The classifier predict x that bellongs to classes having the highest
posterior probability conditioned on X:
P(𝐶𝑖 X) = 𝑃(𝑋|𝐶𝑖)(𝑃(𝐶𝑖)
𝑃(𝑋)
Naïve Bayes Classification
3. For data with many attributes
P(X|𝐶𝑖 ) = ς𝑛𝑘=1 𝑃(𝑋𝑘 |𝐶𝑖 )
What if the attribute value is continuous? Gaussian distribution with a
mean µ an d standard deviation σ.
4. Select the higher(est) conditional probability
𝑃 𝑋 𝐶𝑖 𝑃(𝐶𝑖 ) > 𝑃 𝑋 𝐶𝑗 𝑃(𝐶𝑗 )

Estimating probability from Data
Example Naïve Bayes
Example Naïve Bayes
Naïve Bayes Summary
• Robust to isolate noise points
• Handle missing values by ignoring the instance during

probability estimate calculations
• Independence assumption may not hold for some attributes.
– Use other techniques such as Bayesian Belief Networks

(BBN)
naïve Bayes vs Bayes Nets
• Note that the naïve Bayes classifier is simply a special instance

of a Bayes Net.
• Naïve Bayes simplified comutation by making the assumption

of class conditional independence
• Bayesian belief networks allow the representation of

dependencies among subsets of attributes.
Bayesian Belief Network
• Belief Measure:
– In general, a person belief in a statement a will depend on

some body of knowledge K. We write this as P(a|K).
– P(a|K) represents a belief measure.

BBN
• A belief network is defined by two components—a directed

acyclic graph and a set of conditional probability tables
•The CPT for a variable Y specifies

the conditional distribution
P(Y|Parents(Y))
BBN Example
• Bayesian Networks
implicitly encode the
Markov assumption.
The joint probability
becomes:
Training Bayesian Belief Networks
• Several algorithms exist for learning the network topology

from the training data given observable variables.
• If the network topology is known and the variables are

observable, then training the network is straightforward.
• When the network topology is given and some of the

variables are hidden, there are various techniques.
Training Bayesian Belief Networks
– Gradient descent
• Let D be a training set of data tuples, X1,X2, . . . , X|D|.
• Training the belief network means that we must learn the

values of the CPT entries.
• The CPT entries considered as the weight of the network–

analogous to the hidden weight of NN.
• A gradient descent strategy performs greedy hill-climbing--

at each iteration, the weights are updated and will
eventually converge to a local optimum solution
Application
Source
Thank you
Adane L. Mamuye
June 2020
Introduction SVM
• Basic idea of support vector machines:

– Optimal hyperplane for linearly separable patterns
• A hyperplane is a linear decision surface that splits the space
into two parts
– For nonlinearly separable data-- transformations of original

data to map into new space – the Kernel function
Thank you--- Alexander Statnikov*, Douglas Hardin#, Isabelle Guyon†,

Constantin F. Aliferis*---- New York University
Introduction SVM
• Important because of:

– Robust to very large number of variables and small samples
– Can learn both simple and highly complex classification models
– Employ sophisticated mathematical principles to avoid

overfitting
– Can be used for both classification and regression tasks

Main ideas of SVMs
• Find a linear decision surface (“hyperplane”) that can separate

patient classes and has the largest distance (i.e., largest “gap”
or “margin”) between border-line patients (i.e., “support
vectors”);
Main ideas of SVMs
• If linear decision surface does not exist, the data is mapped

into a much higher dimensional space (feature space)
• The feature space is constructed via a fancy mathematical
projection called the kernel trick.
Support Vectors
• Support vectors are the data points that lie closest to the
decision surface
– Most difficult to classify
– Critical elements of the training set
• Change the position of the dividing hyperplane if removed.

Support vectors
Support Vectors: touch the boundary of the margin

Support Vector Machine
• SVMs maximize the margin around

the separating hyperplane.
• The decision function is fully

specified by a subset of training
samples, the support vectors.
• 2-Ds, it’s a line.
• 3-Ds, it’s a plane.
• In more dimensions, call it a
hyperplane.
Input and Outputs in SVM
• Input: set of (input, output) training pair samples; call the

input sample features x1, x2…xn, and the output result y.
• Output: set of weights w (or wi), one for each feature, whose
linear combination predicts the value of y. (just like neural
nets)
SVM-Mathematical Concepts
• Samples geometrically.
Purpose of vector representation
• Representing each sample/patient as a vector allows to
geometrically represent the decision surface that separates
two groups of samples/patients.
• In order to define the decision surface, we need to introduce

some basic math elements.
Basic Operations on Vectors
• Multiplication by a scalar: when you multiply a vector by a

scalar, you “stretch” it in the same or opposite direction
depending on the direction of the vector.
• Addition
• Subtraction
• Euclidian length (L2-norm)
L2-norm is a typical way to measure length of a vector

• Dot product
The response variable y is just a dot product of the

vector --- and the weights vector ---.
Equation of Hyperplane
Hyperplane Example
• Which one is better? B1 or B2?• How do you define better?

We can define the betteror optimal hyperplae by choosing
maximal or long distance margin from the vectors to the
plane.
Hyperplane Example
• Optimal classification occurs when a hyperplane provides

maximal distance to the nearest training data points
or support vectors. Intuitively, this makes sense, as if the
points are well separated, the classification between two
groups is muchclearer.
SVM- the widest street approach
- Linearly Separable Case-

A separating hyperplane
can be
Where:
- where 𝑊 = {𝑤 1,
𝑤2, . . . , 𝑤𝑑} is
weight vector;
- 𝑏 is called as bias
Linear and non-linear separable data
Which one is easy to separate?

Kernel trick
• How to efficiently compute the hyperplane that separates two

classes with the largest street width?
• How to separate linear inseparable?-- optimization problem
• Kernel trick: for linearly inseparable cases in SVM, kernel trick is

a commonly used technique.
Kernel trick
– Is a nonlinear transformation of samples from the original

space to a feature space with higher or even infinite
dimension so as to make the problems linearly separable.
– Nonlinear mapping function map data 𝑋 in original (or

primal) space into a higher (ever infinite) dimension space
𝐹
Kernel trick
• The separating hyperplane function extended into:
• The separating hyperplane is

Strong points of SVM-based learning
methods
• Empirically achieve excellent results in high-dimensional data

with very few samples
• Internal capacity control to avoid overfitting
• Can learn both simple linear and very complex nonlinear
functions by using “kernel trick”
• Robust to outliers and noise
• Do not require direct access to data, work only with dot-
products of data-points.
Weak points of SVM-based learning
methods
• Interpretation is less straightforward than classical statistics

• Lack of parametric statistical significance tests
• Has several key parameters like C, kernel function, and
Gamma that all need to be set correctly
Genetic Algorithm
• Inspired by Charles Darwin’s theory of natural evolution
• Genetic algorithms are a type of optimization algorithm

– Used to find the optimal solution(s) to a given computational problem.
• Genetic algorithms represent one branch of the field of study

called evolutionary computation
• Like in evolution, many of a genetic algorithm's processes are

random -- allows one to set the level of randomization and
the level of control.
Genetic Algorithm
• An individual is characterized by a set of parameters

(variables) known as Genes. Genes are joined into a string to
form a Chromosome (solution).
• Genes of an individual
is represented using a
string (binary values),
in terms of an alphabet
• Encode the genes in a

chromosome
Genetic Algorithm
• The basic components common to almost all genetic

algorithms are:
– A fitness function for optimization
• Tests and quantifies how `fit' each potential solution is
• One of the most pivotal parts of the algorithm
– A population of chromosomes
• Refers to a numerical value or values that represent a candidate
solution to the problem that the genetic algorithm is trying to
solve
• Each candidate solution is encoded as an array of parameter
values
Genetic Algorithm
• Selection Methods: can be broadly classified into two classes

as follows.
– Fitness Proportionate Selection: includes methods such as roulette-
wheel selection and stochastic universal selection
– Ordinal Selection: includes methods such as tournament selection and
truncation selection
Genetic Algorithm
– Crossover to produce next generation of chromosomes
Offspring are created by exchanging the genes of parents among themselves

until the crossover point is reached
Basic Genetic algorithm operators
Genetic algorithm
- Random mutation of chromosomes in new generation
• Mutation occurs to maintain

diversity within the population and
prevent premature convergence.
Application of Genetic algorithm
Benefits
• Concept is easy to understand
• Modular, separate from application
• Supports multi-objective optimization
• Always an answer; answer gets better with time.
• Easy to exploit previous or alternate solutions
• Flexible building blocks for hybrid applications.

Ensemble Method
• A set of classifiers whose individual decisions are combined in

some way (typically by weighted or unweighted voting) to
classify new examples.
• The most active areas of research in supervised learning has

been to study methods for constructing good ensembles of
classifiers.
• Ensembles are much more accurate than the individual

classifiers that make up them.
Many thanks to Gavin Brown , Ensemble Learning, University of

Manchester
Esembel Method
• Examples: Imagin we have an ensemble of three classifiers:
{h1,h2,h3} and consider anew case x.
– If the three classifiers are identical ( not diverse), then when

h1(x) is wrong, h2(x) and h3(x) will also be wrong.
– If the errors made by classifiers are uncorrelated, then when

h1(x) is wrong, h2(x) and h3(x) may be correct– the majority vote
classify x correctly.
• Decisions can be combined by many methods, including

averaging, voting, and probabilistic methods.
Ensemble Method
• Simplest approach:
1. Generate multiple classification models
2. Each votes on test instance
3. Take majority as classification
Ensemble Method
• Differ in training strategy and method combination

– Bagging: parallel training with different training sets
– Boosting: sequential training, iteratively re-weighting training

examples so current classifier focuses on hard examples
– Mixture of experts: parallel training with objective encouraging

division of labor
Bagging
• Each member of the ensemble is constructed from a different

training dataset.
– Predictions combined either by uniform averaging or voting over
class labels
– Dataset is generated by sampling from the total N data
examples, choosing N items uniformly at random with
replacement.
– Each sample is known as a boot-strap
Bagging
– Bagging works best with unstable models (produce differing
generalization behavior with small changes to the training data)
such as decision tree, neural network
– Tends not to work well with very simple models
– Despite its apparent simplicity, Bagging is still not fully
understood.
• Can hurt a stable algorithm: a Bayes optimal algorithm may

leave out some training examples in every bootstrap
Boosting algorithms
• Boosting algorithms also works by manipulating training set, but
classifiers trained sequentially
• Constructing each ensemble member with some measurement
ensuring that it is substantially different from the other members.
– Alters the distribution of training examples to make more accurate
predictions where previous predictors have made errors.
– Adaboost is the most well known of the Boosting family of algorithms

Boosting
• Probably one of the most influential ideas in machine learning in

the last decade.
• Is a way of converting a “weak” learning model (behaves slightly
better than chance) into a “strong” learning mode (behaves
arbitrarily close to perfect).
• Strong theoretical result, but also lead to a very powerful and

practical algorithm which is used all the time in real world
machine learning.
Rich Zemel, ML & DM-
Ensemble method
Weighting
• How to weight each training case for classifier m

Adaboost
• Trains models sequentially with anew model trained at each
round.
• At each round, misclassified examples are identified and then

used as a fed back into the start of the next round.
• The idea is that subsequent models should be able to

compensate for errors made by earlier models.
• Key difference with bagging is that at each round, Bagging has

a uniform distribution, while Adaboost adapts a non uniform
distribution.
Mixture of Experts
• Widely investigated paradigm for creating a combination of

models.
• Commonly implemented with a neural network as the base
model, or some other model capable of estimating
probabilities.
• Reading assignment --- read the maths of
Bagging, Boosting and mixtures of experts
Thank You

Supervised Learning 1 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Supervised Learning 1 PDF

Uploaded by

Copyright:

Available Formats

Supervising Learning 1

• Data quality problems

• Data have quality if they satisfy the requirements of the

• Preprocessing is one of the most critical steps in a data

• Less data: machine learning methods can learn faster

– Smoothing noisy data

– Identifying or removing outliers

• Issues to be considered during integration:

– Redundancy and correlation analysis

• Some redundancies can be detected by correlation analysis

• Most machine learning and data mining techniques may not

• Data are transformed or consolidated into forms appropriate for

• Reduction is performed by applying a linear transformation to

• Principal Component Analysis (PCA)

• PCA is a method that brings together:

– The directions in which our data are dispersed.

– The relative importance of these different directions.

- Compute the covariance matrix of the

- Covariance matrix would be:

- Art test scores have more variability than English test

- We can find the determinant of the same :

- Where x is the eigenvector

• Transform the samples onto the new subspace

• Do you want to reduce the number of variables, but aren’t

• Do you want to ensure your variables are independent of one

• Are you comfortable making your independent variables less

• Each type of task is characterized by the kinds of data they

What classification techniques do you know?

• Regression comes from fact that we fit a linear model to the

• Linear regression is an excellent, simple method for numeric

• Best-fitting straight line will be found, where “best” is

• Linear regression measures the goodness of fit using the

• One type of classification algorithm

Towards Data Science

• It is a special case of linear regression where the target

• Sigmoid function: also called logistic function gives an ‘S’ shaped

If the curve goes to positive infinity, y

– Based upon this threshold, the obtained estimated

The cost function can be

• Discuss the difference between linear

• Decision Tree classification algorithm

Data: A set of data records (also

Learning (training): Learn a model using the training data

• A decision tree is a predictor, h : X Y, that predicts the label

• Tree leaf ↔ contains a specific

• Tree is constructed in a top-down recursive manner– greedy

• A good attribute prefers attributes that split the data so that

• The encoding information that would be gained by branching

• The attribute A with the highest information gain is chosen as

Entropy in information theory specifies the minimum number of

• Suppose we were to partition the tuples in D on some

• Assume an attribute A split the set S into subsets {S1, S2,…,

• To compute the Information Gain, we have to compute:

• The encoding information that would be gained by branching

E(S)= 0 if S contains only positive or only negative examples

Note: No root-to-leaf path should contain the

• Stop expanding a node when all the records belong to the

• Pruning reduces the size of decision trees by removing parts

• Prepruning: stopping tree construction early on before it is

• Postpruning: to get simpler tree (to find and prune

• Comparing prepruning and postpruning:

• Overfitting and underfitting are the two biggest causes for

• If a decision tree is fully grown, it may lose some