You are on page 1of 162

Supervising Learning 1

Adane L. Mamuye

June 2020
Outline

• Data quality problems


• Data preprocessing
• Regression
Data Quality

• Data have quality if they satisfy the requirements of the


intended use.
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability
– Interpretability
Data Quality Problems

• Missing values
• Duplication
• Inconsistent
• Outliers

• Preprocessing is one of the most critical steps in a data


analytics process.
Why data Preprocessing is Important?

• Less data: machine learning methods can learn faster


• Higher accuracy: machine learning methods can generalize
better
• Simple results: output of machine learning methods are easier
to understand
Data Preprocessing Major Tasks
Data Cleaning

• Data cleaning:
– Filling in missing values

– Smoothing noisy data

– Identifying or removing outliers


Data Integration
• Blending data from multiple sources into a coherent data store.

• Issues to be considered during integration:


– Entity identification

– Redundancy and correlation analysis

• Some redundancies can be detected by correlation analysis


Data Reduction

• Most machine learning and data mining techniques may not


be effective for high-dimensional data.
• Dimensionality reduction is the process of taking data in a
high dimensional space and mapping it into a new space
whose dimensionality is much smaller.
• Reason:
– Impose computational challenges
– Lead to poor generalization abilities of the learning
algorithm– KNN
• Finding meaningful structure of the data
Data Reduction
• Data reduction strategies include:
– Dimensionality reduction:
• Wavelet transform, Principal component analysis, Attribute
subset selection.

– Numerosity reduction
• Parametric (regression and log-linear model) and
nonparametric (histograms, clustering, sampling and data
cube aggregation.)

– Data compression
• Lossless and lossy data compression techniques
Data Transformation and Discretization

• Data are transformed or consolidated into forms appropriate for


mining
• Involves the following:
– Smoothing: remove noise from the data – binning, regression,
and clustering
– Attribute construction: new attributes are constructed from the
given set of attributes
– Aggregation: summary or aggregation operations are applied to
the data
Dimensionality Reduction in Detail

• Reduction is performed by applying a linear transformation to


the original data
• Principal Component Analysis (PCA)
– Let {𝑥1 ; : : : ; 𝑥𝑚 } be m vectors in 𝑅 𝑑 -- reduce the
dimensionality of these vectors
– A matrix W ϵ 𝑅 n,𝑑 , where n < d, induces a mapping x -> 𝑤𝑥 ,
where W ϵ 𝑅 n is the lower dimensionality representation of
x.
– A matrix U ϵ 𝑅 n,𝑑 , can be used to (approximately) recover
each original vector x from its compressed version.
Dimensionality Reduction in Detail

• Principal Component Analysis (PCA)


– Let {𝑥1 ; : : : ; 𝑥𝑚 } be m vectors in 𝑅 𝑑 -- reduce the
dimensionality of these vectors
– A matrix W ϵ 𝑅 n,𝑑 , where n < d, induces a mapping x -> 𝑤𝑥 ,
where W ϵ 𝑅 n is the lower dimensionality representation of
x.
– A matrix U ϵ 𝑅 n,𝑑 , can be used to (approximately) recover
each original vector x from its compressed version.
Dimensionality Reduction in Detail

• PCA is a method that brings together:


– A measure of how each variable is associated with one
another. (Covariance matrix.)

– The directions in which our data are dispersed.


(Eigenvectors.)

– The relative importance of these different directions.


(Eigenvalues.)
Dimensionality Reduction in Detail
Dimensionality Reduction- Example
- Compute the mean of every dimension of
the whole dataset.

- Compute the covariance matrix of the


whole dataset

- Covariance matrix would be:

- Art test scores have more variability than English test


scores
Src: towards data science - Covariance between math and English; math and art is
positive but cov. b/n Art and English is unpredictable
Dimensionality Reduction- Example
– Compute Eigenvectors (whose direction remains unchanged when a
linear transformation is applied to it) and corresponding Eigenvalues
– Let A be a square matrix, ν a vector and λ a scalar that satisfies Aν =
λν, then λ is called eigenvalue associated with eigenvector ν of A.
– The eigenvalues of A are roots of the characteristic equation

- We can find the determinant of the same :


Dimensionality Reduction- Example
• Calculate the eigenvectors corresponding to the above
eigenvalues: -Computer eigenvectors by
Gaussian Elimination

- Where x is the eigenvector


associated with eigenvalue λ
• Sort the eigenvectors by decreasing eigenvalues
• Eigenvectors corresponding to two maximum eigenvalues are

• Transform the samples onto the new subspace


When should I use PCA?

• Do you want to reduce the number of variables, but aren’t


able to identify variables to completely remove from
consideration?

• Do you want to ensure your variables are independent of one


another?

• Are you comfortable making your independent variables less


interpretable?
• If you answered “yes” to all three questions, then PCA is a
good method to use.
Supervised Learning
Supervised Learning

• Each type of task is characterized by the kinds of data they


require and the kinds of output they generate
Classification Tasks

• Given:
– A set of classes
– Instances (examples) of
each class
• Described as a set of
features or attributes
and their values
• Generate: A method (aka
model) that when given a
new instance it will
determine its class
Supervised Learning
• Classification
– Output type: discrete (binary/multi-classes)
– Trying to find: a boundary
– Evaluation: accuracy
• Regression
– Output type: continuous
– Trying to find: best fit line
– Evaluation: sum of squared errors

Slide share
Classification Techniques

What classification techniques do you know?


Linear Regression

• Regression comes from fact that we fit a linear model to the


feature space.
• When the outcome and all the attributes are numeric, linear
regression is a natural technique to consider.
• The idea is to express the class as a linear combination of the
attributes, with predetermined weights:

Where
• x---- class
• a1 to ak– attribute values
• Wo-wk--- weights– calculated from the
training data
Linear Regression

• The predicted value, not the actual, for the first instance’s
class can be written as:

• Linear regression is an excellent, simple method for numeric


prediction.
Linear Regression

• Best-fitting straight line will be found, where “best” is


interpreted as the least mean-squared difference.

• Linear regression measures the goodness of fit using the


squared error.
• Linear models serve well as building blocks for more complex
learning methods.
Linear Regression
Logistic Regression

• One type of classification algorithm


• Logistic regression builds a linear model based on a
transformed target variable.

Towards Data Science


Logistic Regression

Source
Logistic Regression

• It is a special case of linear regression where the target


variable is categorical in nature.
• The outcome or target variable is dichotomous (two possible
classes) in nature.
• Logistic Regression predicts the probability of occurrence of a
binary event utilizing a logit function.
Logistic Regression

source
Logistic Regression
• From linear to logistic regression--- using sigmoid function.

0 + 1 𝑥1 + 2 𝑥2 …𝑛 𝑥𝑛

• Sigmoid function: also called logistic function gives an ‘S’ shaped


curve that can take any real-valued number and map it into a value
between 0 and 1.
Sigmoid function of
weighted sum
f(x) = 1Τ1 + 𝑒 −(𝑦)
f(x) = 1Τ1 + 0 + 1 𝑥1 + …𝑛 𝑥𝑛

If the curve goes to positive infinity, y


predicted will become 1, and if the curve goes
to negative infinity, y predicted will become 0
Logistic Regression

• Decision Boundary

– Based upon this threshold, the obtained estimated


probability is classified into classes.
– Say, if predicted value ≥ 0.5, then classify email as spam else
as not spam.
Example: If the output is 0.75, we can say in terms of
probability as: There is a 75 percent chance that email will be
spam.
• Decision boundary can be linear or non-linear. Polynomial order
can be increased to get complex decision boundary.
Logistic Regression
Sigmoid function
Logistic Regression
Logistic Regression

The cost function can be


reduced by using Gradient
Descent.
Types of Logistic Regression
1. Binary Logistic Regression
– The categorical response has only two 2 possible
outcomes. Example: Spam or Not
– one of the most simple and commonly used Machine
Learning algorithms for two-class classification.
2. Multinomial Logistic Regression
– Three or more categories without ordering. Example:
Predicting which food is preferred more (Veg, Non-Veg,
Vegan)
3. Ordinal Logistic Regression
– Three or more categories with ordering. Example: Movie
rating from 1 to 5
Logistic Regression -- Advantage
• Advantages:
- Makes no assumptions about distributions of classes in
feature space
- Easily extended to multiple classes (multinomial regression)
- Natural probabilistic view of class predictions
- Quick to train
- Very fast at classifying unknown records
- Good accuracy for many simple data sets
- Resistant to overfitting
- Can interpret model coefficients as indicators of feature
importance
• What are the dis advantages of LR?
Question

• Discuss the difference between linear


regression and logistic regression
Supervising Learning 2

Adane L. Mamuye

June 2020
Outline

• Decision Tree classification algorithm


• Classification algorithms performance evaluation
• K-Nearest Neighbours
Classification Tasks

Data: A set of data records (also


called examples, instances or
cases) described by
k attributes: A1, A2, … Ak.
a class: Each example is
labelled with a pre-
defined class.
Goal: To learn a classification
model from the data that can be
used to predict the classes of new
(future, or test) cases/instances.

Learning (training): Learn a model using the training data


Testing: Test the model using unseen test data to assess the model accuracy
Decision Tree

• A decision tree is a predictor, h : X Y, that predicts the label


associated with an instance X by traveling from a root node of
a tree to a leaf.
• A method for approximating discrete classification functions
by means of a tree-based representation

• Tree leaf ↔ contains a specific


label
•Tree branch ↔ possible
attribute value for the instance in
question
DT Learning Algorithm

• Tree is constructed in a top-down recursive manner– greedy


search - through the space of possible solutions
• A general Decision Tree learning algorithm:
1. Perform a statistical test of each attribute- mostly categorical
(possible to handle continuous attribute values) to determine how
well it classifies the training examples when considered alone;

1. Select the attribute that performs best and use it as the root of the
tree;Order the attribute based on decreasing entropy or highest
information gain to lowest information gain.
2. To decide the descendant node down each branch of the root
(parent node), sort the training examples according to the value
related to the current branch and repeat the process described in
steps 1 and 2--- a recursive process
What is Good Attribute

• A good attribute prefers attributes that split the data so that


each successor node is as pure as possible.

• In other words:
– We want a measure that prefers attributes that have a high
degree of ”order”
• Maximum order: all examples are of the same class
• Minimum order: all classes are equally likely
• Needs a measure of impurity
Measures of Node Impurity

• Information Gain
– Determine how informative an attribute is
– Attributes are assumed to be categorical

• Gini Index
– Attributes are assumed to be continuous
– Assume there exist several possible split values for each
attribute
Information Gain

• The encoding information that would be gained by branching


on A.
Gain(A) = Info(D) - InfoA(D)
• Gain(A) tells us how much would be gained by branching on
A.

• The attribute A with the highest information gain is chosen as


the splitting attribute at node N.
Information Gain

Entropy in information theory specifies the minimum number of


bits needed to encode the classification accuracy of an instance.
Information Gain

• Suppose we were to partition the tuples in D on some


attribute A having v distinct values, {a1, a2, …, av}
• Attribute A can be used to split D into v partitions or subsets,
{D1,D2,…, Dv}, where Dj contains those tuples in D that have
outcome aj of A.
• These partitions would correspond to the branches grown
from node N.
• The expected information required to classify a tuple from D
based on the partitioning by A:
Information Gain

• Assume an attribute A split the set S into subsets {S1, S2,…,


Sv}

• To compute the Information Gain, we have to compute:


– The average entropy
– The sum of the entropy of the original set S

• The encoding information that would be gained by branching


on A.

E(S)= 0 if S contains only positive or only negative examples


E(S)= 1 if S contains equal amount of positive and negative examples
Play Tennis Example
Tree Construction-1
Tree Construction-2
Tree Construction-3
Tree Construction-4

Note: No root-to-leaf path should contain the


same discrete attribute twice
Stopping Criteria for Tree Induction

• Stop expanding a node when all the records belong to the


same class OR when all the records have similar attribute
values.
Tree pruning

• Pruning reduces the size of decision trees by removing parts


of the tree that do not provide power to classify instances

• Prepruning: stopping tree construction early on before it is


full

• Postpruning: to get simpler tree (to find and prune


unnecessary sub trees)

• Comparing prepruning and postpruning:


– The prepruning is faster but postpruning leads to more
accurate trees.
Overfitting
• The goal of a good machine learning model is to generalize
well from the problem domain.

• Overfitting and underfitting are the two biggest causes for


poor performance of machine learning algorithms.

• If a decision tree is fully grown, it may lose some


generalization capability.
– This is a phenomenon known as overfitting.
• Overfitting refers to a model that models the training data
too well.
• Happens the model learns the detail and noise in the training
data---- accuracy compromised.
Issues in decision trees

• Overfitting negatively impact the performance.

• A decision tree is said to overfit the training data if


– It results in poor accuracy to classify test samples

– It has too many branches, that reflect anomalies


Causes of Overfitting
• Presence of Noise
– Mislabeled instances may contradict the class labels of
other similar records.

• Lack of Representative Instances


– Lack of representative instances in the training data can
prevent refinement of the learning algorithm.

• The Multiple Comparison Procedure


– Failure to compensate for algorithms that explore a large
number of alternatives can result in spurious fitting.
Avoiding Over fitting

• Ways to avoid overfitting:


1. Stop the training process before the learner reaches the point
where it perfectly classifies the training data.

2. Apply backtracking in the search for the optimal hypothesis. In


the case of Decision Tree Learning, backtracking process is
referred to as ‘post-pruning of the overfitted tree’.
Underfitting

• Underfitting refers to a model that can neither model the


training data nor generalize to new data.

• Underfitting: when model is too simple, both training and test


errors are large

• Easy to detect given a good performance metric

• Remedy: try out an alternative machine learning algorithm


Evaluation Matrices
Nearest Neighbours

• Based on learning by analogy– by comparing a given test tuple


with training tuples that are similar to it.

• Training tuple represents a point in an n-dimensional space–


described by n attributes.

• When unknown tuple is given – searches the pattern space for


the k training tuples (k nearest nieghbors) that are closest to
the unknown tuple.
K-Nearest Neighbours

• Euclidean distance: the most commonly used measure

• Determine the class from nearest neighbor list take the


majority vote of class labels among the k-nearest neighbors
• Weigh the vote according to distance
K-Nearest Neighbours

• Example:
Name Acidity Strength Class
Durability
Type-1 7 7 Bad
Type-2 7 4 Bad
Type-3 3 1 Good
Type-4 1 4 Good

• Test data: AD=3, S=7, Class=? Check for K=1/2/3


K-Nearest Neighbours

• Choosing k for K-NN is just one of the many model selection


problems we face in machine learning

– If k is too small, sensitive to noise points


– If k is too large, neighborhood may include points from other
classes

• Normalize the values of each attribute before computing


closeness– min-max
K-Nearest Neighbours

• Advantages
– Conceptually simple, easy to understand and explain
– Very flexible decision boundaries
– Not much learning at all
• Disadvantages
– It can be hard to find a good distance measure
– Irrelevant features and noise can be very detrimental
– Typically can not handle more than a few dozen attributes
– Computational cost: requires a lot computation and memory
Thank You
Supervising Learning 3

Adane L. Mamuye

June 2020
Outline

• Artificial Neural Network


• Bayes learning
ANN

• Neural networks are nonlinear models inspired by the


structure of neural networks in the brain.
• We use ANN:
– They are extremely powerful computational devices

– Massive parallelism makes them very efficient

– They are particularly fault tolerant


How ANN Works

– Receives inputs

– Combines them in someway

– Performs a generally nonlinear operation on the result

– Outputs the final result


Artificial Neural Networks: the dimensions
How ANN Works

• The three basic components of the (artificial) neuron are:


– synapses or connecting links

– adder that sums

– Activation function.
How ANN works
Activation functions
Preceptor

Perceptron = a neuron that its input is the dot product of W


and X and uses a step function as a transfer function
Perceptron: Example 1 -AND
Perceptron: Example 2 -OR
NNs: Architecture
Classification Back Propagation

• Back Propagation learns by iteratively processing a set of


training data (samples).

• For each sample, weights are modified to minimize the error


between network’s classification and actual classification.
Steps in Back Propagation Algorithm

• STEP ONE: initialize the weights and biases.

• The weights in the network are initialized to random numbers


from the interval.
• Each unit has a BIAS associated with it
• The biases are similarly initialized to random numbers from
the interval .

• STEP TWO: feed the training sample.


Steps in Back Propagation Algorithm

• STEP THREE: propagate the inputs forward; we compute the


net input and output of each unit in the hidden and output
layers.

• STEP FOUR: back propagate the error.

• STEP FIVE: update weights and biases to reflect the


propagated errors.

• STEP SIX: terminating conditions.


Example
x1 1
• Given w14

w15 4 w46

2 w24 6
x2
w25
w56
w34 5
w35
x3 3

Initial Input, Weight and biases values:


• x1=1, x2=0,x3=1, w14=0.2, w15=-0.3, w24=0.4, w25 = 0.1, w34= -
0.5, w35=0.2, w46=-0.3, w56=-0.3, 4=-0.4, 5=0.2, 6=0.1, 𝓁 = 0.9
Solution
• I4= σ𝑖 𝑤𝑖𝑗 𝑂𝑖 + 𝑗 The input of a neuron is
obtained by taking a
= w14 * x1 + w24 * x2 + w34 * x3 + 4 weighted sum of the outputs
of all the neurons connected
= 1*0.2 + 0*0.4 + 1*(-0.5) + (-0.4) to it
= -0.7
• Given net input I4, the output of I4 is computed as

O4 = 1ൗ −𝐼𝑗  O4 = 1ൗ1+𝑒 − −0.7  0.332


1+𝑒
Accordingly, the net input of:
- 𝐼5 = -0.3 + 0 + 0.2 + 0.2 = 0.1  O5 = 1ൗ1+𝑒 −(0.1)  0.525
- 𝐼6 = -0.3 *0.332 + (-0.2)*(0.525) + 0.1= -0.105  O6 = 1ൗ1+𝑒 −(−0.105) 
0.474
Solution
• Calculating of error at each node of output layer
– 𝐸𝑟𝑗 = 𝑂𝑗 (1- 𝑂𝑗 ) (𝑇𝑗 - 𝑂𝑗 )
Where 𝑂𝑗 is the actual output and 𝑇𝑗 in the known target value
− 𝐸𝑟6 = (0.474)(1-0.474)(1-0.474) = 0.1311

• Error rate of hidden layer unit j


– 𝐸𝑟𝑗 = 𝑂𝑗 (1- 𝑂𝑗 ) σ𝐸𝑟𝑘 𝑤𝑗𝑘
Where𝑤𝑗𝑘 is connection weight from j to k and 𝐸𝑟𝑘 error rate of
unit k
- 𝐸𝑟5 = (0.525)(1- 0.525)(0.1311) (-0.2)= -0.0065
- 𝐸𝑟4 = (0.332)(1- 0.332)(0.1311) (-0.3)= -0.0087
Solution
• Calculation of weight of biases:
– Weight and biases updated to reflect the propagated error

Δ𝑤𝑖𝑗 = (𝓁)𝐸𝑟𝑗 𝑂𝑗 𝑤𝑖𝑗 = 𝑤𝑖𝑗 + Δ𝑤𝑖𝑗


𝑤46 = 𝑤46 + Δ𝑤𝑖𝑗  (-0.3)(0.9)(0.1311)(0.332) = -0.261

– Biases are updated by


Δ𝑗 = (𝓁)𝐸𝑟𝑗 𝑗 = 𝑗 + Δ𝑗

6 = 6 + Δ𝑗  (0.1)(0.9)(0.1311

• Forward propagation starts again


ANN Application

• Real world applications:


– Financial modelling:- predicting the stock market
– Time series prediction:- climate, weather, seizures
– Computer games:- intelligent agents, chess, backgammon
– Robotics:- autonomous adaptable robots
– Pattern recognition:- speech recognition, seismic activity,
sonar
– signals Data analysis ? data compression, data mining
– Bioinformatics:- DNA sequencing, alignment
Weakness of ANN

• The complex internal structure shows black box behavior: it is


• Very hard to get an idea of the meaning of the internal
computations.
• Another feature of neural networks is their random behavior.
– The training process contains random elements. When this is
repeated, the same input set may yield very different
networks.
– Sometimes they differ in performance, one showing good
behavior, while others behave badly.
• Neural Network needs long time for training.
Bayes Learning
• Use a probability framework for fitting a predictive model to a
training dataset.
• Has two roles
– Provides learning algorithms
• Naïve Bayes learning
• Bayes Belief Network learning
– Provides conceptual framework
• Provides “gold standard” to evaluate other learning
algorithms
Probability Theory

• Conditional (posterior) probabilities:


– Formalize the process of accumulating evidence and
updating probabilities based on new evidence
– Specify the belief in one proposition (event, conclusion,
diagnosis, ets) conditioned on another proposition (evidence,
feature, symptom, etc)
• P(A|B) is the conditional probability of A given evidence B:
Probability Theory

• Conditional probabilities behave like standard probabilities:


– With in a range of [0,1]
– Sum to 1

• Can have P( conjunction of events |B)


– P(A ᴧ B ᴧ C |E) is the sentence “A ᴧ B ᴧ C” is true
conditioned on the evidence E being true.
Rules of Probability Theory

• Negation: probability event A being false:


P(¬A|B) = 1- P(A|B)
• Sum rule: probability of a disjunction of two events A and B:
P(A v B) = P(A) + P(B) – P(AᴧB)
• Product Rule: probability of a conjunction of two events A and B
P(A ᴧ B) = P(A|B) x P(B)
= P(B|A) x P(A)
• Chain rule: generalization of the product rule for any number of
events.
P(A ᴧ B ᴧ C) = P(A|B ᴧ C) x P(B|C) x P(C)
Rules of Probability Theory

• Conditional chain rule: variant of the chain rule for conditional


probabilities
P(A ᴧ B|C) = P(A|B ᴧ C) x P(B|C)
• Total Probability: summing out over mutually exclusive events
B1,…..,Bn
Bayes Classification Method

• The goal is to learn a model and use this model to predict.


– Learn a probability distribution

– Use the distribution to make a decision

• A learner tries to find the most probably hypothesis h from a


set of hypothesis H, given the observed data.
• Bayesian classifiers (statistical classifiers) -- predict class
membership probabilities -- based on Bayes’ theorem
Bayes Theorem

• Let X be a data tuple.

• Let H be some hypothesis such as that the data tuple X belongs


to a specified class C.

• We want to determine P(H|X)– posterior probability of H


conditioned on X

• We are looking for the probability that tuple X belongs to class C,


given that we know the attribute description of X.
Example of Bayes Theorem

• Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is1/50,000
– Prior probability of any patient having stiff neck is1/20

• If a patient has stiff neck, what is the probability he/she has


meningitis?
Naïve Bayes Classification

• Simple Bayesian classifier known as the Naïve Bayesian


– Comparable in performance with decision tree and neural
network
• How it works
1. Each tuple is represented by n-dimensional attribute vector X
={𝑥1 , 𝑥2 , … , 𝑥𝑛 }
2. Suppose there are m classes --- {𝑐1, 𝑐2 , …𝑐𝑛 }
The classifier predict x that bellongs to classes having the highest
posterior probability conditioned on X:
P(𝐶𝑖 X) = 𝑃(𝑋|𝐶𝑖)(𝑃(𝐶𝑖)
𝑃(𝑋)
Naïve Bayes Classification
3. For data with many attributes
P(X|𝐶𝑖 ) = ς𝑛𝑘=1 𝑃(𝑋𝑘 |𝐶𝑖 )
What if the attribute value is continuous? Gaussian distribution with a
mean µ an d standard deviation σ.

4. Select the higher(est) conditional probability

𝑃 𝑋 𝐶𝑖 𝑃(𝐶𝑖 ) > 𝑃 𝑋 𝐶𝑗 𝑃(𝐶𝑗 )


Estimating probability from Data
Example Naïve Bayes
Example Naïve Bayes
Naïve Bayes Summary

• Robust to isolate noise points

• Handle missing values by ignoring the instance during


probability estimate calculations

• Independence assumption may not hold for some attributes.

– Use other techniques such as Bayesian Belief Networks


(BBN)
naïve Bayes vs Bayes Nets

• Note that the naïve Bayes classifier is simply a special instance


of a Bayes Net.

• Naïve Bayes simplified comutation by making the assumption


of class conditional independence

• Bayesian belief networks allow the representation of


dependencies among subsets of attributes.
Bayesian Belief Network

• Belief Measure:

– In general, a person belief in a statement a will depend on


some body of knowledge K. We write this as P(a|K).

– P(a|K) represents a belief measure.


BBN

• A belief network is defined by two components—a directed


acyclic graph and a set of conditional probability tables

•The CPT for a variable Y specifies


the conditional distribution
P(Y|Parents(Y))
BBN Example

• Bayesian Networks
implicitly encode the
Markov assumption.
The joint probability
becomes:
Training Bayesian Belief Networks

• Several algorithms exist for learning the network topology


from the training data given observable variables.

• If the network topology is known and the variables are


observable, then training the network is straightforward.

• When the network topology is given and some of the


variables are hidden, there are various techniques.
Training Bayesian Belief Networks
– Gradient descent
• Let D be a training set of data tuples, X1,X2, . . . , X|D|.

• Training the belief network means that we must learn the


values of the CPT entries.

• The CPT entries considered as the weight of the network–


analogous to the hidden weight of NN.

• A gradient descent strategy performs greedy hill-climbing--


at each iteration, the weights are updated and will
eventually converge to a local optimum solution
Application

Source
Thank you
Supervising Learning 5

Adane L. Mamuye

June 2020
Introduction SVM

• Basic idea of support vector machines:


– Optimal hyperplane for linearly separable patterns
• A hyperplane is a linear decision surface that splits the space
into two parts

– For nonlinearly separable data-- transformations of original


data to map into new space – the Kernel function

Thank you--- Alexander Statnikov*, Douglas Hardin#, Isabelle Guyon†,


Constantin F. Aliferis*---- New York University
Introduction SVM

• Important because of:


– Robust to very large number of variables and small samples

– Can learn both simple and highly complex classification models

– Employ sophisticated mathematical principles to avoid


overfitting

– Can be used for both classification and regression tasks


Main ideas of SVMs

• Find a linear decision surface (“hyperplane”) that can separate


patient classes and has the largest distance (i.e., largest “gap”
or “margin”) between border-line patients (i.e., “support
vectors”);
Main ideas of SVMs

• If linear decision surface does not exist, the data is mapped


into a much higher dimensional space (feature space)
• The feature space is constructed via a fancy mathematical
projection called the kernel trick.
Support Vectors
• Support vectors are the data points that lie closest to the
decision surface
– Most difficult to classify

– Critical elements of the training set

• Change the position of the dividing hyperplane if removed.


Support vectors

Support Vectors: touch the boundary of the margin


Support Vector Machine

• SVMs maximize the margin around


the separating hyperplane.

• The decision function is fully


specified by a subset of training
samples, the support vectors.
• 2-Ds, it’s a line.
• 3-Ds, it’s a plane.
• In more dimensions, call it a
hyperplane.
Input and Outputs in SVM

• Input: set of (input, output) training pair samples; call the


input sample features x1, x2…xn, and the output result y.

• Output: set of weights w (or wi), one for each feature, whose
linear combination predicts the value of y. (just like neural
nets)
SVM-Mathematical Concepts

• Samples geometrically.
Purpose of vector representation
• Representing each sample/patient as a vector allows to
geometrically represent the decision surface that separates
two groups of samples/patients.

• In order to define the decision surface, we need to introduce


some basic math elements.
Basic Operations on Vectors

• Multiplication by a scalar: when you multiply a vector by a


scalar, you “stretch” it in the same or opposite direction
depending on the direction of the vector.
Basic Operations on Vectors

• Addition
Basic Operations on Vectors

• Subtraction
Basic Operations on Vectors

• Euclidian length (L2-norm)

L2-norm is a typical way to measure length of a vector


Basic Operations on Vectors

• Dot product
Basic Operations on Vectors

The response variable y is just a dot product of the


vector --- and the weights vector ---.
Equation of Hyperplane
Hyperplane Example

• Which one is better? B1 or B2?• How do you define better?


We can define the betteror optimal hyperplae by choosing
maximal or long distance margin from the vectors to the
plane.
Hyperplane Example

• Optimal classification occurs when a hyperplane provides


maximal distance to the nearest training data points
or support vectors. Intuitively, this makes sense, as if the
points are well separated, the classification between two
groups is muchclearer.
SVM- the widest street approach

- Linearly Separable Case-


A separating hyperplane
can be

Where:
- where 𝑊 = {𝑤 1,
𝑤2, . . . , 𝑤𝑑} is
weight vector;
- 𝑏 is called as bias
Linear and non-linear separable data

Which one is easy to separate?


Kernel trick

• How to efficiently compute the hyperplane that separates two


classes with the largest street width?

• How to separate linear inseparable?-- optimization problem

• Kernel trick: for linearly inseparable cases in SVM, kernel trick is


a commonly used technique.
Kernel trick

– Is a nonlinear transformation of samples from the original


space to a feature space with higher or even infinite
dimension so as to make the problems linearly separable.

– Nonlinear mapping function map data 𝑋 in original (or


primal) space into a higher (ever infinite) dimension space
𝐹
Kernel trick

• The separating hyperplane function extended into:

• The separating hyperplane is


Strong points of SVM-based learning
methods

• Empirically achieve excellent results in high-dimensional data


with very few samples
• Internal capacity control to avoid overfitting
• Can learn both simple linear and very complex nonlinear
functions by using “kernel trick”
• Robust to outliers and noise
• Do not require direct access to data, work only with dot-
products of data-points.
Weak points of SVM-based learning
methods

• Interpretation is less straightforward than classical statistics


• Lack of parametric statistical significance tests
• Has several key parameters like C, kernel function, and
Gamma that all need to be set correctly
Genetic Algorithm
• Inspired by Charles Darwin’s theory of natural evolution

• Genetic algorithms are a type of optimization algorithm


– Used to find the optimal solution(s) to a given computational problem.

• Genetic algorithms represent one branch of the field of study


called evolutionary computation

• Like in evolution, many of a genetic algorithm's processes are


random -- allows one to set the level of randomization and
the level of control.
Genetic Algorithm

• An individual is characterized by a set of parameters


(variables) known as Genes. Genes are joined into a string to
form a Chromosome (solution).

• Genes of an individual
is represented using a
string (binary values),
in terms of an alphabet

• Encode the genes in a


chromosome
Genetic Algorithm

• The basic components common to almost all genetic


algorithms are:
– A fitness function for optimization
• Tests and quantifies how `fit' each potential solution is
• One of the most pivotal parts of the algorithm
– A population of chromosomes
• Refers to a numerical value or values that represent a candidate
solution to the problem that the genetic algorithm is trying to
solve
• Each candidate solution is encoded as an array of parameter
values
Genetic Algorithm

• Selection Methods: can be broadly classified into two classes


as follows.
– Fitness Proportionate Selection: includes methods such as roulette-
wheel selection and stochastic universal selection
– Ordinal Selection: includes methods such as tournament selection and
truncation selection
Genetic Algorithm

– Crossover to produce next generation of chromosomes

Offspring are created by exchanging the genes of parents among themselves


until the crossover point is reached
Basic Genetic algorithm operators
Genetic algorithm

- Random mutation of chromosomes in new generation

• Mutation occurs to maintain


diversity within the population and
prevent premature convergence.
Application of Genetic algorithm
Benefits

• Concept is easy to understand

• Modular, separate from application

• Supports multi-objective optimization

• Always an answer; answer gets better with time.

• Easy to exploit previous or alternate solutions

• Flexible building blocks for hybrid applications.


Ensemble Method

• A set of classifiers whose individual decisions are combined in


some way (typically by weighted or unweighted voting) to
classify new examples.

• The most active areas of research in supervised learning has


been to study methods for constructing good ensembles of
classifiers.

• Ensembles are much more accurate than the individual


classifiers that make up them.

Many thanks to Gavin Brown , Ensemble Learning, University of


Manchester
Esembel Method
• Examples: Imagin we have an ensemble of three classifiers:
{h1,h2,h3} and consider anew case x.

– If the three classifiers are identical ( not diverse), then when


h1(x) is wrong, h2(x) and h3(x) will also be wrong.

– If the errors made by classifiers are uncorrelated, then when


h1(x) is wrong, h2(x) and h3(x) may be correct– the majority vote
classify x correctly.

• Decisions can be combined by many methods, including


averaging, voting, and probabilistic methods.
Ensemble Method

• Simplest approach:
1. Generate multiple classification models
2. Each votes on test instance
3. Take majority as classification
Ensemble Method

• Differ in training strategy and method combination


– Bagging: parallel training with different training sets

– Boosting: sequential training, iteratively re-weighting training


examples so current classifier focuses on hard examples

– Mixture of experts: parallel training with objective encouraging


division of labor
Bagging

• Each member of the ensemble is constructed from a different


training dataset.
– Predictions combined either by uniform averaging or voting over
class labels
– Dataset is generated by sampling from the total N data
examples, choosing N items uniformly at random with
replacement.
– Each sample is known as a boot-strap
Bagging
– Bagging works best with unstable models (produce differing
generalization behavior with small changes to the training data)
such as decision tree, neural network
– Tends not to work well with very simple models
– Despite its apparent simplicity, Bagging is still not fully
understood.

• Can hurt a stable algorithm: a Bayes optimal algorithm may


leave out some training examples in every bootstrap
Boosting algorithms
• Boosting algorithms also works by manipulating training set, but
classifiers trained sequentially
• Constructing each ensemble member with some measurement
ensuring that it is substantially different from the other members.
– Alters the distribution of training examples to make more accurate
predictions where previous predictors have made errors.

– Adaboost is the most well known of the Boosting family of algorithms


Boosting

• Probably one of the most influential ideas in machine learning in


the last decade.
• Is a way of converting a “weak” learning model (behaves slightly
better than chance) into a “strong” learning mode (behaves
arbitrarily close to perfect).

• Strong theoretical result, but also lead to a very powerful and


practical algorithm which is used all the time in real world
machine learning.
Rich Zemel, ML & DM-
Ensemble method
Weighting

• How to weight each training case for classifier m


Adaboost
• Trains models sequentially with anew model trained at each
round.

• At each round, misclassified examples are identified and then


used as a fed back into the start of the next round.

• The idea is that subsequent models should be able to


compensate for errors made by earlier models.

• Key difference with bagging is that at each round, Bagging has


a uniform distribution, while Adaboost adapts a non uniform
distribution.
Mixture of Experts

• Widely investigated paradigm for creating a combination of


models.
• Commonly implemented with a neural network as the base
model, or some other model capable of estimating
probabilities.
• Reading assignment --- read the maths of
Bagging, Boosting and mixtures of experts
Thank You

You might also like