Combined ML

MACHINE LEARNING (ML-4)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura

AGENDA
Concept Learning
Hypotheses Representation
Find-S Algorithm
NEERAJ.GUPTA@GLA.AC.IN 2
LEARNING
Human learn our surrounding through 5 senses
eye,
ear,
nose,
tongue
and skin.
LEARNING
1. Rote Learning (memorization): Memorizing things without
knowing the concept or logic behind them.
2. Passive Learning (instructions): Learning from a
teacher/expert.
3. Analogy (experience): Learning new things from our past
experience.
4. Inductive Learning (experience): On the basis of past
experience, formulating a generalized concept.
5. Deductive Learning: Deriving new facts from past facts.
CONCEPT LEARNING
Inducing general functions from specific training examples is a
main issue of machine learning.
Tom Mitchell defines the concept learning as —

“Problem of searching through a predefined
space of potential hypotheses for the
hypothesis that best fits the training examples”
DEFINITION OF CONCEPT LEARNING
Task: learning a category description (concept) from a set of
positive and negative training examples.
Concept may be an event, an object …
Target function: a boolean function c: X  {0, 1}
Experience: a set of training instances D:{x, c(x)}
A search problem for best fitting hypothesis in a hypotheses space.
CONCEPT LEARNING
A Formal Definition for Concept Learning:
Inferring a Boolean-valued function from training examples of
its input and output.
• An example for concept-learning is the learning of bird-concept
from the given examples of birds (positive examples) and non-
birds (negative examples).
• We are trying to learn the definition of a concept from given
examples.
SPORT EXAMPLE
SPORT EXAMPLE
Concept to be learned:
Days in which Aldo can enjoy water sport
Attributes:
Sky: Sunny, Cloudy, Rainy Wind: Strong, Weak
AirTemp: Warm, Cold Water: Warm, Cool
Humidity: Normal, High Forecast: Same, Change
Instances in the training set:
(out of the 96 possible):
HYPOTHESES REPRESENTATION
h is a set of constraints on attributes:
 a specific value: e.g. Water = Warm
 any value allowed: e.g. Water = ?
 no value allowed: e.g. Water = Ø
Example hypothesis:
Sky AirTemp Humidity Wind Water Forecast
Sunny, ?, ?, Strong, ?, Same
Corresponding to boolean function:
Sunny(Sky) ∧ Strong(Wind) ∧ Same(Forecast)
H, hypotheses space, all possible h
HYPOTHESIS SATISFACTION
An instance x satisfies an hypothesis h iff all the constraints expressed by h are
satisfied by the attribute values in x.
Example 1:
x1: Sunny, Warm, Normal, Strong, Warm, Same
h1: Sunny, ?, ?, Strong, ?, Same Satisfies?
Yes
Example 2:
x2: Sunny, Warm, Normal, Strong, Warm, Same
h2: Sunny, ?, ?, Ø, ?, Same Satisfies?
No
FORMAL TASK DESCRIPTION
Given:
 X all possible days, as described by the attributes
 A set of hypothesis H, a conjunction of constraints on the attributes,
representing a function h: X  {0, 1}
[h(x) = 1 if x satisfies h; h(x) = 0 if x does not satisfy h]
 A target concept: c: X  {0, 1} where
c(x) = 1 iff EnjoySport = Yes;
c(x) = 0 iff EnjoySport = No;
 A training set of possible instances D: {x, c(x)}
Goal: find a hypothesis h in H such that
h(x) = c(x) for all x in D
Hopefully h will be able to predict outside D…
THE INDUCTIVE LEARNING ASSUMPTION
We can at best guarantee that the output hypothesis fits the target
concept over the training data
Assumption: an hypothesis that approximates well the training data
will also approximate the target function over unobserved
examples
i.e. given a significant training set, the output hypothesis is able to
make predictions
CONCEPT LEARNING AS SEARCH
Concept learning is a task of searching an hypotheses space
The representation chosen for hypotheses determines the search
space
In the example we have:
3 x 25 = 96 possible instances (6 attributes)
1 + 4 x 35= 973 semantically distinct hypothesis
considering that all the hypothesis with some  are semantically
equivalent, i.e. inconsistent
Structuring the search space may help in searching more efficiently
GENERAL TO SPECIFIC ORDERING
Consider: h1 = Sunny, ?, ?, Strong, ?, ?
h2 = Sunny, ?, ?, ?, ?, ?
Any instance classified positive by h1 will also be classified
positive by h2
h2 is more general than h1
Definition: hj g hk iff (x  X ) [(hk(x) = 1)  (hj (x)= 1)]
g more general or equal; >g strictly more general
Most general hypothesis: ?, ?, ?, ?, ?, ?
Most specific hypothesis: Ø, Ø, Ø, Ø, Ø, Ø NEERAJ.GUPTA@GLA.AC.IN 14
GENERAL TO SPECIFIC ORDERING: INDUCED STRUCTURE
FIND-S: FINDING THE MOST SPECIFIC HYPOTHESIS
1.Initialize h to the most specific hypothesis in H
2.For each positive training instance:
for each attribute constraint a in h:
If the constraint a is satisfied by x then do nothing
else replace a in h by the next more general
constraint satified by x (move towards a more
general hp)
3.Output hypothesis h
FIND-S IN ACTION
EXAMPLE
To illustrate this algorithm, assume the learner is given the sequence of training
examples from the EnjoySport task
PROPERTIES OF FIND-S
Find-S is guaranteed to output the most specific hypothesis within H that is
consistent with the positive training examples
The final hypothesis will also be consistent with the negative examples
Problems:
 There can be more than one “most specific hypotheses”
 We cannot say if the learner converged to the correct target
 Why choose the most specific?
 If the training examples are inconsistent, the algorithm can be mislead: no tolerance to rumor.
 Negative example are not considered
QUESTION
Consider the following data set having the data about which particular seeds are
poisonous.
EXAMPLE
First we consider the hypothesis to be more specific hypothesis.
Hence, our hypothesis would be :
h = {ϕ, ϕ, ϕ, ϕ}
Instance 1 :
The data in example 1 is { GREEN, HARD, NO, WRINKLED }. We see that our initial hypothesis is
more specific and we have to generalize it for this example. Hence, the hypothesis becomes :
h = { GREEN, HARD, NO, WRINKLED }

Instance 2 :
Here we see that this example has a negative outcome. Hence we neglect this example and our
hypothesis remains the same.
EXAMPLE
Instance 3 :
Here we see that this example has a negative outcome. Hence we neglect this example and our
hypothesis remains the same.

Instance 4 :
The data present in example 4 is { ORANGE, HARD, NO, WRINKLED }. We compare every single
attribute with the initial data and if any mismatch is found we replace that particular attribute
with general case ( ” ? ” ). After doing the process the hypothesis becomes :
h = { ?, HARD, NO, WRINKLED }
EXAMPLE
Instance 5 :
The data present in example 5 is { GREEN, SOFT, YES, SMOOTH }. We compare
every single attribute with the initial data and if any mismatch is found we replace
that particular attribute with general case ( ” ? ” ). After doing the process the
hypothesis becomes :
h = { ?, ?, ?, ? }
Since we have reached a point where all the attributes in our hypothesis have the
general condition, the example 6 and example 7 would result in the same
hypothesizes with all general attributes.
h = { ?, ?, ?, ? }
Hence, for the given data the final hypothesis would be :
Final Hyposthesis: h = { ?, ?, ?, ? }
THANKS
AGENDA
Linear regression with one variable
QUIZ
X Y
2 4
3 9
5 25
9 81
7 49
11 121
10.5 WHAT?
QUIZ
X Y
2 4
3 9
5 25
9 81
7 49
11 121
10.5 110.25
HOW DO YOU FIND THAT?
You find the relation between X and Y
Y = X.X =X^2
Y=f(X)
Y = X.X =X^2
Y=f(X)
Which one is dependent variable ?
Y = X.X =X^2
Y=f(X)
Which one is dependent variable ? ANSWER =
Y
Y = X.X =X^2
Y=f(X)
Which one is dependent variable ? ANSWER = Y
So What is X?
Y = X.X =X^2
Y=f(X)
Which one is dependent variable ? ANSWER = Y
SO What is X? X= Independent variable
QUIZ
X Y
2 3
3 5
5 9
9 10
7 6.5
11 11.8
10.5 WHAT?
QUIZ
X Y
2 3
3 5
5 9
9 10
7 6.5
11 11.8
10.5 Is it difficult to find out
the relation ?
GRAPH IS SOLUTION ?
Y
14
12
10
0
0 2 4 6 8 10 12
Y
14
12
10
0
0 2 4 6 8 10 12
APPROXIMATION
Y
14
11.8
12
10
10
9
8
6.5
Y
6
5
4
3
0
0 2 4 6 8 10 12
X
FIND THE EQUATION OF LINE ?
Two Points are given (3, 5) and (9,10)
Find equation of line ?
First right answer = 1Choclate (with in 90 secs)
What will be slope (m) and y intercept (c )?
Y= m.X + c
FIND THE EQUATION OF LINE ?
Two Points are given (3, 5) and (9,10)
Find equation of line ?
First right answer = 1Choclate
What will be slope (m) and y intercept (c )?
Y= m.X + c
Y= 0.83X+2.5
DEFINITION
Finding the relation between dependent variable and
Independent variable is called Linear Regression.
OR
Finding the best fit line between dependent variable and
Independent variable is called Linear Regression.
DEFINITION
Finding the relation between dependent variable and Independent variable is called
Linear Regression.
Now X=10.5 , X=2 , 13, 7

Y= what?
Put the values in equation Y=0.83X+2.5
What are you doing here? (USES OF LINEAR REGRESSION)
DEFINITION
Finding the relation between dependent variable and Independent variable is called
Linear Regression.
Now X=10.5 , X=12 , 13, 2.5
Y= what?
What are you doing here?
(USES OF LINEAR REGRESSION)
FORECASTING
PREDICTION
THE ERROR (RESIDUALS)
ERROR =GIVEN DATA(ACTUAL DATA) – PREDICTED DATA
e=Y(actual) –Y(predicted)
“Question”
“How can we find the best fit line?”
“Question”
“How can we find the best fit line?”
“Answer”
If Y(actual) =Y(predicted) or e=0
Or
Minimise the error
HOW TO FIND BEST FIT LINE
Derivation of linear regression equations : (FOR SINGLE VARIABLE)
given a set of n points ( , ) on a scatterplot,
find the best-fit line, =a+ b
such that the sum of squared errors in Y, ∑( − ) is minimized.
SSE method or least Square method:

(Sum of Squared Error method)
Find the a (y intercept), b (slope of line).
Linear regression with one variable
MODEL REPRESENTATION *
SRC : * Andrew NG
500
Housing Prices
400
(Portland, OR)
300
Price 200
(in 1000s 100
of dollars)
0
0 500 1000 1500 2000 2500 3000
Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data.
Training set of Size in feet2 (x) Price ($) in 1000's (y)
housing prices 2104 460
(Portland, OR) 1416 232
1534 315
852 178
… …
Notation:
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable
Training Set How do we represent h ?
Learning Algorithm
Size of h Estimated
house price
Linear regression with one variable.

Univariate linear regression.
COST FUNCTION *
SRC : * Andrew NG
Size in feet2 (x) Price ($) in 1000's (y)
Training Set
2104 460
1416 232
1534 315
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ?
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
y
Idea: Choose so that

is close to for our
training examples
COST FUNCTION
INTUITION I *
SRC : * Andrew NG
Simplified
Hypothesis:
Parameters:
Cost Function:
Goal:
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
COST FUNCTION
INTUITION II *
SRC : * Andrew NG
Hypothesis:
Parameters:
Cost Function:
Goal:
(for fixed , this is a function of x) (function of the parameters )
500
400
Price ($) 300

in 1000’s
200
100
0
0 500 1000 1500 2000 2500 3000
Size in feet2 (x)
GRADIENT DESCENT *
SRC : * Andrew NG
Have some function
Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
J()


J()


Gradient descent algorithm
Correct: Simultaneous update Incorrect:
GRADIENT DESCENT
INTUITION*
SRC : * Andrew NG
If α is too small, gradient descent
can be slow.
If α is too large, gradient descent

can overshoot the minimum. It may
fail to converge, or even diverge.
GRADIENT DESCENT FOR
LINEAR REGRESSION*
SRC : * Andrew NG
Gradient descent algorithm Linear Regression Model
update
and
simultaneously
J()


J()


“Batch” Gradient Descent
“Batch”: Each step of gradient descent

uses all the training examples.
THANKS
AGENDA
Linear regression with multiple variable
Gradient descent for multiple variables
Gradient descent in practice I: Feature Scaling
SRC : * Andrew NG
Multiple features (variables).
Size (feet2) Price ($1000)
2104 460
1416 232
1534 315
852 178
… …
Size (feet2) Number of Age of home Price ($1000)

bedrooms Number of floors (years)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Size (feet2) Number of Number of Price ($1000)
bedrooms floors Age of home (years)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Notation:
= number of features
= input (features) of training example.
= value of feature in training example.
Hypothesis:
Previously:
For convenience of notation, define .
Multivariate linear regression.

Linear Regression with multiple variables
GRADIENT DESCENT FOR MULTIPLE

VARIABLES
SRC : * Andrew NG
Hypothesis:
Parameters:
Cost function:
Gradient descent:
Repeat
(simultaneously update for every )

New algorithm :
Gradient Descent
Repeat
Previously (n=1):
Repeat
(simultaneously update for
)
(simultaneously update )
GRADIENT DESCENT IN PRACTICE I:

FEATURE SCALING
SRC : * Andrew NG
Feature Scaling
Idea: Make sure features are on a similar scale.
E.g. = size (0-2000 feet2) size (feet2)
= number of bedrooms (1-5)
number of bedrooms
Feature Scaling
Get every feature into approximately a
range.
Mean normalization
Replace with to make features have approximately zero mean
(Do not apply to ).
E.g.
GRADIENT DESCENT IN PRACTICE II:

LEARNING RATE
SRC : * Andrew NG
Gradient descent
- “Debugging”: How to make sure gradient

descent is working correctly.
- How to choose learning rate .
Making sure gradient descent is working correctly.
Example automatic
convergence test:
Declare convergence if
decreases by less than
in one iteration.
0 100 200 300 400
No. of iterations
Making sure gradient descent is working correctly.
Gradient descent not working.
Use smaller .
No. of iterations
No. of iterations No. of iterations
- For sufficiently small , should decrease on every iteration.

- But if is too small, gradient descent can be slow to converge.
Summary:
- If is too small: slow convergence.
- If is too large: may not decrease on
every iteration; may not converge.
To choose , try
FEATURES AND POLYNOMIAL REGRESSION
Housing prices prediction
Polynomial regression
Price
(y)
Size (x)
Choice of features
Price
(y)
Size (x)
NORMAL EQUATION
Gradient Descent
Normal equation: Method to solve for

analytically.
Intuition: If 1D
(for every )
Solve for NEERAJ.GUPTA@GLA.AC.IN 26

Examples:
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
Examples:
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
1 3000 4 1 38 540
THANKS
AGENDA
Logistic regression (Classification)
SRC : * Andrew NG
Classification
Email: Spam / Not Spam?

Online Transactions: Fraudulent (Yes / No)?
Tumor: Malignant / Benign ?
0: “Negative Class” (e.g., benign tumor)

1: “Positive Class” (e.g., malignant tumor)
(Yes) 1
Malignant ?
(No) 0
Tumor Size Tumor Size
Threshold classifier output at 0.5:

If , predict “y = 1”
If , predict “y = 0”
Classification: y = 0 or 1
can be > 1 or < 0
Logistic Regression:
Logistic Regression
HYPOTHESIS
REPRESENTATION
Logistic Regression Model
Want
0.5
Sigmoid function 0
Logistic function
Interpretation of Hypothesis Output
= estimated probability that y = 1 on input x
Example: If
Tell patient that 70% chance of tumor being malignant
“probability that y = 1, given x,

parameterized by ”
Logistic Regression
DECISION BOUNDARY
Logistic regression 1
z
Suppose predict “ “ if
predict “ “ if
Decision Boundary
x2
3
2
1 2 3 x1
Predict “ “ if
Non-linear decision boundaries
x2
-1 1 x1
-1
Predict “ “ if
x2
x1
Logistic Regression
COST FUNCTION
Training set:
m examples
How to choose parameters ?

Cost function
Linear regression:
“non-convex” “convex”
Logistic regression cost function
If y = 1
0 1
If y = 0
0 1
Logistic Regression
SIMPLIFIED COST FUNCTION AND
GRADIENT DESCENT
To fit parameters :
To make a prediction given new :

Output
Gradient Descent
Want :
Repeat
(simultaneously update all )

Gradient Descent
Want :
Repeat
(simultaneously update all )
Algorithm looks identical to linear regression!

Logistic Regression
MULTI-CLASS CLASSIFICATION:
ONE-VS-ALL
Multiclass classification
Email foldering/tagging: Work, Friends, Family, Hobby
Medical diagrams: Not ill, Cold, Flu
Weather: Sunny, Cloudy, Rain, Snow

Binary classification: Multi-class classification:
x2 x2
x1 x1
x2
One-vs-all (one-vs-rest):
x1
x2 x2
x1 x1
x2
Class 1:
Class 2:
Class 3:
x1
One-vs-all
Train a logistic regression classifier for each
class to predict the probability that .
On a new input , to make a prediction, pick the

class that maximizes
THANKS
AGENDA
Machine Learning: Training, Testing, Evaluation
SRC : * Andrew NG
EVALUATING THE HYPOTHESIS
Fail to generalize to new examples not in
training set.
EVALUATING THE HYPOTHESIS
TRAINING/TESTING PROCEDURE FOR LINEAR
REGRESSION
Learn parameter from training data (minimizing training error ( ))
Compute test set error:
TRAINING/TESTING PROCEDURE FOR LOGISTIC
REGRESSION
Learn parameter from training data (minimizing training error ( ))
Compute test set error:
Misclassification error (0/1 misclassification error):
PYTHON IMPLEMENTATION
EVALUATION METRICS*
Training objective (cost function) is only a proxy for real world objectives.
Metrics are useful and important for evaluation.
Metrics help capture a business goal into a quantitative target.
Helps organize ML team effort towards that target. Generally in the form of
improving that metric on the dev set.
SRC : *Yining Chen Slides

EVALUATION METRICS
Useful to quantify the “gap” between:
Desired performance and baseline (estimate effort initially).
Desired performance and current performance.
Measure progress over time.
Useful for lower level tasks and debugging (e.g. diagnosing bias vs variance).
BINARY CLASSIFICATION
x is input
y is binary output (0/1)
Model is ŷ= h(x)
Two types of models
Models that output a categorical class directly (K-nearest neighbor, Decision
tree)
Models that output a real valued score (SVM, Logistic Regression)
Score could be margin (SVM), probability
Need to pick a threshold
 We focus on this type (the other type can be interpreted as an instance)
SCORE BASED MODELS
THRESHOLD -> CLASSIFIER -> POINT METRICS
POINT METRICS: CONFUSION MATRIX
POINT METRICS: TRUE POSITIVES
POINT METRICS: TRUE NEGATIVES
POINT METRICS: FALSE POSITIVES
POINT METRICS: FALSE NEGATIVES
FP AND FN ALSO CALLED TYPE-1 AND TYPE-2
ERRORS
POINT METRICS: ACCURACY
POINT METRICS: PRECISION
POINT METRICS: POSITIVE RECALL (SENSITIVITY)
POINT METRICS: NEGATIVE RECALL (SPECIFICITY)
POINT METRICS: F1-SCORE
POINT METRICS: CHANGING THRESHOLD
POINT METRICS: CHANGING THRESHOLD
SUMMARY METRICS: PRC (RECALL VS. PRECISION)
ROC (RECEIVER OPERATING CHARACTERISTICS)
•ROC curve is a performance measurement for classification problem at various
thresholds settings.
•ROC is a probability curve and AUC represents degree or measure of separability.
•It tells how much model is capable of distinguishing between classes.
•Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.
•By analogy, Higher the AUC, better the model is at distinguishing between patients
with disease and no disease.
ROC CURVE
•The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is
on the x-axis.
DEFINING TERMS USED IN AUC AND ROC CURVE
TPR (True Positive Rate) / Recall /Sensitivity
Specificity FPR(False Positive Rate)
HOW TO SPECULATE THE PERFORMANCE OF THE
MODEL?
•An excellent model has AUC near to the 1 which means it has good measure of
separability.
•A poor model has AUC near to the 0 which means it has worst measure of
separability. In fact it means it is reciprocating the result. It is predicting 0s as 1s and
1s as 0s.
•AUC is 0.5, it means model has no class separation capacity whatsoever.
INTERPRETATION OF ROC CURVE
As we know, ROC is a curve of probability. So lets plot the distributions of those
probabilities:
Red distribution curve is of the positive class (patients with disease) and green
distribution curve is of negative class(patients with no disease)
COMPARING ROC CURVES
RELATION BETWEEN SENSITIVITY, SPECIFICITY,
FPR AND THRESHOLD
•Sensitivity and Specificity are inversely proportional to each other. So when we
increase Sensitivity, Specificity decreases and vice versa.
•When we decrease the threshold, we get more positive values thus it increases the
sensitivity and decreasing the specificity.
•Similarly, when we increase the threshold, we get more negative values thus we get
higher specificity and lower sensitivity.
Sensitivity⬆, Specificity⬇ and Sensitivity⬇, Specificity⬆
EXAMPLE: ROC
THANKS
AGENDA
Bias Vs Variance
TRADE-OFF (BIAS VS VARIANCE)
THANKS
AGENDA
K Nearest Neighbor
K-NEAREST-NEIGHBORS ALGORITHM
K nearest neighbors (KNN) is a simple algorithm that stores all
available cases and classifies new cases based on a similarity
measure (distance function)
KNN has been used in statistical estimation and pattern recognition

since 1970’s.
K-NEAREST-NEIGHBORS ALGORITHM
A case is classified by a majority voting of its neighbors, with the
case being assigned to the class most common among its K nearest
neighbors measured by a distance function.
If K=1, then the case is simply assigned to the class of its nearest
neighbor
DISTANCE FUNCTION MEASUREMENTS
HAMMING DISTANCE
For category variables, Hamming distance can be used.
K-NEAREST-NEIGHBORS
WHAT IS THE MOST POSSIBLE LABEL FOR C?
c
Solution: Looking for the nearest K neighbors of c.
Take the majority label as c’s label
Let’s suppose k = 3:
c
The 3 nearest points to c are: a, a and o.
Therefore, the most possible label for c is a.
PSEUDO CODE OF KNN
1. Load the data
2. Initialise the value of k
3. For getting the predicted class, iterate from 1 to total number of
training data points
1. Calculate the distance between test data and each row of training data. Here we will
use Euclidean distance as our distance metric since it’s the most popular method.
2. Sort the calculated distances in ascending order based on distance values
3. Get top k rows from the sorted array
4. Get the most frequent class of these rows
5. Return the predicted class
REMARKS
CHOOSING THE MOST SUITABLE K
NORMALIZATION
NORMALIZATION
NORMALIZATION
NORMALIZATION
K-NEAREST NEIGHBOR CLASSIFICATION (KNN)
Unlike all the previous learning methods, kNN does not build model
from the training data.
To classify a test instance d, define k-neighborhood P as k nearest
neighbors of d
Count number n of training instances in P that belong to class cj
Estimate Pr(cj|d) as n/k
No training is needed. Classification time is linear in training set size for
each test case.
19
DISCUSSIONS
kNN can deal with complex and arbitrary decision boundaries.
Despite its simplicity, researchers have shown that the classification accuracy of kNN
can be quite strong and in many cases as accurate as those elaborated methods.
kNN is slow at the classification time
kNN does not produce an understandable model
20
EXERCISE
EXERCISE
EXERCISE
Suppose, you have given the following data where x and y are the 2 input variables
and Class is the dependent variable.
Q1. Suppose, you want to predict the class of new data point x=1 and y=1 using
eucludian distance in 3-NN. In which class this data point belong to?
EXERCISE
Q2. In the previous question, you are now want use 7-NN instead of 3-KNN which of
the following x=1 and y=1 will belong to?
Q2. In the previous question, you are now want use 5-NN instead of 3-KNN which of
the following x=1 and y=1 will belong to?
THANKS
AGENDA
oNaïve Bayes Classifier
WHAT IS NAIVE BAYES ALGORITHM?
• It is a classification technique based on Bayes’ Theorem with an assumption of
independence among predictors.
• A Naive Bayes classifier assumes that the presence of a particular feature in a class
is unrelated to the presence of any other feature.
• Naive Bayes model is easy to build and particularly useful for very large data sets.
• Along with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.
PREREQUISITES FOR BAYES’ THEOREM
What is an Experiment?
“An experiment is a planned operation carried out under controlled conditions.”
Tossing a coin, rolling a die, and drawing a card out of a well-shuffled pack of cards are all
examples of experiments.
SAMPLE SPACE
The result of an experiment is called an outcome. The set of all possible outcomes of
an event is called the sample space.
For example, if our experiment is throwing dice and recording its outcome, the sample
space will be:
S1 = {1, 2, 3, 4, 5, 6}
What will be the sample when we’re tossing a coin?
S2 = {H, T}
EVENT
An event is a set of outcomes (i.e. a subset of the sample space) of an experiment.
Let’s get back to the experiment of rolling a dice and define events E and F as:
E = An even number is obtained = {2, 4, 6}
F = A number greater than 3 is obtained = {4, 5, 6}
The probability of these events:
P(E) = Number of favorable outcomes / Total number of possible outcomes = 3 / 6

= 0.5
P(F) = 3 / 6 = 0.5
RANDOM VARIABLE
A Random Variable is exactly what it sounds like – a variable taking on random values
with each value having some probability (which can be zero).
It is a real-valued function defined on the sample space of an experiment:
RANDOM VARIABLE
Let’s take a simple example (refer to the above image as we go along). Define a
random variable X on the sample space of the experiment of tossing a coin. It takes a
value +1 if “Heads” is obtained and -1 if “Tails” is obtained. Then, X takes on values
+1 and -1 with equal probability of 1/2.
Consider that Y is the observed temperature (in Celsius) of a given place on a given
day. So, we can say that Y is a continuous random variable defined on the same space,
S = [0, 100] (Celsius Scale is defined from zero degree Celsius to 100 degrees
Celsius).
EXHAUSTIVE EVENTS
A set of events is said to be exhaustive if at least one of the events must occur at
any time. Thus, two events A and B are said to be exhaustive if A ∪ B = S, the
sample space.
For example, let’s say that A is the event that a card drawn out of a pack is red and
B is the event that the card drawn is black. Here, A and B are exhaustive because
the sample space S = {red, black}. Pretty straightforward stuff, right?
INDEPENDENT EVENTS
If the occurrence of one event does not have any effect on the occurrence of
another, then the two events are said to be independent. Mathematically, two events
A and B are said to be independent if:
P(A ∩ B) = P(AB) = P(A)*P(B)
For example, if A is obtaining a 5 on throwing a die and B is drawing a king of

hearts from a well-shuffled pack of cards, then A and B are independent just by
their definition. It’s usually not as easy to identify independent events, hence we use
the formula I mentioned above.
CONDITIONAL PROBABILITY
Consider that we’re drawing a card from a given deck.
What is the probability that it is a black card?
That’s easy – 1/2, right?
However, what if we know it was a black card – then what would be the probability
that it was a king?
This is where the concept of conditional probability comes into play.
Conditional probability is defined as the probability of an event A, given that
another event B has already occurred (i.e. A conditional B). This is represented by
P(A|B) and we can define it as:
P(A|B) = P(A ∩ B) / P(B)
CONDITIONAL PROBABILITY
Let event A represent picking a king, and event B, picking a black card. Then, we find
P(A|B) using the above formula:
P(A ∩ B) = P(Obtaining a black card which is a King) = 2/52

P(B) = P(Picking a black card) = 1/2
Thus, P(A|B) = 4/52. Try this out on an example of your choice.
WHAT IS BAYES’ THEOREM?
WHAT IS BAYES’ THEOREM?
“Have you ever seen the popular TV show ‘Sherlock’
(or any crime thriller show)? Think about it – our
beliefs about the culprit change throughout the
episode. We process new evidence and refine our
hypothesis at each step.
This is Bayes’ Theorem in real life!”

BAYES’ THEOREM
Now, let’s understand this mathematically. Consider that A and B are any two
events from a sample space S where P(B) ≠ 0. Using our understanding of
conditional probability, we have:
P(A|B) = P(A ∩ B) / P(B)

Similarly, P(B|A) = P(A ∩ B) / P(A)
It follows that P(A ∩ B) = P(A|B) * P(B) = P(B|A) * P(A)
Thus, P(A|B) = P(B|A)*P(A) / P(B)
Here, P(A) and P(B) are probabilities of observing A and B independently of

each other. P(B|A) and P(A|B) are conditional probabilities. NEERAJ.GUPTA@GLA.AC.IN 15
BAYES’ THEOREM
P(A) is called Prior probability and P(B) is called Evidence.
P(B|A) is called Likelihood and P(A|B) is called Posterior probability.
posterior = likelihood * prior / evidence
AN ILLUSTRATION OF BAYES’ THEOREM
Let’s solve a problem using Bayes’ Theorem. This will help you understand and
visualize where you can apply it.
There are 3 boxes labeled A, B, and C:
Box A contains 2 red and 3 black balls
Box B contains 3 red and 1 black ball
And box C contains 1 red ball and 4 black balls
The three boxes are identical and have an equal probability of getting picked.
Consider that a red ball is chosen. Then what is the probability that this red ball was
picked out of box A?
CONTD…
We have prior probabilities P(A) = P(B) = P (C) = 1 / 3, since all boxes have equal
probability of getting picked.
P(E|A) = Number of red balls in box A / Total number of balls in box A = 2 / 5

Similarly, P(E|B) = 3 / 4 and P(E|C) = 1 / 5
Then evidence P(E) = P(E|A)*P(A) + P(E|B)*P(B) + P(E|C)*P(C)

= (2/5) * (1/3) + (3/4) * (1/3) + (1/5) * (1/3) = 0.45
Therefore, P(A|E) = P(E|A) * P(A) / P(E) = (2/5) * (1/3) / 0.45 = 0.296
APPLICATIONS OF BAYES’ THEOREM
The three main applications of Bayes’ Theorem:
Naive Bayes’ Classifiers
Discriminant Functions and Decision Surfaces
Bayesian Parameter Estimation
NAIVE BAYES’ CLASSIFIERS
Naive Bayes’ Classifiers are a set of probabilistic classifiers based on
the Bayes’ Theorem. The underlying assumption of these classifiers is
that all the features used for classification are independent of each
other.
That’s where the name ‘naive’ comes in since it is rare that we obtain a
set of totally independent features.
The way these classifiers work is exactly how we solved in the
illustration, just with a lot more features assumed to be independent of
each other.
NAIVE BAYES’ CLASSIFIERS
Here, we need to find the probability P(Y|X) where X is an n-dimensional random
variable whose component random variables X_1, X_2, …., X_n are independent of
each other:
Finally, the Y for which P(Y|X) is maximum is our predicted class.

WORKING OF NAÏVE BAYES' CLASSIFIER
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular
day according to the weather conditions. So to solve this problem, we need to follow the below
steps:
Convert the given dataset into frequency tables.
Generate Likelihood table by finding the probabilities of given features.
Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
CONTD…
CONTD…
ADVANTAGES & DISADVANTAGES
Advantages of Naïve Bayes Classifier:
•Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
•It can be used for Binary as well as Multi-class Classifications.
•It performs well in Multi-class predictions as compared to the other Algorithms.
•It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
•Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.
APPLICATIONS OF NAÏVE BAYES CLASSIFIER:
It is used for Credit Scoring.
It is used in medical data classification.
It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
It is used in Text classification such as Spam filtering and Sentiment analysis.
TYPES OF NAÏVE BAYES MODEL
There are three types of Naive Bayes Model, which are given below:
Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values
are sampled from the Gaussian distribution.
Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or
not in a document. This model is also famous for document classification tasks.
BernoulliNB implements the naive Bayes training and classification algorithms for data that is
distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features
but each one is assumed to be a binary-valued (Bernoulli, boolean) variable.
THANKS
AGENDA
oDecision Tree
DECISION TREE
A decision tree is a graphical representation

of all the possible solutions to a decision
based on certain conditions.
DECISION TREE
ILLUSTRATING CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
EXAMPLES OF CLASSIFICATION TASK
Predicting tumor cells as benign or malignant
Classifying credit card transactions

as legitimate or fraudulent
Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random
coil
Categorizing news stories as finance,

weather, entertainment, sports, etc
EXAMPLE OF A DECISION TREE
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No Refund
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10
Training Data Model: Decision Tree

ANOTHER EXAMPLE OF DECISION TREE
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
DECISION TREE CLASSIFICATION TASK
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
5 No Large 95K Yes
6 No Medium 60K No
9 No Medium 75K No
10 No Small 90K Yes
10
Apply Decision Tree

Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
APPLY MODEL TO TEST DATA
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
APPLY MODEL TO TEST DATA Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
DECISION TREE CLASSIFICATION TASK
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
5 No Large 95K Yes
6 No Medium 60K No
9 No Medium 75K No
10 No Small 90K Yes
10
Apply Decision Tree

Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
DECISION TREE INDUCTION
Many Algorithms:
 Hunt’s Algorithm (one of the earliest)
 CART
 ID3, C4.5
 SLIQ, SPRINT
TREE INDUCTION
Greedy strategy.
 Split the records based on an attribute test that optimizes certain criterion.
Issues
 Determine how to split the records
 How to specify the attribute test condition?
 How to determine the best split?
 Determine when to stop splitting
TREE INDUCTION
Greedy strategy.
Issues
HOW TO SPECIFY TEST CONDITION?
Depends on attribute types
 Nominal
 Ordinal
 Continuous
Depends on number of ways to split

 2-way split
 Multi-way split
SPLITTING BASED ON NOMINAL ATTRIBUTES
Multi-way split: Use as many partitions as distinct values.
CarType
Family Luxury
Sports
Binary split: Divides values into two subsets.

Need to find optimal partitioning.
CarType OR CarType
{Sports, {Family,
Luxury} {Family} Luxury} {Sports}
SPLITTING BASED ON ORDINAL ATTRIBUTES
Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium
Binary split: Divides values into two subsets.

Need to find optimal partitioning.
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}
Size
{Small,
Large} {Medium}
SPLITTING BASED ON CONTINUOUS ATTRIBUTES
Different ways of handling

 Discretization to form an ordinal categorical attribute
 Static – discretize once at the beginning
 Dynamic – ranges can be found by equal interval bucketing, equal
frequency bucketing (percentiles), or clustering.
 Binary Decision: (A < v) or (A  v)

 consider all possible splits and finds the best cut
 can be more compute intensive
SPLITTING BASED ON CONTINUOUS ATTRIBUTES
TREE INDUCTION
Greedy strategy.
Issues
HOW TO DETERMINE THE BEST SPLIT
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
HOW TO DETERMINE THE BEST SPLIT
Greedy approach:
 Nodes with homogeneous class distribution are preferred
Need a measure of node impurity:
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
MEASURES OF NODE IMPURITY
Entropy
Gini Index
Misclassification error
ENTROPY
ENTROPY
https://en.wikipedia.org/wiki/Entropy_(information_theory)
ENTROPY
ENTROPY
yes
no
HOW TO FIND THE BEST SPLIT
Before Splitting: C0 N00 M0
C1 N01
A? B?
Yes No Yes No
Node N1 Node N2 Node N3 Node N4
C0 N10 C0 N20 C0 N30 C0 N40

C1 N11 C1 N21 C1 N31 C1 N41
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34 NEERAJ.GUPTA@GLA.AC.IN 33
MEASURE OF IMPURITY: GINI
Gini Index for a given node t :
GINI (t )  1   [ p ( j | t )]2
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
 Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least
interesting information
 Minimum (0.0) when all records belong to one class, implying most interesting information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
EXAMPLES FOR COMPUTING GINI
GINI (t )  1   [ p ( j | t )]2
j
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
EXAMPLE
THANKS
AGENDA
oData preprocessing
What is Data?
a collection of number assigned as value to
quantitative variable and/ or characters
assigned as value to qualitative variables, or
collection of records and their attributes
An attribute is a characteristic of an object
Example: Colours, temperature, etc.

Attribute is also known as variable, feature,
characteristics, fields, etc.

 A collection of attributes describe
an object
 Object is also known as record, point, case, sample,
entity or instances
3
Types of Attributes
 Nominal
Used to assign individual cases to categories
Example: eye colour, ID number, Zip code, etc
 Ordinal
Used to rank order cases
Example: ranking (eg. movie on scale of 1-10), height (tall, medium, short), grades
 Interval
 Example: Calendar dates, longitude, latitude
 Ratio
Same as interval variable but they have a “true zero”
Example: time, length, population, age
4
Properties of Attribute values
 The type of an attributes depends on which of the following properties it
possess:
Distinctness: = ≠
Order: < >
Addition: + -
Multiplication: * /
Nominal: Distinctness
Ordinal: Distinctness, Order
Interval: Distinctness, Order, Addition
Ratio: all 4 properties
5
Discrete and Continuous Attributes
 Discrete Attribute
Has only a finite or countable infinite set of values
Examples: zip codes, counts, or the set of words in a collection of
documents
Often represented as integer variables.
Note: Binary attributes are special cases of discrete attributes
 Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a finite number of digits.
Continuous attributes are typically represented as floating-point variable
6
Type of data sets
 Record Data
Data Matrix
Transaction data
 Graph Data
World wide web
Molecular structure
 Ordered
Spatial data
Temporal data
Sequential data
Genetic sequence data
7
Record Data
 Data that consists of a collection of records, each of which consists of fixed set
of attributes
8
Data Matrix
 If data objects have the same fixed set of numeric attributes, then the data
objects can be thought of as points in a multidimensional space, where each
dimension represents a distinct attribute
 Such data set can be represented by an m by n matrix, where there are m rows,
one for each object, and n columns, one for each attribute
9
Data Matrix Example for Documents
 Each document becomes a `term' vector,
each term is a component (attribute) of the vector,
the value of each component is the number of times the corresponding term occurs in the
document.
10
Transaction Data
 A typical type of record data, then
 Each record (transaction) involves a set of items
11
Graph data
Example: Facebook graph and HTML links
12
Ordered data
 Genetic sequence data
13
Data Quality
What kind of data quality problems?
How can we detect the problem with the data?
What can we do about these problem?
Examples of data quality problems:
Missing values
Noise and outliers
Duplicate data
14
Data Quality: Missing Values
Reasons for missing values
Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
 Handling missing values

Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their probabilities)
15
Data Quality: Noise
 Noise refers to modification of original values
 Examples: distortion of a person’s voice when talking on
16
Data Quality: Outliers
 Outliers are data objects with characteristics that are considerably different
than most of the other data objects in the data set
17
Data Quality: Duplicate Data
 Data set may include data objects that are duplicates, or almost duplicates of one
another
 Major issue when merging data from heterogenous sources
 Examples:
 Same person with multiple email addresses
 Data cleaning
 Process of dealing with duplicate data issues
18
Data Preprocessing
Imputation
Outlier management
One hot encoding
Feature selection
Filter and Wrapper approach
19
Imputation (filling in) of missing data
 Imputation is performed using a number of different algorithms, which can be
subdivided into single and multiple imputation methods.
 Single imputation methods

 a missing value is imputed by a single value
 Multiple-imputation methods
 several likelihood- ordered choices for imputing the missing value are computed and one
“best” value is selected.
20
Imputation Contd…
 Single imputation
Mean imputation
Hot deck imputation
 Multiple imputation
21
Single imputation Contd…
Mean imputation
Mean imputation, also called unconditional mean
imputation, is a widely used imputation method
Mean imputation assumes that the mean of a
variable is the best estimate for any
case that has missing information on this variable
For continuous variable, each missing value is
imputed with the mean of known values for the
same variable
For categorical variable, the missing values of
are the mode of the observed values of same
variable
22
 Advantages
fast,
simple,
ease to implement, and
no cases are excluded
 Limitations
underestimation of the population variance
thus a small standard error
possibility of Type I error.
23
Single imputation Contd…
Hot deck imputation

Hot-deck imputation is a
procedure where the imputed
values come from other cases In
the same data set
for each object that contains
missing values, the most similar

object is found, and the missing
values are imputed from that
object
24
 Advantages
preserves the population distribution
it is better than mean imputation
 Limitations
 distort correlations and covariances
25
Other type of Single imputation
Regression imputation
Cold-deck imputation
Expectation Maximisation (EM)
Sequential imputation
Last observation carried forward
Worst case and Best case imputation
26
Multiple imputation
The idea of Multiple Imputation is to replace each missing value with multiple
acceptable values that represent a distribution of possibilities.
This results in a number of complete datasets (usually 3-10):
27
Outlier management
 Outlier: A data object that deviates significantly from the normal objects as if
it were generated by a different mechanism
 Ex.: Unusual credit card purchase
 Outliers are different from the noise data
Noise is random error or variance in a measured variable
Noise should be removed before outlier detection
28
Types of Outliers
 Three kinds:
Global,
Contextual
Collective
 Global outlier (or point anomaly)

Object is Og if it significantly deviates from the rest of the data set
Ex. Intrusion detection in computer networks
Issue: Find an appropriate measurement of deviation
29
Types of Outliers Contd…
 Contextual outlier (or conditional outlier)
Object is O if it deviates significantly based on a selected context
c
o
Ex. 48 C in Mathura: outlier? (depending on summer or winter?)
Attributes of data objects should be divided into two groups to detect O
c
Contextual attributes: defines the context, e.g., time & location
Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature, pressure, humidity

Issue: How to define or formulate meaningful context?
30
Types of Outliers Contd…
 Collective Outliers
 A subset of data objects collectively deviate significantly from the
whole data set, even if the individual data objects may not be
outliers
 Applications: E.g., intrusion detection:
 When a number of computers keep sending denial-of-service
packages to each other
 Detection of collective outliers
Consider not only behavior of individual objects, but also that
of groups of objects
Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure on

objects. 31
Outlier Detection
 Two ways to categorize outlier detection methods:
 Based on whether user-labeled examples of outliers can be obtained:

Supervised,
Unsupervised, and
Semi-supervised methods
 Based on assumptions about normal data and outliers :

Statistical,
proximity-based, and
clustering-based methods
32
Statistical Methods
 Statistical methods (also known as model -based methods) assume that the normal data follow
some statistical model
The data not following the model are outliers.
 Methods are divided into two categories: parametric vs. non-parametric
 Parametric method
Assumes that the normal data is generated by a parametric distribution with parameter θ
The probability density function of the parametric distribution f(x, θ) gives the probability
that object x is generated by the distribution
 Non-parametric method
Not assume an a-priori statistical model and determine the model from the input data
Not completely parameter free but consider the number and nature of the parameters are
flexible and not fixed in advance

Examples: histogram
33
Parametric Methods I: Univariate Outliers Based on
Normal Distribution
 Often assume that data are generated from a normal distribution, learn the parameters from the
input data, and identify the points with low probability as outliers
 Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
 Use the maximum likelihood method to estimate μ and σ
 For the above data with n = 10, we have
Consider the value 24

28.61 – 3*1.51 = 24.08
So, 24 is an outlier 34
Statistical Methods – Box Plot
 Values less than Q1-1.5*IQR and greater than Q3+1.5*IQR are outliers
 Consider the following dataset:
 10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9,
15.1, 15.9, 16.4
Here,
Q2(median) = 14.6
Q1 = 14.4
Q3 = 14.9
IQR = Q3 – Q1 = 14.9 - 14.4 = 0.5
Outliers will be any points:
below Q1 – 1.5×IQR = 14.4 – 0.75 = 13.65 or
above Q3 + 1.5×IQR = 14.9 + 0.75 = 15.65
So, the outliers are at 10.2, 15.9, and 16.4. 35
Non-Parametric Methods: Detection Using Histogram
 The model of normal data is learned from the input data

without any a priori structure.
 Often makes fewer assumptions about the data, and thus can
be applicable in more scenarios
 Outlier detection using histogram:
 Figure shows the histogram of purchase amounts in transactions
 A transaction in the amount of $7,500 is an outlier, since only 0.2% transactions have an amount
higher than $5,000
 Problem: Hard to choose an appropriate bin size for histogram
 Too small bin size → normal objects in empty/rare bins, false positive
 Too big bin size → outliers in some frequent bins, false negative
36
Proximity-Based Methods
 An object is an outlier if the nearest neighbors of the object are far away, i.e., the
proximity of the object significantly deviates from the proximity of most of the
other objects in the same data set
 Two types of proximity-based outlier detection methods
 Distance-based outlier detection: An object o is an outlier if its neighborhood does not have
enough other points
 Density-based outlier detection: An object o is an outlier if its density is relatively much
lower than that of its neighbors
37
Clustering-Based Outlier Detection
 An object is an outlier if
 it does not belong to any cluster,
 there is a large distance between the object and its

closest cluster , or
 it belongs to a small or sparse cluster

38
Categorical Encoding
Typically, any structured dataset includes multiple columns – a combination of
numerical as well as categorical variables.
A machine can only understand the numbers. It cannot understand the text.
That’s essentially the case with Machine Learning algorithms too.
That’s primarily the reason we need to convert categorical columns to numerical
columns so that a machine learning algorithm understands it. This process is
called categorical encoding.
Three approaches for categorical encoding:
Drop Categorical Variables

Label Encoding
One-Hot Encoding
39
Label Encoding
Label encoding assigns each unique value to a different integer.
This approach assumes an ordering of the categories:
 Eg: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every day" (3)
This assumption makes sense in this example, because there is an indisputable ranking to the
categories. Not all categorical variables have a clear ordering in the values, but we refer to
those that do as ordinal variables.
For tree-based models (like decision trees and random forests), you can expect label
encoding to work well with ordinal variables.

40
One-Hot Encoding
 One-hot encoding creates new columns indicating the presence (or absence) of each
possible value in the original data.
 Eg: In the original dataset, "Color" is a categorical variable with three categories: "Red",
"Yellow", and "Green".
The corresponding one-hot encoding contains one column for each possible value,
and one row for each row in the original dataset.
Wherever the original value was "Red", we put a 1 in the "Red" column; if the
original value was "Yellow", we put a 1 in the "Yellow" column, and so on.
41
One-Hot Encoding Contd…
In contrast to label encoding, one-hot encoding does not assume an ordering of
the categories.
This approach to work particularly well if there is no clear ordering in the
categorical data (e.g., "Red" is neither more nor less than "Yellow").
We refer to categorical variables without an intrinsic ranking as nominal
variables.
One-hot encoding generally does not perform well if the categorical variable
takes on a large number of values (i.e., you generally won't use it for variables
taking more than 15 different values).
42
Feature selection
Features contain information about the target
Naïve view:
 More feature
=>More information
=>More discrimination power
 In practice:
Many reasons why this is not the case !
43
Feature selection Contd…
Curse of dimensionality
 Number of training examples are fixed

=>the classifier’s performance usually will degrade for large number of
features!
44
Feature selection Contd…
 Data may contain many irrelevant and redundant variables (features) and often
comparably few (limited) training examples
Feature selection
 A procedure in machine learning to find a subset of features that produces
‘better’ model for given dataset
Avoid overfitting and achieve better generalization ability
Reduce the storage requirement and training time
Interpretability
45
Feature Selection
 Given a set of N features, the goal of feature selection is to select a subset of K
features (K << N) in order to minimize the classification error.
Feature selection Feature extraction
46
Feature Selection vs Feature Extraction
 Feature Selection
New features represent a subset of the original features.
When classifying novel patterns, only a small number of features need to be computed (i.e.,
faster classification).
 Feature Extraction
Projection to M<N dimension
New features are combinations (linear for PCA/LDA) of the original features (difficult to
interpret).
When classifying novel patterns, all features need to be computed.
47
Feature Selection: Main Steps
 Feature selection is an optimization
problem.
 Step 1: Search the space of possible

feature subsets.
 Step 2: Pick the subset that is

optimal or near-optimal with respect
to some objective function.
48
Feature Selection: Main Steps (cont’d)
Search methods
Exhaustive
Heuristic
Randomized
Evaluation methods
 Filter (Unsupervised)
Look at input only
Select the subset that has the most information
 Wrapper (Supervised)
Train using selected subset
Estimate error on validation dataset
49
Search Methods
 Assuming n features, an exhaustive search would require examining possible
subsets of size d.
 The number of subsets grows combinatorially, making exhaustive search

impractical.
e.g., exhaustive evaluation of 10 out of 20 features involves 184,756 feature subsets.
e.g., exhaustive evaluation of 10 out of 100 involves more than 10
13 feature subsets.
 In practice, heuristics are used to speed-up search but they cannot guarantee
optimality.
50
Evaluation Methods
 Filter
Evaluation is independent of the classification
algorithm.
“goodness”
The objective function is based on the
information content of the feature subsets, e.g.:

interclass distance
statistical dependence
information-theoretic measures (e.g., mutual
information).
51
Evaluation Methods (cont’d)
 Wrapper
Evaluation uses criteria related to
the classification algorithm.
“goodness”
The objective function is based
on the predictive accuracy of the

classifier, e.g.,:
 recognition accuracy on a “validation” data
set.
52
Filter vs Wrapper Methods (cont’d)
 Filter Methods
 Advantages
 Much faster than wrapper methods since the objective function has lower computational
requirements.
The optimum feature set might work well with various classifiers as it is not tied to a
specific classifier.
 Disadvantages
Achieve lower recognition accuracy compared to wrapper methods.
Have a tendency to select more features compared to wrapper methods.
53
Filter vs Wrapper Methods
 Wrapper Methods
 Advantages
 Achieve higher recognition accuracy compared to filter methods since they use the classifier
itself in choosing the optimum set of features.
 Disadvantages
Much slower compared to filter methods since the classifier must be trained and tested for
each candidate feature subset.
The optimum feature subset might not work well for other classifiers.
54
Forward selection
(heuristic search)
Start with empty feature set

Try each remaining feature
Estimate the classification/ regression error for adding each specific feature
Select the feature that gives maximum improvement
Stop when there is no significant improvement
FS performs best when the

optimal subset is small.
55
Backward selection (BS)
(heuristic search)
Start with full feature set

Try removing feature
Drop the feature with smallest impact on error
BS performs best when the

optimal subset is large.
56
THANKS
AGENDA
oMachine Learning
oTypes of Machine Learning
oUnsupervised Learning : Clustering
oUnsupervised Example
oApplications of Clustering
oK-Means Algorithm
oK-Means Examples
WHEN DO WE USE MACHINE LEARNING?
ML is used when:
• Human expertise does not exist (navigating on Mars)
Learning isn’t always useful:

• Humans can’t explain their expertise (speech recognition)
• Models must be customized (personalized medicine)
• There
Models areis noonneed
based to “learn”
huge amounts to calculate payroll
of data (genomics)
SRC: Based on slide by E. Alpaydin

TYPES OFLEARNING
•Supervised learning
•Given: training data+ desired outputs (labels)
•Unsupervised learning
•Given: training data (without desired outputs)
•Semi-supervised learning
•Given: training data + a few desired outputs
•Reinforcement learning
•Rewards from sequence of actions
SUPERVISED LEARNING: REGRESSION
SUPERVISED LEARNING: CLASSIFICATION
UNSUPERVISED LEARNING
SUPERVISED VS UNSUPERVISED LEARNING
Supervised learning: discover patterns in the data that relate data attributes
with a target (class) attribute.
 Patterns are then utilized to predict the values of the target attribute in future data instances.
Unsupervised learning: The data have no target attribute.

 To explore the data to find some intrinsic structures in them.
UNSUPERVISED EXAMPLE
A bank wants to give credit card offers to its customers. Currently, they look at the details of
each customer and based on this information, decide which offer should be given to which
customer.
Clustering is the process of dividing the entire data

into groups (also known as clusters) based on the
patterns in the data.
APPLICATIONS OF CLUSTERING IN REAL-WORLD
Image Segmentation
Clustering is used to collect similar pixels in the same group.

Recommendation Engines
K-MEANS CLUSTERING
The main objective of the K-Means

algorithm is to minimize the sum
of distances between the points
and their respective cluster
centroid.
K-MEANS CLUSTERING
K-MEANS CLUSTERING
K-MEANS CLUSTERING
K-MEANS CLUSTERING
K-MEANS CLUSTERING
K-MEANS ALGORITHM
K-MEANS ALGORITHM
EXAMPLE
Divide the given sample data in two (2) clusters using K-Means algorithm using
Euclidean Distance.
Sno. Height(H) Weight
(W)
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76 NEERAJ.GUPTA@GLA.AC.IN 21
K-MEANS FOR NON-SEPARATED CLUSTERS
EXAMPLE
SOLUTION
SOLUTION
SOLUTION
SOLUTION
SOLUTION
SOLUTION
SOLUTION
SOLUTION
SOLUTION
SOLUTION
VISUALIZATION OF SOLUTION
VISUALIZATION OF SOLUTION
HOW TO CHOOSE K: ELBOW METHOD
It is based on the sum of squared distance (SSE) between data points and their
assigned clusters’ centroids.
DRAWBACKS
K-means algorithm is good in capturing structure of the data if clusters have
a spherical-like shape.
It always try to construct a nice spherical shape around the centroid.
The clusters which have a complicated geometric shapes, k-means does a
poor job in clustering the data.
CLUSTERING QUALITY
Ideal clustering is characterized by minimal intra cluster distance and maximal
inter cluster distance.
There are majorly two types of measures to assess the clustering performance.
(i) Extrinsic Measures which require ground truth labels.
 Rand index
(ii) Intrinsic Measures that does not require ground truth labels.
 Silhouette Coefficient
BASIC CLUSTERING METHODS
1. Partitioning methods
 k-means
2. Hierarchical methods
 Agglomerative (bottom-up) or divisive (top-down)
3. Density-based methods
4. Grid-based methods
OVERVIEW OF CLUSTERING METHODS
HIERARCHICAL METHODS
A hierarchical clustering method works by grouping data objects into a hierarchy or
“tree” of clusters.
Representing data objects in the form of a hierarchy is useful for data summarization
and visualization.
Agglomerative methods start with individual objects as clusters, which are iteratively
merged to form larger clusters.
Divisive methods initially let all the given objects form one cluster, which they
iteratively split into smaller clusters.
Hierarchical clustering methods can encounter difficulties regarding the selection of
merge or split points.
AGGLOMERATIVE VERSUS DIVISIVE HIERARCHICAL
CLUSTERING
A hierarchical clustering method can be either agglomerative or divisive, depending
on whether the hierarchical decomposition is formed in a bottom-up (merging) or top
down (splitting) fashion.
AGNES (AGglomerative NESting)
DIANA (DIvisive ANAlysis)
data objects {a, b, c, d, e}. NEERAJ.GUPTA@GLA.AC.IN 42

AGGLOMERATIVE VERSUS DIVISIVE HIERARCHICAL
CLUSTERING
A tree structure
called a
dendrogram is
commonly used
to represent the
process of
hierarchical
clustering.
Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}.
AGGLOMERATIVE HIERARCHICAL CLUSTERING
We assign each point to an individual cluster in this technique.
Suppose there are 4 data points. We will assign each of these points to a cluster and
hence will have 4 clusters in the beginning:
AGGLOMERATIVE HIERARCHICAL CLUSTERING
Then, at each iteration, we merge the closest pair of clusters and repeat this step until only a single cluster
is left:
We are merging (or adding) the clusters at each step.

Hence, this type of clustering is also known as additive hierarchical clustering.
DIVISIVE HIERARCHICAL CLUSTERING
Divisive hierarchical clustering works in the opposite way.
Instead of starting with n clusters (in case of n observations), we start with a single
cluster and assign all the points to that cluster.
So, it doesn’t matter if we have 10 or 1000 data points. All these points will belong
to the same cluster at the beginning:
DIVISIVE HIERARCHICAL CLUSTERING
Now, at each iteration, we split the farthest point in the cluster and repeat this
process until each cluster only contains a single point:
We are splitting (or dividing) the clusters at each step, hence the name divisive
hierarchical clustering.
STEPS TO PERFORM HIERARCHICAL CLUSTERING
We merge the most similar points or clusters in hierarchical clustering – we know this. Now the
question is –
How do we decide which points are similar and

which are not
?
Distance-based algorithm
In hierarchical clustering, we have a concept called a proximity matrix. This stores
the distances between each point.
Suppose a teacher wants to divide her students into different groups. She has the
marks scored by each student in an assignment and based on these marks, she wants
to segment them into groups. There’s no fixed target here as to how many groups to
have. Since the teacher does not know what type of students should be assigned to
which group, it cannot be solved as a supervised learning problem. So, we will try to
apply hierarchical clustering here and segment the students into different groups.
Let’s take a sample of 5 students:
CREATING A PROXIMITY MATRIX
Let’s make the 5 x 5 proximity matrix for our example:
Step 1: First, we assign all the points to an individual cluster
Step 2: Next, we will look at the smallest distance in the proximity matrix and merge
the points with the smallest distance. We then update the proximity matrix:
Here, the smallest distance is 3 and hence we will merge point 1 and 2.
Let’s look at the updated clusters and accordingly update the proximity matrix:
Now, we will again calculate the proximity matrix for these clusters:
Step 3: We will repeat step 2 until only a single cluster is left.
We started with 5 clusters

and finally have a single
cluster. This is how
agglomerative hierarchical
clustering works. NEERAJ.GUPTA@GLA.AC.IN 55
HOW SHOULD WE CHOOSE THE NUMBER OF
CLUSTERS IN HIERARCHICAL CLUSTERING?
A dendrogram is a tree-like diagram that records the sequences of merges or
splits.
Let’s see how a dendrogram looks like:
Whenever two clusters are merged, we will join them in this dendrogram and the
height of the join will be the distance between these points.
More the distance of the vertical lines in the dendrogram, more the distance
between those clusters. Now, we can set a threshold distance and draw a horizontal
line (Generally, we try to set the threshold in such a way that it cuts the tallest vertical
line)
The number of clusters will

be the number of vertical
lines which are being
intersected by the line
drawn using the threshold.
THANKS
AGENDA
oSupport Vector Machine
oTypes of SVM
oHyperplane
oSupport Vectors
oLinear SVM Mathematically
oExamples
oPros and Cons
INTRODUCTION
•Support Vector Machine abbreviated as SVM
•It can be used for both regression and classification tasks.
•It is widely used in classification objectives.
INTRODUCTION
•SVM algorithm can be used for Face detection, image
classification, text categorization, etc.
WHAT IS SUPPORT VECTOR MACHINE?
•A support vector machine is a machine learning model that is able to
generalize between two different classes if the set of labelled data is
provided in the training set to the algorithm.
•The objective of the support vector machine algorithm is to find a hyperplane

in an N-dimensional space
•(N—the number of features) that distinctly classifies the data points.
TYPES OF SVM
1. Linear SVM:
•Linear SVM is used for linearly separable data.
•It means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data.
•The classifier is used called as Linear SVM classifier.
TYPES OF SVM
2. Non-linear SVM:
•Non-Linear SVM is used for non-linearly separated data.
•It means if a dataset cannot be classified by using a straight line, then such
data is termed as non-linear data.
•The classifier used is called as Non-linear SVM classifier.
SUPPORT VECTOR MACHINE
•To separate the two classes of data points, there are many
possible hyperplanes that could be chosen.
•The objective is to find a plane that has the maximum margin, i.e
the maximum distance between data points of both classes.
•Maximizing the margin distance provides some reinforcement so

that future data points can be classified with more confidence.
HYPERPLANES
•Hyperplanes are decision boundaries that help classify the data points.
•Data points falling on either side of the hyperplane can be attributed to different
classes.
•The dimension of the hyperplane depends upon the number of features.
•If the number of input features is 2, then the hyperplane is just a line.
•If the number of input features is 3, then the hyperplane becomes a two-
dimensional plane.
•It becomes difficult to imagine when the number of features exceeds 3.
HYPERPLANES
HYPERPLANES
•Identify the right hyper-plane (Scenario-1):
HYPERPLANES
Identify the right hyper-plane (Scenario-2):
HYPERPLANES
Identify the right hyper-plane (Scenario-3):
HYPERPLANES
Can we classify two classes (Scenario-4)?
HYPERPLANES
Find the hyper-plane to segregate to classes (Scenario-5):
HYPERPLANES
Find the hyper-plane to segregate to classes (Scenario-5):In the scenario below, we
can’t have linear hyper-plane between the two classes, so how does SVM classify
these two classes?
HYPERPLANES
The SVM algorithm has a technique called the kernel trick. The SVM kernel is
a function that takes low dimensional input space and transforms it to a higher
dimensional space i.e. it converts not separable problem to separable
problem.
SUPPORT VECTORS
•Support vectors are data points that are closer to the hyperplane and influence
the position and orientation of the hyperplane.
•Using these support vectors, we maximize the margin of the classifier.
•Deleting the support vectors will change the position of the hyperplane.
•These are the points that help us build our SVM.
SUPPORT VECTORS
Sec. 15.1
MAXIMUM MARGIN: FORMALIZATION

w: decision hyperplane normal vector
xi: data point i
yi: class of data point i (+1 or -1) NB: Not 1/0
Classifier is: f(xi) = sign(wTxi + b)
Functional margin of xi is: yi (wTxi + b)
 But note that we can increase this margin simply by scaling w, b….
Functional margin of dataset is twice the minimum functional

margin for any point
 The factor of 2 comes from measuring the whole width of the
margin
22
Sec. 15.1
GEOMETRIC MARGIN
wT x  b
Distance from example to the separator is ry
w
Examples closest to the hyperplane are support vectors.
Margin ρ of the separator is the width of separation between support vectors of
classes.
x ρ
r
x′
w
23
Sec. 15.1
LINEAR SVM MATHEMATICALLY

Assume that all data is at least distance 1 from the hyperplane, then the following two
constraints follow for a training set {(xi ,yi)}
wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ −1 if yi = −1
For support vectors, the inequality becomes an equality

wT x  b
Then, since each example’s distance from the hyperplane is ry
w
The margin is: 2


w
24
Sec. 15.1
LINEAR SUPPORT VECTOR MACHINE (SVM)
ρ wTxa + b = 1
Hyperplane
wT x + b = 0 wTxb + b = -1
Extra scale constraint:

mini=1,…,n |wTxi + b| = 1
This implies:
wT x + b = 0
wT(xa–xb) = 2
ρ = ||xa–xb||2 = 2/||w||2
25
Sec. 15.1
LINEAR SVMS MATHEMATICALLY (CONT.)

Then we can formulate the quadratic optimization problem:
Find w and b such that

2
 is maximized; and for all {(xi , yi)}
w
wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1
26
SVM EXAMPLE
SVM EXAMPLE
SVM EXAMPLE
SVM EXAMPLE
SVM EXAMPLE
SVM EXAMPLE
SVM EXAMPLE
EXAMPLE
SVM EXAMPLE
SVM EXAMPLE
SVM EXAMPLE
SVM EXAMPLE
PROS AND CONS ASSOCIATED WITH SVM
Pros:
•It works really well with a clear margin of separation
•It is effective in high dimensional spaces.
•It is effective in cases where the number of dimensions is greater than the
number of samples.
•It uses a subset of training points in the decision function (called support
vectors), so it is also memory efficient.
PROS AND CONS ASSOCIATED WITH SVM
Cons:
•It doesn’t perform well when we have large data set because the required training
time is higher.
•It doesn’t perform very well, when the data set has more noise i.e. target classes are
overlapping.
THANKS
AGENDA
oPrincipal Component Ananlysis (PCA)
MOTIVATION
Clustering
One way to summarize a complex real-valued data point with a
single categorical variable
Dimensionality reduction
Another way to simplify complex high-dimensional data
Summarize data with a lower dimensional real valued vector
MOTIVATION
Clustering
 One way to summarize a complex real-valued data point with a single
categorical variable
Dimensionality reduction
 Another way to simplify complex high-dimensional data
 Summarize data with a lower dimentional real valued vector
• Given data points in d dimensions
• Convert them to data points in r<d dimensions
• With minimal loss of information
NEED FOR PCA
High dimension data is extremely complex to process due to inconsistencies in the feature which increase
the computation time.
NEED FOR PCA
Data Compression
Reduce data from

2D to 1D
(inches)
(cm)
Andrew Ng
Data Compression
Reduce data from
2D to 1D
(inches)
(cm)
Andrew Ng
Data Compression
Reduce data from 3D to 2D
Andrew Ng
Principal Component Analysis (PCA) problem formulation
Reduce from 2-dimension to 1-dimension: Find a direction (a vector )

onto which to project the data so as to minimize the projection error.
Reduce from n-dimension to k-dimension: Find vectors
onto which to project the data, so as to minimize the projection error.
Andrew Ng
STEP BY STEP PCA
COVARIANCE
Variance and Covariance:
 Measure of the “spread” of a set of points around their center of mass(mean)
Variance:
 Measure of the deviation from the mean for points in one dimension
Covariance:
 Measure of how much each of the dimensions vary from the mean with respect to
each other
• Covariance is measured between two dimensions

• Covariance sees if there is a relation between two dimensions
• Covariance between one dimension is the variance
Positive: Both dimensions increase or decrease together Negative: While one increase the other decrease
COVARIANCE
Used to find relationships between dimensions in high dimensional data sets
The Sample mean

EIGENVECTOR AND EIGENVALUE
Ax = λx
A: Square Matirx
λ: Eigenvector or characteristic vector
X: Eigenvalue or characteristic value
• The zero vector can not be an eigenvector
• The value zero can be eigenvalue
Ax = λx
A: Square Matirx
λ: Eigenvector or characteristic vector
X: Eigenvalue or characteristic value
Example
Ax - λx = 0
Ax = λx
(A – λI)x = 0
If we define a new matrix B:

B = A – λI
Bx = 0
BUT! an eigenvector
If B has an inverse: x= B-10 =0 cannot be zero!!
x will be an eigenvector of A if and only if B

does not have an inverse, or equivalently
det(B)=0 :
det(A – λI) = 0
Example 1: Find the eigenvalues of 2  12
A 
 1  5 
  2 12
I  A   (  2)(  5)  12
1   5
 2  3  2  (  1)(  2)
two eigenvalues: 1,  2
Note: The roots of the characteristic equation can be repeated. That is, λ1 = λ2 =…= λk. If
that happens, the eigenvalue is said to be of multiplicity k.
Example 2: Find the eigenvalues of 2 1 0
A  0 2 0
0 0 2
  2 1 0
I  A  0 2 0  (   2) 3  0
0 0 2
λ = 2 is an eigenvector of multiplicity 3.
PRINCIPAL COMPONENT ANALYSIS
Input:
Set of basis vectors:
Summarize a D dimensional vector X with K dimensional

feature vector h(x)
Basis vectors are orthonormal
New data representation h(x)

New data representation h(x)
Empirical mean of the data

PRACTICE PROBLEMS BASED ON PCA
Given data = { 2, 3, 4, 5, 6, 7 ; 1, 5, 3, 6, 7, 8 }.
Compute the principal component using PCA Algorithm.
OR
Consider the two dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (7, 8).
Compute the principal component using PCA Algorithm.
OR
Compute the principal component of following data-
CLASS 1
X=2,3,4
Y=1,5,3
CLASS 2
X=5,6,7
Y=6,7,8
PCA Algorithm-
The steps involved in PCA Algorithm are as follows-
Step-01: Get data.
Step-02: Compute the mean vector (µ).
Step-03: Subtract mean from the given data.
Step-04: Calculate the covariance matrix.
Step-05: Calculate the Eigen vectors and Eigen values of the covariance matrix.
Step-06: Choosing components and forming a feature vector.
Step-07: Deriving the new data set.
Step-01:
Get data.
The given feature vectors are-
x1 = (2, 1)
x2 = (3, 5)
x3 = (4, 3)
x4 = (5, 6)
x5 = (6, 7)
x6 = (7, 8)
Step-02:
Calculate the mean vector (µ).
Mean vector (µ)

= ((2 + 3 + 4 + 5 + 6 + 7) / 6, (1 + 5 + 3 + 6 + 7 + 8) / 6)
= (4.5, 5)
Thus,
Step-03:
Subtract mean vector (µ) from the given feature vectors.
x1 – µ = (2 – 4.5, 1 – 5) = (-2.5, -4)
x2 – µ = (3 – 4.5, 5 – 5) = (-1.5, 0)
x3 – µ = (4 – 4.5, 3 – 5) = (-0.5, -2)
x4 – µ = (5 – 4.5, 6 – 5) = (0.5, 1)
x5 – µ = (6 – 4.5, 7 – 5) = (1.5, 2)
x6 – µ = (7 – 4.5, 8 – 5) = (2.5, 3)
Feature vectors (xi) after subtracting mean vector (µ) are-
Step-04:
Calculate the covariance matrix.
Covariance matrix is given by-
Now,
Covariance matrix
= (m1 + m2 + m3 + m4 + m5 + m6) / 6
On adding the above matrices and dividing by 6, we get-
Step-05:
Calculate the eigen values and eigen vectors of the covariance matrix.
λ is an eigen value for a matrix M if it is a solution of the characteristic equation |M – λI|
= 0.
So, we have-
From here,
(2.92 – λ)(5.67 – λ) – (3.67 x 3.67) = 0
16.56 – 2.92λ – 5.67λ + λ2 – 13.47 = 0
λ2 – 8.59λ + 3.09 = 0
Solving this quadratic equation, we get λ = 8.22, 0.38

Thus, two eigen values are λ1 = 8.22 and λ2 = 0.38.
Clearly, the second eigen value is very small compared to the first eigen value.
So, the second eigen vector can be left out.
Eigen vector corresponding to the greatest eigen value is the principal component for the given
data set.
So. we find the eigen vector corresponding to eigen value λ1. NEERAJ.GUPTA@GLA.AC.IN 33
We use the following equation to find the eigen vector-
MX = λX
Where, M = Covariance Matrix, X = Eigen vector, λ = Eigen value
Substituting the values in the above equation, we get-
Solving these, we get-
2.92X1 + 3.67X2 = 8.22X1
3.67X1 + 5.67X2 = 8.22X2
On simplification, we get-
5.3X1 = 3.67X2 ………(1)
3.67X1 = 2.55X2 ………(2)
From (1) and (2), X1 = 0.69X2
From (2), the eigen vector is-
Thus, principal component for the given data set is-
Lastly, we project the data points onto the new subspace as-
Use PCA Algorithm to transform the pattern (2, 1) onto the eigen vector in the
previous question.
The given feature vector is (2, 1).
The feature vector gets transformed to
= Transpose of Eigen vector x (Feature Vector – Mean Vector)
SIFT feature visualization
• The top three principal components of SIFT descriptors from a set of images are computed
• Map these principal components to the principal components of the RGB space
• pixels with similar colors share similar structures
Application: Image compression
ORIGINAL IMAGE
• Divide the original 372x492 image into patches:

• Each patch is an instance that contains 12x12 pixels on a grid
• View each as a 144-D vector
PCA COMPRESSION: 144D  60D
16 MOST IMPORTANT EIGENVECTORS
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
PCA COMPRESSION: 144D ) 6D
6 MOST IMPORTANT EIGENVECTORS
2 2 2
4 4 4
6 6 6
8 8 8
10 10 10
12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
2 2 2
4 4 4
6 6 6
8 8 8
10 10 10
12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
3 most important eigenvectors
2 2
4 4
6 6
8 8
10 10
12 12
2 4 6 8 10 12 2 4 6 8 10 12
10
12
2 4 6 8 10 12
60 most important eigenvectors
Looks like the discrete cosine bases of JPG!...

2D DISCRETE COSINE BASIS
http://en.wikipedia.org/wiki/Discrete_cosine_transform
DIMENSIONALITY REDUCTION
PCA (Principal Component Analysis):
 Find projection that maximize the variance
ICA (Independent Component Analysis):

 Very similar to PCA except that it assumes non-Guassian features
Multidimensional Scaling:
 Find projection that best preserves inter-point distances
LDA(Linear Discriminant Analysis):

 Maximizing the component axes for class-separation
…
…
THANKS
AGENDA
oEnsemble Methods
INTRODUCTION
Ensemble methods
 Use a combination of models to increase accuracy
 Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating
an improved model M*
INTRODUCTION
Two most popular ensemble methods are bagging and boosting.
Bagging: Training a bunch of individual models in a parallel way. Each model
is trained by a random subset of the data.
 averaging the prediction over a collection of classifiers
Boosting: Training a bunch of individual models in a sequential way. Each

individual model learns from mistakes made by the previous model.
 weighted vote with a collection of classifiers
BAGGING VS BOOSTING
BAGGING: BOOSTRAP AGGREGATION
Analogy: Diagnosis based on multiple doctors’ majority vote
Training
 Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with
replacement from D (i.e., bootstrap)
 A classifier model Mi is learned for each training set Di
Classification: classify an unknown sample X

 Each classifier Mi returns its class prediction
 The bagged classifier M* counts the votes and assigns the class with the most votes to X
BAGGING: BOOSTRAP AGGREGATION
Prediction: can be applied to the prediction of continuous values by taking the
average value of each prediction for a given test tuple
Accuracy
 Often significantly better than a single classifier derived from D
 For noise data: not considerably worse, more robust
 Proved improved accuracy in prediction
BOOSTING
Analogy: Consult several doctors, based on a combination of weighted diagnoses—
weight assigned based on the previous diagnosis accuracy
How boosting works?
 Weights are assigned to each training tuple
 A series of k classifiers is iteratively learned
 After a classifier Mi is learned, the weights are updated to allow the subsequent
classifier, Mi+1, to pay more attention to the training tuples that were
misclassified by Mi
 The final M* combines the votes of each individual classifier, where the weight of
each classifier's vote is a function of its accuracy
Boosting algorithm can be extended for numeric prediction
Comparing with bagging: Boosting tends to have greater accuracy, but it also risks
overfitting the model to misclassified data
RANDOM FOREST
Random forest is an ensemble model using bagging as the ensemble method and
decision tree as the individual model.
Problem with Bagging

 Works by reducing the variance
 Its possible of tree to be correlated, since presence of indicative features would
lead to similar split in each tree
RANDOM FOREST
RANDOM FOREST (BREIMAN 2001)
Random Forest:
 Each classifier in the ensemble is a decision tree classifier and is generated using a
random selection of attributes at each node to determine the split
 During classification, each tree votes and the most popular class is returned
Two Methods to construct Random Forest:
 Forest-RI (random input selection): Randomly select, at each node, F attributes as
candidates for the split at the node.
 Forest-RC (random linear combinations): Creates new attributes (or features) that are
a linear combination of the existing attributes (reduces the correlation between
individual classifiers)
Comparable in accuracy to Adaboost, but more robust to errors and outliers
Insensitive to the number of attributes selected for consideration at each split, and
faster than bagging or boosting
ADABOOST (FREUND AND SCHAPIRE, 1997)
Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
Initially, all the weights (wi) of tuples are set the same (1/d)
Generate k classifiers in k rounds. At round i,
 Tuples from D are sampled (with replacement) to form a training set Di of the
same size
 Each tuple’s chance of being selected is based on its weight
 A classification model Mi is derived from Di
 Its error rate is calculated using Di as a test set
 If a tuple is misclassified, its weight is increased, o.w. it is decreased
Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is
the sum of the weights of the misclassified tuples:
d
error ( M i )  w
j
j  err ( X j )
1  error ( M i )
log
The weight of classifier Mi’s vote (αM) is error ( M i )
REFERENCES
Han, Jiawei, Jian Pei, and Micheline Kamber. Data mining: concepts and
techniques. Elsevier, 2011.
T. M. Mitchell, Machine Learning. McGraw-Hill Science, 1997.

THANKS
AGENDA
oArtificial Neural Network
History of Artificial Neural Networks
What is an Artificial Neural Networks?
How it works?
Learning
HISTORY OF THE ARTIFICIAL NEURAL NETWORKS
History of the ANNs stems from the 1940s, the decade of the first electronic
computer.
However, the first important step took place in 1957 when Rosenblatt introduced the
first concrete neural model, the perceptron. Rosenblatt also took part in constructing
the first successful neurocomputer, the Mark I Perceptron. After this, the
development of ANNs has proceeded as described in Figure.
Rosenblatt's original perceptron model contained only one layer. From this, a multi-
layered model was derived in 1960. At first, the use of the multi-layer perceptron
(MLP) was complicated by the lack of a appropriate learning algorithm.
In 1974, Werbos came to introduce a so-called backpropagation algorithm for the
three-layered perceptron network.
In 1986, The application area of the MLP networks remained rather limited until the
breakthrough when a general back propagation algorithm for a multi-layered
perceptron was introduced by Rummelhart and Mclelland.
In 1982, Hopfield brought out his idea of a neural network. Unlike the neurons in
MLP, the Hopfield network consists of only one layer whose neurons are fully
connected with each other.
Since then, new versions of the Hopfield network have been developed. The
Boltzmann machine has been influenced by both the Hopfield network and
the MLP.
In 1988, Radial Basis Function (RBF) networks were first introduced by
Broomhead & Lowe. Although the basic idea of RBF was developed 30 years
ago under the name method of potential function, the work by Broomhead &
Lowe opened a new frontier in the neural network community.
In 1982, A totally unique kind of network model is the Self-Organizing Map
(SOM) introduced by Kohonen. SOM is a certain kind of topological map
which organizes itself based on the input patterns that it is trained with. The
SOM originated from the LVQ (Learning Vector Quantization) network the
underlying idea of which was also Kohonen's in 1972.
Since then, research on artificial neural networks

has remained active, leading to many new network
types, as well as hybrid algorithms and hardware
for neural information processing.
ARTIFICIAL NEURAL NETWORK
A set of major aspects of a parallel distributed model
include:
 a set of processing units (cells).
 a state of activation for every unit, which equivalent to the output of the
unit.
 connections between the units. Generally each connection is defined by a
weight.
 a propagation rule, which determines the effective input of a unit from its
external inputs.
an activation function, which determines the new level of
activation based on the effective input and the current activation.
an external input for each unit.
a method for information gathering (the learning rule).
an environment within which the system must operate, providing
input signals and _ if necessary _ error signals.
COMPUTERS VS. NEURAL NETWORKS
“Standard” Computers Neural Networks
 one CPU highly parallel processing
fast processing units slow processing units
reliable units unreliable units
static infrastructure dynamic infrastructure
WHY ARTIFICIAL NEURAL NETWORKS?
There are two basic reasons why we are interested in building artificial neural
networks (ANNs):
• Technical viewpoint: Some problems such as character recognition or the
prediction of future states of a system require massively parallel and adaptive
processing.
• Biological viewpoint: ANNs can be used to replicate and simulate
components of the human (or animal) brain, thereby giving us insight into
natural information processing.
•The “building blocks” of neural networks are
the neurons.
• In technical systems, we also refer to them as units or nodes.
•Basically, each neuron

receives input from many other neurons.
changes its internal state (activation) based on the current input.
sends one output signal to many other neurons, possibly including
its input neurons (recurrent network).
•Information is transmitted as a series of electric impulses, so-called spikes.
•The frequency and phase of these spikes encodes the information.
•In biological systems, one neuron can be connected to as many as 10,000

other neurons.
•Usually, a neuron receives its information from other neurons in a confined

area, its so-called receptive field.
HOW DO ANNS WORK?
An artificial neural network (ANN) is either a hardware
implementation or a computer program which strives to simulate
the information processing capabilities of its biological exemplar.
ANNs are typically composed of a great number of interconnected

artificial neurons. The artificial neurons are simplified models of their
biological counterparts.
ANN is a technique for solving problems by constructing software that

works like our brains.
HOW DO OUR BRAINS WORK?
The Brain is A massively parallel information processing
system.
Our brains are a huge network of processing elements. A
typical brain contains a network of 10 billion neurons.
HOW DO OUR BRAINS WORK?
 A processing element
Dendrites: Input
Cell body: Processor
Synaptic: Link
Axon: Output
How do our brains work?
A neuron is connected to other neurons through about 10,000

synapses
A neuron receives input from other neurons. Inputs are combined.
Once input exceeds a critical level, the neuron discharges a spike ‐

an electrical pulse that travels from the body, down the axon, to
the next neuron(s)
The axon endings almost touch the dendrites or cell body of the
next neuron.
Transmission of an electrical signal from one neuron to the next is

effected by neurotransmitters.
Neurotransmitters are chemicals which are released from the first neuron
and which bind to the
Second.
This link is called a synapse. The strength of the signal that

reaches the next neuron depends on factors such as the amount of
neurotransmitter available.
How do ANNs work?
An artificial neuron is an imitation of a human neuron

How do ANNs work?
• Now, let us have a look at the model of an artificial neuron.
How do ANNs work?
............
Input xm x2 x1
Processing ∑
∑= X1+X2 + ….+Xm =y
Output y
How do ANNs work?
Not all inputs are equal
............
xm x2 x1
Input
wm .....
weights w2 w1
Processing ∑ ∑= X1w1+X2w2 + ….+Xmwm

=y
Output y
How do ANNs work?
The signal is not passed down to the
next neuron verbatim
............
xm x2 x1
Input
wm ..... w2
weights w1
Processing ∑
Transfer Function
f(vk)
(Activation Function)
Output y
The output is a function of the input, that is
affected by the weights, and the transfer
functions
Activation Functions:
Forward propagation: Vectorized implementation
Neural Network learning its own features
Other Network Architecture
Simple example: AND
Simple example: OR
Artificial Neural Networks
 An ANN can:
1. compute any computable function, by the
appropriate selection of the network
topology and weights values.
1. learn from experience!

 Specifically, by trial‐and‐error
Learning by trial‐and‐error
Continuous process of:
Trial:
Processing an input to produce an output (In terms of ANN: Compute
the output function of a given input)
Evaluate:
Evaluating this output by comparing the actual output with
the expected output.
Adjust:
Adjust the weights.
How it works?
 Set initial values of the weights randomly.
 Input: truth table of the XOR
 Do
 Read input (e.g. 0, and 0)
 Compute an output (e.g. 0.60543)
 Compare it to the expected output. (Diff= 0.60543)
 Modify the weights accordingly.
 Loop until a condition is met
 Condition: certain number of iterations
 Condition: error threshold
Design Issues
 Initial weights (small random values ∈[‐1,1])
 Transfer function (How the inputs and the weights are
combined to produce output?)
 Error estimation
 Weights adjusting
 Number of neurons
 Data representation
 Size of training set
Transfer Functions
 Linear: The output is proportional to the total weighted
input.
 Threshold: The output is set at one of two values, depending

on whether the total weighted input is greater than or less
than some threshold value.
 Non‐linear: The output varies continuously but not linearly

as the input changes.
Error Estimation
 The root mean square error (RMSE) is a frequently-used
measure of the differences between values predicted by a
model or an estimator and the values actually observed from
the thing being modeled or estimated
Weights Adjusting
 After each iteration, weights should be adjusted to
minimize the error.
– All possible weights
– Back propagation
Back Propagation
 Back-propagation is an example of supervised learning is used at
each layer to minimize the error between the layer’s response and
the actual data
 The error at each hidden layer is an average of the evaluated error
 Hidden layer networks are trained this way
ANN EXAMPLE
BACKPROPAGATION STEP BY STEP
• Backpropagation is a commonly used technique

for training neural network.
• Neural network training is about finding weights that minimize prediction error. We
usually start our training with a set of randomly generated weights.
• Then, backpropagation is used to update the weights in an attempt to correctly map
arbitrary inputs to outputs.
initial weights will be as following: w1 = 0.11, w2 = 0.21, w3 = 0.12, w4 = 0.08, w5 = 0.14 and w6 = 0.15
Dataset
Our dataset has one sample with two inputs and one output.
Our single sample is as following inputs=[2, 3] and output=[1].
Forward Pass
We will use given weights and inputs to predict the output. Inputs are multiplied by weights;
the results are then passed forward to next layer.
Calculating Error
Now, it’s time to find out how our network performed by calculating the difference between the
actual output and predicted one. It’s clear that our network output, or prediction, is not even
close to actual output. We can calculate the difference or the error as following.
Reducing Error
• The main goal of the training is to reduce the error or the difference
between prediction and actual output. Since actual output is constant,
“not changing”, the only way to reduce the error is to
change prediction value.
• how to change prediction value?
• By decomposing prediction into its basic elements we can find

that weights are the variable elements affecting prediction value. In
other words, in order to change prediction value, we need to
change weights values.
how to change\update the weights value so that the

error is reduced?
Backpropagation!
• Backpropagation, short for “backward propagation of errors”, is a
mechanism used to update the weights using gradient descent.
• It calculates the gradient of the error function with respect to the neural
network’s weights. The calculation proceeds backwards through the
network.
Gradient descent is an iterative optimization algorithm for finding the

minimum of a function; in our case we want to minimize th error function.
To find a local minimum of a function using gradient descent, one takes
steps proportional to the negative of the gradient of the function at the
current point.
• For example, to update w6, we take the current w6 and

subtract the partial derivative of error function with
respect to w6.
• Optionally, we multiply the derivative of

the error function by a selected number to make sure
that the new updated weight is minimizing the error
function; this number is called learning rate.
So to update w6 we can apply the following formula
Similarly, we can derive the update formula for w5 and any other weights existing between the output and
the hidden layer.
However, when moving backward to update w1, w2, w3 and w4 existing between input and hidden
layer, the partial derivative for the error function with respect to w1, for example, will be as following.
We can find the update formula for the remaining weights w2, w3 and w4 in the same way.
In summary, the update formulas for all weights will be as following:
We can rewrite the update formulas in matrices as following
Backward Pass
Using derived formulas we can find the new weights.
Learning rate: is a hyperparameter which means that we need to manually guess its value.
Now, using the new weights we will repeat the forward passed
We can notice that

the prediction 0.26 is a
little bit closer to actual
output than the
previously predicted
one 0.191.
We can repeat the same
process of backward
and forward pass
until error is close or
equal to zero.
https://hmkcode.com/netflow/
Applications Areas
 Function approximation
 including time series prediction and modeling.
 Classification
 including patterns and sequences recognition, novelty
detection and sequential decision making.
 (radar systems, face identification, handwritten text recognition)
 Data processing
 including filtering, clustering blinds source separation and
compression.
 (data mining, e-mail Spam filtering)
Advantages / Disadvantages
 Advantages
 Adapt to unknown situations
 Powerful, it can model complex functions.
 Ease of use, learns by example, and very little user
domain‐specific expertise needed
 Disadvantages
 Forgets
 Not exact
 Large complexity of the network structure
Conclusion
 Artificial Neural Networks are an imitation of the biological
neural networks, but much simpler ones.
 The computing would have a lot to gain from neural networks.
Their ability to learn by example makes them very flexible and
powerful furthermore there is need to device an algorithm in
order to perform a specific task.
Conclusion
 Neural networks also contributes to area of research such a
neurology and psychology. They are regularly used to model
parts of living organizations and to investigate the internal
mechanisms of the brain.
 Many factors affect the performance of ANNs, such as the
transfer functions, size of training sample, network topology,
weights adjusting algorithm, …
Q. How does each neuron work in ANNS?
What is back propagation?
 A neuron: receives input from many other neurons;
 changes its internal state (activation) based on the
current input;
 sends one output signal to many other neurons, possibly
including its input neurons (ANN is recurrent network).
 Back-propagation is a type of supervised learning, used at each

layer to minimize the error between the layer’s response and the
actual data.
THANKS
AGENDA
oArtificial Neural Network
Gradient Descent
Stochastic Gradient Descent
Gradient Descent Vs Stochastic Gradient Descent
OPTIMIZATION
 Optimization is always the ultimate goal whether you are dealing with a real life
problem or building a software product.
 Optimization basically means getting the optimal output for your problem.
Broad applications of Optimization

 There are various kinds of optimization techniques which are applied across various
domains such as
 Mechanics – For eg: In deciding the surface of aerospace design
 Economics – For eg: Cost minimization
 Physics – For eg: Optimization time in quantum computing
 Optimization has many more advanced applications like deciding optimal route for
transportation, shelf-space optimization, etc. NEERAJ.GUPTA@GLA.AC.IN 3
OPTIMIZATION
 Many popular machine algorithms depend upon
optimization techniques such as
 linear regression,
 k-nearest neighbors,
 neural networks, etc.
 The applications of optimization are limitless and is a
widely researched topic in both academia and
industries.
WHAT IS GRADIENT DESCENT?
To explain Gradient Descent I’ll use the classic mountaineering example.
GRADIENT DESCENT ALGORITHM AND ITS VARIANTS
 Gradient Descent is an optimization algorithm used for minimizing the cost
function in various machine learning algorithms. It is basically used for
updating the parameters of the learning model.
Types of gradient Descent:

1.Batch Gradient Descent
2.Stochastic Gradient Descent
3.Mini Batch gradient descent
BATCH GRADIENT DESCENT
Batch Gradient Descent:
• This is a type of gradient descent which processes all the training examples
for each iteration of gradient descent.
• If the number of training examples is large, then batch gradient descent is
computationally very expensive.
• If the number of training examples is large, then batch gradient descent is
not preferred.
• Instead, we prefer to use stochastic gradient descent or mini-batch gradient
descent.
BATCH GRADIENT DESCENT
STOCHASTIC GRADIENT DESCENT
Stochastic Gradient Descent:
• This is a type of gradient descent which processes 1 training example per
iteration.
• Hence, the parameters are being updated even after one iteration in which
only a single example has been processed.
• Hence this is quite faster than batch gradient descent.
• But again, when the number of training examples is large, even then it
processes only one example which can be additional overhead for the
system as the number of iterations will be quite large.
STOCHASTIC GRADIENT DESCENT
MINI BATCH GRADIENT DESCENT:
Mini Batch gradient descent:
• This is a type of gradient descent which works faster than both batch
gradient descent and stochastic gradient descent.
• Here b examples where b<m are processed per iteration.
• So even if the number of training examples is large, it is processed in
batches of b training examples in one go.
• Thus, it works for larger training examples and that too with lesser number
of iterations.
MINI BATCH GRADIENT DESCENT:
THANKS
AN INTRODUCTION TO : DEEP LEARNING
AGENDA
oAn introduction to: Deep Learning
aka or related to
Deep Neural Networks
Deep Structural Learning
Deep Belief Networks
etc,
Machine Learning Basics
Machine learning is a field of computer science that gives computers the ability to
learn without being explicitly programmed
Machine Learning
Labeled Data algorithm
Training
Prediction
Labeled Data Learned model Prediction
Methods that can learn from and make predictions on data

Types of Learning
Supervised: Learning with a labeled training set
Example: email classification with already labeled emails
Unsupervised: Discover patterns in unlabeled data

Example: cluster similar documents based on text
Reinforcement learning: learn to act based on feedback/reward

Example: learn to play Go, reward: win or lose
class A
class A
Classification Regression Clustering
Anomaly Detection
Sequence labeling
http://mbjoseph.github.io/2013/11/27/measure.html
…
ML vs. Deep Learning
Most machine learning methods work well because of human-designed representations
and input features
ML becomes just optimizing weights to best make a final prediction
What is Deep Learning (DL) ?
A machine learning subfield of learning representations of data. Exceptional effective
at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of) representation by using a
hierarchy of multiple layers
If you provide the system tons of information, it begins to understand it and respond in
useful ways.
https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png
Why is DL useful?
o Manually designed features are often over-specified, incomplete and take a long
time to design and validate
o Learned Features are easy to adapt, fast to learn
o Deep learning provides a very flexible, (almost?) universal, learnable framework for
representing world, visual and linguistic information.
o Can learn both unsupervised and supervised
o Effective end-to-end joint system learning
o Utilize large amounts of training data
In ~2010 DL started outperforming other

ML techniques
first in speech and vision, then NLP
State of the art in …
Several big improvements in recent years in NLP

 Machine Translation
 Sentiment Analysis Leverage different levels of representation
 Dialogue Agents o words & characters
 Question Answering o syntax & semantics
 Text Classification …
Neural Network Intro
Weights
= ( + )
= ( + )
Activation functions
How do we train?
4 + 2 = 6 neurons (not counting inputs)

[3 x 4] + [4 x 2] = 20 weights
4 + 2 = 6 biases
26 learnable parameters
Demo
Training
Sample labeled Forward it Back-
data through the Update the
propagate the
network, get network weights
(batch) errors
predictions
Optimize (min. or max.) objective/cost function ( )

Generate error signal that measures difference between
predictions and target values
Use error signal to change the weights and get more

accurate predictions
Subtracting a fraction of the gradient moves you towards
the (local) minimum of the cost function
https://medium.com/@ramrajchandradevan/the-evolution-of-gradient-descend-optimization-algorithm-4106a6702d39
Gradient Descent
objective/cost function ( ) https://playground.tensorflow.org/
= − ( ) Update each element of θ
= − ( ) Matrix notation for all parameters
learning rate
Recursively apply chain rule though each node

One forward pass
Text (input) representation
TFIDF
Word embeddings
….
0.2 -0.5 0.1 0.1 1.0 0.95 very positive

2.0 1.5 1.3 3.0 3.89 positive
0.2
0.5 0.0 0.25 0.025 0.15 negative
0.3
-0.3 2.0 0.0 0.0 0.37 very negative
( ; , )
Activation functions
Non-linearities needed to learn complex (non-linear) representations of data, otherwise
the NN would be just a linear function
http://cs231n.github.io/assets/nn1/layer_sizes.jpeg
More layers and neurons can approximate more complex functions
Full list: https://en.wikipedia.org/wiki/Activation_function

Activation: Sigmoid
Takes a real-valued number and
“squashes” it into range between 0 and
1.
→ 0,1
http://adilmoujahid.com/images/activation.png
+ Nice interpretation as the firing rate of a neuron

• 0 = not firing at all
• 1 = fully firing
- Sigmoid neurons saturate and kill gradients, thus NN will barely learn
• when the neuron’s activation are 0 or 1 (saturate)
• gradient at these regions almost zero
• almost no signal will flow to its weights
• if initial weights are too large then most neurons would saturate
Activation: Tanh
“squashes” it into range between -1 and
1.
→ −1,1
- Like sigmoid, tanh neurons saturate

- Unlike sigmoid, output is zero-centered
- Tanh is a scaled sigmoid: tanh =2 2 −1
Activation: ReLU
thresholds it at zero f = max(0, )
Most Deep Networks use ReLU nowadays
• Trains much faster

• accelerates the convergence of SGD
• due to linear, non-saturating form
• Less expensive operations
• compared to sigmoid/tanh (exponentials etc.)
• implemented by simply thresholding a matrix at zero
• More expressive
• Prevents the gradient vanishing problem
Overfitting
http://wiki.bethanycrane.com/overfitting-of-data
Learned hypothesis may fit the

training data very well, even
outliers (noise) but fail to generalize
to new examples (test data)
https://www.neuraldesigner.com/images/learning/selection_error.svg
Regularization
Dropout
• Randomly drop units (along with their connections)
during training
• Each unit retained with fixed probability p,
independent of other units
• Hyper-parameter p to be chosen (tuned)
Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural
networks from overfitting." Journal of machine learning research (2014)
L2 = weight decay
• Regularization term that penalizes big weights, added to the
objective = +
• Weight decay value determines how dominant regularization is during
gradient computation
• Big weight decay coefficient  big penalty for big weights
Early-stopping
• Use validation error to decide when to stop training
• Stop when monitored quantity has not improved after n subsequent epochs
• n is called patience
Loss functions and output
Classification Regression
Training Rn x {class_1, ..., class_n} Rn x Rm

examples (one-hot encoding)
Output Soft-max Linear (Identity)

Layer [map Rn to a probability distribution] or Sigmoid
f(x)=x
Cost (loss) Cross-entropy Mean Squared Error

function 1 () ()
= −
1 () () ()
=− log + 1− log 1 −
Mean Absolute Error
1 () ()
= −
List of loss functions
Convolutional Neural Networks (CNNs)
Main CNN idea for text:
Compute vectors for n-grams and group them afterwards
Example: “this takes too long” compute vectors for:

This takes, takes too, too long, this takes too, takes too long, this takes too long
Convolutional
Input matrix 3x3 filter
http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
Convolutional Neural Networks (CNNs)
Main CNN idea for text:
Compute vectors for n-grams and group them afterwards
max pool
2x2 filters
and stride 2
https://shafeentejani.github.io/assets/images/pooling.gif
CNN for text classification
Severyn, Aliaksei, and Alessandro Moschitti. "UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment
Classification." SemEval@ NAACL-HLT. 2015.
CNN with multiple filters
Kim, Y. “Convolutional Neural Networks for Sentence Classification”, EMNLP (2014)
sliding over 3, 4 or 5 words at a time

Recurrent Neural Networks (RNNs)
Main RNN idea for text:
Condition on all previous words
Use same set of weights at all time steps ( ) ( )
ℎ = ( ℎ + )
https://pbs.twimg.com/media/C2j-8j5UsAACgEK.jpg
• Stack them up, Lego fun!
• Vanishing gradient problem

https://discuss.pytorch.org/uploads/default/original/1X/6415da0424dd66f2f5b134709b92baa59e604c55.jpg
Bidirectional RNNs
Main idea: incorporate both left and right context
output may not only depend on the previous elements in the sequence, but also
future elements.
( ) ( )
ℎ = ( ℎ + )
( ) ( )
ℎ = ( ℎ + )
= ℎ ;ℎ
past and future around a single token

http://www.wildml.com/2015/09/recurrent-neural-networks-
tutorial-part-1-introduction-to-rnns/
two RNNs stacked on top of each other

output is computed based on the hidden state of both RNNs ℎ ; ℎ
THANKS

Combined ML

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Combined ML

Uploaded by

Copyright:

Available Formats

MACHINE LEARNING (ML-4)

Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura

Tom Mitchell defines the concept learning as —

h = { GREEN, HARD, NO, WRINKLED }

h = { GREEN, HARD, NO, WRINKLED }

h = { GREEN, HARD, NO, WRINKLED }

First right answer = 1Choclate (with in 90 secs)

What will be slope (m) and y intercept (c )?

First right answer = 1Choclate

What will be slope (m) and y intercept (c )?

Now X=10.5 , X=2 , 13, 7

What are you doing here? (USES OF LINEAR REGRESSION)

SSE method or least Square method:

Linear regression with one variable.

Idea: Choose so that

Price ($) 300

Correct: Simultaneous update Incorrect:

If α is too large, gradient descent

“Batch”: Each step of gradient descent

Size (feet2) Price ($1000)

Size (feet2) Number of Age of home Price ($1000)

Multivariate linear regression.

GRADIENT DESCENT FOR MULTIPLE

(simultaneously update for every )

GRADIENT DESCENT IN PRACTICE I:

GRADIENT DESCENT IN PRACTICE II:

- “Debugging”: How to make sure gradient

No. of iterations No. of iterations

- For sufficiently small , should decrease on every iteration.

FEATURES AND POLYNOMIAL REGRESSION

Normal equation: Method to solve for

Solve for NEERAJ.GUPTA@GLA.AC.IN 26

Email: Spam / Not Spam?

0: “Negative Class” (e.g., benign tumor)

Threshold classifier output at 0.5:

can be > 1 or < 0

Tell patient that 70% chance of tumor being malignant

“probability that y = 1, given x,

How to choose parameters ?

To make a prediction given new :

(simultaneously update all )

(simultaneously update all )

Algorithm looks identical to linear regression!

Email foldering/tagging: Work, Friends, Family, Hobby

Medical diagrams: Not ill, Cold, Flu

Weather: Sunny, Cloudy, Rain, Snow

On a new input , to make a prediction, pick the

Compute test set error:

Misclassification error (0/1 misclassification error):

SRC : *Yining Chen Slides

Specificity FPR(False Positive Rate)

Sensitivity⬆, Specificity⬇ and Sensitivity⬇, Specificity⬆

KNN has been used in statistical estimation and pattern recognition

P(E) = Number of favorable outcomes / Total number of possible outcomes = 3 / 6

P(A ∩ B) = P(AB) = P(A)*P(B)

For example, if A is obtaining a 5 on throwing a die and B is drawing a king of

P(A ∩ B) = P(Obtaining a black card which is a King) = 2/52

This is Bayes’ Theorem in real life!”

P(A|B) = P(A ∩ B) / P(B)

Here, P(A) and P(B) are probabilities of observing A and B independently of

posterior = likelihood * prior / evidence

P(E|A) = Number of red balls in box A / Total number of balls in box A = 2 / 5

Then evidence P(E) = P(E|A)*P(A) + P(E|B)*P(B) + P(E|C)*P(C)

Finally, the Y for which P(Y|X) is maximum is our predicted class.

Then evidence P(E) = P(E|A)P(A) + P(E|B)P(B) + P(E|C)*P(C)