You are on page 1of 101

US-TCS-504






Learning

Learning
• There are three types of learning:
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
Supervised Learning
• In Supervised learning, an AI system is
presented with data which is labelled.
• It means some data is already tagged with the
correct answer.
• It can be compared to learning which takes
place in the presence of a supervisor or a
teacher.
Supervised learning
Types of supervised learning
Regression
• It is a Supervised Learning task where output is
having continuous value.
• It is used for the prediction of continuous variables,
such as Weather forecasting, Market Trends, etc.
Classification
• Classification means to group the output
inside a class.
• If the algorithm tries to label input into two
distinct classes, it is called binary
classification.
• Selecting between more than two classes is
referred to as multiclass classification.
Classification
Unsupervised learning
• Unsupervised learning is a type of machine
learning in which models are trained using
unlabelled dataset and are allowed to act on
that data without any supervision.
Types of Unsupervised learning
Clustering
• Clustering is a method of grouping the objects
into clusters such that objects with most
similarities remains into a group and has less
or no similarities with the objects of another
group.
Association
• An association rule learning problem is where
you want to discover rules that describe large
portions of your data, such as people that buy
X also tend to buy Y.
Reinforcement Learning
• A reinforcement learning algorithm, or agent,
learns by interacting with its environment.
• The agent receives rewards by performing
correctly and penalties for performing incorrectly.
• The agent learns without intervention from a
human by maximizing its reward and minimizing
its penalty.
• It is a type of dynamic programming that trains
algorithms using a system of reward and
punishment.
Models
• There are two forms of data analysis that can
be used for extracting models describing
important classes or to predict future data
trends. These two forms are as follows −

• Classification
• Prediction
Prediction
• Prediction models predict continuous valued
functions.

• The prediction technique can be used in the


sale to predict profit/loss for the future.
Two Steps

Classification
• Classification models predict categorical class
labels

• A classification model may be built to


categorize credit card transactions as either
real or fake.
Binary Classification
• Binary classification refers to predicting one of two
classes.
• y=f(x), where y = categorical output
Multi class Classification
• If a classification problem has more than two
outcomes, then it is called as Multi-class
Classifier.
Example: Classifications of types of crops
Two steps
1. Developing the Classifier or model creation
2. Applying classifier for classification
Step 1 (Developing)
• This step is the learning step or the learning phase.

• In this step the classification algorithms build the


classifier.

• The classifier is built from the training set made up of


database tuples and their associated class labels.

• Each tuple that constitutes the training set is referred


to as a category or class. These tuples can also be
referred to as sample, object or data points.
Step 1

Step 2
• In this step, the classifier is used for
classification.
• Here the test data is used to estimate the
accuracy of classification rules.
• The classification rules can be applied to the
new data tuples if the accuracy is considered
acceptable.
Step 2

Types Of Learners In Classification
• Lazy Learners – Lazy learners simply store the
training data and wait until a testing data
appears. Eg – k-nearest neighbor.

• Eager Learners – Eager learners construct a


classification model based on the given training
data before getting data for predictions.
• Due to this, they take a lot of time in training and
less time for a prediction. Eg – Decision Tree,
Naive Bayes, Artificial Neural Networks.
Process

Applications
• Sentiment Analysis
• Document Classification
• Image Classification
Statement of the Problem
Training Set: T = {t1, …, tn} set of n
examples
Each example ti
• characterized by m features (ti(A1), …, ti(Am))
• belongs to one of k classes (Ci : 1 i k)
GOAL
• From the training data find a model to
describe the classes accurately and
synthetically using the data’s features.
• The model will then be used to assign
class labels to unknown (previously
unseen) records
KNN Classification

• K-NN algorithm assumes the similarity


between the new case/data and available
cases and put the new case into the category
that is most similar to the available categories.
• KNN is a non-parametric and lazy learning
algorithm.
KNN Classification
• K-NN is a non-parametric algorithm, which
means it does not make any assumption on
underlying data.
• It is also called a lazy learner algorithm
because it does not learn from the training set
immediately instead it stores the dataset and
at the time of classification, it performs an
action on the dataset.
KNN
• Suppose P1 is the point, for which label needs
to predict.
• First, you find the k closest point to P1 and
then classify points by majority vote of its k
neighbors.
• Each object votes for their class and the class
with the most votes is taken as the prediction.
KNN

Optimal value of K ??

Optimal value of K

KNN example
NAME AGE GENDER
CLASS OF
SPORTS
• Here male is denoted
with numeric value 0
Ajay 32 0 Football
and female with 1.
Mark 40 0 Neither

Sara 16 1 Cricket
• Find in which class of
Zaira 34 1 Cricket
people Angelina will lie
Sachin 55 0 Neither
whose k factor is 3 and
age is 5.
Rahul 40 0 Cricket

Pooja 20 1 Neither

Smith 15 0 Cricket

Laxmi 55 1 Football

Michael 15 0 Football
Finding distance

KNN
• To find the distance between 2 points

• Distance between Angelina and Ajay is


• d=√((age2-age1)²+(gender2-gender1)²)

• d=√((5-32)²+(1-0)²)

• d=√729+1
The distance calculated
• Ajay 27.02

Mark 35.01

Sara 11.00

Zaira 9.00

Sachin 50.01

Rahul 35.01

Pooja 15.00

Smith 10.00

Laxmi 50.00

Michael 10.05
KNN Classification
• As the value of k=3, the 3 closest points are

Zaira 9 cricket
Michael 10 cricket
smith 10.5 football

• So according to KNN algorithm, Angelina will


be in the class of people who like cricket.
Decision tree
• Decision tree builds classification models in the form
of a tree structure.
• It breaks down a dataset into smaller and smaller
subsets
• The final result is a tree with decision nodes and leaf
nodes.
Steps
• It begins with the original set S as the root
node.
• On each iteration of the algorithm, it iterates
through the very unused attribute of the set S
and calculates Entropy(H) and Information
gain(IG) of this attribute.
Steps
• It then selects the attribute which has the
smallest Entropy or Largest Information gain.
• The set S is then split by the selected attribute
to produce a subset of the data.
• The algorithm continues to recur on each
subset, considering only attributes never
selected before.
ID3 (Iterative Dichotomiser 3)
• ID3 follows the rule — A branch with an
entropy of zero is a leaf node and A branch
with entropy more than zero needs further
splitting.
Entropy
• Entropy: Entropy is a measure of the randomness
in the information being processed.
• Information Gain: Information Gain is the
decrease/increase in Entropy value when the
node is split.
• An attribute should have the highest information
gain to be selected for splitting.
• Based on the computed values of Entropy and
Information Gain, we choose the best attribute at
any particular step.
Example
Decision tree
• Mathematically Entropy for 1 attribute is
represented as:

• Where S → Current state, and


• Pi → Probability of an event
Entropy
Information Gain
Information gain of all Attributes
Highest gain attribute
• To determine which of the following
Temperature, Humidity or Wind has higher
information gain.

• Calculate parent entropy E(S)


• E(S) = (-(3/5)log(3/5)-(2/5)log(2/5)) = 0.971.
• Now calculate information gain.
IG(S, Temperature) = 0.971–0.4 =0.571

• Similarly
IG(S, Humidity) = 0.971
IG(S, Windy) = 0.020
• For humidity from the above table,
• play will occur if humidity is normal and
• not occur if it is high.
Underfitting
• A statistical model or a machine learning
algorithm is said to have underfitting when it
cannot capture the underlying trend of the
data.
• Underfitting destroys the accuracy of the
machine learning model.
Overfitting
• A statistical model is said to be overfitted,
when we train it with a lot of data.
• When a model gets trained with so much of
data, it starts learning from the noise and
inaccurate data entries in our data set.
Curve Fitting
Pruning
• Pruning is an algorithm used to reduce the
number of nodes that have to be evaluated to
reach the optimal solution.
• Pruning works by blocking the evaluation of
nodes whose leaf nodes would give worse
results compared to the one that was
previously examined.
Cross-Validation
• Cross-Validation (CV)is a technique used to
test a model’s ability to predict unseen data,
data not used to train the model.
• Holdout cross-validation
• k- fold cross validation
• Leave-one-out cross-validation LOOCV
• Random subsampling
Holdout cross-validation
• The simplest approach is to randomly split the
available data into a training set
• From this the learning algorithm produces h
using train set and a test set on which the
accuracy of h is evaluated.
k- fold cross validation
• Split the data into k equal subsets.
• We then perform k rounds of learning
• On each round 1/k of the data is held out as a test
set and the remaining examples are used as training
data.
Leave-one-out cross-validation LOOCV
• Leave-one-out cross validation is K-fold cross
validation
• Leave-one-out cross-validation (LOOCV) is an
exhaustive cross-validation technique.
• K equal to N, the number of data points in the
set.
Random subsampling
• Random subsampling validation also referred
to as Monte Carlo cross-validation splits the
dataset randomly into training and validation.
• In this technique, multiple sets of data are
randomly chosen from the dataset and
combined to form a test dataset. The
remaining data forms the training dataset.
Random subsampling
• The error rate of the model is the average of
the error rate of each iteration.
Evaluation Metrics
• Evaluation metrics help in determining how
good the model is trained.
1. Confusion matrix
2. Accuracy
3. Precision
4. Recall
5. ROC
Confusion matrix
• The confusion matrix visualizes the accuracy
of a classifier by comparing the actual and
predicted classes.
Evaluation metrics
•Recall or Sensitivity or TPR (True Positive Rate):
TP/(TP+FN)

•Specificity or TNR (True Negative Rate):


TN/(TN+FP)

•Precision:
TP/(TP+FP)
Evaluation metrics
•Accuracy:
(TP+TN)/(N+P)
F1, log loss
• ROC is a probability curve and AUC represents
the degree or measure of separability.
Bayes Theorem
• It is defined as the: Probability of an event A
given B equals the probability of B and A
happening together divided by the probability
of B.”
Naïve Bayes
• Naive Bayes is a probabilistic machine learning
algorithm that can be used in a wide variety of
classification tasks.
• Typical applications include filtering spam,
sentiment prediction etc.
Naïve Bayes
• It is a classification technique based on Bayes’
Theorem with an assumption of
independence among predictors.
• Naive Bayes classifier assumes that the
presence of a particular feature in a class is
unrelated to the presence of any other
feature.
Naïve Bayes
• Multiple Features x1,x2,x3…….xn and y
output.

The posterior probability can then be written as:


Naïve Bayes
• The Naive Bayes classifier combines this
model with a decision rule. One common rule
is to pick the hypothesis that’s most probable.
• This is known as the maximum a
posteriori or MAP decision rule.
Dataset
In learning phase we will compute the
table of likelihood
Calculate P(Yes) and P(No)
Naïve Bayes
• If we have a new set of weather condition. We
can classify it as whether we will play football
or not based on the likelihood from learning
phase.
New Sample
• Outlook=Sunny,
• Temperature=Mild,
• Humidity=High and
• Wind=Weak

• On this we can predict whether we will play or


not?
Probability to Play
• P(play=Yes|x) =
2/9 * 4/9 * 3/9 * 6/9 *9/14 =
0.014
Probability to No play
• P(No|x) =
3/5 * 2/5 * 4/5 * 2/5 * 5/14=
0.0274
• Normalize
• P(Yes|x) = 0.0141/(0.0141 + 0.0274)
= 0.34

• P(No|x) = 0.00274/(0.0141 + 0.0274)


= 0.66
Naïve Bayes
• Since P(No|x) is greater than P(Yes|x) we
classify the new instance as No.
Rule based Classifier
• Rule-based classifier makes use of a set of IF-THEN
rules for classification.
IF condition THEN conclusion
1. The IF part of the rule is called rule antecedent or
precondition.
2. The THEN part of the rule is called rule consequent.
3. The antecedent part the condition consist of one or
more attribute tests and these tests are logically
ANDed.
4. The consequent part consists of class prediction.
Example
• • R1: (Give Birth = no) & (Can Fly
= yes) → Birds

• R2: (Give Birth = no) & (Live in


Water = yes) → Fishes

• R3: (Give Birth = yes) & (Blood


Type = warm) → Mammals

• R4: (Give Birth = no) & (Can Fly


= no) → Reptiles

• R5: (Live in Water = sometimes)


→ Amphibians
Example
• R1: (Give Birth = no) & (Can Fly =
yes) → Birds
• The rule R1 above covers a hawk
=> Bird • R2: (Give Birth = no) & (Live in
• The rule R3 covers the grizzly Water = yes) → Fishes
bear => Mammal
• R3: (Give Birth = yes) & (Blood
Type = warm) → Mammals

• R4: (Give Birth = no) & (Can Fly =


• A lemur triggers rule R3, so it is no) → Reptiles
classified => Mammal
• A turtle triggers both R4 & R5 • R5: (Live in Water = sometimes)
• A dogfish shark trigger matches → Amphibians
none
Assessment of Rule
• Rule can be accessed based on two factors.
• Coverage and accuracy

• na = number of records covered by the rule(R).

• nc = number of records correctly classified by


rule(R).

• n = Total number of records


Coverage of a rule
• Coverage of a rule: Fraction of records that
satisfy the rule’s antecedent describes rule
coverage.
Coverage (R) = na / |n|
Accuracy of a rule
• Accuracy of a rule: Fraction of records that
meet the antecedent and consequent value
defines rule accuracy.
Accuracy (R) = nc / na
Building Classification Rules

Direct Method
• Extract rules directly from data
• RIPPER, Holte’s 1R (OneR)
Indirect Method
• Extract rules from other classification models
(e.g.decision trees, etc).

Indirect Method
Convert from the decision tree:

You might also like