Professional Documents
Culture Documents
Chapter – 2
4- d(p, q) = 0 only if p = q.
• Note that, (Entropy, Information Gain) & (Gini Index) are two
seperate methods for deciding the best split. Also, remeber that the
max range for Entropy is 1, while for Gini Index it’s 0.5.
Chapter – 4
• If the rule set is not exhaustive, then a default rule, rd: () --> yd, must
be added to cover the remaining cases. A default rule has an empty
antecedent and is triggered when all other rules have failed. yd is
known as the default class and is typically assigned to the majority
class of training records not covered by the existing rules.
• If the rule set is not mutually exclusive, then a record can be
covered by several rules, some of which may predict conflicting
classes. There are two ways to overcome this problem:-
- Ordered Rules: In this approach, the rules in a rule set are
ordered in decreasing order of their priority, which can be
defined in many ways (e.g., based on accuracy, coverage, total
description length, or the order in which the rules are
generated). An ordered rule set is also known as a decision list.
When a test record is presented, it is classified by the highest-
ranked rule that covers the record. This avoids the problem of
having conflicting classes predicted by multiple classification
rules.
• KNN tries to predict the correct class for the test data by calculating
the distance between the test data and all the training points.
Important Stuff
• Cluster Proximity.
• Progressive Sampling (PS) starts with a small data sample from the
full dataset and use progressively larger samples until the model
accuracy cannot increase substantially.
• Given two sets A and B, A - B is the set of elements of A that are not
in B. For example, if A: {1, 2, 3, 4} and B : {2, 3, 4}, then A-
B = {1} and B - A = {}, the empty set. We can define the distance d
between two sets A and B as d(A,B): size(A - B), where size is a
function returning the number of elements in a set. This distance
measure, which is an integer value greater than or equal to 0, does
not satisfy the second part of the positivity property the symmetry
property, or the triangle inequality. However, these properties can
be made to hold if the dissimilarity measure is modified as follows:
d(A,B): size(A- B) + size(B – A).
• When a model performs very well for training data but has poor
performance with test data (new data), it is known as overfi tting. In
this case, the machine learning model learns the details and noise in
the training data such that it negatively affects the performance of
the model on test data. Overfitting can happen due to low bias and
high variance.
• When a model has not learned the patterns in the training data well
and is unable to generalize well on the new data, it is known as
underfitting. An underfit model has poor performance on the
training data and will result in unreliable predictions. Underfitting
occurs due to high bias and low variance.
• Noisy data can appear as normal data. So noise objects are not
always outliers.