You are on page 1of 14

LECTURE 6

MACHINE LEARNING
DECISION TREE
A decision tree is a type of Supervised Machine Learning in which each internal node represents
a test on a feature (e.g. whether a coin flip comes up heads or tails) each leaf node represents
a class label (decision taken after computing all features) and branches represent conjunctions
of features that lead to those class labels. The paths from root to leaf represent classification
rules.
• Inductive bias is a preference for smaller trees over larger ones
• They classify instances by sorting them down the tree from the root to some leaf node
• Each node in the tree specifies a test of some attribute of the instance, and each branch
descending from that node corresponds to one of the possible values for this attribute
Decision trees are constructed via an algorithmic approach that identifies ways to split a data
set based on different conditions. Decision Trees are a supervised learning method used for
both classification and regression tasks.
Decision trees – appropriate problems
Decision tree learning is generally best suited to problems with the following characteristics:

 Instances are represented by attribute-value pairs.


 The target function has discrete output values.
 Disjunctive descriptions may be required.
Examples:

 Learning to classify medical patients by their disease


 Equipment malfunctions by their cause
 Loan applicants by their likelihood of defaulting on payments
Decision trees – the algorithm
If the dataset consists of N attributes, then deciding which attribute to place at the root or at
different levels of the tree as internal nodes is a complicated step. By just randomly selecting
any node to be the root can’t solve the issue.
For solving this attribute selection problem, researchers worked and devised some solutions.
They suggested using some criteria like:
Entropy,
Information gain,
These criteria will calculate values for every attribute. The values are sorted, and attributes are
placed in the tree by following the order i.e, the attribute with a high value (in case of
information gain) is placed at the root.
information gain = Measures how well a given attribute separates the training
examples according to their target classification

ENTROPY.
Entropy is a measure of the randomness in the information being processed. The higher the
entropy, the harder it is to draw any conclusions from that information. Flipping a coin is an
example of an action that provides information that is random. Entropy measures homogeneity
of examples
c
(General Equation) In general Entropy ( S ) ≡ ∑ − p i log 2 pi Where pi is the proportion of S
i=1
belonging to class i
(Formula) Given a collection S, containing positive and negative examples of some target
concept, the entropy of S relative to this boolean classification is
Entropy ( S ) ≡− p⊕ log 2 p ⊕− p⊖ log 2 p ⊖ Where p⊕ is the proportion of positive examples in S and
p⊖ is the proportion of negative examples in S

Example
• Suppose S is a collection of 14 examples of some boolean concept
• 9 positive and 5 negative examples ¿
Entropy of S relative to this boolean classification is

Entropy ¿¿ 0.940

 Entropy is 0 if all members of S belong to the same class, either positive or negative
 Entropy is 1 when the collection contains an equal number of positive and
negative(randomness)
INFORMATION GAIN.
Information gain is the measure used by the ID3 algorithm to identify the best attribute at any
particular node of the tree.  Constructing a decision tree is all about finding an attribute that
returns the highest information gain and the smallest entropy.
Information gain identifies how much impurity is removed in a set of examples if a particular
attribute is chosen at a particular node
• The aim is to reduce that impurity so that for a particular attribute
constraint all the examples are either positive or negative
Information gain, Gain ( S , A ) of an attribute A , relative to a collection of examples S, is defined
|S v|
as Gain ( S , A ) ≡ Entropy ( S )− ∑ Entropy ( S v ) where Values ( A ) is the set of all possible
v ∈Values ( A ) |S|

values for attribute A and Sv is the subset of S for which attribute A has value v

• Gain ( S , A ) is therefore the expected reduction in entropy caused by knowing the value
of attribute A

 Values ( Wind ) =Weak , Strong


 S=¿
 SWeak ←¿
 SStrong ← ¿
|S v|
 Gain ( S ,Wind )=Entropy ( S )− ∑ |S|
Entropy ( S v )
v ∈ {Weak , Strong }

 ( 148 ) Entropy ( S )−( 146 ) Entropy ( S


¿ Entropy ( S ) − Weak Strong )

¿ 0.940−( )0.811−( )1.00


8 6

14 14
 ¿ 0.048
 Not much reduction in entropy by using Wind as an attribute
Given the training data, what should be the root node of the decision tree?

 𝐺𝑎𝑖𝑛(𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘)=0.246 (𝑂𝑢𝑡𝑙𝑜𝑜𝑘 provides the best prediction of the target concept)
 𝐺𝑎𝑖𝑛(𝑆, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦)=0.151
 𝐺𝑎𝑖𝑛(𝑆, 𝑊𝑖𝑛𝑑)=0.048
 𝐺𝑎𝑖𝑛(𝑆, 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒)=0.029

LECTURE 5
MACHINE LEARNING
What is Multivariate Regression?
Multivariate Regression is a supervised machine learning algorithm involving multiple data
variables for analysis. Multivariate regression is an extension of simple linear regression. It is
used when we want to predict the value of a variable based on the value of two or more
different variables. The variable we want to predict is called the Dependent Variable, while
those used to calculate the dependent variable are termed as Independent Variables.
Mathematical equation
The simple regression linear model represents a straight-line meaning y is a function of x. When
we have an extra dimension (z), the straight line becomes a plane. Here, the plane is the
function that expresses y as a function of x and z. The linear regression equation can now be
expressed as: y = m1.x + m2.z+ c

 y is the dependent variable which needs to be predicted.


 x is the first independent variable. It is the first input.
 z is the second independent variable. It is the second input.
 m1 is the slope of x1.
 m2 is the slope of z.
 c is the intercept. A constant that finds the value of y when x and z are 0.
The general mathematical function/hypothesis of a Multivariate regression is of the form:

where, n represents the number of independent variables, β0~ βn represent the coefficients
and x1~xn, are the independent variable

What is Cost Function?


The cost function is a function that allows a cost to samples when the model differs from
observed data. A smaller mean squared error implies a better performance. In simple words it is
a function that assigns a cost to instances where the model deviates from the observed data. In
this case, our cost is the sum of squared errors. The cost function for multiple linear regression
is given by:
LECTURE 4
MACHINE LEARNING
What is Regression Analysis?
Regression analysis is the process of estimating the relationship between a dependent variable
and independent variables.
Using regression, you fit a function on the available data and try to predict the outcome for the
future datapoints. This fitting of function serves two purposes.

 You can estimate missing data within your data range (Interpolation)
 You can estimate future data outside your data range (Extrapolation)
Interpolation Versus Extrapolation.
In regression tasks, we use data to generalize a function that maps a set of input variables X to
output variables y. Using this function mapping, a y value can be predicted for any combination
of input variables. This process is referred to as interpolation when the input variables lie in
between the training data, whereas if the point of estimation lies outside this region it is
referred to as extrapolation.
Let's consider the example of college graduates.
• Let's assume we have access to somewhat sparse data where we know the number of
college graduates every 4 years, as shown in the scatter plot below.

We want to estimate the number of college graduates for all the missing years in between. We
can do this by fitting a line to the limited available data points. This process is called
interpolation.
 Let’s assume we have access to limited data from the year 2001 to the year 2012, and
we want to predict the number of college graduates from the year 2013 to 2018.

It can be seen that the number of college graduates with master’s degrees increases almost
linearly with the year.
Hence, it makes sense to fit a line to the dataset. Using the 12 points to fit a line, and then test
the prediction of this line on the future 6 points, it can be seen that the prediction is very close.
This process is called extrapolation.

Linear Regression
In simple words, Linear Regression is the supervised Machine Learning model in which the
model finds the best fit linear line between the independent and dependent variable i.e it finds
the linear relationship between the dependent and independent variable.
Linear Regression is of two types: 
1. Simple Linear Regression
Simple Linear Regression is where only one independent variable is present and the model has
to find the linear relationship of it with the dependent variable
2. Multiple Linear Regression( In chapter 5).
VARIABLE ROLES

 
LECTURE 3
MACHINE LEARNING
Concept learning
Learning generalized hypotheses from specific positive and negative examples
Concept Learning can be seen as a problem of searching through a predefined space of
potential hypotheses for the hypothesis that best fits the training examples.
– Search takes place in the hypothesis space

A Concept Learning Task – Enjoy Sport Training Examples

 A set of example days, and each is described by six attributes.


 The task is to learn to predict the value of EnjoySport for arbitrary day, based on the
values of its attribute values.

EnjoySport – Hypothesis Representation


Each hypothesis consists of a conjuction of constraints on the instance attributes.
Each hypothesis will be a vector of six constraints, specifying the values of the six attributes –
(Sky, AirTemp, Humidity, Wind, Water, and Forecast).

Each attribute will be:


– ? - indicating any value is acceptable for the attribute (don’t care)
– single value – specifying a single required value (ex. Warm) (specific)
– 0 - indicating no value is acceptable for the attribute (no value)
If an instance (example) x satisfies all the constraints of hypothesis h, then h classifies x as a
positive example h(x)=1

?, Cold , High, ?, ?, ?
Enjoy Sport is true on Cold days with High humidity no matter if the Sky is Sunny or Rainy, Wind
is Strong or Weak, Water is Warm or Cool, and Forecast is Same or Change
• Most general hypothesis – Every day is a positive example <?, ?, ?, ?, ?, ?>
• Most specific hypothesis – No day is a positive example <0, 0, 0, 0, 0, 0>
• EnjoySport concept learning task requires learning the sets of days for which
EnjoySport=yes,

Concept Learning As Search


• The concept learning task boils down to searching a particular hypothesis or a set of
hypotheses in the search space consisting of candidate hypotheses
– The goal is to find the hypothesis that best fits the training data
Sky has 3 possible values, and other 5 attributes have 2 possible values.
• There are 96 (= 3.2.2.2.2.2) distinct instances in X.
• There are 5120 (=5.4.4.4.4.4) syntactically distinct hypotheses in H.
– Two more values for attributes: ? and 0
Every hypothesis containing one or more 0 symbols represents the empty set of instances; that
is, it classifies every instance as negative.
• There are 973 (= 1 + 4.3.3.3.3.3) semantically distinct hypotheses in H.
– Only one more value for attributes: ?, and one hypothesis representing empty set of
instances.
Although EnjoySport has small, finite hypothesis space, most learning tasks have much larger
(even infinite) hypothesis spaces.

General-to-Specific Ordering of Hypotheses


Many algorithms for concept learning organize the search through the hypothesis space by
relying on a general-to-specific ordering of hypotheses.
Consider two hypotheses

 h1 = (Sunny, ?, ?, Strong, ?, ?)
 h2 = (Sunny, ?, ?, ?, ?, ?)
Now consider the sets of instances that are classified positive by hl and by h2.
– Because h2 imposes fewer constraints on the instance, it classifies more instanceses
– positive.
– In fact, any instance classified positive by hl will also be classified positive by h2.
– Therefore, we say that h2 is more general than hl

FIND-S Algorithm
FIND-S Algorithm starts from the most specific hypothesis and generalize it by considering only
positive examples.
FIND-S algorithm ignores negative examples.
– As long as the hypothesis space contains a hypothesis that describes the true target
concept, and the training data contains no errors, ignoring negative examples does not cause to
any problem.
FIND-S algorithm finds the most specific hypothesis within H that is consistent with the positive
training examples.
We can start the search with the most specific hypothesis and then keep generalizing it till:
– all the positive examples satisfy it
– and no negative example is covered by it
Training data

h = <Ø, Ø, Ø, Ø, Ø, Ø>

x1=<sunny, warm, normal, strong, warm, same> + h1=<sunny, warm, normal, strong, warm, same>

x2=<sunny, warm, high, strong, warm, same> + h1=<sunny, warm, ?, strong, warm, same>
x3=<rainy, cold, high, strong, warm, change> - h1=<sunny, warm, ?, strong, warm, same>

x4=<sunny, warm, high, strong, cool, change> + h1=<sunny, warm, ?, strong, ?,?>

It ignores every negative example

– Assumptions:
o Hypothesis space H contains a hypothesis that describes the true
target concept
o Current hypothesis is the most specific hypothesis in consistent
with the observed positive examples
o Initialize h to the most specific hypothesis in
o For each positive training instance x
 For each attribute constraint ai in h
 If the constraint ai is satisfied by x #(value= h-value)
– Then do nothing
 Else replace ai in h by the next more general constraint
that is satisfied by
o Output hypothesis
Candidate-Elimination Algorithm
• FIND-S outputs a hypothesis from H, that is consistent with the training examples, this is just
one of many hypotheses from H that might fit the training data equally well.
The key idea in the Candidate-Elimination algorithm is to output a description of the set of all
hypotheses consistent with the training examples.

Consistent Hypothesis

An example x is said to satisfy hypothesis h when h(x) = 1


However, such an example is consistent with h depends on the target concept, and in
particular, whether h(x) = c(x)

Version Space
Version Space is the collection of all consistent hypothesis
• The set of all hypotheses consistent with the training data
– Learned by the candidate-elimination algorithm
• Definition: The version space, denoted VSH.D, with respect to hypothesis space H and
training examples D, is the subset of hypotheses H from consistent with the training
examples in D.

VS H , D  h  H | Consistent h, D 
Compact Representation of Version Spaces
A version space can be represented with its general and specific boundary sets.
The Candidate-Elimination algorithm represents the version space by storing only its most
general members G and its most specific members S.
Example Version Space

A version space with its general and specific boundary sets.


The version space includes all six hypotheses shown here, but can be represented
more simply by S and G.
Candidate-Elimination Algorithm
The Candidate-Elimination algorithm computes the version space containing all hypotheses
from H that are consistent with an observed sequence of training examples.
It begins by initializing the version space to the set of all hypotheses in H; that is, by initializing
the G boundary set to contain the most general hypothesis in H
G0 <- <?, ?, ?, ?, ?, ?>
and initializing the S boundary set to contain the most specific hypothesis
S0 <- <0, 0, 0, 0, 0, 0>
These two boundaries set the limit for entire hypothesis space, because every other hypothesis
in H is both more general than S0 and more specific than G0.
Since the version space contains hypotheses consistent with the training data, the algorithm
removes all hypotheses that are inconsistent with it
How Candidate Elimination Algorithm Works
Initialize G to the set of maximally general hypotheses in H
• Initialize S to the set of maximally specific hypotheses in H
• For each training example d, do
– If d is a positive example
• Remove from G any hypothesis inconsistent with d ,
• For each hypothesis s in S that is not consistent with d ,-
– Remove s from S
– Add to S all minimal generalizations h of s such that
» h is consistent with d, and some member of G is more general
than h
– Remove from S any hypothesis that is more general than another
hypothesis in S
– If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
– Remove g from G
– Add to G all minimal specializations h of g such that
» h is consistent with d, and some member of S is more specific
than h
– Remove from G any hypothesis that is less general than another
hypothesis in G

S0 :   , ,  ,  ,  ,  
S1 :  Sunny , Warm , Normal , Strong , Warm , Same 
S 2 :  Sunny , Warm , ?, Strong , Warm , Same 

G0 , G1 , G2 :  ?, ?, ?, ?, ?, ? 

S 2 , S3 :  Sunny, Warm , ?, Strong ,Warm , Same 

G3 :  Sunny , ?, ?, ?, ?, ? ?,Warm , ?, ?, ?, ? ?, ?, ?, ?, ?, Same 


  
G2 :  ?, ?, ?, ?, ?, ? 

S 3 :  Sunny, Warm , ?, Strong , Warm , Same 


S 4 :  Sunny ,Warm , ?, Strong , ?, ? 

G4 :  Sunny , ?, ?, ?, ?, ? ?,Warm , ?, ?, ?, ? 
G3 :  Sunny , ?, ?, ?, ?, ? ?,Warm , ?, ?, ?, ? ?, ?, ?, ?, ?, Same 

You might also like