You are on page 1of 143

Learning

• Learning is essential for unknown environments, i.e. when designer


lacks omniscience
• Learning is useful as a system construction method, i.e., expose the
agent to reality rather than trying to write it down
• Learning modifies the agent's decision mechanisms to improve
performance
Types of Learning
• Supervised learning:
• Correct answer for each example. Answer can be a numeric
variable, categorical variable etc.
• Both inputs and outputs are given
• The outputs are typically provided by a friendly teacher.
o Unsupervised learning:
• Correct answers not given – just examples
• The agent receives some evaluation of its actions (such as a fine for
stealing bananas), but is not told the correct action (such as how to
buy bananas).
o Reinforcement learning:
• Occasional rewards
• The agent can learn relationships among its percepts, and the trend
with time
Decision Tree
• A decision tree takes as input an object or situation described by a set of
properties, and outputs a yes/no “decision”.
• Decision tree induction is one of the simplest and yet most successful
forms of
machine learning. We first describe the representation—the hypothesis
space—
and then show how to learn a good hypothesis.
• Each node tests the value of an input attribute
• Branches from the node correspond to possible values of the attribute
• Leaf nodes supply the values to be returned if that leaf is reached
Simple Examples
• Decision: Whether to wait for a table at a restaurant.
1. Alternate: whether there is a suitable alternative restaurant.
2. Bar: whether the restaurant has a Bar for waiting customers.
3. Fri/Sat: true on Fridays and Saturdays Sat.
4. Hungry: whether we are hungry.
5. Patrons: how many people are in it (None, Some, Full).
6. Price: the restaurant’s rating (★, ★★ , ★★★).
7. Raining: whether it is raining outside Raining.
8. Reservation: whether we made a reservation.
9. Type: the kind of restaurant (Indian, Chinese, Thai, Fast food):
10.Wait Estimate: 10 mins , 10 30, 30 60, >60
Representation
• To draw a decision tree from a dataset of some attributes:
● Each node corresponds to a splitting attribute.
● Each arc is a possible value of that attribute.
● Splitting attribute is selected to be the most informative among
the attributes.
● Entropy is a factor used to measure how informative is a node.
● The algorithm uses the criterion of information gain to determine
the goodness of a split.
● The attribute with the greatest information gain is taken as the
splitting attribute, and the data set is split for all distinct values of
the attribute values of the attribute.
• Classification Methods
Gini Index
• Used in CART, SLIQ, SPRINT to determine the best split.
• Tree Growing Process:
• Find each predictor’s best split.
• Find the node’s best split : Among the best splits found in step 1,
choose the one that maximizes the splitting criterion.
• Split the node using its best split found in step 2 if the stopping
rules are not satisfied.
• Finding the Best Split: GINI Index
• GINI Index for a given node t:


p(j|t) is the relative frequency of j at node t
● Maximum (1-1/n): records equally distributed in n classes
● Minimum 0: all records in one class
Example
Revision Of Classification
 Decision Tree
 Naïve Bayes
Inductive Learning (Learning From Examples)
Different kinds of learning…
Semi-
Supervised Unsupervised Reinforcement
supervised
learning: learning: learning:
learning:

Someone gives us We see examples Small amount of


examples with right only but get no labeled data, We take actions
answer (output/label) for feedback (no large amount of and get rewards
those examples labels/output) unlabeled data

We have to predict the right We need to find Have to learn


answer for unseen patterns in the how to get high
examples data rewards
Inductive Machine Learning Process …
Learning
Preparation Testing and
Scheme OR Trained
of Training Evaluation
Learning Model
Data Test Data
Algorithm

Deploy Trained &


Tested Model in
Domain
Machine Learning Process

Learning
Scheme OR
Learning
Algorithm
Supervised Machine Learning Types
Types Supervised Machine Learning
Classification
Association
Regression \ Numeric Prediction
Also Known as Predictive Learning
Classification Algorithms
There are many algorithms for classification. Some of the
famous are following:
 Decision trees
 Naive Bayes classifier
 Linear Regression
 Logistic regression
 Neural networks
 Perceptron
 Multi Layer Neural Networks
 Support vector machines (SVM)
 Quadratic classifiers
 Instance Based Learning
 k-nearest neighbor
Decision Tree Learning
Classification Training data may be used to create a
Decision Tree (Model) which consists of:
Nodes:
 Represents the Test of Attributes/Features of Training Data
Edges:
 Correspond to the outcome of a test which connect to the
next node or leaf
Leaves:
 Predict the class(y)
Nominal and numeric attributes
 Nominal:
number of children usually equal to number values
attribute won’t get tested more than once
 Other possibility: division into two subsets
 Numeric:
test whether value is greater or less than constant
attribute may get tested several times
 Other possibility: three-way split (or multi-way split)
Integer: less than, equal to, greater than
Real: below, within, above
Classification using Decision Tree
Example (Restaurant)
Decision Tree of Restaurant Example

Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait
Yes No yes Yes Full $ No Yes French 30–60 ?
Criterion for attribute selection
n
There is no way to efficiently search through the 22
trees
Want to get the smallest tree
Heuristic: choose the attribute that produces the
“purest” nodes
Strategy: choose attribute that gives greatest
information gain
Purity check of an attribute

Pure Impure
Measuring Purity
Question is How to Measure Purity?
Two Methods are used
Entropy Based Information Gain
Gini Index
Information Gain Used in ID3 and its commercial
version i.e. C5.4, J48 and J50
Gini Index is used in CART algorithm
implemented in Python Sklearn
Decision Trees
There are four cases to consider while creating
 If the remaining examples are all positive (or all negative), then answer Yes
or No
 If there are some positive and some negative examples, then choose the
best attribute to split them
 No Example left: If there are no examples left, it means that no example
has been observed for this combination of attribute values, and we return a
default value calculated from the plurality classification
 Noise in Data (No feature left): If there are no attributes left, but both
positive and negative examples, it means that these examples have exactly
the same description, but different classifications. This can happen because
there is an error or noise in the data; because the domain is on
deterministic; or because we can’t observe an attribute that would
distinguish the examples. The best we can do is return the plurality
classification of the remaining examples.
Example
(Weather Data)
Which attribute to select? 19
Computing information
Measure information in bits

Formula to calculate Entropy

entropy(𝒑𝟏, 𝒑𝟐, . . . , 𝒑𝒏 ) = −𝒑𝟏 𝐥𝐨𝐠𝒑𝟏 − 𝒑𝟐 𝐥𝐨𝐠𝒑𝟐 . . . −𝒑𝒏 𝐥𝐨𝐠𝒑𝒏


Constructing DT
1. Calculate Entropy of Branch: of each value of feature

2. Average Entropy Nodes : average of Entropy of Branches

3. Calculate Information Gain Nodes: which is: Total


Information - average of Entropy of Nodes

4. Select Node with highest Information Gain

5. Expand Nodes with mixed Classes Branches & repeat


process 1-4
Example: attribute Outlook
 Outlook = Sunny :
info( 𝟐, 𝟑 ) = entropy(𝟐 𝟓, 𝟑 𝟓) = −𝟐 𝟓 𝐥𝐨𝐠(𝟐 𝟓) − 𝟑 𝟓 𝐥𝐨𝐠(𝟑 𝟓) = 𝟎. 𝟗𝟕𝟏bits
 Outlook = Overcast : Note: this is normally undefined.

info( 4,0 ) = entropy(1,0) = −1log(1) − 0log(0) = 0bits


 Outlook = Rainy :
info( 𝟐, 𝟑 ) = entropy(𝟑 𝟓, 𝟐 𝟓) = −𝟑 𝟓 𝐥𝐨𝐠(𝟑 𝟓) − 𝟐 𝟓 𝐥𝐨𝐠(𝟐 𝟓) = 𝟎. 𝟗𝟕𝟏bits

 Expected information for attribute:


info 3,2 , 4,0 , 3,2 = 5 14 × 0.971 + 4 14 × 0 + (5 14) × 0.971
= 0.693bits
Which attribute to select?
Computing information gain
• Information gain: information before splitting – information
after splitting
• Gain(Outlook ) = Info([9,5]) – info([2,3],[4,0],[3,2])
= {-9/14log9/14-5/14log5/14}- 0.693
= 0.940 – 0.693
= 0.247 bits
• Information gain for attributes from weather data:
Gain(Outlook ) = 0.247 bits
Gain(Temperature ) = 0.029 bits
Gain(Humidity ) = 0.152 bits
Gain(Windy ) = 0.048 bits
Which attribute to select?

Gain(Outlook ) = 0.247 bits


Gain(Temperature ) = 0.029 bits
Gain(Humidity ) = 0.152 bits
Gain(Windy ) = 0.048 bits
Continuing to split

 gain(Temperature ) = 0.571 bits


 gain(Humidity ) = 0.971 bits
 gain(Windy ) = 0.020 bits
Final decision tree

 Note: not all leaves need to be pure; sometimes


identical instances have different classes
 Splitting stops when data can’t be split any further
Activity Restaurant?

• Construct Decision Tree on a Page and Submit


• Clearly Calculate Information Gain of every feature
• Submit Up to this Sunday
CART Algorithm
 Many alternative measures to Information Gain
 Most popular alternative: Gini index used in e.g., in CART
(Classification And Regression Trees)
 Average Gini index (instead of average entropy / information)
 Gini index is minimized instead of maximizing Gini gain

Impurity measure/For each Feature Value:

Average Gini for Attribute:


CART Algorithm
• Gini (outlook=sunny)=1-(2/5)2-(3/5)2 =0.48
• Gini (outlook=overcast)=1-(4/4)2-(0/4)2 =0
• Gini (outlook=rainy)=1-(3/5)2-(2/5)2 =0.48
Average Gini for Attribute:
Gini (outlook)=(5/14)*0.48+(4/14)*0+(5/14)*0.48
=0.342
Impurity measure/For each Feature Value:

Average Gini for Attribute:


Which attribute to select?
Activity
Weather?

• Construct Decision Tree


on a Page and Submit
• Clearly Calculate Gini
Measure feature
• Submit Up to this Sunday
Properties of Good Measure
We expect information measure should have
following properties:
1. When the number of either yes’s or no’s is
zero, the information is zero.
2. When the number of yes’s and no’s is equal,
the information reaches a maximum
3. The information should obey the multistage
property
Information Gain based on Entropy has all
above properties:
ISSUES With Training Data
 Missing data: Not all the attribute values are known
 Method to train and test during implementation
 Multivalued attributes: Attribute has many values information gain gives inappropriate
indication of the attribute’s usefulness. An attribute such as ExactTime ,ID
 One solution is to use the gain ratio C4.5
 Continuous and integer-valued input attributes: Continuous or integer-valued
attributes such as Height and Weight , have an infinite set of possible values.
 SPLIT POINT split point For example, at a given node in the tree, it might be the case
that testing on Weight > 160. start by sorting the values
 Splitting is the most expensive part of real-world decision tree learning applications
 Continuous-valued output attributes: If we are trying to predict a numerical output
value, such as the price of an apartment, then we need a regression tree rather than a
classification tree.
DT Discussion
 A decision-tree learning system for real-world applications must be able to
handle all of these problems.
 Handling continuous-valued variables is especially important, because both
physical and financial processes provide numerical data. Several commercial
packages built that meet these criteria,
 In many areas of industry and commerce, decision trees are usually the first
method tried when a classification method is to be extracted from a data set.
 One important property of decision trees is that it is possible for a human to
understand the reason for the output of the learning algorithm. (Indeed, this is
a legal requirement for financial decisions that are subject to anti-
discrimination laws.)
 This is a property not shared by some other representations, such as neural
networks.
Loading Data
 import pandas
 from sklearn import tree
 from sklearn.tree import DecisionTreeClassifier
 from sklearn import metrics
 df = pandas.read_csv("weatherdata_converted.csv")
 print(df)
 print(df.info())
Spliting dat
 features=['Outlook', 'Temperature', 'Humidity', 'Windy']
 X = df[features]
 y = df['Play']
 print(X)
 print(y)
 print(df.groupby('Play').size())
Decision Tree (Entropy)
 dtree1 = DecisionTreeClassifier(criterion = "entropy")
 # Performing training
 dtree1.fit(X, y)
 tree.plot_tree(dtree1,feature_names=features)
Decision Tree (Gini Index)
 # Classifier and Prediction
 dtree = DecisionTreeClassifier(criterion="gini")
 dtree = dtree.fit(X, y)
 #tree.plot_tree(dtree,feature_names=features)
 print(X.iloc[1,:])
Prediction
 print(dtree.predict(X.iloc[1:3,:]))
 print(y.iloc[1:3])
 x_predict=dtree.predict(X)
 print(x_predict)
 print ("Accuracy : ", metrics.accuracy_score(x_predict,y)*100)
Statistical modeling (Naïve Bayes)
Two assumptions: Attributes are
equally important
statistically independent (given the class value)
I.e., knowing the value of one attribute says nothing
about the value of another (if the class is known)
Independence assumption is never correct!
But … this scheme works well in practice
Bayes’s rule
Probability of event H given evidence E:
𝑃𝑟 𝐸 ∣ 𝐻 𝑃𝑟 𝐻
𝑃𝑟 𝐻 ∣ 𝐸 =
𝑃𝑟 𝐸

A priori probability of H : 𝑷𝒓 𝐄 ∣ 𝑯

Probability of event before evidence is seen


A posteriori probability of H : 𝑃𝑟 𝐻|𝑬
Probability of event after evidence is seen
Naïve Bayes for classification
 Classification learning: what’s the probability
of the class given an instance?
• Evidence E = instance’s non-class attribute
values
• Event H = class value of instance
 Naïve assumption: evidence splits into parts
(i.e. attributes) that are independent
𝑷𝒓 𝑬𝟏 ∣ 𝑯 𝑷𝒓 𝑬𝟐 ∣ 𝑯 . . 𝑷𝒓 𝑬𝒏 ∣ 𝑯 𝑷𝒓 𝑯
𝑷𝒓 𝑯 ∣ 𝑬 =
𝑷𝒓 𝑬
Weather data example
Outlook Temp. Humidity Windy Play Evidence E
Sunny Cool High True ?

P(yes| E) = P(Outlook = Sunny | yes)


P(Temperature= Cool | yes)
Probability of
P(Humidity = High | yes)
class “yes”
𝑷𝒓 𝑬𝟏 ∣ 𝑯 𝑷𝒓 𝑬𝟐 ∣ 𝑯 . . 𝑷𝒓 𝑬𝒏 ∣ 𝑯 𝑷𝒓 𝑯
P(Windy = True| yes) 𝑷𝒓 𝑯 ∣ 𝑬 =
𝑷𝒓 𝑬
P(yes) / P(E)
2 / 9 ´ 3 / 9 ´ 3 / 9 ´ 3 / 9 ´ 9 /14
=
P(E)
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes

Probabilities for Rainy


Rainy
Mild
Cool
High
Normal
False
False
Yes
Yes
weather data Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
Prior Probabilities for weather data
(Training)
Probabilities for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5

• A new day: Outlook Temp. Humidity Windy Play


Sunny Cool High True ?

Likelihood of “yes” = 2/9  3/9  3/9  3/9  9/14 = 0.0053


Likelihood For “no” = 3/5  1/5  4/5  3/5  5/14 = 0.0206
Normalizing Likelihood so that probablities sum to 1:

NO
How Naïve Bayes Deals….
Zero Frequency Problem
Missing Values
Numeric Values
The “zero-frequency problem”
 What if an attribute value doesn’t occur with every class
value?
(e.g. “Humidity = high” for class “yes”)
 Probability will be zero!
 A posteriori probability will also be zero!
(No matter how likely the other values are!)
 Remedy: add 1 to the count for every attribute value-
class combination (Laplace estimator)
 Result: probabilities will never be zero!
(also: stabilizes probability estimates)
Missing values
 Training: instance is not included in frequency
count for attribute value-class combination
 Classification: attribute will be omitted from
calculation
 Example:

Outlook Temp. Humidity Windy Play


? Cool High True ?
Numeric attributes
Outlook Temperature Humidity Windy Play
Sunny 85 85 FALSE no
Sunny 80 90 TRUE no
Overcast 83 86 FALSE yes
Rainy 70 96 FALSE yes
Rainy 68 80 FALSE yes
Rainy 65 70 TRUE no
Overcast 64 65 TRUE yes
Sunny 72 95 FALSE no
Sunny 69 70 FALSE yes
Rainy 75 80 FALSE yes
Sunny 75 70 TRUE yes
Overcast 72 90 TRUE yes
Overcast 81 75 FALSE yes
Rainy 71 91 TRUE no
Numeric attributes
• Usual assumption: attributes have a normal or Gaussian probability
distribution (given the class)
• Use probability density function for the normal distribution is defined
by two parameters:

• Sample mean

• Standard deviation

• Then the density function f(x) is


Statistics for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 64, 68, 65,71, 65, 70, 70, 85, False 6 2 9 5
Overcas 4 0 69, 70, 72,80, 70, 75, 90, 91, True 3 3
tRainy 3 2 72, … 85, 80, … 95, …
Sunny 2/9 3/5  =73 …
 =75  =79  =86 False 6/9 2/5 9/ 5/
Overcas 4/9 0/5  =6.2  =7.9  =10.2  =9.7 True 3/9 3/5 14 14
tRainy 3/9 2/5

• Example density value:


Classifying a new day
• A new day: Outlook Temp. Humidity Windy Play
Sunny 66 90 true ?

Likelihood of “yes” = 2/9  0.0340  0.0221  3/9  9/14 = 0.000036


Likelihood of “no” = 3/5  0.0221  0.0381  3/5  5/14 = 0.000108
P(“yes”) = 0.000036 / (0.000036 + 0. 000108) = 25%
P(“no”) = 0.000108 / (0.000036 + 0. 000108) = 75%

• Missing values during training are not included


in calculation of mean and standard deviation
Naïve Bayes: discussion
 Naïve Bayes works surprisingly well (even if
independence assumption is clearly violated)
 Why? Because classification doesn’t require
accurate probability estimates as long as
maximum probability is assigned to correct class
 However: adding too many redundant attributes
will cause problems (e.g. identical attributes)
 Note also: many numeric attributes are not
normally distributed ( kernel density estimators)
Multinomial naïve Bayes
• Version of naïve Bayes used for document classification using bag of words
model
• n1,n2, ... , nk: number of times word i occurs in the document
• P1,P2, ... , Pk: probability of obtaining word i when sampling from documents
in class H
• Probability of observing a particular document E given probabilities class H
(based on multinomial distribution):

• Note that this expression ignores the probability of generating a document of


the right length
• This probability is assumed to be constant for all classes
Multinomial naïve Bayes
• Suppose dictionary has two words, yellow and blue
• Suppose P(yellow | H) = 75% and P(blue | H) = 25%
• Suppose E is the document “blue yellow blue”
• Probability of observing document:
0.751 0.252 27
P({blueyellowblue} | H ) = 3!´ ´ =
1! 2! 64
Suppose there is another class H' that has
P(yellow | H’) = 10% and P(blue| H’) = 90%:
0.11 0.9 2 243
P({blueyellowblue} | H ) = 3!´ ´ =
1! 2! 1000
• Need to take prior probability of class into account to make the final
classification using Bayes’ rule
Categorical Naïve Bays
 import pandas
 from sklearn.naive_bayes import CategoricalNB
 from sklearn import metrics
 df = pandas.read_csv("weatherdata_converted.csv")
 print(df)
 features=['Outlook', 'Temperature', 'Humidity', 'Windy']
 print(df.info())
 df['Outlook'] = df['Outlook'].astype('category')
 df['Temperature'] =
df['Temperature'].astype('category')
 df['Humidity'] = df['Humidity'].astype('category')
 df['Windy'] = df['Windy'].astype('category')
 X = df[features]
 y = df['Play']
Categorical Naïve Bays
 print(df.info())
 print(df.describe())
 print(df.groupby('Play').size())
 clf = CategoricalNB()
 clf.fit(X, y)
 y_pred = clf.predict(X)
 print(y_pred)
 print ("Accuracy : ",
metrics.accuracy_score(y_pred,y)*
100)
Gaussian Naïve Bays (Iris flower data set)
The dataset contains a set of 150 records under five attributes /Features-
petal length, petal width, sepal length, sepal width and species(Class)

………..

………..
Gaussian Naïve Bays (Iris flower data set)
 from sklearn import datasets
 from sklearn.naive_bayes import GaussianNB
 from sklearn import metrics
 iris= datasets.load_iris()
 X = iris.data
 y = iris.target
 print(X.shape)
 print(y.shape)
 nb = GaussianNB()
 nb.fit(X,y).predict(X)
 y_pred = nb.predict(X)
 print(y_pred)
 print ("Accuracy : ", metrics.accuracy_score(y_pred,y)*100)
Revision Of Classification
 Decision Tree
 Naïve Bayes
Inductive Learning (Learning From Examples)
Different kinds of learning…
Semi-
Supervised Unsupervised Reinforcement
supervised
learning: learning: learning:
learning:

Someone gives us We see examples Small amount of


examples with right only but get no labeled data, We take actions
answer (output/label) for feedback (no large amount of and get rewards
those examples labels/output) unlabeled data

We have to predict the right We need to find Have to learn


answer for unseen patterns in the how to get high
examples data rewards
Inductive Machine Learning Process …
Learning
Preparation Testing and
Scheme OR Trained
of Training Evaluation
Learning Model
Data Test Data
Algorithm

Deploy Trained &


Tested Model in
Domain
Machine Learning Process

Learning
Scheme OR
Learning
Algorithm
Supervised Machine Learning Types
Types Supervised Machine Learning
Classification
Association
Regression \ Numeric Prediction
Also Known as Predictive Learning
Classification Algorithms
There are many algorithms for classification. Some of the
famous are following:
 Decision trees
 Naive Bayes classifier
 Linear Regression
 Logistic regression
 Neural networks
 Perceptron
 Multi Layer Neural Networks
 Support vector machines (SVM)
 Quadratic classifiers
 Instance Based Learning
 k-nearest neighbor
Decision Tree Learning
Classification Training data may be used to create a
Decision Tree (Model) which consists of:
Nodes:
 Represents the Test of Attributes/Features of Training Data
Edges:
 Correspond to the outcome of a test which connect to the
next node or leaf
Leaves:
 Predict the class(y)
Nominal and numeric attributes
 Nominal:
number of children usually equal to number values
attribute won’t get tested more than once
 Other possibility: division into two subsets
 Numeric:
test whether value is greater or less than constant
attribute may get tested several times
 Other possibility: three-way split (or multi-way split)
Integer: less than, equal to, greater than
Real: below, within, above
Classification using Decision Tree
Example (Restaurant)
Decision Tree of Restaurant Example

Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait
Yes No yes Yes Full $ No Yes French 30–60 ?
Criterion for attribute selection
n
There is no way to efficiently search through the 22
trees
Want to get the smallest tree
Heuristic: choose the attribute that produces the
“purest” nodes
Strategy: choose attribute that gives greatest
information gain
Purity check of an attribute

Pure Impure
Measuring Purity
Question is How to Measure Purity?
Two Methods are used
Entropy Based Information Gain
Gini Index
Information Gain Used in ID3 and its commercial
version i.e. C5.4, J48 and J50
Gini Index is used in CART algorithm
implemented in Python Sklearn
Decision Trees
There are four cases to consider while creating
 If the remaining examples are all positive (or all negative), then answer Yes
or No
 If there are some positive and some negative examples, then choose the
best attribute to split them
 No Example left: If there are no examples left, it means that no example
has been observed for this combination of attribute values, and we return a
default value calculated from the plurality classification
 Noise in Data (No feature left): If there are no attributes left, but both
positive and negative examples, it means that these examples have exactly
the same description, but different classifications. This can happen because
there is an error or noise in the data; because the domain is on
deterministic; or because we can’t observe an attribute that would
distinguish the examples. The best we can do is return the plurality
classification of the remaining examples.
Example
(Weather Data)
Which attribute to select? 19
Computing information
Measure information in bits

Formula to calculate Entropy

entropy(𝒑𝟏, 𝒑𝟐, . . . , 𝒑𝒏 ) = −𝒑𝟏 𝐥𝐨𝐠𝒑𝟏 − 𝒑𝟐 𝐥𝐨𝐠𝒑𝟐 . . . −𝒑𝒏 𝐥𝐨𝐠𝒑𝒏


Constructing DT
1. Calculate Entropy of Branch: of each value of feature

2. Average Entropy Nodes : average of Entropy of Branches

3. Calculate Information Gain Nodes: which is: Total


Information - average of Entropy of Nodes

4. Select Node with highest Information Gain

5. Expand Nodes with mixed Classes Branches & repeat


process 1-4
Example: attribute Outlook
 Outlook = Sunny :
info( 𝟐, 𝟑 ) = entropy(𝟐 𝟓, 𝟑 𝟓) = −𝟐 𝟓 𝐥𝐨𝐠(𝟐 𝟓) − 𝟑 𝟓 𝐥𝐨𝐠(𝟑 𝟓) = 𝟎. 𝟗𝟕𝟏bits
 Outlook = Overcast : Note: this is normally undefined.

info( 4,0 ) = entropy(1,0) = −1log(1) − 0log(0) = 0bits


 Outlook = Rainy :
info( 𝟐, 𝟑 ) = entropy(𝟑 𝟓, 𝟐 𝟓) = −𝟑 𝟓 𝐥𝐨𝐠(𝟑 𝟓) − 𝟐 𝟓 𝐥𝐨𝐠(𝟐 𝟓) = 𝟎. 𝟗𝟕𝟏bits

 Expected information for attribute:


info 3,2 , 4,0 , 3,2 = 5 14 × 0.971 + 4 14 × 0 + (5 14) × 0.971
= 0.693bits
Which attribute to select?
Computing information gain
• Information gain: information before splitting – information
after splitting
• Gain(Outlook ) = Info([9,5]) – info([2,3],[4,0],[3,2])
= {-9/14log9/14-5/14log5/14}- 0.693
= 0.940 – 0.693
= 0.247 bits
• Information gain for attributes from weather data:
Gain(Outlook ) = 0.247 bits
Gain(Temperature ) = 0.029 bits
Gain(Humidity ) = 0.152 bits
Gain(Windy ) = 0.048 bits
Which attribute to select?

Gain(Outlook ) = 0.247 bits


Gain(Temperature ) = 0.029 bits
Gain(Humidity ) = 0.152 bits
Gain(Windy ) = 0.048 bits
Continuing to split

 gain(Temperature ) = 0.571 bits


 gain(Humidity ) = 0.971 bits
 gain(Windy ) = 0.020 bits
Final decision tree

 Note: not all leaves need to be pure; sometimes


identical instances have different classes
 Splitting stops when data can’t be split any further
Activity Restaurant?

• Construct Decision Tree on a Page and Submit


• Clearly Calculate Information Gain of every feature
• Submit Up to this Sunday
CART Algorithm
 Many alternative measures to Information Gain
 Most popular alternative: Gini index used in e.g., in CART
(Classification And Regression Trees)
 Average Gini index (instead of average entropy / information)
 Gini index is minimized instead of maximizing Gini gain

Impurity measure/For each Feature Value:

Average Gini for Attribute:


CART Algorithm
• Gini (outlook=sunny)=1-(2/5)2-(3/5)2 =0.48
• Gini (outlook=overcast)=1-(4/4)2-(0/4)2 =0
• Gini (outlook=rainy)=1-(3/5)2-(2/5)2 =0.48
Average Gini for Attribute:
Gini (outlook)=(5/14)*0.48+(4/14)*0+(5/14)*0.48
=0.342
Impurity measure/For each Feature Value:

Average Gini for Attribute:


Which attribute to select?
Activity
Weather?

• Construct Decision Tree


on a Page and Submit
• Clearly Calculate Gini
Measure feature
• Submit Up to this Sunday
Properties of Good Measure
We expect information measure should have
following properties:
1. When the number of either yes’s or no’s is
zero, the information is zero.
2. When the number of yes’s and no’s is equal,
the information reaches a maximum
3. The information should obey the multistage
property
Information Gain based on Entropy has all
above properties:
ISSUES With Training Data
 Missing data: Not all the attribute values are known
 Method to train and test during implementation
 Multivalued attributes: Attribute has many values information gain gives inappropriate
indication of the attribute’s usefulness. An attribute such as ExactTime ,ID
 One solution is to use the gain ratio C4.5
 Continuous and integer-valued input attributes: Continuous or integer-valued
attributes such as Height and Weight , have an infinite set of possible values.
 SPLIT POINT split point For example, at a given node in the tree, it might be the case
that testing on Weight > 160. start by sorting the values
 Splitting is the most expensive part of real-world decision tree learning applications
 Continuous-valued output attributes: If we are trying to predict a numerical output
value, such as the price of an apartment, then we need a regression tree rather than a
classification tree.
DT Discussion
 A decision-tree learning system for real-world applications must be able to
handle all of these problems.
 Handling continuous-valued variables is especially important, because both
physical and financial processes provide numerical data. Several commercial
packages built that meet these criteria,
 In many areas of industry and commerce, decision trees are usually the first
method tried when a classification method is to be extracted from a data set.
 One important property of decision trees is that it is possible for a human to
understand the reason for the output of the learning algorithm. (Indeed, this is
a legal requirement for financial decisions that are subject to anti-
discrimination laws.)
 This is a property not shared by some other representations, such as neural
networks.
Loading Data
 import pandas
 from sklearn import tree
 from sklearn.tree import DecisionTreeClassifier
 from sklearn import metrics
 df = pandas.read_csv("weatherdata_converted.csv")
 print(df)
 print(df.info())
Spliting dat
 features=['Outlook', 'Temperature', 'Humidity', 'Windy']
 X = df[features]
 y = df['Play']
 print(X)
 print(y)
 print(df.groupby('Play').size())
Decision Tree (Entropy)
 dtree1 = DecisionTreeClassifier(criterion = "entropy")
 # Performing training
 dtree1.fit(X, y)
 tree.plot_tree(dtree1,feature_names=features)
Decision Tree (Gini Index)
 # Classifier and Prediction
 dtree = DecisionTreeClassifier(criterion="gini")
 dtree = dtree.fit(X, y)
 #tree.plot_tree(dtree,feature_names=features)
 print(X.iloc[1,:])
Prediction
 print(dtree.predict(X.iloc[1:3,:]))
 print(y.iloc[1:3])
 x_predict=dtree.predict(X)
 print(x_predict)
 print ("Accuracy : ", metrics.accuracy_score(x_predict,y)*100)
Statistical modeling (Naïve Bayes)
Two assumptions: Attributes are
equally important
statistically independent (given the class value)
I.e., knowing the value of one attribute says nothing
about the value of another (if the class is known)
Independence assumption is never correct!
But … this scheme works well in practice
Bayes’s rule
Probability of event H given evidence E:
𝑃𝑟 𝐸 ∣ 𝐻 𝑃𝑟 𝐻
𝑃𝑟 𝐻 ∣ 𝐸 =
𝑃𝑟 𝐸

A priori probability of H : 𝑷𝒓 𝐄 ∣ 𝑯

Probability of event before evidence is seen


A posteriori probability of H : 𝑃𝑟 𝐻|𝑬
Probability of event after evidence is seen
Naïve Bayes for classification
 Classification learning: what’s the probability
of the class given an instance?
• Evidence E = instance’s non-class attribute
values
• Event H = class value of instance
 Naïve assumption: evidence splits into parts
(i.e. attributes) that are independent
𝑷𝒓 𝑬𝟏 ∣ 𝑯 𝑷𝒓 𝑬𝟐 ∣ 𝑯 . . 𝑷𝒓 𝑬𝒏 ∣ 𝑯 𝑷𝒓 𝑯
𝑷𝒓 𝑯 ∣ 𝑬 =
𝑷𝒓 𝑬
Weather data example
Outlook Temp. Humidity Windy Play Evidence E
Sunny Cool High True ?

P(yes| E) = P(Outlook = Sunny | yes)


P(Temperature= Cool | yes)
Probability of
P(Humidity = High | yes)
class “yes”
𝑷𝒓 𝑬𝟏 ∣ 𝑯 𝑷𝒓 𝑬𝟐 ∣ 𝑯 . . 𝑷𝒓 𝑬𝒏 ∣ 𝑯 𝑷𝒓 𝑯
P(Windy = True| yes) 𝑷𝒓 𝑯 ∣ 𝑬 =
𝑷𝒓 𝑬
P(yes) / P(E)
2 / 9 ´ 3 / 9 ´ 3 / 9 ´ 3 / 9 ´ 9 /14
=
P(E)
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes

Probabilities for Rainy


Rainy
Mild
Cool
High
Normal
False
False
Yes
Yes
weather data Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
Prior Probabilities for weather data
(Training)
Probabilities for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5

• A new day: Outlook Temp. Humidity Windy Play


Sunny Cool High True ?

Likelihood of “yes” = 2/9  3/9  3/9  3/9  9/14 = 0.0053


Likelihood For “no” = 3/5  1/5  4/5  3/5  5/14 = 0.0206
Normalizing Likelihood so that probablities sum to 1:

NO
How Naïve Bayes Deals….
Zero Frequency Problem
Missing Values
Numeric Values
The “zero-frequency problem”
 What if an attribute value doesn’t occur with every class
value?
(e.g. “Humidity = high” for class “yes”)
 Probability will be zero!
 A posteriori probability will also be zero!
(No matter how likely the other values are!)
 Remedy: add 1 to the count for every attribute value-
class combination (Laplace estimator)
 Result: probabilities will never be zero!
(also: stabilizes probability estimates)
Missing values
 Training: instance is not included in frequency
count for attribute value-class combination
 Classification: attribute will be omitted from
calculation
 Example:

Outlook Temp. Humidity Windy Play


? Cool High True ?
Numeric attributes
Outlook Temperature Humidity Windy Play
Sunny 85 85 FALSE no
Sunny 80 90 TRUE no
Overcast 83 86 FALSE yes
Rainy 70 96 FALSE yes
Rainy 68 80 FALSE yes
Rainy 65 70 TRUE no
Overcast 64 65 TRUE yes
Sunny 72 95 FALSE no
Sunny 69 70 FALSE yes
Rainy 75 80 FALSE yes
Sunny 75 70 TRUE yes
Overcast 72 90 TRUE yes
Overcast 81 75 FALSE yes
Rainy 71 91 TRUE no
Numeric attributes
• Usual assumption: attributes have a normal or Gaussian probability
distribution (given the class)
• Use probability density function for the normal distribution is defined
by two parameters:

• Sample mean

• Standard deviation

• Then the density function f(x) is


Statistics for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 64, 68, 65,71, 65, 70, 70, 85, False 6 2 9 5
Overcas 4 0 69, 70, 72,80, 70, 75, 90, 91, True 3 3
tRainy 3 2 72, … 85, 80, … 95, …
Sunny 2/9 3/5  =73 …
 =75  =79  =86 False 6/9 2/5 9/ 5/
Overcas 4/9 0/5  =6.2  =7.9  =10.2  =9.7 True 3/9 3/5 14 14
tRainy 3/9 2/5

• Example density value:


Classifying a new day
• A new day: Outlook Temp. Humidity Windy Play
Sunny 66 90 true ?

Likelihood of “yes” = 2/9  0.0340  0.0221  3/9  9/14 = 0.000036


Likelihood of “no” = 3/5  0.0221  0.0381  3/5  5/14 = 0.000108
P(“yes”) = 0.000036 / (0.000036 + 0. 000108) = 25%
P(“no”) = 0.000108 / (0.000036 + 0. 000108) = 75%

• Missing values during training are not included


in calculation of mean and standard deviation
Naïve Bayes: discussion
 Naïve Bayes works surprisingly well (even if
independence assumption is clearly violated)
 Why? Because classification doesn’t require
accurate probability estimates as long as
maximum probability is assigned to correct class
 However: adding too many redundant attributes
will cause problems (e.g. identical attributes)
 Note also: many numeric attributes are not
normally distributed ( kernel density estimators)
Multinomial naïve Bayes
• Version of naïve Bayes used for document classification using bag of words
model
• n1,n2, ... , nk: number of times word i occurs in the document
• P1,P2, ... , Pk: probability of obtaining word i when sampling from documents
in class H
• Probability of observing a particular document E given probabilities class H
(based on multinomial distribution):

• Note that this expression ignores the probability of generating a document of


the right length
• This probability is assumed to be constant for all classes
Multinomial naïve Bayes
• Suppose dictionary has two words, yellow and blue
• Suppose P(yellow | H) = 75% and P(blue | H) = 25%
• Suppose E is the document “blue yellow blue”
• Probability of observing document:
0.751 0.252 27
P({blueyellowblue} | H ) = 3!´ ´ =
1! 2! 64
Suppose there is another class H' that has
P(yellow | H’) = 10% and P(blue| H’) = 90%:
0.11 0.9 2 243
P({blueyellowblue} | H ) = 3!´ ´ =
1! 2! 1000
• Need to take prior probability of class into account to make the final
classification using Bayes’ rule
Categorical Naïve Bays
 import pandas
 from sklearn.naive_bayes import CategoricalNB
 from sklearn import metrics
 df = pandas.read_csv("weatherdata_converted.csv")
 print(df)
 features=['Outlook', 'Temperature', 'Humidity', 'Windy']
 print(df.info())
 df['Outlook'] = df['Outlook'].astype('category')
 df['Temperature'] =
df['Temperature'].astype('category')
 df['Humidity'] = df['Humidity'].astype('category')
 df['Windy'] = df['Windy'].astype('category')
 X = df[features]
 y = df['Play']
Categorical Naïve Bays
 print(df.info())
 print(df.describe())
 print(df.groupby('Play').size())
 clf = CategoricalNB()
 clf.fit(X, y)
 y_pred = clf.predict(X)
 print(y_pred)
 print ("Accuracy : ",
metrics.accuracy_score(y_pred,y)*
100)
Gaussian Naïve Bays (Iris flower data set)
The dataset contains a set of 150 records under five attributes /Features-
petal length, petal width, sepal length, sepal width and species(Class)

………..

………..
Gaussian Naïve Bays (Iris flower data set)
 from sklearn import datasets
 from sklearn.naive_bayes import GaussianNB
 from sklearn import metrics
 iris= datasets.load_iris()
 X = iris.data
 y = iris.target
 print(X.shape)
 print(y.shape)
 nb = GaussianNB()
 nb.fit(X,y).predict(X)
 y_pred = nb.predict(X)
 print(y_pred)
 print ("Accuracy : ", metrics.accuracy_score(y_pred,y)*100)

You might also like