Professional Documents
Culture Documents
Actual negativ TN FP
classe e
s
positive FN TP
Accuracy: (TN+TP)/(TN+TP+FN+FP)(TN+TP)/(TN+TP+FN+FP)
Precision: TP/(TP+FP)TP/(TP+FP)
Recall: TP/(TP+FN)
Scikit
Scikit-learn is a Python module merging classic machine learning algorithms with the
world of scientific Python packages (NumPy, SciPy, matplotlib).
The digits dataset is a dictionary-like objects, containing the actual data and some
metadata.
print(digits.data)
[[ 0. 0. 5. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 10. 0. 0.]
[ 0. 0. 0. ..., 16. 9. 0.]
...,
[ 0. 0. 1. ..., 6. 0. 0.]
[ 0. 0. 2. ..., 12. 0. 0.]
[ 0. 0. 10. ..., 12. 1. 0.]]
digits.data contains the features, i.e. images of handwritten images of digits, which can be
used for classification.
digits.target
(1797, 1797)
digits.target contain the labels, i.e. digits from 0 to 9 for the digits of digits.data. The data
"digits" is a 2 D array with the shape (number of samples, number of features). In our case,
a sample is an image of shape (8, 8):
print(digits.target[0], digits.data[0])
print(digits.images[0])
0 [ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15.
5.
0. 0. 3. 15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8.
8. 0. 0. 5. 8. 0. 0. 9. 8. 0. 0. 4. 11. 0. 1.
12. 7. 0. 0. 2. 14. 5. 10. 12. 0. 0. 0. 0. 6. 13.
10. 0. 0. 0.]
[[ 0. 0. 5. 13. 9. 1. 0. 0.]
[ 0. 0. 13. 15. 10. 15. 5. 0.]
[ 0. 3. 15. 2. 0. 11. 8. 0.]
[ 0. 4. 12. 0. 0. 8. 8. 0.]
[ 0. 5. 8. 0. 0. 9. 8. 0.]
[ 0. 4. 11. 0. 1. 12. 7. 0.]
[ 0. 2. 14. 5. 10. 12. 0. 0.]
[ 0. 0. 6. 13. 10. 0. 0. 0.]]
A Naive Bayes Classifier Example
We will use a file called 'person_data.txt'. It contains 100 random person data, male and
female, with body sizes, weights and gender tags.
import numpy as np
genders = ["male", "female"]
persons = []
with open("data/person_data.txt") as fh:
for line in fh:
persons.append(line.strip().split())
firstnames = {}
heights = {}
for gender in genders:
firstnames[gender] = [ x[0] for x in persons if x[4]==gender]
heights[gender] = [ x[2] for x in persons if x[4]==gender]
heights[gender] = np.array(heights[gender], np.int)
Warning: There might be some confusion between a Python class and a Naive Bayes class.
We try to avoid it by saying explicitly what is meant, whenever possible!
The Feature class needs a label, e.g. "heights" or "firstnames". If the feature values are
numerical we may want to "bin" them to reduce the number of possible feature values. The
heights from our persons have a huge range and we have only 50 measured values for our
Naive Bayes classes "male" and "female". We will bin them into ranges "130 to 134", "135
to 139", "140 to 144" and so on by setting bin_width to 5. There is no way of binning the
first names, so bin_width will be set to None.
The method frequency returns the number of occurrencies for a certain feature value or a
binned range.
We will create now two feature classes Feature for the height values of the person data set.
One Feature class contains the height for the Naive Bayes class "male" and one the heights
for the class "female":
fts = {}
for gender in genders:
fts[gender] = Feature(heights[gender], name=gender, bin_width=5)
print(gender, fts[gender].freq_dict)
male {160: 5, 195: 2, 180: 5, 165: 4, 200: 3, 185: 8, 170: 6, 155: 1, 190:
8, 175: 7}
female {160: 8, 130: 1, 165: 11, 135: 1, 170: 7, 140: 0, 175: 2, 145: 3,
180: 4, 150: 5, 185: 0, 155: 7}
We printed out the frequencies of our bins, but it is a lot better to see these values dipicted
in a bar chart. We will do this with the following code:
class NBclass:
def probability_value_given_feature(self,
feature_value,
feature):
"""
p_value_given_feature returns the probability p
for a feature_value 'value' of the feature to occurr
corresponds to P(d_i | p_j)
where d_i is a feature variable of the feature i
"""
if feature.freq_sum == 0:
return 0
else:
return feature.frequency(feature_value) / feature.freq_sum
In the following code, we will create NBclasses with one feature, i.e. the height feature. We
will use the Feature classes of fts, which we have previously created:
cls = {}
for gender in genders:
cls[gender] = NBclass(gender, fts[gender])
The final step for creating a simple Naive Bayes classifier consists in writing a class
'Classifier', which will use our classes 'NBclass' and 'Feature'.
class Classifier:
nbclasses = self.nbclasses
probability_list = []
for nbclass in nbclasses:
ftrs = nbclass.features
prob = 1
for i in range(len(ftrs)):
prob *= nbclass.probability_value_given_feature(d[i],
ftrs[i])
We will create a classifier with one feature class 'height'. We check it with values between
130 and 220 cm.
c = Classifier(cls["male"], cls["female"])
for i in range(130, 220, 5):
print(i, c.prob(i, best_only=False))
130 [(0.0, 'male'), (1.0, 'female')]
135 [(0.0, 'male'), (1.0, 'female')]
140 [(0.5, 'male'), (0.5, 'female')]
145 [(0.0, 'male'), (1.0, 'female')]
150 [(0.0, 'male'), (1.0, 'female')]
155 [(0.125, 'male'), (0.875, 'female')]
160 [(0.38461538461538469, 'male'), (0.61538461538461542, 'female')]
165 [(0.26666666666666666, 'male'), (0.73333333333333328, 'female')]
170 [(0.46153846153846162, 'male'), (0.53846153846153855, 'female')]
175 [(0.77777777777777779, 'male'), (0.22222222222222224, 'female')]
180 [(0.55555555555555558, 'male'), (0.44444444444444448, 'female')]
185 [(1.0, 'male'), (0.0, 'female')]
190 [(1.0, 'male'), (0.0, 'female')]
195 [(1.0, 'male'), (0.0, 'female')]
200 [(1.0, 'male'), (0.0, 'female')]
205 [(0.5, 'male'), (0.5, 'female')]
210 [(0.5, 'male'), (0.5, 'female')]
215 [(0.5, 'male'), (0.5, 'female')]
There are no persons - neither male nor female - in our learn set, with a body height
between 140 and 144. This is the reason, why our classifier can't base its result on learned
data and therefore comes back with a fify-fifty result.
fts = {}
cls = {}
for gender in genders:
fts_names = Feature(firstnames[gender], name=gender)
cls[gender] = NBclass(gender, fts_names)
c = Classifier(cls["male"], cls["female"])
testnames = ['Edgar', 'Benjamin', 'Fred', 'Albert', 'Laura',
'Maria', 'Paula', 'Sharon', 'Jessie']
for name in testnames:
print(name, c.prob(name))
Edgar (0.5, 'male')
Benjamin (1.0, 'male')
Fred (1.0, 'male')
Albert (1.0, 'male')
Laura (1.0, 'female')
Maria (1.0, 'female')
Paula (1.0, 'female')
Sharon (1.0, 'female')
Jessie (0.6666666666666667, 'female')
The name "Jessie" is an ambiguous name. There are about 66 boys per 100 girls with this
name. We can learn from the previous classification results that the probability for the name
"Jessie" being "female" is about two-thirds, which is calculated from our data set "person":
After having executed the Python code above we received the following result:
Jessie Washington is only 159 cm tall. If we have a look at the results of our Classifier,
trained with heights, we see that the likelihood for a person 159 cm tall of being "female" is
0.875. So what about an unknown person called "Jessie" and being 159 cm tall? Is this
person female or male?
To answer this question, we will train an Naive Bayes classifier with two feature classes, i.e.
heights and firstnames:
cls = {}
for gender in genders:
fts_heights = Feature(heights[gender], name="heights", bin_width=5)
fts_names = Feature(firstnames[gender], name="names")
cls[gender] = NBclass(gender, fts_names, fts_heights)
c = Classifier(cls["male"], cls["female"])
for d in [("Maria", 140), ("Anthony", 200), ("Anthony", 153),
("Jessie", 188) , ("Jessie", 159), ("Jessie", 160) ]:
print(d, c.prob(*d, best_only=False))
Let's set out on a journey by train to create our first very simple Naive Bayes Classifier. Let
us assume we are in the city of Hamburg and we want to travel to Munich. We will have to
change trains in Frankfurt am Main. We know from previous train journeys that our train
from Hamburg might be delayed and the we will not catch our connecting train in Frankfurt.
The probability that we will not be in time for our connecting train depends on how high our
possible delay will be. The connecting train will not wait for more than five minutes.
Sometimes the other train is delayed as well.
The following lists 'in_time' (the train from Hamburg arrived in time to catch the connecting
train to Munich) and 'too_late' (connecting train is missed) are data showing the situation
over some weeks. The first component of each tuple shows the minutes the train was late
and the second component shows the number of time this occurred.