You are on page 1of 70

Unit 4

Algorithms: Basic methods


Sec 1. Classification
Unit 4 Algorithms: The Basic Methods
CLASSIFICATION

• Inferring Rudimentary Rules


• Statistical Modeling
• Divide-and-Conquer: Constructing Decision Trees
• Covering algorithms: Constructing rules
• Mining association rules;
• Linear models;
• Instance-based learning;
• Clustering: Euclidean distance, Manhattan distance, nearest
neighbour, farthest neighbour, knn;
• Multi-instance learning.
Classification
• Also called supervised learning, inductive
learning  learning by examples
• Given a set of data records, described by a set
of attributes A = {A1, A2, …, An}. Data set also
has a special target attribute C, which is called
the class attribute. The objective of
classification task is
– to relate values of attributes in A and classes in C
• Classification can also be called prediction
function  The function can be used to predict
the class values/labels of the future data.
• The function is also called a classification
model, a predictive model or simply a
classifier
1. Inferring Rudimentary Rules

• The idea is this:


– We make rules that test a single attribute and
branch accordingly.
• Each branch corresponds to a different value of the
attribute.
• Use the class that occurs most often in the training data.
• Count the errors that occur on the training data—that is,
the number of instances that do not have the majority
class.
INFERRING RUDIMENTARY RULES

• Called 1R for 1-rule


1R: example
1R: dealing with missing values and
numeric attributes
• Missing value,
– treated as a possible value for the attribute
• Numeric value,
– Converted to nominal using discretization
techniques
• Highly branching attributes do not perform well
on test examples;
– Example:
• an ID attribute that pinpoints instances uniquely
• each partition contains just one instance zero error rate
– This phenomenon is known as overfitting.
• Over fitted to the training set  do not work well on test
set
– For 1R, overfitting is likely to occur whenever an
attribute has a large number of possible values.
• Solution?
– Apply constraint, e.g. each partition contains at least three
instances of the majority class
– Whenever adjacent partitions have the same majority class, as do
the first two partitions above, they can be merged together without
affecting the meaning of the rule sets.
Example
• Weather data: temperature attribute with numeric data

• Discretization steps:
– placing breakpoints wherever class changes

– Limiting the number of majority class, i.e. 3

– Merging adjacent partition with same majority class

• Leading to a rule:
2. Statistical modeling
• 1R using one attribute as the basis of decision
• How about using all attributes as the basis of
decision?
– All attributes contribute to the decision
• Attributes are equally important
• Attributes are independent to one another
• A simple methods to be used based on
probability  Bayesian Classification
Bayesian Classification
Konsep Dasar (1)

• Naïve Bayes menggunakan prediksi probabilistic untuk


mengklasifikasikan data.

• Ilustrasi: Data dikelompokkan ke dalam 2 kelas, yaitu


MERAH dan HIJAU.
Bayesian Classification
Konsep Dasar (2) – prior probability
• Jumlah HIJAU 2x dari jumlah MERAH, maka rasional jika
data baru cederung diklasifikasikan ke HIJAU karena
probabilitasnya lebih tinggi  Prior Probability.

• Prior probability dari HIJAU = jumlah item HIJAU/jumlah semua item


• Prior probability dari MERAH = jumlah item MERAH/jumlah semua item

• Prior probability dari HIJAU = 40/60


• Prior probability dari MERAH = 20/60
Bayesian Classification
Konsep Dasar (3) – likelihood probability

• Semakin dekat item dengan


kelompok kelas tertentu, maka item
tersebut cenderung diklasifikasikan
ke dalam kelas tersebut 
Likelihood probability

Likelihood item diklasifikasikan HIJAU = jumlah item HIJAU di sekitar/jumlah total item HIJAU
Likelihood item diklasifikasikan RED = jumlah item MERAH di sekitar/jumlah total item MERAH

Likelihood item untuk diklasifikasikan dalam HIJAU = 1/40


Likelihood item untuk diklasifikasikan dalam RED = 3/20
Bayesian Classification
Konsep Dasar (4) – posterior probability
• Metode Bayesian mengkombinasikan kedua sumber
informasi tersebut yang kemudian disebut sebagai posterior
probability untuk menentukan klasifikasi akhir.

• Posterior Prob untuk diklasifikasikan dalam HIJAU = 4/6 x 1/40 = 1/60


• Posterior Prob untuk diklasifikasikan dalam MERAH = 2/6 x 3/20 = 1/20
• Kesimpulan:
Item yang baru di atas diklasifikasikan ke dalam MERAH karena
memiliki nilai posterior probability yang lebih besar
Implementation of Bayesian classification concept

• Prior probability,
– Prior prob of yes = 9/14
– Prior prob of no = 5/14
• Likelihood probability,
– Likelihood prob of yes = 2/9 x 3/9 x 3/9 x 3/9 =
– Likelihood prob of no = 3/5 x 1/5 x 4/5 x 3/5 =
Implementation of Bayesian classification concept

• Posterior probability

• Substituting numbers from previous slide


Rumusan Matematis
• Naïve Bayes Classifier: menentukan P(H|X) atau
posterior probability yang paling besar dari data
yang hendak diklasifikasikan.

• Jika ada data training D yang berisi atribut X = X1, X2,


… Xn dan ada sejumlah m kelas C1, C2, … Cm, maka:

• Karena dalam klasifikasi nilai P(X) selalu konstan


untuk semua kelas, maka:
Rumusan Matematis
• Dalam Naïve Bayes, diasumsikan bahwa
atributnya saling independen.
• Rumusan akhir secara matematis:

dengan
Contoh
• Dataset untuk menentukan apakah seseorang akan
membeli komputer atau tidak berdasarkan atribut
umur, pemasukan, status pelajar, dan rating kredit.
• Klasifikasikan data dengan:
age income student credit_rating buys_comput
 age <= 30 er
<=30 High No fair No
 income = medium <=30 High No Excellent No
31 … 40 High No Fair Yes
 student = yes >40 Medium No Fair Yes
>40 Low Yes Fair Yes
 credit_rating = fair >40 Low Yes Excellent No
31 … 40 Low Yes Excellent Yes
 Buys_computer=? <=30 Medium No Fair No
<=30 Low Yes Fair Yes
>40 Medium Yes Fair Yes
<=30 Medium Yes Excellent Yes
31 … 40 Medium No Excellent Yes
31 … 40 High Yes Fair Yes
>40 Medium No Excellent No
Contoh Implementasi age income student credit_ratin buys_comp
g uter
• Dimisalkan: <=30
<=30
High
High
No
No
fair
Excellent
No
No
31 … 40 High No Fair Yes

C1: buys_computer = yes >40


>40
Medium
Low
No
Yes
Fair
Fair
Yes
Yes
>40 Low Yes Excellent No

C2: buys_computer = no 31 … 40
<=30
Low
Medium
Yes
No
Excellent
Fair
Yes
No
<=30 Low Yes Fair Yes
• Hitung P(Ci) atau prior >40
<=30
Medium
Medium
Yes
Yes
Fair
Excellent
Yes
Yes

probability: 31 … 40
31 … 40
Medium
High
No
Yes
Excellent
Fair
Yes
Yes
>40 Medium No Excellent No
• P(buys_computer = yes)
= 9/14 = 0.643
• P(buys_computer = no) =
5/14 = 0.357
age income student credit_ratin buys_comp
g uter
<=30 High No fair No
<=30 High No Excellent No
31 … 40 High No Fair Yes
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
31 … 40 Low Yes Excellent Yes
<=30 Medium No Fair No
<=30 Low Yes Fair Yes
>40 Medium Yes Fair Yes
<=30 Medium Yes Excellent Yes
31 … 40 Medium No Excellent Yes
31 … 40 High Yes Fair Yes
>40 Medium No Excellent No

Likelihood P(X|Ci) untuk setiap atribut:


•P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
•P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
•P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
•P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
•P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
•P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
•P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
•P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• Maka likelihood probability-nya adalah:
P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

• Hitung posterior probability P(X|Ci)*P(Ci):


P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Maka dapat disimpulkan bahwa data X diklasifikasikan ke dalam kelas “C1:


buys_computer = yes” karena nilai posterior probability-nya yang lebih besar.
Menghindari Zero-Probability
• Klasifikasi atau prediksi dengan Naïve Bayes membutuhkan setiap
kondisi untuk memiliki probabilitas yang tidak nol (non-zero).
Contoh: Dataset dengan 1000 instances, salah satu atributnya,
income, memiliki 3 nilai, yaitu low, medium, dan high dengan jumlah
datanya untuk masing-masing nilai atribut yaitu:
 Income = low adalah 0
 Income = medium adalah 980
 Income = high adalah 20
Dengan Laplacian correction, setiap kasus akan ditambah dengan
nilai 1, sehingga probabilitasnya menjadi:
 Prob(income = low) = 1/1003
 Prob(income = medium) = 981/1003
 Prob(income = high) = 21/1003
Menghindari Zero-Probability
• Laplacian correction works well in practice, however, we could
instead choose a small constant μ and use:
 Income = low adalah 0
 Income = medium adalah 980
 Income = high adalah 20
menjadi:
 Prob(income = low) = 0+ (μ/3) /1000+μ
 Prob(income = medium) = 980+(μ/3)/1000+ μ
 Prob(income = high) = 20+ (μ/3)/1000+ μ

• The value of μ effectively provides a weight that determines how


influential the a priori values.
Menghindari Zero-Probability
• Finally, there is no particular reason for dividing μ into
three equal parts in the numerators, instead we could
use:

– where p1, p2, and p3 sum to 1.


• Effectively, these three numbers are a priori
probabilities of the values of the income attribute being
low, medium, and high, respectively.
Dealing with..
• Missing values
– No problem at all, the calculation would simply omit this
attribute
• Handling categorical

• Handling continues-valued attributes.


Example

• Thus,
– Posterior prob of yes = 2/9 x 0.0340 x 0.0221 x 3/9 x 9/14 = 0.000036
– Posterior prob of no = 3/5 x 0.0279 x 0.0381 x 3/5 x 5/14 = 0.000137
Solution for example 15
Kelebihan dan Kekurangan Naïve Bayes

Kelebihan:
• Mudah untuk diimplementasikan
• Untuk sebagian besar kasus diperoleh hasil yang bagus

Kekurangan:
• Harus menggunakan asumsi tidak ada hubungan antara
satu atribut dengan atribut yang lain, padahal prakteknya
terkadang ada data yang atributnya berkaitan. Masalah
ini diselesaikan dengan pengembangan dari Naïve
Bayes, yaitu Bayesian Belief Networks.
However, dependencies may exist..
• Bayesian (belief) network
– Graphical model of causal relationship
– Trained Bayesian network can be used for
classification.
– Two components:
• A directed acyclic graph
• A set of conditional probability tables (CPTs), each
variable has one CPT.
• Each node represents
random variable, which may
correspond to
– Attributes of D
– Hidden variables believed to
form a relationship
• CPT for a variable Y,
specifies the
conditional distribution
P(Y|Parents(Y)) .
Training Bayesian Network
• Network topology (layout of nodes and arcs)
– Given observable variables, several algorithms exist
for learning the network topology.
– Human expert in the field of analysis may help in
network design.

• IF topology is known and variables are observable THEN


– Training: computing the CPT entries

• IF topology is known and some variables are hidden THEN


– Gradient decent strategy  Adaptive Probabilistic Network algorithm
3. Divide and conquer
Constructing Decision Tree
• Constructing a decision tree can be expressed
recursively.
– First, select an attribute to place at the root node,
and make one branch for each possible value.
• This splits up the example set into subsets, one for every
value of the attribute.
– Now the process can be repeated recursively for
each branch,
• using only those instances that actually reach the branch.
– If at any time all instances at a node have the same
classification, stop developing that part of the tree.
Which attribute to split on?
• We seek small trees, we would like this to
happen as soon as possible.
• We could choose the attribute that produces the
purest daughter nodes.
– The measure of purity is called information (the units
are called bits).
– Information associated with each node of the tree,
• it represents the expected amount of information that
would be needed to specify whether a new instance
should be classified yes or no, given that the example
reached that node.
• Number of yes and no classes at the leaf nodes are [2,
3], [4, 0], and [3, 2], respectively,
• information values of these nodes are:

• average information value of these


• The root comprised nine yes and five no nodes,
corresponding to an information value of
info([9, 5]) = 0.940 bits
• Thus, Fig. 4.2(a) is responsible for an information gain
of
gain(outlook) = info([9, 5]) − info([2, 3], [4, 0], [3, 2]) = 0.940 − 0.693
= 0.247 bits
• which can be interpreted as the informational value of creating a
branch on the outlook attribute.
• Calculate the information gain for each attribute and split on the
one that gains the most information.
o gain(outlook) = 0.247 bits
o gain(temperature) = 0.029 bits
o gain(humidity) = 0.152 bits
o gain(windy) = 0.048 bits
• Therefore, we select outlook as the splitting attribute at the root
of the tree.
• Therefore, we select humidity as the
splitting attribute at this point. There is
no need to split these nodes any
further, so this branch is finished.
• The decision tree for weather data
How to calculate information?
• The best splitting attribute is the one that most
closely give each partition pure result.
– Defining the splitting attribute by impurity function
– The most popular impurity functions used for
decision tree learning are
• information gain and
• information gain ratio
– C4.5 algorithm uses information gain and
information gain ratio
Information Gain (1)
• The information gain measure is based on the
entropy (information value) function from
information theory
Information Gain (2)
Information Gain (3)
• We se the trend, data becomes purer, entropy
value becomes smaller.
• Thus, entropy measure the amount of impurity.
Information Gain (4)
• Then, we want to know which attribute can
reduce the impurity most if it is used to partition
D.
• To find out, every attribute is evaluated. Let the
number of possible values of the attribute Ai be
v. If we are going to use Ai to partition the data
D, we will divide D into v disjoint subsets D1,
D2, …, Dv. The entropy after the partition is
Information Gain (4)
• The information gain of attribute Ai is computed
with:
Answer of example 7.
Choosing splitting attribute
Answer of example 7.
Choosing splitting attribute
Gain Ratio (1)
• In a case where attribute has many possible values (an
extreme example, ID attribute) 
– entropy value?
– Information gain?
• Gain ratio remedies this bias by normalizing the gain
using the entropy of the data with respect to the values
of the attribute. Our previous entropy computations are
done with respect to the class attribute:
Gain Ratio (2)

– where s is the number of possible values of Ai, and


– Dj is the subset of data that has the jth value of Ai.

• The attribute with the highest gainRatio value to


extend the tree
Quiz:
Define splitting attribute using a) information gain and b) gain ratio!
Evaluation (1)
• Many ways and many measures to evaluate
classifiers
– Accuracy

– Some use error rate


error rate = 1- accuracy
Evaluation (2)
• Several methods to evaluate classifiers
– Holdout set
•s
• test set is also called the holdout set
• mainly used when the data set D is large.
– Multiple Random Sampling
• When the available data set is small
• Perform random sampling n times.
– Each time a different training set and a different test set are
produced.
– This produces n accuracies.
– The final estimated accuracy on the data is the average of the n
accuracies.
Evaluation (3)
• Several methods to evaluate classifiers
– Cross-Validation
• When the data set is small, the n-fold cross-validation
method is very commonly used.
• available data is partitioned into n equal-size disjoint
subsets
– Each subset is then used as the test set and the remaining n-1
subsets are combined as the training set to learn a classifier.
– This procedure is then run n times, which gives n accuracies.
– Final estimated accuracy of learning from this data set is the
average of the n accuracies.
• 10-fold and 5-fold cross-validations are often used.
Evaluation (4)
• In some applications, we are only interested in
one class.
– The class that the user is interested in is commonly
called the positive class, and the rest negative
classes (the negative classes may be combined into
one negative class).
– Accuracy may not be a good measure (intrusion
example).
• Example:
99% of the cases are normal in an intrusion detection data set. Then a
classifier can achieve 99% accuracy without doing anything by simply
classifying every test case as “not intrusion”. This is, however,
useless.

– Instead, we can use Precision and Recall.


Evaluation (4)
• Precision and Recall
– measure how precise and how complete the
classification is on the positive class.
• Confusion matrix:
4. Covering algorithm
• Take each class in turn and seek a way of
covering all instances in it, at the same time
excluding instances not in the class. This is
called a covering approach.
– At each stage we identify a rule that “covers” some
of the instances.
• By its very nature, this covering approach leads
to a set of rules rather than to a decision tree.
• If x > 1.2 then class = a
• However, the rule covers many b’s as well as a’s,
– so a new test is added to it by further splitting the space
horizontally as shown in the third diagram:
• If x > 1.2 and y > 2.6 then class = a
• This gives a rule covering all but one of the a’s.
– We could stop here, but if it were felt necessary to cover
the final a, another rule would be needed, perhaps
• If x > 1.4 and y < 2.4 then class = a
A simple covering algorithm
• Divide-and-conquer algorithms choose an
attribute to maximize information gain.
• Covering algorithm chooses an attribute–value
pair to maximize probability of the desired
classification.
– To include as many instances of the desired class as
possible and exclude as many instances of other
classes as possible.
• Suppose the new rule will cover a total of t instances
– p are positive examples
– Thus, t – p are other classes
Then choose the new term to maximize the ratio p/t.
A simple covering algorithm:
example
• We will form rules that cover each of the three
classes—hard, soft, and none—in turn. To
begin, we seek a rule:
• If ? then recommendation = hard
– For the unknown term ?, we have nine choices:
» age = young 2/8
» age = pre-presbyopic 1/8
» age = presbyopic 1/8
» spectacle prescription = myope 3/12
» spectacle prescription = hypermetrope 1/12
» astigmatism = no 0/12
» astigmatism = yes 4/12
» tear production rate = reduced 0/12
» tear production rate = normal 4/12
A simple covering algorithm:
example
• If astigmatism = yes then recommendation = hard
• This rule is quite inaccurate, getting only 4
instances correct out of the 12 that it covers. So
we refine it further:
• If astigmatism = yes AND ? then recommendation = hard
» age = young 2/4
» age = pre-presbyopic 1/4
» age = presbyopic 1/4
» spectacle prescription = myope 3/6
» spectacle prescription = hypermetrope 1/6
» tear production rate = reduced 0/6
» tear production rate = normal 4/6

• IF astigmatism = yes AND tear production rate = normal


THEN recommendation = hard
A simple covering algorithm:
example
– Produced rule only covers 3 out of the 4 hard
recommendations.
• So, we delete these 3 from the set of instances and start
again, looking for another rule.
– Then, do the same process for class soft and none.
• What we have just described is the PRISM
method for constructing rules.