You are on page 1of 31

CSD311: Artificial Intelligence

Kinds or types of learning problems

Main dimension:
I Supervised - answers/ labels known, very common.
I Unsupervised - no answers/ labels. Find patterns in data.
I Semi-supervised - not dicussed in this course.
I Inductive learning - learning rules(will not discussed).
I Reinforcement learning - reward based learning (already
discussed).
Secondary dimension:
I Offline or batch - all the data is available at once. Currently,
the most common application.
I Online - data is available sequentially and decsions have to be
made on each data item as it comes. All data is not available
all at once. Have to keep updating model as data streams in.
Supervised learning

I Data can be divided into a set of independent variables and a


dependent or response variable.
I The dependence of the response variable on the independent
variable can be deterministic or probabilistic - most often it is
probabilistic.
I The goal is to learn a model that will predict the response
variable.
I Learning data is a set of pairs: L = {(xi , yi )|xi ∈ X , yi ∈ Y},
where xi is a vector of independent variable values and yi is
the dependent/response variable. The dependent variable can
be:
I Discrete (and usually finite) - classification problems
I Continuous (subset of R ) - regression problems
Types of supervised learning
I Binary classification. Dependent variable has only one of two
values.
Examples: spam detection, specific disease diagnosis (e.g.
Covid test).
I Multi-class classification. Dependent variable can have one of
finite number (> 2) of values. Binary is a special case with 2
values.
Examples: disease diagnosis, predicting rating, OCR.
I Multi label classification. Dependent variable can have a set
of values.
Examples: Objects in a scene, keywords for a piece of text,
categories for a product.
I Regression - dependent variable can take real values.
Examples: Predicting numeric variables (windspeed,
precipitation, stock index)
Unsupervised learning

I Find some kind of structure in data - typically cluster similar


objects together. Example: clustering regions in satellite
images.
I Needs notion of similarity/dissimilarity - some kind of a metric
between vectors.
I Often used as an initial step to simplify some supervised
problems.
Semi-supervised learning

I Basically a supervised problem but with very little labelled


data. Greater amount of unlabelled data.
I Goal is to use unlabelled data to improve supervised learning.
Classification example: Iris data set

Three types of Iris flowers.

Sample feature vector and labels from Iris data set.


Data set URL: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/
The Classificiation Problem I

I The data consists of the finite learning set L ⊂ X × Y, where


X is a feature space of vectors x ∈ X of dimension d Y is a
finite set of labels. x = (x1 , . . . , xd ) where each xi is called a
feature or attribute and x is called the feature vector.
I x is the independent variable and y ∈ Y is the dependent or
response variable.
I Assume the true dependence of y on x is defined by:
f : X → Y then the classification problem requires that we
find a predictor fˆ that approximates f using the learning set L.
I Ideally, we want ∀x ∈ X , fˆ(x) = f (x). Realistically, we expect
fˆ to be a good approximation of f and a bound for the error
for fˆ on X . This is called the generalization error.
I More commonly the error is estimated on a finite test set
T ⊂ X that is independent of L i.e. L ∩ T = ∅.
The Classificiation Problem II
I There are many ways to calculate error. One common error
called 0 − 1 classification error is defined as:
(
1 (f (x) 6= fˆ(x))
error =
0 otherwise

The average error on test set T where |T | = m is:


m
1 X
Rfˆ = 1f (x)6=fˆ(x)
m
i=1

where 1P is the indicator function whose value is 1 if predicate


P is true and is 0 otherwise. Often Rfˆ is called the risk.
I The goal of a learning algorithm A : L → F is to use the
learning set L to infer a predictor fˆ ∈ F that minimizes error
on the test set T . Here F is the set of all possible predictors.
The Classificiation Problem III

I The data type for a feature can be categorial, ordinal or


numeric - i.e. X ⊂ D1 × D2 . . . Dd where Di , i = 1..d is the
domain of feature xi . Since most of the time categorial and
ordinal data is numerized often X = Rd .
The iid(independent, identically distributed) assumption

I Each feature xi can be thought of as a random variable that


has a distribution.
I The feature vector x is a multivariate (d-variate) joint
distribution of the individual feature distributions.
I L can be assumed to be constructed by randomly picking a
x ∈ X and adding it to L along with its label y .
I The iid assumption assumes that each x ∈ L is independently
picked and the joint distribution for x is the same for all
vectors x ∈ X .
I Learning algorithms A and any error bound derivations make
the iid assumption.
I Further the iid assumption also applies when the predictor/
model is tested or deployed. It is assumed that the vectors in
the test set T ⊂ X have the same distribution as the X in L.
Bayes decision theory

I The dependence of the label on the feature vector is generally


stochastic since we never really have full/perfect information.
There is always some uncertainty in both the actual data
values and also whether every independent variable influencing
the dependent variable is present.
I So, any predictor will make some error. A natural question is
what is the least expected error that the best predictor will
have.
I This will be a lower bound on the expected error of any
predictor.
I The answer:Bayes decision rule (BDR) is the least error or
optimal predictor.
An Example I
I Assume we want to predict whether a person originates from
the north of India or the South of India. Assume North
corresponds to class label C1 and South to C2 respectively.
I In the absence of any information about the individual we can
use the following rule. Let fN be the fraction of the Indian
population in the North and fS = 1 − fN the fraction in the
South. Then use the following rule:
If (fN > fS ) then C1 else C2 .
I Identify fN with the apriori probability P(C1 ) and similarly
P(C2 ) then the rule can be written as:
if (P(C1 ) > P(C2 )) then C1 else C2 .
I In additon assume we have the melanin concentration (leads
to different skin colour) of the individual and we also have the
distributions for melanin concentration for individuals in
classes C1 and C2 .
An Example II

Class conditional distribution p(x|C1 ), p(x|C2 ) of melanin concentration


for C1 , C2 . R1 is region to left of t and R2 the region to its right.
I Now the rule can be updated to:
if (P(C1 |x) > P(C2 |x)) then C1 else C2 .
An Example III

I We can calculate P(Ci |x) by using Bayes rule:

P(x|Ci )P(Ci )
P(Ci |x) = , i ∈ 1..2
P(x)

I More generally, if X is d dimensional then the multivariate


joint distribution together with the priors will carve out
subspaces or regions where P(Ci |x) > P(Cj |x), i 6= j. The
label Ci is given to all x that fall in that region.
I In earlier figure if x ∈ R1 then C1 else if x ∈ R2 then C2 . At t
either label can be given.
Errors I

Error area for BDR in p(x|C1 ), p(x|C2 ) of melanin concentration for C1 , C2 .

I The BDR misclassifies those x that lie in the hatched region in


the figure above leading to error.
I Visually, if the threshold t is moved either to the left or right
the area under the curve increases. The minimum error is
obtained when classification is done by BDR. This can be
proved rigourously.
Errors II

I The error probability can be written as:

C
X
p(error ) = p(error |Ci )P(Ci )
i=1

where p(error |Ci ) is the probability of misclassifying objects


from class Ci given by:
Z
p(error |Ci ) = p(x|Ci )dx
Rci

where Rci is the complement of region Ri in the feature space


X.
Errors III

Substituting for p(error |Ci ) gives:


C
X Z
p(error ) = P(Ci ) p(x|Ci )dx
i=1 Rci

XC Z
= P(Ci )[1 − p(x|Ci )dx]
i=1 Ri
(A)
z }| {
XC Z
=1−[ P(Ci ) p(x|Ci )dx] (1)
i=1 Ri

I To minimize error, maximize (A) in 1. That is choose Ri such


R
that Ri P(Ci )p(x|Ci )dx is maximum. To maximize sum
maximize each component of the sum (which is +ve).
Errors IV

I Implies probability of correct label is:

C Z
X
p(correct) = max [P(Ci )p(x|Ci )]dx
Ri i
i=1

and probability of error is p(error ) = 1 − p(correct).


I p(correct) is essentially BDR. Therefore, BDR indeed does
minimize the probability of error.
Not all errors are equal I

I All classification errors are not equally bad. Example: If a Covid


+ve person is classified as -ve the consequences are worse than
classifying a Covid -ve person as +ve.
I This is handled by weighting factors. Let λij be the weight when
true label Ci is misclassified as Cj . For the case where there are only
two labels the risk r is:
Z Z
r = λ12 P(C1 )p(x|C1 )dx + λ21 P(C2 )p(x|C2 )dx
R2 R1
R
Integral r1 = λ12 R2 p(x|C1 )dx is the risk or loss associated with
giving a wrong label C2 to an object whose real label is C1 .
I More generally for C class labels the risk or loss for label Ck is:

C
X Z
rk = λki p(x|Ck )dx
i=1 Ri
Not all errors are equal II
I The goal is to choose regions Ri such that expected or average risk
is minimized. The expected risk is:
C
X
r= P(Ck )rk
k=1
C
X C
X Z
= P(Ck ) λki p(x|Ck )dx
k=1 i=1 Ri
C Z C
!
X X
= λki p(x|Ck )P(Ck ) dx (2)
i=1 Ri k=1

I To minimize the risk (2) choose the regions Ri such that Ri is given
suitable label Ci . This means minimize each of the C integrals.
Achieved if x is given label Ci as follows: x ∈ Ri if `i < `j , ∀j 6= i
PC PC
where `i = k=1 λki p(x|Ck )P(Ck ) and `j = k=1 λkj p(x|Ck )P(Ck )
with j 6= i
Not all errors are equal III
I Above is really a weighted version of the BDR with λki as weights.
The loss/risk can be specified by a loss/risk matrix:
 
λ11 . . . λ1C
 .. .. .. 
 . . . 
λC 1 ... λCC

I Usually λii is 0, that is there is no loss for correct classification.


When we have 0 − 1 loss defined by:
Attach Files
3. Due Dates
Due Date Date Date Selection Calendar Time Time Selection Menu
4. Grading
Points Possible Associated Rubrics Add Rubric Click for more
options Name Type Date Last Edited Show Rubric to Students
Submission Details Grading Options Display of Grades
5. Availability
Not all errors are equal IV
Make the Assignment Available Limit Availability Display After
Start Date Date Selection Calendar - Display After Start Time Time
Selection Menu Display Until End Date Date Selection Calendar -
Display Until End Time Time Selection Menu Track Number of
Views
Attach Files
3. Due Dates
Due Date Date Date Selection Calendar Time Time Selection Menu
4. Grading
Points Possible Associated Rubrics Add Rubric Click for more
options Name Type Date Last Edited Show Rubric to Students
Submission Details Grading Options Display of Grades
5. Availability
Make the Assignment Available Limit Availability Display After
Start Date Date Selection Calendar - Display After Start Time Time
Selection Menu Display Until End Date Date Selection Calendar -
Not all errors are equal V

Display Until End Time Time Selection Menu Track Number of


Views

(
0 k =i
λki =
1 k 6= i

then we get the minimum classification error probability.


Discriminant functions
I Discriminant functions are a set of functions
gi : X → R, i = 1..C such that if gi (x) > gj (x), ∀j 6= i then x
is given label Ci .
Attach Files
3. Due Dates
Due Date Date Date Selection Calendar Time Time Selection
Menu
4. Grading
Points Possible Associated Rubrics Add Rubric Click for more
options Name Type Date Last Edited Show Rubric to Students
Submission Details Grading Options Display of Grades
5. Availability
Make the Assignment Available Limit Availability Display
After Start Date Date Selection Calendar - Display After Start
Time Time Selection Menu Display Until End Date Date
Selection Calendar - Display Until End Time Time Selection
Menu Track Number of Views
I Bayes decision rule becomes: gi (x) = −ri , where ri is the
Discriminant functions vs Bayes decision rule (BDR) I

I To apply BDR we needed the distribution p(x|Ci ) for each Ci and


the a priori probability P(Ci ).
I p(x|Ci )P(Ci ) demarcate regions where one particular P(Ci |x)
dominates others giving class label Ci to x. Here P(Ci |x) become
the discriminant functions. Similarly, if ri is the risk/loss of giving
label Ci to x we can treat −ri as a generating function (negative of
ri since ri is minimized).
I The above approach requires inferring distributions p(x|Ci ) using L.
This means using L to estimate the parameters of the chosen
distribution. This is called the parametric approach.
I An alternate method is to directly use L to find class separation
boundaries by minimizing classification error on the learning set L.
This is the non-parametric approach.
Discriminant functions vs Bayes decision rule (BDR) II

I We start by studying the simplest possible boundaries - the straight


line or in higher dimensions a hyperplane.
Attach Files
3. Due Dates
Due Date Date Date Selection Calendar Time Time Selection Menu
4. Grading
Points Possible Associated Rubrics Add Rubric Click for more
options Name Type Date Last Edited Show Rubric to Students
Submission Details Grading Options Display of Grades
5. Availability
Make the Assignment Available Limit Availability Display After
Start Date Date Selection Calendar - Display After Start Time Time
Selection Menu Display Until End Date Date Selection Calendar -
Display Until End Time Time Selection Menu Track Number of
Views
Discriminant functions vs Bayes decision rule (BDR) III

I One reason for studying linear discriminants (apart from their


simplicity) is that if p(x|Ci ) are Gaussians then under some
conditions the boundaries are linear.
Example of multi-variate gaussian
Let x for class Ci be distributed as a multi-variate gaussian of d
dimensions with mean µ̄i , covariance matrix Σi (|Σi | is
determinant):

1 1 T Σ−1 (x−µ̄ )]
p(x|Ci ) = 1 e [− 2 (x−µ̄i ) i i

(2π)(d/2) |Σi | 2

Figure: Example distribution with d = 2 (Image from Mathworld Wolfram).


Define gi (x) = ln p(x|Ci ) + ln P(Ci ).
We write x instead of x|Ci . For d-variate gaussian:
gi (x) = − 21 (x − µ̄i )T Σ−1 d 1
i (x − µ̄i ) − 2 ln 2π − 2 ln |Σi | + ln P(Ci )
d 1
Case 1: Σi = σ 2 I . 2 ln 2π, 2 ln |Σi | can be ignored - constants.
gi (x) = − 2σ1 2 (xT x − 2µ̄T T
i x + µ̄i µ̄i ) + ln P(Ci ).

Ignoring xT x - same for all Ci :


µ̄i
gi (x) = wiT x + wi0 , where wi = σ2
, wi0 = − 2σ1 2 µ̄T
i µ̄i + ln P(Ci )

So, discriminant function is linear - a hyper-plane.


Case 2: Σi = Σ
gi (x) = − 12 (x − µ̄i )T Σ−1 (x − µ̄i ) + ln P(Ci ).
gi (x) = wiT x + wi0 , where wi = Σ−1 µ̄i ,
wi0 = − 12 µ̄T −1
i Σ µ̄i + ln P(Ci )

Discriminant function again a hyper-plane.


Case 3: Σi = arbitrary
gi (x) = xT Wi x + wT x + wi0 where
Wi = − 12 Σ−1
i ,

wi = Σ−1
i µ̄i ,
−1
wi0 = − 12 µ̄T 1
i Σi µ̄i − 2 ln |Σi | + ln P(Ci )

You might also like