Professional Documents
Culture Documents
~h ~
-=--2..:: ·
Q1
s
~ '-°
s111w,n e
LgOHDo'Q a t
1'1\ru:.h1n e 1~ n1d ta
1s (A_ pe ~ (j}{ t'i'jY&J 1nleJJ.f'(l/7ce
(Al) -tho.}; pro Ylde.s Compulf).{,!, iuf--1}, lk Obf!FtJ io
/eOJ1() u.&./Jiovl ~ exp /?cf~ fr°(} 1ommecl . 1"v1 o.Ui ?n e
/eOJ{ n ?~ <f"oMeJ on !Je dwelopmenl j Corrvpule4
-th al CtJn /:eaJ, thern oeJYe-3 fo Jro w
{f>Yc} 'TM!s 0/l oi
ch°rJe when exposeJ lo pew dcdtx .
~ <'(foClBS d
macJi'?ne /ecv-ml(J ?s s?mYIOJ.l -to fM
~ clo!:o. mMnr9- 130-1:h st-l:em Seo>1d, -HY01fi dala b look
-1 O"{ r~n . Ko 1/J e>i eH I f Y1 S ieod 9/ c?J< hru.ui] dota. Jo~
hu,mOil UJrnfYeliens!Cm a.s ~s ~e CJJ.Je 'i'n dC!tO.. m'?nf'~
OfPl i eoE of)S ' me. cJ,'i' ne IeOJ-1 n'i'd u&.1, U da±o.. fu ?mpto"-f'.
-Uie rp-ro3 'l'O.m 's OWrl unde,.{ tCU1dro . 'Mo.thine /eOH n?7:J
<yr~'oo.ms Jelecl 7&eH 11 s fn chk o.n d o.ctfos-1: ficg o0111.
o.clion ClctO'od?n(} ~ .
k.e~ 1 0?r1:l6
• 'M( twms 'Da:l:o... ?nko ?nJo,rmctlJon [}1eon~(Jfa1 't)cJ;o..)
• 1t\L ~ s c,.. tom b~no#or, !} Cornptdf>f &~ence , f'ng?n e. eufcJ
nnd sW?s+Ycs
• 11\L V~ ito±;~sbfeal concept -to proeess data ond l:o
So1ve rpr-o b1erns .
• 1-AL fn!Ui~ dcttcl ond atbs on ,t .
• "Ml err~ '(()ftl,S put(?}{ };:o
Cl. C, om pt'rm'(x.e
O rp~ 0 rm OJ?Ce
c,¥-t;e,Ja vs1~ <pus-/; e.xpe;&nce.
k ~ 11rnm~rwlod?u
_(xpe,t.th ~ 5-bem ~ &/he exff')JJ SJo'tem aµe the- Comp;ieH
Ofr1PtClt1'on~ developed to so\ve C.orn'Plex p'YOb}erncs ?n a
ifcuill wlQJ{ do~ oJn , ol Hie level ~ t!xh--c,. - o'ocl.?n ~
• µrjM r~onnance
• iind(?}{ston detlJ!e
• rReJJab le .
• r&f I~ J.!eJfo ri s~Ye
-
OJid olhftb UteS ru-te- ~nd1yf cluJ rmeCQ, uttwi ent3 LUh en_
1
le~ n~~ ./k c,18 (mH, m 9e.neJioJ:e\ o._ Jwclfon -/;hat; "fY7CJf?5"' 'i'n pu:h
to desheel o\J:tfU:hs . die ,.s-/,orx.JGHd Jo'Ym ul dfon ~ the <S Uf i!J-N,:ed
Iewrn~ng ./Jw( ?s the cJo.so7fCW;'ror1 pro blem : -/;JiB /f!M o eJ..l "' 10
~ri.ift-red -W /ermn Ct Juncli°on ulh?e.h mo.ps Cl veclor ,'rtto OY1("
r;:\n~.:e the ?nfui dcJng 'rs looKi''12 ai f:h e, dok rom -Me
'fYe,>fioWi ,tabK • I-1: 'i'n wlv~ .Xec°8 n'r:c?~ pcu±eHn6 ?dent-?cfd I
1 ns
ottl:.lf~ o.nd deift-1:Yori 3 novelkJ.
!foc&n the a.lg err?th m • "IJi e aclu.al mod, 'i'n e IE'CIJ-( 11f nB tad ks ·
'Plnce heHe . l-fh?~ otq> Md ihe ne.x-t s+ep OHe wln01e_ i:he
I
(fCoye." ol ~ o.n+1 m~ I?e. cJepen d.'f~ OY1 -1::he ol3o'l'(/Jirn • <(ee.J {he
I
Dlga~-l:hm 300d C.lEb.n do±O- Jmm the. /TSt who oteps one/
ex.ho.t t know.I e~e oY IY}PTiYl cdion · 'the kno wJ ~ e t?xtrorl&J
Is otcrre.d 'fn c;. J~ #iJ~ }{et1J'l~ vSQ\1\e. b~ 0--
m0-Ll1~ne ·Jorr the nex.:t -t!JX) oteps.
'il'-5t the olgaK1:hm g l1he q~or-mcill'orn /eOJ-ine.d ?'n +he
tpYe.vfoVJ, s~s ?o ~ ./:o we dw<?>] ~fi'~ .
Vte n g rthKe lJ~e ~ J{eol rp-ro3iYO,t)') -to do some -tf:M k,
(}Jlcl O)'lce Off&' (10U se.e ~ ojj +he rrevbU-O steps wo,/<ecl
~ ~O\A. e..x~cled ,
'Hpp?W'i'on ~ v\c:i.d,~n~ l fc1H n9nJ
u eOHn'ha Asoo c.ttJfo n& ~ Is ().. m l!ihod doY di'6('.0VV{)'.~1 'n.:/eHl'b-1:-'i'11 d
~C\,tfons be,h~en V ~I~ 'f' n IWise clo±ol,o.1 el, • ~ Is wed t-o
'fJ~, 1-\~ulo.J.8.t'te.i, o.m°j loJ13e 5L:e cl~e.b '11.e c.:nc epi
~eOH11 in9 c,,_~&J:Yon ~s moJri\~ wd ,n 't>'tru-tkcl: on sI s w hf a0 0
!
1
~ ~~3 o.s,so&d:Yon btlween <y'l-o&iJs bo~ht bJ M-1:om~ : 1J ,pe<Yf1~
vJ-c b~ >< tjr,'C.~ o.lso b~ \d , cmd ~ -\:he){' ~ s °' c..wlome}{ who
b~s x orid de€/, not bU!:J ~ , \ie_ r:,y- she '/'0 o. ,poll'Jt-!Yol 'I LU!lomeH
Once :;u<l, wk'fl'W CIJi e ~den!1)\'ed , t1, ~ co.n be i:-OJ-13 el.ed o-r c'l"t>ss - se.i/'i'~ d
<· Cl
ClSs'g'(c~on ~ llo.~lfc.Cttfon ?.s the pmb\em ~ ?denl-\JB~ 4o wA'ii:J-
~ o. 6~ ~ C.O:le30¥lei o. Yl0..u obs('H vo.\:i'on /;o w:'/J, ~ CJ.. .sd., ~
cci±e3CY'f1(?,\ 0.. ne/U O hse,~v~oYl belOJ'1j . on the b11.S1 6 ~ Cl. f-r-cJ;n~
ccl ~ cWti. e_Cl1t@i'i'nj obsei-ivt>li"on whou C cd.<zJo1 rn~m~&J.i:s fs know,
.:5· 'Ycuf.eun 'R~ n'i't'i'on ~ 1' $ c,._ b-ro.rd,, ~ rn o.di~oe Ieo>1n?l'lj 't-h CJ.t J<X. u&'~
If the above model is reliable and valid, use this model to predict the Selling
Price of any house based on its area (Sq. ft.) and the number of bathrooms
(#Baths).
General Nature of Linear Regression
Let xf = x 2
Therefore,
Y =Po+ f31x1 + f32x2
General Nature of Linear Regression
General Nature of Linear Regression
Therefore,
Simple Linear Regression
y==f3o+f31X+E
Celsius
Selecting Independent Variables: Scatter Plots
Scatter Plots are used to visualize the relationship between any two
variables. For regression analysis, we are looking for strong linear
relationship between the independent and dependent variable.
Identify possible independent variables (predictors - should make sense) . Use scatter plots and
Select Variables
correlations for selection .
Test to see if all coefficients are significant (reliability). Establish validity (are relationships as
Test Model expected , do predictions match actuals).
Implement the model in Decision Support System. Incorporate error in predictions. Outline
Implement and Use limitations/constraints of the model.
where:
Yi = observed value of the dependent variable
for the ith observation
I\
Yi = estimated value of the dependent variable
for the ith observation
Least Square Method
(Simple Linear Regression)
Slope for the estimated regression line (hypothesis function)
Least Square Method
(Simple Linear Regression)
y-intercept for the estimated regression line (hypothesis function)
where:
xi = value of independent variable for .zth
observation
Yi = value of dependent variable for ith
observ ation
x = mean value for independent variable
y = mean value for dependent variable
n = total number of observations
Simple Linear Regression
Estimate the regression line (hypothesis function) for the given data.
Height (x) Weight (y)
151 63
174 81
138 56
186 91
128 47
136 57
179 76
163 72
152 62
131 48
Simple Linear Regression - -
s. Height Weight (xi -x) (Y, -Y) (xi -x)(Y, -F) (xi -x)2
o. (X) ems (}?kgs
90 •
80
70
60
50
40
30
20
10
0
0 20 40 60 80 100 120 140 160 180 200
Simple Linear Regression
Estimate the regression line (hypothesis function) for the given data.
Area in Square feet {x) House Price in $1000 {y)
1400 245
1600 312
1700 279
1875 308
1100 199
1550 219
2350 405
2450 324
1425 319
1700 255
Simple Linear Regression
House Price in $1000 (y)
450
400 •
y = 0.1098x + 98.248
350
300
250
200
•
150
100
50
0
0 500 1000 1500 2000 2500 3000
Simple Linear Regression
Estimate the regression line (hypothesis function) for the given data.
Year (x) Sales$ (y)
2000 1350
2001 1610
2002 1430
2003 1790
2004 1550
2005 1670
2006 1870
2007 1850
2008 1920
2009 1980
Simple Linear Regression
Sales$
2500
2000
y = 62.424x - 123427
•
1500
•
1000
500
0
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Simple Linear Regression
Estimate the regression line (hypothesis function) for the given data.
Study Hours (x) Test Results % (y)
60 67
40 61
50 73
65 80
35 60
40 55
50 62
30 50
45 61
55 70
Simple Linear Regression
Test Results% (y)
90
y = 0.6955x + 31.212
80
•
70
60
50
•
40
30
20
10
0
0 10 20 30 40 50 60 70
A B AB A~
5 Lud e.n l: x,· Y; Xj - 5( Yr -
-Y
.
I 95 ~s I1 ~ I 3€ 2S9 I
2 ~5 q5 ,i I 2--(; '--{q
l
3 80 10 Q__ -7 - I l( 4
-5 - I 2 q6 b'-/
y 10 6'5
-1 I 2. b.. 321+
5 bO 10 - Ii
Estimate the Karl Pearson's correlation coefficient for the given data.
Height (x) Weight (y)
151 63
174 81
138 56
186 91
128 47
136 57
179 76
163 72
152 62
131 48
Karl Pearson's Coefficient of Correlation
Height (x) Weight (y) xy xz y2
151 63
174 81
138 56
186 91
128 47
136 57
179 76
163 72
152 62
131 48
r = 0.977
R-Square
where
Yi is the observed output for the input xi
y is the mean of the observed outputs
Yi is the estimated output for the input xi
L(Yi - y) 2 = SST is the total sum of squared deviations in y from its mean.
L(Yi - y) 2 = SSR is the sum of squares due to regression.
L(Yi - yJ 2 = SSE is the sum of squared residuals (errors).
R-Square
Relationship Among SST, SSR, SSE
w h ere:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squ ares d u e to error
R-Square
where:
SSR = sum of squares due to regression
SST = total suin of squares
R-Square
2
SSR _ 178.392 _
_
R - SST - 1872.1 - 0, 955
Multiple R
(ill
Multiple R = ✓ R-Square = ✓ SIT= v'0.955 = 0.977
Standard Error of Estimate
The standard error of the estimate is a measure of the accuracy of
predictions. It is the measure of variation of an observation made around the
computed regression line.
LCY; - }';)2 = ~
N ✓N
The denominator is the sample size (n) reduced by the number of model
parameters (p ) estimated from the same data.
N = (n - p) for p regressors
N = (n - p - 1) if an intercept is used
Standard Error of Estimate
_
(Jest -
L(y; - )';) 2
N
_ /m -_j84.652
- ✓ JV S
_
- 3.253
Logistic Regression
Why not linear regression?
Why not linear regression?
Examples of Binary Classification Model
= p probability of event
0
1- p probability of no event
• Transforming the probability to odds removes the upper bound.
Why Odds?
6 ~-
Tfx-) do-gl
r.
-x
4
I'
2 __,,,,.. ~
~v
0
2 _---tt r.l
0 .,,,,,,-
_,..,..
--------
06 08 1.{)
-2 /
-4 II
-6
Logistic Regression
What is Pi ?
Logistic Regression
1-pi 1
Pi e (/3o+f31xi1 +/32Xi2 +···+f3kxik)
Logistic Regression
1 1 + e (/3o+/31Xi1 +/32Xi2 +.. ·+/3kXik)
Pi eC/3o+/31Xi1 +/32Xi2+···+/3kxik)
Pi eC/3o+/31Xi1 +/32Xi2+···+/3kxik)
1 1 + e (/3o+/31Xi1 +/32Xi2 +·· ·+/3kXik)
• We can simplify it by dividing both numerator and denominator by
numerator to obtain:
1
Logistic Regression
X
Maximum Likelihood Estimation
• In which
p(
1 + a+px
Maximum Likelihood Estimation
L ( /3)
n J;
= IT (h ( x;)) (1- h( X; )
i=I
t Y;
f(/3) = log(L(/3))
Estimating the Parameters
n
I! ( {3) = L J; log ( h ( X; ) ) + (I - J; ) log ( 1 - h ( X; ) )
i=I
Using Gradient Descent Algorithm
[t
Cost Function:
Objective:
min](/3)
/3
Using Gradient Descent Algorithm
Repeat{
a
/3j == /3j - a apjl(/3) ( simultaneously update all /3j)
}
Repeat{ n
/lj == /lj - a L(h(x;) -y;)xij (simultaneously update all /lj)
i=1
}
Decision Trees
74
Classification and Decision Tree
• A tree is build in which the leaf nodes contain the output category.
• The class of the output is predicted based on the rules generated
from the tree structure.
• Learned trees can be represented as a set of IF-THEN rules as well.
Decision Trees
Decision trees can be used to solve problems that have the following
features:
• Instances or tuples are represented as attribute value pairs, where the
attributes take a small number of disjoint possible values.
• The target function has discrete output values such as yes or no.
• Decision trees require disjunctive descriptions which implies that the output
of the decision tree can be represented using a rule-based classifier.
• Decision tree can be used when training data contains errors and when it
contains missing attribute values.
Date
. -
Let Se.~ how b-o So I Vt, J:h,S ----~---:---- -
0 Dekh bhcti _koi bhi cJatase6 .J°' ·. /C1ble. cJi_Jo ho
vs kci. Jo /rut colomn hotC/ ha i no voh hoC-a
h~, class l) TrR11Bvr E
1' \\
'--- r~ k-t-
Jo
f~t wale
node Ko lea P nocb-
1<..q ht:-a ha i
j
I
I
I
I
,.I
I
~
'
- - - -- -- - ---
P-= O I N~Z ~
lO t \{h 'onC{ \
bhi valu..t
:rrr;,N ,J -: . D
OldG'
5 I m I I$'.]' l.l,l.L ~nJ 8>{ __________
p;!- 2., (V :- 2-
T (P,·,Ni) ~
m,d
~loq
"1
(J=) -~
Y
loq (-3y:J- - -
Ji 4 -:Jr :;.----
--
- ~ ----
TC0,rvi) = 0 ~
NeuJ
- 10
- - - - -- --
-- .b_te_p_1-!~ :___ _ _ _ _ _ _ __:,___ _ ___ _
=:rn -~~~-~----------
- - - - - - - - -- - -- - - - - -
- uJ; ~ ---- - - - - - -
-· C1 ct58 P.: .Proft-_{u PJ ~ £
_ cl Clb.s_[V = Prd1: t- cc) ow ri) :: 5
- - - - - - - -- -
-e /DQ
__ _- eJ-tv_,_ ~ ri-
( (? ~ -
~
fV
P+rV_
fBQ f ==
(v -\- ' - -
v'l._~_f +rv
..-=a
_ _ __
--·----- ·- - - - · - · - - -
.--· ..,
chc:,lo n:nc, Poca chq1 Y!!q _k; ,
up Q/) d cloc.vri
me,-ie, lea P no d..R. hem:} e. So
1ltk:Jhons
Co~ ooob no~ I-rem ho~q
"-----=:)
o le}· 0 3 0
~ 2 2-
rn; d (
- --
n~ 3 G
I a
- j
- .. -- · - - ~ - - - -- ·- ..
Now ~rf
Co rnpcJi h'o n
Pt Ni rcP;, (v iJ
Ye.b . I 3 O·Stt99
No y '2
- - -- - - -- -- --- --
..
- ?it 3 lD + ('l+?.) (0
s+s . s+5
~ I- f
- - - -------
Tn~Yma kon ~aln __________ __ - -
- - -- - - - - -- -------------
--- --
- _O·l 2 45L5 _ _ __
Co rnpth'h"o n
0
~p~ -
11 -ro h J ain 5ql_, j< 2T ~11cJoi ~ Ira bq_i
'?f"oob no JJZ.. e,o 11l_ be- ~ - -.-:----
"
Ab AGi b m.t -jilni bhi _ va flU-bCl i (i5 _ _
uine rroot n(!)clt:_ S-L b ~c_hu> _fl'clLung,e-- - -
- ---
Aj~ - - - - - - - - -
" c) - - - () fv,j
o - - - mid -
.n f,, f\J ;J ~o
LY-e.D
(_koi bhi e.k' va\ttt p l)Y IV MV- Q _ hai fvh -
~ ()f).J WJJ7 wi 11 b.e. Z<Jl o )
~ - - - - -- --
[n/;rop~ Ccol"f)pt:hVl'oi)
- - ---- -- -
-=~ (CJ) -t !::- CoJ 2
0
¼ - ~
---
- ::..[-- 0
-------
- - -·- -------.-- - -- - - - - - - -
GC{; n
)
com pe l:1'110 n
0
1jpi
Oowr1
I
I
petif-ion
Co rn ;:}" ~ t Co mfeVJ Vi o ri Pro h' l:
y~ -:3 Dow\/\ rvo ~ uJJ
7 tb -:;) Oow r-1 rvo ➔ oP
•
,<-- - -- -
- -- - ------
1o h fs ~ t-c07 ~ h I /;{?(✓•
k_ 'oe.q __ ~ ,- - -
Intuitively: A good attribute splits the examples into subsets that are
(ideally) all positive or all negative .
•••••• ••••••
•• •••• • • •• ••
Decision Tree Induction: Attribute Selection
Intuitively: A good attribute splits the examples into subsets that are
(ideally) all positive or all negative.
••••••
• • •• ••
Decision Tree Induction: Decision Boundary
+ +
+ +
+ +
• + ++ +
• +++
• • • • •
• • •
+ + + • •
Decision Tree Induction: Decision Boundary
+ +
+ +
+ +
• + ++ +
• +++
• • • • •
• • •
+ + + • •
Decision Tree Induction: Decision Boundary
+ +
+ +
+ +
• + ++ +
• +++
• • • • •
• • •
+ + + • •
Decision Tree Induction: Decision Boundary
+ +
+ +
+ +
• + ++ +
• +++
• • • • •
• • •
+ + + • •
Decision Tree Induction: Decision Boundary
+ +
+ +
+ +
• + ++ +
• +++
• • • • •
• • •
+ + + • •
Decision Tree Induction: Decision Boundary
+ +
+ +
+ +
• + ++ +
• +++
• • • • •
• • •
+ + + • •
Basic Decision Tree Learning Algorithm
Disadvantages
• C4.5 constructs empty branches with zero values.
• Over fitting happens when algorithm model picks up data with
uncommon characteristics, especially when data is noisy.
Basic Decision Tree Learning Algorithm
Disadvantages
• It can split on only one variable.
• Trees formed may be unstable.
ID3 Decision Tree
Simplified 1D3 Algorithm
s == [9+, s-J
Calculate Entropy(S).
Problem based on ID3
s == [9+, s-J
Calculate Entropy(S).
Entropy(S) == 0. 94
Problem based on ID3
s == [9+, s-J
Ssunny == [2+, 3-]
Sovercast == [ 4 +, o-]
SRain == [3+, 2-]
5 5
= 0.94 - 14 * 0.971 - 1~ *0- 14 * 0.971 = 0.247
Problem based on ID3
s == [9+, s-J
SHot == [2+, 2-]
SMild == [ 4 +, 2-]
Scool == [3+, 1 -]
6
= 0.94 - 1~ *1- 14 * 0.918 - 1~ * 0.811 = 0.029
Problem based on ID3
s ==[9+, s-J
SHigh == [3+, 4-]
SNormal == [6+, 1 -]
s == [9+, 5-J
SHigh == [3+, 4-]
SNormal == [6+, 1 -]
• The Outlook Attribute wins pretty handily, so it's placed at the root of
the Decision Tree.
• Afterwards, the Decision Tree is expanded to cover Outlook's possible
values.
• In Play, the Outlook can be Sunny, Overcast or Rain, so all of the
examples that have a Sunny Outlook are funneled through the Sunny
path, the examples with an Overcast Outlook are diverted to the
Overcast path and so on.
Problem based on ID3
Outlook
(1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14)
/ \
Problem based on ID3
\
Humidity
(1, 2, 8, 9, 11)
High Normal
I \
Problem based on ID3
\
Humidity
(1, 2, 8, 9, 11)
High Normal
Problem based on ID3
\
Humidity
(1, 2, 8, 9, 11)
High Normal
Problem based on ID3
Humidity Wind
(1, 2, 8, 9, 11) (4, 5, 6, 10, 14)
I \
Problem based on ID3
Humidity Wind
(1, 2, 8, 9, 11) (4, 5, 6, 10, 14)
Entropy(S) =- I
i=1
Pi log2 Pi
where,
N is the number of classes
Pi is the probability of an element to belong in class i
Problem based on ID3
Transportation
Instance Gender Car Ownership (x 1 ) Travel Cost (x 2 ) Income Level (x 3 )
(y)
5Cl) Male 0 Cheap Low Bus
5(2) Male 1 Cheap Medium Bus
5(3) Female 1 Cheap Medium Train
5(4) Female 0 Cheap Low Bus
5(5) Male 1 Cheap Medium Bus
5(6) Male 0 Standard Medium Train
5(7) Female 1 Standard Medium Train
5(8) Female 1 Expensive High Car
5 (9) Male 2 Expensive Medium Car
5(10) Female 2 Expensive High Car
. ~ . . . Gain(A)
1
Gru.n Ratio( A) = - - - -
Split Info A. ( D)
Problem based on C4.5
Instance Outlook (x1) Temperature (x 2) Humidity (X3) Wind (X4) Play? (y)
5Cl) Sunny Hot High Weak No
5(2) Sunny Hot High Strong No
5(3) Overcast Hot High Weak Yes
5(4) Rain Mild High Weak Yes
5(5) Rain Cool Normal Weak Yes
5(6) Rain Cool Normal Strong No
5(7) Overcast Cool Normal Strong Yes
5(8) Sunny Mild High Weak No
5 (9) Sunny Cool Normal Weak Yes
5(10) Rain Mild Normal Weak Yes
5(11) Sunny Mild Normal Strong Yes
5(12) Overcast Mild High Strong Yes
5(13) Overcast Hot Normal Weak Yes
5(14) Rain Mild High Strong No
CART (Decision Tree)
Simplified Algorithm
https://www.youtube.com/watch?v=nWuUahhK30c
Underfitti ng and Overfitti ng
Underfitting and Overfitting
• Approaches that stop the growth of tree before it reaches the point
where it perfectly classifies the training data.
• Approaches that allow the tree to overfit the data, and then post-
prune it.
Approaches to Identify Final Tree Size
• Use separate dataset for training and for evaluation. This is done by
the training and validation set approach.
• Use the entire dataset for training but use a statistical test (chi square
test) for estimation.
• Use an explicit measure of the complexity for encoding the training
examples and the decision tree, halting growth of the tree when this
encoding size is minimized. This is done using the minimum
description length principle.
• Continuous values attributes: Uses a threshold based Boolean
attribute approach
• Missing attribute values: assign the value that is most common
among training examples at node n.
• Handling Attributes with Differing Costs: prefer low-cost attributes are
preferred over high-cost attributes
Rule-Based Classification
• To extract rules from a decision tree, one rule is created for each path
from the root node to the leaf node.
• Each splitting criterion in each path is logically ANDed to form the rule
antecedent. The leaf node holds the class prediction.
Pruning the Rule Set
• For a given rule antecedent, any condition that does not improve the
estimated accuracy of the rule can be pruned
• Sequential covering algorithm can be used for the same
Algorithm
1. Start with an empty cover
2. Find the best hypothesis using learn-one-rule
3. If the just-learnt-rule satisfies the threshold then
(a) Put just-learnt-rule to the cover.
(b) Remove examples covered by just-learnt-rule.
(c) Go to step 2.
Suppose you are given plot of two label classes on graph ali shown in image I below. Can you
decide a separating line for the classes?
•
•• •
• • •• ••
•• ••
■
• ■ ■ •
■
■
•
You might have come up with something similar to image 2. It fairly separates the two classes.
Any point that is left of line falls into black circle class and on right falls into blue square class.
Separation of classes. TJ,at's w'1at SVlf does. It finds out a line/ hyper-plane (in
multidimensional space that separate outs classes).
■ ■ ■
■ ■ ■
• • • ••• ■ ■
■ • •• ■ ■
■
• • ■ ■
■ ■ ■
■ ■■
Clearly. there is no line that can separate the two classes in this x-y plane. For this we apply
transformation und ndd one more dimension ns we cull it z-uxis. Let's assume value of points
on z plunc, w = x2 + y2. In this cusc we can manipulate it ns distance of point from z-origin.
Now if we plot in z-uxis, u clear separation is visible und u line cun be drawn.
When we transform back this line to original plane, it maps to circular boundary as shown in
image below. These transformations arc called kemels.
■ ■ ■
■ ■
■
••• ■ ■
•
• ■
·y
• •
i ■■ ■
■
■
■
■
■
■
Transforming bnck to x-y plnnc, n line transform.,; to circle.
Thankfully, you don't hnve to guess/ derive the transformation every time for your data set.
The sklenm library's SVM implementation provides it inbuilt.
• • ■
•• ■ ••
• •• •
• ■ ■
• • ■ ■
■ ■
••
What in this case?
•
• •
• •••
•
••
.••(: •••
• • • •
•
•• • •
• •
0
•
/.
• •••
.I
• •
•
Image 1 Image 2
Which one do you think? Well, both lhe answers arc correct. The first one tolerates some outlier
points. The second one is trying to achieve Otolerance with perfect partilion.
But, there ls trade off. In real world application. finding perfect class for millions of training
data set takes lot of time. This is called rcgulnrfzation pnrnmctcr.
Tuning parameters:
Kernel
The learning of the hyperplane in linear SVM is done by lransfonning the problem using some
linear algebra. This is where the kernel plays role. For linear kernel the equation for prediction
for a new input using the dot product between lhe input (x) and each support vector (xi) is
cnlculatcd ns follows:
f(x} = B{O} + sum(ui * {x,xi})
This is an equation that involves calculating the inner products of a new input vector (x) with
all support vectors in training data. The coefficients BO and ai (for each input) must be
estimated from the trnining data by the learning algorithm.
The polynomial kernel can be written as K(x.xi) = 1 + s11111(x * xi)"d and exponentinl as
K(x.xi) = e.,11(-gamma • s11m((.t- xF)).
Polynomial and exponential kernels calculate separation line in higher dimension. This is called
kernel trick.
Regularization
The Regularization parameter (C) tells the SVM optimization how much you want 10 avoid
misclassifying each training example. For large values of C, the optimization will choose a
smaller-margin hyperplane if that hyperplane docs a better job of getting aJI the training points
classified correctly. Conversely, a very small value of C will cause the optimizer to look for a
larger-margin separating hyperplane. even if that hyperplane misclassifies more points.
The images below arc example of two different regularization parameter. Left one has some
misclassification due to lower regularization value. Higher value leads to results like right one.
• • •
• •• ■
•
••
•■
■
•
■■ ■
•
••
•• ■
•
0
/..·:■■ ■ ■
Low regularlzatlon value High regularization value
Gamma
The gamma purumctcr defines how far the influence of a single training example reaches. with
low values meaning 'for' and high values meaning 'close'. ln other words, with low gamma,
points far away from reasonable separation line arc considered in calculation for the separation
line. Whereas high gamma means the points close to reasonable line are considered in
calculation.
High Gamma
Only nearby points are
considered.
• • Low Gamma
. . •··..... ...
Far away points are also
. · considered .
Support Vector
• .··•
·.·. .
It can be defined as the distance from a class element to the line / hyperplane.
+
//
/
/
//
/
,Bsupport
/// vector
Margin
A good margin is one where this separation is larger for both the clao;;ses. Images below gives
to visual example of good and bud margin. A good margin ullows the points to be in their
respective classes without crossing 10 other class.
Good margin
equidistant as as lar as
- - ---f-~--r.,:;__- - - ---tnosslble for both side.
• ••
Bad margin
very close to blue class.
The target is to determine the decision boundary (Good Margin). This can be done using
Lagrange Multiplier as follows -
The Lagrangian is
1 n
.C=-wTw+ Loi(l-yi(WTXi+b))
2 i=l
■ Note that I lwl 12 = wTw
Setting the gradient of £ w.r.t. wand b to zero, we
have n n
w+ L
Oi(-yi)Xi = 0 => w = OiY-i"Xi L
i=l i=l
n
L OiYi =0
i=l
Where W is a unit vector with class labels.
r• If we substitute w = I: oiYiXi to £, we have
i=l
= - - ,- £- £- OJOj1/iYJX;'/'Xj
2 i=l j=l
Cl'j
i=l
11
Class 2
····.... a =O
Example
Consider the following dataset (Table: 1).
RID age Income student creditJoting Goss: buys..computer
you1h high no fair no
2 youlh high no cxcellcnl no
3 middlc_aged high no fair }'~S
4 senior mroium no fair yes
s senior low y<.'S fair yes
6 senior low yes exccllenl no
7 miJllle_aged low yes exceJlenl yes
8 you1h medium no fair no
9 youlh low yes fair yrs
10 senior medium yes fair yes
11 youth medium yes cxccllcnl yes
12 miJllle_ageJ meJium no excellenl yes
13 miJllle_aged high yes fair yes
14 senior medium no excellent no
If there is no rule sutislicd by Xu fullback or default rule can be set up to specify a default class, based on
a 1raining set. The class used as default cln.c;s can be the class in majority or the majority class of the tuples
thnt were not covered by any rule.
Likelihood Statistics:
Likelihood ratio statistic is n statistical test of significance to determine if the apparent effect of o rule is
not attributed to chance, but instead indicates n genuine correlation between attribute values and classes.
TI1c test compares the observed distribution among clusscs or tuples covered by n rule with the expected
distribution that would result if the rnlc made predictions at random.
Likclil1oodJ?ntio = 2f 1; log(/;.)
. I Cr
•=
For tuples satisfying the rule, f I is the observed frequency of each class i among the tuples. e I is expected
frcqucncy of each class i to be if the rule made random predictions.
Ruic rnming
To avoid the overfilling (Overfittin~ menns the rules mny perform well on the training dntn, but less
well on subsequent data) or rules on data a rule is pruned by removing n conjunct (attribute test). A Ruic
R is chosen for pruning, if the pruned version of R ha~ greater quality, as assessed on an independent set of
tuples which is known as Pruning Set.
Pruning can be done in several ways as in decision trees. In FOIL, the quality of pruning is evaluated as
Foil prune for given rules R
pos- 11cg
FOILPrimr(R) = -------,
pos+ 11eg
Based on the above four measures (TP, TN, FP, FN) the classifier's performance can be evaluated using
the following equations
accuracy, recognition rate TP + TN
P+N
error rate, misclassification rate FP+FN
P+ N
17>
'"'
FN p
110 Fl' TN N
Total p N' P+N
Classifiers can also be compared with respect to the following additional aspects:
Speed: This refers to the computational costs involved in generating nnd using the given classifier.
Robustness: This is the ability of the classifier to make correct predictions. given noisy data or dntn with
missing values.
ScnlablUty: This refers to the ability to construct the classifier efficiently given large amounts of data which
keeps on increasing.
Interpretabillty: TI1is refers lo the level of understanding and insight that is provided by the classifier or
predictor.
Bavesian Classification
Bayesian classifiers nre statistkal classifiers that predict class membership probabilities. The probability
that a given tuple belongs to a particular class. A simple Bayesian classifier is also known as the naive
Bayesian classifier which is comparable in performance with decision tree and selected neural network
cla..,sifiers. Bayesian classifiers have exhibited high accuracy and speed when applied to large databases.
Naive Bayesian classifiers assume that the effect of an attribute value on a given class is independent of
the values of the other attributes. This assumption is called class-conditional independence.
Baves' theorem
Bayesian classification is based on Bayes· theorem. Let X be a data tuple considered as Evidence in
Bayesian terms. X is described based on measurements made on a set of n auributes. Let 'H' be some
hypothesis such us that the data tuple ·x· belongs to a specified class 'C'. In-order to perform a
clnssification. the conditional probability P(HIX) has to be detennined. P(HIX) is the probability that tuple
X belongs to class C, given that the attribute description of X is known. It is given ns -
P HIX = P(XIH)P(H)
( ) P(X)
where.
P{HIX) is known ns the posterior probability of H conditioned on X.
P(XIH) is the posterior probability of X conditioned on H.
P(H) nnd P(X) is Lhe prior probability.
Thus P(C1 IX) has to be maximized. The class C 1for which P(C1 IX) is maximized is called the maximum
pmueriori h_rpmhesis.
4. If tl1e dataset have too many allributes then computation is expensive to compute P(XICi ).
"
P(XIC;) = fl P(xklC;)
k=l
21ra
5. To predict the class label of X, P(XIC1 )P(C1 ) is evaluated for each class C1 . The classifier predicts that
the class label of tuple Xis the class Ci if and only if P(XIC )P(C;) > P(XIC; )P(C; ) for I ~j ~ m, j = i.
i.e. predicted class label is the class C for which P(X ICi )P(Ci) is the maximum.
If an arc is drawn from a node Y to a node Z, then Y is a parent or immediate predecessor of Z, nmJ Z is n
descendant of Y. Each variable is conditionally independent of its non-descendanLc; in the grnph. given its
parents
Example: A directed ucyclic graph showing the Six nttributcs of n person resulting in Lung Cancer is
generated from the dataset. For example, having lung cancer is influenced by a person's family history of
lung cancer, us well us whether or not the person is a smoker. Note thut the variable PositiveXRuy is
independent of whether the patient hu.c; u family history of lung cancer or is a smoker. given that we know
the patient has lung cancer. In other wore.ls. once we know the outcome of the variable LungCancer, then
the variables FumilyHistory and Smoker do not provide uny additional infonnation regarding
PositiveXRay. The arcs also show that the variable LungCancer is conditionally independent of
Emphysema. given its parents, FnmilyHistory nnc.l Smoker.
LC I
-LC_ 0.2
0.8 I05
_ 0.5
IO7
_ 0.3
I_0.9
OI I_
The above table represents the CPT for the variable LungCancer. The conditional probability for ench
known value of LungCnnccr is given for each possible combination of the values of its parents. From the
above CPT it is clear that the upper leftmost and lower rightmost entries represent
= =
P(L1mgCn11ccr ) 'CS I Fnmi/yHistory yes, Smoker= yes)= 0.8
P(L1mgCn11ccr = 1101 FnmilyHistory = 110, Smoker= 110) = 0.9
Let X = (x 1 , •• • , Xn ) be a data tuple described by the variables or attributes Y 1 , ••• , Y" ,
respectively. Each vnrinble is conditionally independent of its non-descendants in the network graph, given
its parents. The joint probability distribution is represented as
where P(x1 , .. . , Xn) is the probability of a particular combination of values of X, and the values for
P(x; IParents(Y; )) correspond to the entries in the CJYf for Y; .
A node within the network can be selected as an "output" node, representing a class label attribute. There
may be more than one output node. The classification process can return a probability distribution that
gives the probability of each class.
Classification by Backpropagation
Backpropagation is a neural network learning algorithm. A neural network is a set of connected input/output
units in which each connection has a weight associated with it. During the learning phase, the neural
network learns by adjusting the weights so as to be able to predict the correct class label of the input tuples.
The attributes measured for each tuple are fed into the input layer simuJtaneously. These inputs aJong with
the weights associated with each Link wilJ be the input for the next layer called hidden layer. The output of
the first hidden Layer may be input to other hidden layer or to the output layer. The number of hidden layers
is determined by the complexity of the problem to be solved. Usually a single Hidden layer is used. The
output produced by the output layer correspondc; to the network prediction for the given tuples.
Input Units: Units in input layer that takes anributes from outside into the network.
Neurodes: The hidden units present between input unit and output unit.
Two-Layer Network: Have three layers (Input layer, Hidden layer and output layer). Input layer is not
counted as ii docs not perform any computa1ion.
Feed fonrnrd Network: A ne1work which hnve connection only in forward direction. There won'l bo any
loops or backward connections.
Fully connected network: Each unit provides input to each unit in the next forward layer.
Inputs
D: A dutn set consisting of the training tuples and 1heir associated target values:
I: the learning rute:
Network: A multilayer feed-forward network.
Algorithm
(I) Initialize all weights aml biases in 11ctwork;
(2) while terminating condition is not satisfied (
(3) for each training tuple X in D (
(4) // Propagate the inputs forward:
(5) for each input la)'er unit j {
(6) Oj = lj; II output of an input unit is its actual input Villue
(7) for each hidden or output layer unit j (
(8) lj = Li wijOi + Of //compute the net input of unit j with respect to
the previous layer. i
Oj =
1
(9) _ , , ;) // compute the output of each unitj
l+c• l
(IO) // Backpropagate the errors:
(I 1) for each unit j in the output layer
(12) Errj = Oj( I - Oj)(Tj- Oj); // compute the error
(13) for each unit j in the hidden layers, from the last to the first hidden layer
(14) Errj = Oj( I - Oj) Ls. ErrkWjk; II compute the error with respect to,
the m:xt higher layer. k
(15) for each weight w;j in network I
(16) 6 Wij = (/)ErrjO;; II ,wight increment
(17) Wij = Wij + 6 u,j; ) II weight update
(18) for each bias 8j in network {
( 19) 60j = (l)Errj; II bias increment
(20) 0j = 9j + 69j;} II bias update
(21) })
Ij = L WijO; + 0j
I
where wij is the weight of the connection from unit i in the previous layer to unit j; Qi is the output of unit
i from the previous layer; and j is the bias of the unit. The bias acts as a threshold (0) in that it serves to
vary the activity of the unit. Each unit in the hidden and output layers takes its net input and then applies
an activation function. The logistic, or sigmoid. function is used as activation function. Given the net input
Ij to unit j , then Oj , the output of unit j , is computed as
1
0 ·=---
J I+ e-lj
This function is also referred to as a squashing functi on, because it maps a large input domain onto the
smaller range of O to J. The logistic function is nonlinear and differcntiablc, allowing the backpropagation
nlgorithm to model classification problems that arc linearly inscparnblc. output values. Oj • for each hidden
layer, up to and including the output layer, which gives the network's prediction.
Backpropagate the error: The error is propagated backward by updating the weights and
biases to reflect the error of the network's prediction. For a unit j in the output layer, the error
Errj is computed by Errj = Oj ( l - Oj )(Tj - Oj )
where Oj is the actual output of unit j , and Tj is the known target value of the given training tuple. Oj ( I -
Oj ) is the derivative of the log istic function. To compute Lhe error of a hidden layer unit j. the weighted
sum of the errors of the units connected to unit j in the next layer are considered. The error of a hidden layer
unit j is
Errj = Oj( l - Oj) L Errk Wjk
k
where wjk is the weight of the connection from unit j to a unit k in the next higher layer and Errk is the
error of unit k. The weights and biases are updated to reflect the propagated errors.
Weights are updated ns
8 Wij =
(l)ErrjO;
Wij = lY;j + 8 Wjj
The variable I is the learning rate, a constant typically having a value between 0.0 and 1.0. Backpropagation
learns using a gradient descent method to search for a set of weights that fits the training data so as to
minimize tJ1e mean-squared distance between the network's c lass prediction and the known target value of
the tuples. The learning rate helps avoid getting stuck at a local minimum in decision space (i.e., where the
weights appear to converge, but are not the optimum solution) and encourages finding the global minimum.
lf the learning rate is too small, then learning will occur at a very slow pace. lf the learning rate is too large,
then osciJlation between inadequate solutions may occur. Biases are updated
89j = (l)Errj
0j =Bj+8Bj
Updating the weights and biases after the presentation of each tuple is referred to as case updating. The
weight and bias increments could be accumulated in variables and updated a fter all the tuples in the training
set have been presented. This strategy is called epoch updating, where one iteration through the training set
is an epoch.
Terminating condition: Training stops when
► All wij in the previous epoch arc so smull ns to be below some specified threshold.
► The percentoge of tuples misclassified in the previous epoch is below some threshold.
► A prcspccified number of epochs hus expired.
The computational efficiency depends on the time spent in training the network. Given IDI tuples and w
weights. ench epoch requires 0(10 1x w) time.
► HMM consist of two types of states, the Hidden State (w) nnd Visible State (v).
► Transition probability: is the probability of transitioning from one state to another in a single step.
, Emission Probability: The conditional distributions of the observed variables P(Vn I Z.,) from n
specific stnte.
Consider n hidden Markov model with three visible slates nnd three hidden states
W= l\'/ W l . WJ
V= V, Vi . VJ
In each hidden state W, the system will emits n visible output V=V, Vi . Vi. The probability of emitting nny
visible output by the hidden states are represented by emission probability which is represented as b ;j. i.e
State w, will emits the output V, Vz Vi with probability b, ,, b12, b13 respectively.
State w1 . will emits the olltput with probability bu,b11,b1J respectively.
State '1'J. will also emit the output V, Vz . VJ wit/, probability bJ1,bJ1,hJJ re:,pectively.
From any hidden state there is always a transition to nny other state. The probability of such transition
between states is represented as transition probability A1J,
The transition probability from state w, to w1 . is represemed as n,1 mu/from w1. tow, is au.
Similarly between wimrd . W J .
It is also possible to have n transition from n state to itself which is represented as a.LThe
sum of probability to have transition from one state to other will be one. i.e.
L Aij = 1
j
for nil j
similarly, the sum of probability to emit any of the visible output by n state is also one. i.e.
I:
/3jk = 1
for all k. k
After several transitions between hidden states, the system will reach a final state represented by Wn. Once
the system reaches finnl state, then there is transition to only itself. This transition is represented ns nm with
probability 1. It emits only one type of visible symbol, Vo. Also, the probability of emitting the visible
symbol Vo is given by bmwhcre bm=I.
The above state diagram can be drawn with the final state nnd the output as shown below.
• Decoding Problem: In decoding problem, WT, the sequence of states generated by the visible
symbol sequence yr has to be calculated.
i.e. WT=?
• Training Problem: For a known set of hidden states, Wand visible symbols V, the training problem
is to find the transition probability Aii, and emission probabiljty B;i: from the training set.
i.e. A,i = ?
B jL = ?
Evaluation Problem
For a given HMM e, and sequence of visible symbols vr, the conditional probability of getting the visible
symbol sequence have to be estimated.
P(\/1l0) =?
So P(VTIO) can be written as
where W,Tis one of those possible sequence of vrr and P(W,T) is the probability of getting Wl selected.
Since there nre different number of possible sequence of W, the P(VTJO) can be represented as
rma.r
i.e. product of transition probabilities at consequent time sequence in consecutive time. The
P(VTIWl) term of the equation (I) can be rewritten as
T
P(vT111•,.r) = IJ[P(i/ r)11v(t)]
1=1 --- ---(3)
Substituting (2) and (3) back in equation (1) results in the final equation of P(VTJO) as
rmnr T
P(VT) l0) = L IT P(\l(t) lll. (t)).P(ll.(t)lw(t - 1))
r= I 1= 1 ---------------(4)
The above equation (4) have high computational complexity which is of the order O (N1N)
Consider the following Trellis diagram that shows number of states at different instances of times.
At time t=O, the stntc of the system is w 1.
If the visible symbol emitted ut time t= I i.e. v( I )=vO. then there is n trunsition from w I to wO.
From the above dingmm, the probability that a machine is nt state W, at time T=t can be represented as a1(t).
So a1(t-l) is the probability thut the machine is at state WI ut time t-1. Similarly, for a1(t-l ), a1(t-l) etc.
The probability of transition from one state to other can be rewritten ns a product of probability of being in
a state and probability of transition to other state.
i.e. probability of transition from w 1 to w1 can be represented ns a1(l-l )n,J
Example:
n1J = Probability or transition rrom w1(t) to w1(t) at time instant t =a1(t-l )a 12.
an= Probability of transition from w2(t) to w2(t) at time instant t =a2(t-l )an.
Therefore, the probability of a visible state is the product of sum of nil the probabilities and t11eir emission
probabilities.
when sum is calculated for the product of probabilities of transition from a state w,.1 to wt it gives the
probability that there will be a transition from one of the state to another in a time instant t.
Probability that machine make a transition from a state nt time t-1 to another state at time tis given by
ai(t-l)au.
1l1e probability that the machine is in state w1 at time t after emitting t number of visible symbols from
sequence of visible symbols yr is given by sum of product terms multiplied by emission probability. In
Simplified Tenns it can be written as
Using this an algorithm can be generated to find the probability of HMM 8 has generated a sequence vr.
Forward Algorithm
Initialize: t=O, c!ij I bilt, vr. <Xj(O)
Fort=t+l
.\'
nJ{I) - l,;J.:\ "(I) . L
,-,
o i(t - l )111J
until t=T
Return P(VTl0) = aorn for final state
end
Decoding Problem
The decoding problem deals with finding the most probable sequence of set through which the machine
passed to reach the final state.
Algorithm for Decoding Problem
Let yr be the visible symbol sequence.
Aim: Find the probable sequence of hidden states.
Until j =N
j' = arg max [D I(t□
Append j' to path
Repeat until t=T
Return Path.
Training Problem
The hidden markov model can be trained using supervised learning method. It uses a number of known
visible symbols for training. The main motive of training a HMM is to estimate the transition probability
and the emission probability au and bJi. which are the unknown terms in equation (5).
Example
Consider a visible symbol sequence <V1,V3,V1,Vs,V1,V2,vo>
Then a 3(4)=the probability that the machine is in states W 3 after generating the first four visible symbols of
the sequence, v1,V1,v,,v.s.
Pi4)=The probability that the machine remains in the state W1 and it will generate the remaining four
visible symbols Vs,v 1, v2,Vo.
In the above example a new term is introduced p,(t) represent the probability that model will be· in
w,(t) and will generate remaining of the given target sequence vr. i.e. all the symbols from V(t+ 1)
to V(t).
Using this nn nlgorithm cnn be generated to ftnd the probnbility of HMM Othnt will generate the remnining
symbols of V"by continuing in the present state. 1l1is algorithm is known as backward nlgorithm.
Backward Algorithm
Initialize: t=T, n,J, bJk, VT, P,(t)
For t=t-1
until t=T
Return P(V"IO) = pi(O) for the known initial stme.
end
Both aJ(t) and P,(t), the forwnrd probability and backward probability will be used to calculate the a1J und
b,J.
Using the nbove tenns, the probability of transition from wi(t-1) to WJ(t) which emits visible symbols VT
can be written as
P(\'1l0)
Still. in the above equation of Yu(t), there arc tcnns of nu nnd h Jk which are unknown. So. in order to solve
the above equation, assume random values for a;1and bJk initially.
By using random values of OIJ and bJt the Yu(O can be estimated, and later improve the values of n,J and bJt.
The expected number of transitions from Wi(t-J) to Wj(t) at any time in the sequence yr is
T
L ; ij(T)
t= t
Using the above sequence. the value for P(VTl8) can be estimated for a particular model 8. Similarly, for
other models 0, also. If the conditional probability of one model 0, is greater than other OJ tl1en the model
points the corresponding class of 0,.
P(VTIO,) > P(VTJOi) then yr c 0,
Example
Consider a model with 4 hidden states and 4 visible states
W = W1,W2,WJ,W4
V= V1,V2,V1,V4 and final symbol Vo
Let the Transition probability and Emission probability as follows.
W1W2 W3W4 Vo V1 V2 V3 V4
1 0 0 0 w, 1 0 0 0 0 W1
AIJ= .2 .3 .1 .4 Wz Bij= 0 .3 .4 .1 .2 Wz
.2 .5 .2 .1 W3 0 .1 .1 . 7 .1 W3
.7 .1 .1 .1 w4 0 .5 .2 .1 .2 w4
where
L Aij = 1 L Bjk = l
j k
Let v~ =(v1, VJ, v2,vo) and the machine is at W1 at t=O. Find P(VTl8)
Solution:
From above data, the trellis diagram can be generated.
• &. 3 ...
w. 0 • D 0 . oo,,,
..,, O-•.s2 O·
~&.
0 -•001 Q o
'-'.s Q -0011,
oC..>(I)
O·
--.,(4)
cc.i•) O<JJ~)
"'
.. P [v +I ~ == 0-00,.,,
Given Initial state is W ,. So, probability of W1is I at t=O. And remaining states have probability zero.
From initial state transition is possible 10 W 1. W 1 and W J at t=l Considering the emission probability for
the first symbol given in sequence v,, at state \Va. using the equation of aj(t), the probability will be
obtained as 0.09. i.e . .3x.3. Similarly, for the other states W 2 and W 3.
So, at time instant t= I , the state with highest probability will be firing for the subsequent states i.e. state
\V1have 0.2 probability which is the highest. Same process is continued till all visible symbols are
produced as per the sequence given. At t=4 the system reaches final state Woand emits Vo with a
probability 0.001 38. which is P(VTJ8) for given HMM.
Why naive bayes is called naive?
(Nai e baye ek clas ifier hai jo clttssify I nntt hni cheezo ko naive ka mc1tlab hota
hai a Person/action hawing a lack of e,perience t h isko is classifier ko naive k u
kehte hai voh 1ss an wer me bdla a hai )
• Naive Bayes dassifiers are a collection of classilicaUon algorithms based on
Bayes' Theorem. lt is not a single algorithm but a family of algorithms
where alJ of them share a common principle, i.e. every pair of features being
cla sHied i independent of each olher.
• The aive Bayes Classifier assumes that all the features of a class are
independent of each other.
• This means I.hat anyone of the feature is missing then dassiUcalion may nol
be perfect.
• Fore ample, a fruit may be considered to be an apple if it is red, round and
about 4" in diameter, or a ball if we don't give the red colour classification.
• o even if Lhe featu re are inter-dependent on each oth r, the Naive Ba es
la ifier will consider them independent! .
It's called nai e also becau e iL ma l e the a sumption that all attributes are
independent of ea h other. Thi assumption is why it' caJl d naive as in lots of real
world situations this does not fil. Despite this the das ifier works e Lremely well in
lots of real world siluations and has comparable performance lo neural net works
and S M's in certain cases (I.hough not aU).
~~u t\assi ~tff •p"~<'{ ~~{,(
. - 30 I~ ~ Yt.t) I
~l ~ -=- 0 -?.:~ ~
~f.. ,~(OVY\t $™41.l,\t C-rutr Buj_
<.1c) H,~\.\ lfo f.liv ~o r (3,c : <.10 [~.,, /Jo)-; ">} s- -:; o. ooo
El(ctlltMf No
'L ,10 ~~ No • h,co~
ra;.., 'Ill>
?. '31-40 Mij\i No p (1"101MJ. : ""uf / Ii-:, =~U) -= "ljq -:. 0 -4- 4
)4-0 "'d.uWY\ Alo ~aiY YLI
1- PL i111co ""'-' = ~t..td {½-=-No).._ "J./s-
faiv
YcJ, -=- 0 · 40
5 >4-0 low ~
No • Sfi\W {
'
l-
)4-0
ll-40
low
low
Yt~
'ltS
httllu.J
t11ctlluJ 'ta P($W: 'WI
kr Yu) :: '/1 '= 0-tH-
#Jo
i
"
<30
~30
~(d_j\lWI
tow
#Jo
'/t~
Fai-1
Fa1"v .,~ p ( s~,lol .. ~u ) B~ "'No) -= 1/r :: 0 .1.0
• (-,ec;t ~ (. _
10 ~ ~\&.WI 'fa Faiv 'iu
" <3o ~{lW.W\ 'ft.I ()((tllt..J" iu. r(t-vt.CLI ~t} b,_, -= 'lu) ~ c/q -:. 0 .H 1 .
'""
13
31-40
11-40
Mcdi"'-w,
~\,
No ElCLtJ!.u.it 'lt.-1
p lc"\·d J ,: faiY / b-j ~ No) -- 'tfr -;
~q J:aiv iu.
14 )40 1-\~m No ~,c,t.l.lo.J /\Jo_
~ ("'hu) .~) = P[~e / Yt.-\) - P( ,,.~ [~u). P{Yu/~
l r (fJ.-.,,h~) . P))t6) 1
·- ---------
Na iVe __ _B~e,._'3_ _
J
- ---· ----- -
- - - - - ----- - -· --- - -
- -··- -
. '• .
---- - -- - - - - ---- . -- ----·· - - ·- ---~·-----
. .
..
- .
. . .
Scanned by CamScanner
- ~- - --- - ·- -·- -- -- ----·- - ---· -
p(&ho, tJ
-- ----- ----
-~-
- ---
no . a h"™- shoY t____C4JnA-_,,_____ r -
--
. ____ ____ kol:-4/_ oo_, ol__i_l:w)_ __ __
-- - -- - - --- - - -- - - - - ------
_ _p(mLc!S\JmJ__ ~ ~
-- . --- -- - - _ g___ ___:____ _ _ _ _ __
- - - -· ---- ·- -- - - - - -- -- -- - -- - - -- -
·--- --- --
r cta_f_Q ____
---'---C - _2--_ _ _ _ _~ -- -
-- -- ,,t,
- -- --- - - -- - - - - - - - - - - -
-- - -------- -- . ·~
. ., .
---...-- I- - - ' - - - - -- - - - - -- - - - ---'-____...:.,....- - ' - - - - : - -~ --
. .• •' • . : .J
Scanned by CamScanner
---.- ~ r====== = = = = = = = = = = ==
[r-i, ~ ,,11·
[ \, qI - 2 .. oJ _
[2 ·0 -aj
Gien ci(h ~ ma Ll
F'emal e, - -- - -- - - --- - - -· - - - -- ·
- ------- - - - -- - -· - - - - -- -·
···---·
.
. .--. . . --- - -- ---1--------- - i· ... ' ..
#' • •• " .,
• L - -- - -• • - _ . . .. •• ~ _, .,J
_ ..
Scanned by CamScanner
·r
Scanned by CamScanner
Puyo No; i. •,
/- 0 X ~
. ' . 9·. ,.
~o
Lil-u.lf he, o J (;o' ~JtoM
~ fC 1-;/meJfurrj)x P(m,edil)I\/U
~ ox 2
1
.· • ' . .
. . .
Scanned by CamScanner
i---·---- .
Page No . .
~ l x~ -
2- q
, 2_ ~ J_ = o·_...,.l I .
1g q <--
q J) n1~a Io e_ Sn mA fi v a Iut
I
I
.,
' .
. ( . . .
h) rn,v lq
-·· - ...._...:...-
.. ·~ - .. - .. .
. .
Scanned by CamScanner
fCti1II /&J - Pa-/1-r,Jj) • P(tr11!)
f(t-)
---- ,,
1-
~ 2-
~
Cf --- D · lI _,
_
(......
0 . ((
o· 11
. .
. ..
. . . ' .. ~---~ : ._
Scanned by CamScanner
Module 6
Dimensionality Reduction
Subset Selection
Subset selection, is the process of finding the best subset of the set of features. The best subset contains
the least number of dimensions that contribute most to the uccurucy by discarding the remaining,
unimponunt dimensions. Once the necessary dimensions arc found out, it can be used in both
regression and clussilication problems using a su itable error function . Suppose there arc d dimensions
or variables. then there exist 2" possible subsets. Unless if d is small, it is difficult to lest nil 1he
variables. There arc two nppro.ichcs:
• Sequential Forward Approuch
• Scquentiul Buckwurd Approuch
The mean is the nvernge of the numbers or n given set or a calculated "central" value or a set or
numbers.
Sta11dard De1·iatio11
The standard deviation or the dataset is a measure of how spread out data is. It is the average distance
from mean of dataset to a point. The standard deviation can be calculated by computing the squares
of the distance from each data point to the mean of the set, add them all up, it is to compute the squares
of the distance from each data point to the mean of the set, add them all up, divide by n-1, and take
the positive square root. Thnt is
s=
I:;~l (Xi - X) 2
(n - 1)
where s is the symbol representing Standard Deviation. Here n-1 is used instead of n in order to
represent a sample smaller in size than the population, where the result obtained by considering a
small sample will always be more or less same as that of considering the actual sample.
Varin11ce
Variance is another measure of the spread or data in a data set. In fact, it is almost identical
to the standard deviation. The formula is
,<;2
~~l- _1_
= _L-_1_ (X · - X)2
_1_ _ __
(n. - 1)
Both standard deviation and Variance are used to find the relationships among data points in a One
Dimension Space.
Ct11'aria11ce
Covariance is always measured between 2 dimensions. If the covariance between one dimension and
itself is calculated then variance is obtained. Covariance is n measure to find out how much the
dimensions vary from the mean with respect to each other. The covariance between any two
dimensions in a multi-dimensional space can be given as
Covariance Matrix
Covarinnce is nlwuys mem;ured between 2 dimensions. Ir u dntu set contains more tlmn 2 dimensions,
there is more thnn one covariance measurement that cnn be cnlculnted. The matrix representation of
nil the possible cnlculnted values of covariance is known us covariance matrix. A covariance matrix
three dimensions is given as follows
r:011(~:, 11)
<:m1(11: 11)
,:m,(z, 71)
Two important points to be noted about Principal Component Analysis arc that. it lowers the
dimension of the input features und orthogonality of the new dimension is ulwuys maintained.
Consider the darn plot given below, which hns two dimensions, one along axis XI and other along
X2.
I
I •
~
,._~,.~{?-:\
In the above data plot, the variability of data points along both lhe axis is large. That means, both the
dimension provide significant information about the data points in the plot. So, each data point can
be represented using the corresponding coordinate values of both dimensions.
X, Xi
X11
Xi,
X=
Xu
Cov(X) = ~ ~
~~
The main aim of PCA is to find new dimensions such that the vuriubility across atleast one of the
dimension will be so smull enough to be neglected. Inordcr to find new dimensions. rotate the axis
XI and X2 by and angle o. Aller rotaing both axis, the new axis obtained arc Z I and Z2. The data
plot after transformation is shown below.
Now it can be noted that the variability of data along the uxis Z2 is very small. That means it docs not
provide much infonnntion about the data points in the plot. So, ignoring such a dimension will not
affect the quality of the classification process. The two new dimensions obtuined by transforming the
initial axis arc known us Principal Components. Herc the principal components arc 2 1 und 22.
Vur(21 )>> Var(22)
In original data XI and X2 arc highly co-related. So, getting un ellipse where major and minor axis
urc not parallel to XI und X2 which shows the dependencies between x I und x2. After
transformation major and minor axis arc along 21 and 22 and it is known us orthogonality.
Variability across Z2 is very less n.c; compared to 21 so that dimension 22 can be ignored.
Consider n single dnta point in the plot say M(x l ,x2). When the axis are changed to 21 and z2 from
x l and x2, the corresponding coordinates have to be found out for each point in the dutn plot.
Consider the below diagram
A• [ c.os
s,~
9
e
-~~J
Co.s..9
w~ CJ>-~ ~
~
~~ ~
.So
So
~~
9 r~e
l:-.!,," s
g,"~e\.
~~ L8, .. e =ru 'J
~B _s,.,e]-= [~ ~,::>,;.~
.,.'I-.
z, -= 0....
I ~> a_ 11'1,. I + C\.. I l "".,_ . - ... - o...,~ ~p
,
a..2-..,.. -) °'-2-1 '#,_ I -+
z 'Z. -:.
a. u '1,..
'2. - - - -
O..il> '1-..t
- .. --
.
..,IJL.
I •
-tu.
.&u.bjui...J.
0."""'
Cov[~..,._
• ,r-k, .. ~
0-.J. '/...
f-'-'"'~
ct..o
'FO't" .:r-\"' 'r l..l.VI~\ \ C.0\"t\ \~~ ~,nt_ "\J r,_ Au~t e_.
~ frC½'O..j-1 '=~
J\.w,cl-~t"t -4-o be. ~~ \~,-.e.J ~l,~,~
.
So ~ -r~t) U-l;d ~ U~"'
\... -= o.J S°:,· - )., LoJ. °':; -3 t,!) kt... >- ..t& Jw
~1F" WI~
a. \~-'.o~o
'
~~ c.c\'T\r~ ~e....
.
f. e
s
. •
'P1'?
..u, C\ ~---? 1'(\Q~IU.~ O.Y\t\ \
- \)) a.
..
•
\~ ¼e.. ~~'r~ ~ '\)~? ~Q'f"I ~~ C, \ CLJ.~ ~
10 IDO
I 'l- llO
II ,o.,-
~ C\4-
C\
qs- • • •
10 qC\
to.4
I\
IOI • • •
\'2..
IOS"
"
10
II
ll!a
103
•• •
12. 110 ••
"'-= 30-60
~'2- - o.0-3 .
ku.v
{) ' \ q C) • C\i
Q.l\t -0 , \'\
l\1otivntion
Imagine that you nre in a room where two people nrc speaking simultuncously. You have two
microphones, which you hold in different locations. 1l1e microphones give you two recorded time
signal~. which we could denote by x I (t) and x2 (t). with x I und x2 the amplitm.lcs. and t the time
int.lex. Each of these recorded signals is u weighted sum of the speech signals emilled by the two
speakers, which we denote bys I (t) und s2 (t). We could express this as n linear equation:
X1 (t): nu S1 + 811S1 (1)
X1 (t) = n21 s, + nus1 (2)
where a 11 • a 11• a21 , and n22 arc some parameters that depend on the distances of the microphones from
the speakers. It would be very useful if you could now estimate the two original speech signals s 1 (t)
and s1(l), using only the recorded signals x, (t) and x1 (l). This is culled the cocklnil-pnrty problem.
One approach to solving this problem would be to use some information on the statistical properties
of the signals s, (t) to estimute the u,1• Actuully. und perhaps surprisingly. it turns out thut it is enough
to assume that s, (t) and s2 (t). at each time instant t. arc statistically independent. Independent
Component Analysis, or ICA, can be used to estimate the U1J based on the infonnation of their
independence, which allows us to separntc the two original source signals s , (t) and s2 (l) from their
mixtures X1 (t) and Xi (l).
Definition of ICA
Assume that we observe n linear mixtures x, •... , Xn of n independent components x
J = a jl S1 + Up S2 + ... + U jnSn , for all j.
Using vector-matrix notation let us denote by x the rnndom vector whose clements arc the mixtures
x, •...• Xn- and bys the rnndom vector with clements S1 , ..., Sn • Let A the matrix with clements nu. All
vectors arc understood us column vectors: thus xT. or the transpose of x, is a row vector. Using vector
matrix notation this can be represented as
x=As. (3)
"
x = [a,s;.
i= t
The statistical model in Eq. 3 is called independent component analysis, or ICA model. 1l1e ICA
model is a generative model. which means that it describes how the observed data arc generated by a
process of mixing the components s1 . The independent components are Intent variables. meaning that
they cannot be directly observed. Also the mixing matrix is assumed to be unknown. All we observe
is the random vector x. nnd we must estimate both A and s using it The starting point for ICA is tl1e
very simple assumption that the components si arc statistically independent and the independent
component must have non-gnussinn distributions. After estimating the matrix, A, compute its inverse,
say W, and obtain the independent component simply by: s = Wx.
Amhjgujtjf§ of ICA
1. Cnnnot detennine the vnrinnces of the independent components. Because both sand A being
unknown. any scnlar multiplier in one of the sources s1 could always be cancelled by dividing the
corresponding column n, of A by the same scalar
2. Cannot determine the order of the independent components.
and similarly for Y2 . Thus Y1 and Y2 arc independent if and only if the joint pdf is factorizable in the
following way:
Uncorrelated variables
A weaker form of independence is uncorrclatcdncss. Two rnndom variables yl and y2 arc said lo be
uncorrelated if their covariance is zero.
E{y1 Y1} - E{y1 )E(y1) = 0
If the variables arc independent, they arc uncorrelated, On the other hand. uncorrelatedness does not
imply independence. Since independence implies uncorrelatedness. many ICA methods constrain the
estimation procedure so that it always gives uncorrelated estimates of the independent components.
This reduces the number of free parameters and simplifies the problem.
Measures of non-gaussianity
Kurtosis: The classical measure of non-gaussianity is kurtosis or the fourth-order cumulant. The
kurtosis of y is clnssically defined by
kurt(y) = E {.r''} - 3(£ {.i} )2
If assuming that y is of unit variance, the right-hand side simplifies to E( '/ } - 3. This shows that
kurtosis is simply a normalized version of the fourth moment E {y4 }. For a gaussian y. the fourth
moment equals 3(E( y2 })2 • Thus. kurtosis is zero for a Gaussian random variable. For most (but not
quite all) non-gaussian random variables. k'llrtosis is non-zero. Kurtosis can be both positive or
negative. Random variables that have n negative kurtosis are called sub- gaussinn. and those with
positive kurtosis arc called super-gaussian.
Modnle7
Learning with Clustering
K-means Clustering
The k-means algorithm takes the input parameter. k. and partitions a set of n objects into k
clusters so thut the resulting intra-cluster similarity is high but the inter-cluster similarity is
low. The cluster similarity is measured in regard 10 the mean value of the objects in a cluster,
which can be viewed as the cluster's centroid or centre of gravity.
Steps:
► First, it randomly selects k of the objects, each of which initially represents a cluster
mean or centre.
► For each of the remaining object,;, an object is assigned to the cluster 10 which it is the
most similar, based on the dis1ance between the object and the cluster mean.
► It then computes the new mean for each cluster.
► This process iterates until the criterion function converges.
► Typically, the square-error criterion is used, defined as
I.
£ = L, L, IJ1 - 111il2 •
/c lpEC1
where E is the sum of the square error for all objects in the data set; p is the point in space
representing a given object; and •mi' is the mean of cluster Ci (both p and mi are
multidimensional)
Algorithm:
1l1e k-mcans algorithm for partitioning, where each cluster's centre is represented by the mean
value of the objects in the clus1er.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(I) arbitrarily choose k objects from Das the injtiaJ cluster centres;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar, based on the
mean value of the objects in the clus ter;
(4) update the cluster means, i.e., calculate the mean value of the objects for each cluster;
(5) until no change;
Hierarchical Clustering
Hierarchical Clustering
A hierarchical clustering method works by grouping data objects into a tree of clusters. The
quality of a pure hierarchical clustering method suffers from its inability to perform adjustment
once a merge or split decision has been executed. That is, if II particular merge or split decision
Inter turns out to hnve been a poor choice. the method cannot backtrack and correct it.
In general, there arc two types of hierarchical clustering methods:
Agglomerative hierarchical clustering:
This bottom-up strategy starts by placing each object in its own cluster and then merges these
atomic clusters into larger and larger clusters, until all of the objects arc in a single cluster or
until certain termination conditions nre satisfied. Most hicrnrchical clustering methods belong
10 this category. They differ only in their definition of inter-cluster similarity.
In either ngglomerative or divisive hierarchical clustering, the user can specify the desired
number of clusters a.c; a termination condition.
Derivation
l.c:l X IK? rnrulnm vnc:t.or wliid1 rc,:mll1' from n pnrnmnl<:rl:wcl frunily. \Vr. wil'h
lo fiml O i-11d1 tlmt 'P(XIO) 11' n llll\xln111111. 1111N IN known n..'l tl,r. Mnxlmnm
l.ikdll1uocl (ML) t!tellmnlt: fur 0 . 111 urclt:r ln <!tCllmnlu 0, It IN lyplcnl lo lnlnHl11cc:
llic fog likdilwod funrtion tlcfim.-cl n..-c,
1.(0) - 111 'P(XIO) . (7)
'rim likc·lihnCHI r1111rlio11 be nml'i«lc·n-cl lo IM1 ,, f11m:tlo11 of thn Jmrnmrt.<·r O ~ivun
llm clnln X. Sl11r.e: 111(:r) ,,.. n ,..lrir.tly l11r.n·1~l11~ f1111r.t.lo11, the: vnlm: or (J wl1icl1
1111uci111iY.c,.. "P(XIO) nltW, 1111uci111iYA~ 1,(0).
'11u: F.1\1 nl,tnrilhm is nu ilcrnlivc: proouclun: for mn:,dmizing 1,(0). Assumt!
llmt nrtc r the 11 th ile rnlio11 the c11rrc11l t..""4li11mlc for O itt given hy 0,.. Sim,-c t.lie
uhjt.'c livc itt Lu 1111uci111i:t.c L(O), we w~la tu cu111p11lu 1u1 11pdalt..-c_l CNlimnlc O Ht1cl1
Llml,
L(O) > L(O,.) (8)
Ec111ivnlu 11t.ly we wnut. Lu 1111uci111i:t.e t.lm cliffercuc.x.:,
/.,,(fJ) - /.,,(fJ,.) - l11P(XIO) - l11P(XIO.. ). (U)
Su far, we: tmvu 11ul cut1Niclcn..-cl nny 1111ul1NC.1rvcel or mitCMiu,; vnrinl,ICN. 111
pruhlc111H where tt11c h clala cxi."'l, lhu El\,1 nlgurithm pruvielett" 1ml11nll frnmcwurk
fur Llacir i11d11Hio11. Allc rimtcly, laiclelc 11 vnrinhlett mny ho i11lrucl11cul purely ''"
IUI nrt.ilic.:c ru r 11mki111; the 11m.xi11111111 likclil1oocl :t<li111nliu11 uf (} lrnctnl,I •• 111
llaiH c..:1U<O, it. iH 1u..-..11111ccl llml kuuwlL-cl~u u f lh • hiclel •11 vnrinblett will 111,lku t.11 •
1mu.i111iY,11Liu11 of tl1u likclilauucl fu11 ·tiu11 t·11Nicr. Eitl1cr way, clc uuw t.l1e l1iclele 11
r,uulu111 ve-clnr l,y Z nml n 1~ivc11 rc·1lliY.ntin11 l,y z . Tiu: tul1ll pn,lml,llit.y P(XIO)
mny IM.' wrillc11 i11 Lurm>< of t.hc hiclelt•11 vnrinl,lc"H z ,ix,
>
(13)
1,(0) ~ l(0j0,.).
\Ve Lnv • uuw n fuuct.iou, 1(010,.) wLicla bl 001uul :<I nix.we by t.lac likcliJ10ocl
fu11clio11 L(O). Acldit.lu11nlly, ulJt(Crv · tlint,
L(0,.),
• (14)
Our ohjc-ctivo lK to d , ~ n vnluC!C of O i.o thnt 1,(0) iK mnximi:,.NI. \Vr. hnvr.
Khown thnt the function l(OIOn) iK IMnmdccl nhovn hy tlm likdihood function /,(0)
rmcl thnt the vnluc of the~ fnnctiorL-c l(OIOn) mul /,,(0) rln! c.-qmJ nl lhc current
CNlimnlc fur O - O~.. 'l'lu.in .ifure , nuy O whida i11cnm..-«~ 1(010") i11 tuna i11cmn.'tl!
the /,,(0). 111 orcl •r Lo nchic v • tht! grc al~t JH.Jt(.'Cil,I • i11crciu-te iu Lhc viJuc of L(O),
Ll,c EM 1Jgurillam c:llm fur ~lt.-clirag O tmda LlmL 1(010,.) iK m 1lXi111iu.'CI. \Ve clcraulc
lhiK upclnh."Cl vnluc nx 0.,, , . Thi.,.. prt.><..'CNH iN il111Hlmlccl iu Fi~.,,rc (2).
Fun111Jly w e hnvc,
Ju Ec111nlio11 (15) t.l1c c.xpc...-clnliuu ruul 11mxi111iznlio11 Nl.cptt nrc nppnnmt.. 'l'l1 •
EM nlgurill1111 tllllH CUllP«iNl.H ur itcmliug ll1c:
Whal- ts ClustU/inJ ?
ClusteJJinJ rs /:fu, pYoms oC rnaki?} g.-.ovp
a.
0 e.,coj n I <-Cl biCJ n , d q /;-q [lj) C/ !Js,,; and irr,0-. J e, Pro CLbS i j
Page No.
k-rneM
.. 'Th I 5 Is uS<..d in CJU>a ti V& CJv...a b n(J ceJJ 6'm I
cl u,s to-,
..
Ct IJ hr,.5 mL m et:zn n 7/tq IO pQj1 O_,.sl l (J_)q ( 0 ~
.sewu- b Cil) ao
K1 : { 2. 1 3 1 Y1 lo 11 1 1
1 2] k 2 ~ { 201 2S13 c }
M, -:. 1 Yt12._ ~ 25
07 ' ~ 1
1hv s c.UL OJ-LL, ~e tl:Jnj s Cl../Y\,(_ m e Cl.I'\ (,{,(,e_ hq u e:
tb s tD p K1 CU) d k: 'l_ Q;U_ fJ.iLU cl u 0 h:V-/ _,.
PW7t- ( 1)
fl3 J lornm o bve_,
cf us [e.ninJ
nttr16vti{xJ Rttfibvte.C}j
2
2
2
3
l
i
o - 5
d c 0 1EJ -- J (i -5-3) --
2
~
a_
+--(0 -5 - I} f ·5 CZ
°' c}J a CLn ~ m~H, x TS ,WJJ aJ
1h e 0
A B C 0 f
obj e.cb
A 0
0
B I
0
C I· l,- I 2· 24
0 f, 4-1 I 2 0
~ Jo 5g 2"/2 o-:;, 1-si 0
!.. '
. . (
,,.--
A C B D
-r -rte.1V\ E
• 3
I ,
~
Ii - -
£
A--
0 \
0
l
~
2
s I-
l
3
r-
' '
' 0
C ~ :2. I
I =~~--:---
\
1 3
~ _ ..J
5
' 0
' 0
.
3 \ 3
' \ ~
~ -· ~
I '!.ttM. E· A (; 8 D
£ 0
A
{.,
'
2
0
~-s 0 '
1
'
B c2. I 0 '
D 3 3 6 3 0
. . . J sV=-,
r
7'fow (ffnd <Y\,~rn um
ro~f\lmurn crt~tMce ,:; \ p.
J-/erice YYte(ae iA
Scanned by CamScanner
~
I I
£ ~
I -
~,n d q ,~ \ __
, ro,n C:hs1l>ftle from fA. -1:.o othtY
f' C
I.
Scanned by CamScanner
n~ni___ - ,
£A C 8
-
7)
cA _ ..,, __
--- .2. 0 . .
C 0
-
)
r
.. I
f
8~ - ·- - -2. I 0
~.._,... __~..,..-
-- - -.-.4:.......,-: .. -
,
1) 3• 0
1n -MrJ '
rnTnimum %totlce
3
=I ,.....
~d"'1Yont /; I
I l I
A c s
r;.
.....\
Scanned by CamScanner
. . ..... - .
f'
f,
•
..
,
'
..
Scanned by CamScanner
-
p
- -· -
•
C.S } 1 0
•
• - - . - • ~ •-- - _ _ _ ,,,_ _ - ••- , _, - -• w - - I
p i
3
\ I 3 0
~J,~c.m
A ""'
r
,-
: min dl$4; [3, 31 t., 3 J
= s
-:;,c
r
3 0
Scanned by CamScanner
I
If\ I
('
3
...... 2.
............ ,
A C /
D
,-,.
Scanned by CamScanner
Radial Basis Function
A radial basis function (RBF) is a term that describes any real valued function whose output
depends exclusively on the distance of its input from some origin. Typically, radial basis
functions are defined in terms of the standard Euclidean norm of the input vector, but
technically speaking one can use any other norm as well.
A radial basis function gives value to each point based on its distance from the origin or a
fixed centre, usually on a Euclidean space. It is spherically symmetrical.
In machine learning, radial basis functions are most commonly used as a kernel for
classification with the support vector machine (SVM). For this application of RBFs,
the Gaussian Radial Basis Function is virtually always selected.
One of the commonly-used types, Gaussian, is usually chosen as the activation function to
compute the derived features (units in the middle of the network) in neural networks known
as radial basis function networks.
RBFs can also occasionally be used as activation functions in neural networks to form what is
known as a Radial Basis Function Network, but this application is uncommon.
MODULES
Reinforcement Learning
Introduction
► Reinforcement Leaming is a type of Machine Leaming, and thereby also a branch of Artificial
/11tellige11ce.
► It allows machjnes and software agents to automatically determjne the ideal behaviour withjn
a specific context. in order to maximize its performance.
► Simple reward feedback is required for the agent to learn its behaviour. this is known as the
reinforcement signal.
► Reinforcement Learning allows the machine or software agent to learn its behaviour based on
feedback from the environment.
► If the problem is modelled with care, some Reinforcement Leaming algorithms can converge
to the global optimum; this is the ideal behaviour that maximises the reward.
Functionini
► In reinforcement learning, the learner is a decision-making agent that takes actions in an
environment and receives reward (or penalty) for its actions in trying to solve a problem.
► After a set of trial and error runs, it should learn the best policy, which is the sequence of
actions that maximize the total reward.
Core Concept
► Reinforcement learning differs from standard supervised learning in which correct
input/output pairs are never presented, nor sub-optimal actions explicitly corrected.
► There is a focus on 'on-Line' performance, which involves finding a balance between
exploration (of uncharted territory) and exploitation.
► It is called "learning with a critic," as opposed to learning with a teacher which we have in
supervised learning.
► A critic differs from a teacher in that it does not tell us what to do but only how well we have
been doing in the past; the critic never informs in advance.
► The feedback from the critic is scarce and when it comes, it comes late. This leads to the credit
assignment problem.
► The reinforcement learning generates internal values or the intermediate actions in terms of
how good it is in leading to the goal and getting us to the real reward.
► Once such an internal reward mechanism is learned, the agent can just take the local actions
to maximize it.
► The solution to the task requires a sequence of actions which resembles a markov process. A
Markov decision process is used to model the agent.
► The difference is that in the case of Markov models, there is an external process that generates
a sequence of signals. In this case it is the agent that generates the sequence of actions.
► Sometimes a partially observable Markov decision process is seen in cases where the agent
does not know its state exactly but should infer it with some uncertainty through observations
using sensors.
ENVIRONMENT
Reward
Action
Episode/frail: The sequence of actions from the start. to the terminal state is an episode or a trial.
Bused on the npplicution the uny state may be designated us the initial state. There is also an absorbing
tem1inal (goal) stute where the search ends; ull actions in this tenninul state transition to itself with
probability I and without any reward.
Policy: TI1e policy, defines the agent's behaviour and is a mapping from the states of then
environment to actions: Sn - A. The policy defines the action to be taken in any states,: a,= (sn ,).
The value of a policy, V 7t ff (s, ), is Lhe expected cumulative reward that will be received while the
agent follows the policy, starting from states,.
Finite Horizon: In the finite horizon or episodic model, the agent tries to maximize the expected
reward for the next T steps:
....--- - -----------------------.
V 11 (s,) = E[rr+I + rr +2 + · · · + rr +T J = £ [±,_ I
rr +I-
-
Inrtnite Horizon: Tasks that arc continuing, a.nd there is no prior fixed limit to the episode. In the
infinite horizon model, there is no sequence limit. but future rewards arc discounted.
Discount Rate: The discount rate is used to keep the return finite. If= 0, then only the •y' immediate
reward counts. As approaches I, rewards further in the future count more. •y• is less than 1 because
there generally is a time limit to the sequence of actions needed to solve the task.
Optimal Policy: For each policy, there is a Vn •(s,), and we want to find the optimal policy n· such
that
\I* (s1 ) = max
1T
V" (s1), "f/ s,
In some cases instead of working with the values of states, V (s, ). it is preferred to work wi1h the
values of state-action pairs, Q(s, • a, ). V (s, ) denotes how good it is for the agent to be in state s,,
whereas Q(s, , n,) deno1es how good it is to perform nction a, when in stnte s, . We define Q· (s., n,) as
1he value, 1ha1 is, 1he expected cumula1ive reward, of action a, 1aken in s1a1e s, and !hen obeying the
optimal lie nflerwnrds. The value of a state is c ual to the vnlue of the best ssible nction:
v• (s,) - max Q * (s,, a,)
n,
= max
a,
E [f 1-1r,+1]
I• I
y
= maxE [r, . . ,
a,
+)'I1-1)'1-1r,+1+1]
max
a,
E [rr +I + )'\/* (sr+d]
v• (s,) = ~
a, (E[rr +tl + }' ...... P(s1 +1ls,,a,>\l*(s,+d
max ------------------(1)
s, ..
The equation (I) is known as Bellman's Equation. Similarly. the value of expected cumulative reward
can be represented by
Q * (s,,a,> = E[r, +il +>'I
•
P(s1 +1ls,,a,) maxQ*(s1+1,a1+1)
Dt • I
Sp. 1
Once the vnJue of Q·(s,,a,) is calculated the policy can be defined by trucing the action an ., which has
the highest vnJue among nJl Q•(s,, a.)
rr *(s,): Choose at where Q *(s,,at) = maxQ* (s,,a,)
a,
If values of Q· (s,. a,) are known, then using greedy approach at each local step, the optimal sequence
of steps that maximises lhe cumulative reward can be cnJculn1ed.
Model-Based Learning
In model-based learning, the environment model parameters, P(r"1 ls,, a,) and P(s.,,ls,, a,) are known.
If the environmental parameters are known optimal value function and policy can be solved directly
using dynamic programming. The optimal value function is unique and is the solution to the
simultaneous equations (Bellman's equation). The optimal policy is to choose the action that
maximizes the value of optimal value function, in the next state:
where I is the iteration counter. It is possible that the policy converges to the optimal one even before
the values converge to their optimal values because only the actions with maximum values arc
considered.
Polin· lll'r:ttion
In policy iteration, the policies are stored and updated, rather than doing this indirectly over the
values. The process starts with a policy and improve it repeatedly until there is no change. The value
function can be calculuted by solving for the linear equations. Improvement in the policy by is
checked. This step is guaranteed to improve the policy, and when no improvement is possible, tlhe
policy is guaranteed to be optimal.
Exploration Strategies
c-greedv search: c-grccdy search is where with probability E, we choose one action uniformly
randomly among all possible actions reffcrcd to as explore, und with probability 1-c, we choose the
best action, rcffcred us exploit. Start with a high value and gradually decrease it inorder to exploit.
Soft Policy: Policy is soft, if the probability of choosing any action a E A in states ES is greater than
0. A Soft-Max function is a function used to convert values to probabilities
P(als) = expQ(S,a)
LbE.'A expQ(s,b)
and then sample according to these probabilities. The above equation can be rewritten use n
"temperature" variable T and define the probability of choosing action a as
P(als) = exp[Q(s,a)/T]
Ibe.'A exp[ Q (s, b) / T]
When Tis large, nil probabilities arc equal and we have exploration. When Tis small, better actions
are favored. The strategy is to start with a large T and decreac;e it gradually. a procedure named
annealing. which in this case moves from exploration to exploitation
use this as an assignment to update Q(st , at ). At state s. , action a. is chosen, which returns a reward
r,,. and takes us to state s1, 1 • The value of previous action can be updated a.c;
Q(s,, a,) - r, +1 + )1 maxQ(sr
a,,. I +1, a,+d
where the hat denotes that the value is an estimate Q(st+ 1 • at+ 1 ) is a Inter value and has a higher
chance of being correct. We discount this by and add the immediate reward (ify any) and take this ns
the new estimate for the previous Q(st • ot ). This is called a backup because it can be viewed as taking
the estimated value of an action in the next time step and "backing it up" to revise the estimate for
the value of a current action.
Assume thnt all Q(s. a) values arc stored in a table. Initially all (st. at) arc O. nnd thcyQQ arc updated
in time a~ a result of trial episodes. If there is a sequence of moves and nt each move the estimate of
the Q value of the previous state-action pair is updated using the Q value of the current state-action
pair by the formula
Q(s,, a,) - r, + 1 + l' max Q(s, + 1, a, + i)
<lt, I
In the intermediate states, ull rewards have values as 0, so no update is done. When the goal state is
reached. the reward r is obtained and update the Q value of the previous stateaction pair as r. As for
the preceding stute-action pair. its immediate reward is O nnd the y contribution from the next
state-action pair is discounted by y because it is one step Inter. Then in another episode, if we reach
this state. we can update the one preceding that as y tor, and so on. This way, after many episodes,
this information is backed up to earlier statcaction pairs. Q values increase until they reach their
optimal values as we find paths with higher cumulative reward.
If the rewards and the result of actions are not deterministic, then there is a probability distribution
for the reward p(r11 ijs,.a,) from which rewards arc sampled. and there is a probability distribution for
the next state P (s,, iJs.,u,). These help us model the uncertainty in the system that may be due to forces
that cannot control in the environment.
For example. we may have an impcrfect robot which sometimes fails to go in the intended direction
= E[r1+d + y I
Q(s,,a,)
S, . P(s1+ils,,a,)maxQ(s1+1,a1+d
a,.,
and deviates or advances shorter or longer than expected. we can write
A direct assignment is not possible because for the same state and action, different rewards or move
to different next states are received. This type of problem can be handled by keeping the Average,
and the method is known as Q-Leaming.
Q (s,, a,) - Q(s,, a,) + 11(r1+1 + )' max
Dr, I
Q(s1+1, a,+ i) - Q (s,, a,))
TcmpornJ difference (TD) algoriLhms nrc those reducing the difference between the current Q value
and the backed-up estimate, from one-time step later. This is nn off-policy method as the value of the
best next action is used without using the policy. In nn on-policy method. the policy is used to
dctenninc also the next uction. The on-policy version of Q leurninA is the Sursa nlAorithm.
Eligibility Traces: TI1e temporal difference is used to update only the previous value (of the state or
state-action pair). An eligibility trace is a record of the occurrence of pac.t visits that enables us to
implement temporal credit assignment, allowing us to update the values of previously occurring visits
a-; well. To store the eligibility trJce, an additional memory variable is required associated with each
state-action pair. c(s. a), initialized to 0. When the state-action pair (s, a) is visited, i.e. an action a is
taken in stutc s, its eligibility is set to 1; the eligibilities of all other stnte-nction pairs arc multiplied
by y"A.. 0 ~A~ I is the trace decay parameter.
I ifs = s, and a = a,
e,<s, a) = { yi\e,_ a(s, a) othen,1se
If n state-action pair has never been visited. its eligibility remains O; if it has been, as time passes and
other state-actions are visited, its eligibility decays depending on the value of y and > .. In Sarsa, the
temporal error nt time t is
Dr = rr+1 + )1 Q(sr+1, ar+1) - Q(sr, ar)
In Sarsa with an eligibility trace, named Sarsa(l.). all state-action pairs arc updated as
Generalization
In all reinforcement learning algorithms its assumed that the Q(s, a) values (or V (s), if values of
states arc estimated) are stored in a lookup table, and the algorithms nrc called tabular algorithms.
Some issues with this tabular approach are:
• when the number of states and the number of actions is large, the size of the table may become
quite large.
• States and actions may be continuous, and to use a table, they should be discretized which
may cause error.
• When the search space is large, too many episodes may be needed to fill in all the entries of
the table with acceptable accuracy.
Instead of storing the Q values, consider a regression problem, which is a supervised learning problem
with a regressor Q(s,a l )0, taking s and a as inputs and parameterized by a vector of parameters, 0,
to learn Q values. This can be an artificial neural network with s and a as its inputs. one output, and
8 its connection weights.
A good approximation may be achieved with a simple model without explicitly storing the training
instances; it can use continuous inputs; and it allows generalization. If we know that similar (s,a) pairs
have similar Q values, we can generalize from past cases and come up with good Q(s.a) values even
if that state-action pair has never been encountered before.
To be able to train the regressor, a training set is required. In the case of Sarsa(O), we saw before that
we would like Q(st , at ) to get close to rt+ l + Q(st+ 1 , at+ 1 ). So, we can form a set of training
samples where the input is the state-action pair (st , at ) and the required output is
rt+I + yQ(sy l ♦ I, ll1t1 ).
Ir using n gradient-descent method, us in 1mining neural networks, the parumc1er vector is updmed ns
Da0 = 17[r1+1 + yQ(s1+1,a1+d -Q(s,,a,)]v'9 1 Q(s,,a,)
This is a one-step updutc. In the cusc of Sursu( ). the eligibility truce is nlso 1uken into nccounl
with c Oull zeros. In the ca°'c of n tubular algorithm. the eligibilities urc stored for the statcnction pairs
because they arc the parameters (stored as a rnblc). In the case of un estimator, eligibility is associated
with the parnmetcrs of the estimator. ,my regression method cun be used 10 truin the Q function, but
the particular tusk hus u number of requirements. First. it should ullow gencrnlizution: that is. need to
guarantee thul similar stutes und uc1ions have similar Q values. This ulso requires a good coding of s
nnd n. a._ in uny upplicution, to mukc the similarities apparent. Second. reinforcement learning updates
provide instunces one by one and not usu whole training set, um.I the learning ulgorithm should be
able to do individual updates to learn the new instance without forgetting what has been learned
before.
Well Posed Learning Problem:
A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E.
It incudes 3 features