You are on page 1of 264

ModuJe - i-

~h ~
-=--2..:: ·
Q1
s
~ '-°
s111w,n e
LgOHDo'Q a t

. "M.nt.h~ne /ecwn'?c CCL() be cl'lJ,'fned ().<I Cl brCJ.IK.h


~ 1ih-~Jrt?ol lnhilljj ence -H1{)J ho.J /:J,e. ab7l7~ to /f(1,Llfl
on ?ts OWY? u!liJ10 td- haJ.f?no o.Jj f,yo'(YYlo)Jon u;J.-1:/i
1

Huped -bi Cl domoJn ~n the <progroJY/ rbsr;.

1'1\ru:.h1n e 1~ n1d ta
1s (A_ pe ~ (j}{ t'i'jY&J 1nleJJ.f'(l/7ce
(Al) -tho.}; pro Ylde.s Compulf).{,!, iuf--1}, lk Obf!FtJ io
/eOJ1() u.&./Jiovl ~ exp /?cf~ fr°(} 1ommecl . 1"v1 o.Ui ?n e
/eOJ{ n ?~ <f"oMeJ on !Je dwelopmenl j Corrvpule4
-th al CtJn /:eaJ, thern oeJYe-3 fo Jro w
{f>Yc} 'TM!s 0/l oi
ch°rJe when exposeJ lo pew dcdtx .
~ <'(foClBS d
macJi'?ne /ecv-ml(J ?s s?mYIOJ.l -to fM
~ clo!:o. mMnr9- 130-1:h st-l:em Seo>1d, -HY01fi dala b look
-1 O"{ r~n . Ko 1/J e>i eH I f Y1 S ieod 9/ c?J< hru.ui] dota. Jo~
hu,mOil UJrnfYeliens!Cm a.s ~s ~e CJJ.Je 'i'n dC!tO.. m'?nf'~
OfPl i eoE of)S ' me. cJ,'i' ne IeOJ-1 n'i'd u&.1, U da±o.. fu ?mpto"-f'.
-Uie rp-ro3 'l'O.m 's OWrl unde,.{ tCU1dro . 'Mo.thine /eOH n?7:J
<yr~'oo.ms Jelecl 7&eH 11 s fn chk o.n d o.ctfos-1: ficg o0111.
o.clion ClctO'od?n(} ~ .
k.e~ 1 0?r1:l6
• 'M( twms 'Da:l:o... ?nko ?nJo,rmctlJon [}1eon~(Jfa1 't)cJ;o..)
• 1t\L ~ s c,.. tom b~no#or, !} Cornptdf>f &~ence , f'ng?n e. eufcJ
nnd sW?s+Ycs
• 11\L V~ ito±;~sbfeal concept -to proeess data ond l:o
So1ve rpr-o b1erns .
• 1-AL fn!Ui~ dcttcl ond atbs on ,t .
• "Ml err~ '(()ftl,S put(?}{ };:o
Cl. C, om pt'rm'(x.e
O rp~ 0 rm OJ?Ce
c,¥-t;e,Ja vs1~ <pus-/; e.xpe;&nce.

k ~ 11rnm~rwlod?u
_(xpe,t.th ~ 5-bem ~ &/he exff')JJ SJo'tem aµe the- Comp;ieH
Ofr1PtClt1'on~ developed to so\ve C.orn'Plex p'YOb}erncs ?n a
ifcuill wlQJ{ do~ oJn , ol Hie level ~ t!xh--c,. - o'ocl.?n ~

human ?Ylttl.lf3 mce e.xr~se CJJldI du9s7on- mo.kr rig


nM IY~ ~ °' fwrn an expeJJ
C,Aoµfulm?.st'rc& ~ bf# S~otem5

• µrjM r~onnance
• iind(?}{ston detlJ!e
• rReJJab le .
• r&f I~ J.!eJfo ri s~Ye
-

½& PpN& cSq..1-lwi


o
rure cop:,,ble cf-
• ~dv?&?~ _
• ~-hvcl-~nQ oncl ass?ol-J~ human 1n decfs?on rnakrz;-
• 'Tumo0s tYQ-§1~ .
• 'DeiN?~ o.. Sol iill'on .
• '.DIn.c) nos?ng .
• £xplo1.n9~
• ~texprel£~ 'rnpu:b
0
<Po~9cl-?~ Jf@ ul/,& .
• 'tUJ,cfa -Me Conc/vs?on '
• 8 tt88 eei t'?nc) cill:e,m1the o p-1:?on s -t0 o. pro hIem
Lf"h.e Co:mpofle.n±s ~ b.peA ~s4:ern 1nckde.ci -
• lfuwle:Jff Prue :g knowledge to
f& J{tqµf,yed
exh'H:i~t 'lhlellf3ence . 11he succm ~ 0-ntf 60 m'!jo-.0:J
depend-s u.poYl -lk Co lleu-?an j
~'0} CJ.Ccw-iol:e and
rp1--- e-,cf'se., knowle~e _

• {o.cluo.1 knowledge g ,fo.cluol. Knowled!}e Fs -Yid


l<novJedge ~ lht tfiJk clornaJn Mo.1- ?s wfdeig 8h&<fed ,
-t{Jp?~ flund ?n -k,d:1hooK O'o Jol(}-{nals and Common{y . I

~'reed l1f\'ll1 ~ Hose l<nowleclff1'-bh ?n -1:Ae ipaµ/Jcui(JJ{ cfdd·


·Hetvfi:5trc knouJedft:_ g
~,Q . . 0
f{eilJ-li:.s!Je, knowle&e, 'rs --/-J,e
~,-10-' morre JudcrMenboJ. knoude. l,qe
0
'e;
ri.lcJ ory-00~ 1 mO'Ye expef-1 ra1X1W 1 0 u
J
~ p~ oYmonce . In Conhv.s -'t iO cJ.Clu.aJ kn owkcf8e 1

h~ tfc. kn owlet rs )fo}(eJy d.fWM e.d / Md fc5 /OJ-18 eJj


°'lhdfyfcwal,s b?c.

• kqowiedge 9c1epreroenfud!i'on_ g K17ouJleof8e .}-{erre1' en/xdJ'rm


1o the me./Jiod -to crr3o.n'i'xe o.nd J-o'rmalJ.ze. Y.,e know!~
~Yl -1:l,e knouJ.e1r be . i c ?s +he .j-o"ffi/ y
IF'- THEN- £ME
>-rule.& • '1tie IF ~ ~ the. J-tUle 1o &1!lsjled ; Con.; Q. qµ etrtly,
-tJe c(HfN cpcv-lt CcLn be Conwuleol .
•}<now.led& (;.{c 9.u/&~:t:fon g "the s Ui'teJ& & &?ff 0<p vd; s(f-1:.en:i.
m?f~lu .ds on the wo.1'1:w , cornplel:ene,,,s I 0-nd O..ClUJ--(a.~
~ -/:he 1~ omo.tlon o-1:oYe.d 'fri .Y, e kno vJedg e ba.s e . lrhe
knowl~e, base 1s JITTmeJ bJ .9-/eo.:l9?J f1r0rn vw&w ~ ,
cSc./10 IC!!-J8 , Ctnd & kwol1;e fng9r1eoo.
" bfeJ-VCJie r111e, ?0 e:uentYcJ. in d e.duillna CA.. Loy-red: '
Jfaml w, Iuf-fo11. I -1: ocq;Jm Md monfpu.laieo -Hie.. knrud~!l
o0
cfrmi fhe Kriow ledge to.Je ttJ Cl){JlfYe cit o. fOM:-f wl(JJ,{
, soluk?on.
• _U&>-i \n{e,.chce rpr-(1'((de<S "i'nte.J-lo.[;}jon bciw e.en V.\e}( 1
-Me f:S ~ Hie. frS ?11}- 111: e.xpl&n how the £
/w ~ved ol o. 'Pwill'wlilH J-/e(j)mmendol'iori.
h@Bo &ii ~ ~ sei ~ exo.mple ud Joy )eCL4n7ng -1:v
,j-?-t" -/;f;e rpr»!C>.meioo ~ the tbs&'gzeJ--1. %e h&°icJ se:l- Loll ~
cSele d:ed b~ OfP'dl"f/ Cl ~ C\l1 dom f Ite>< -ff! -lk Jaio- , e_ 8 ,
&letl- Jo% ~ -the rpo'r rJ:6 O.J µ o.ndo m to 8eh eHale .Y,e
rn<x:lel wid i:fj{: a. ei'.rllS-l -liie J-fervzd}_n?rig to 1o
8
VoJldcd'i'o11 dci- g {µ_ oe± ~ 8xrmiple wed tJJ -bune -the
PAf1o.rYI eloo ~ 0- cb.stfl'H . 1-1:: f s usu.oJJy u&cl -to odJllot
the chs&lfc.cJ:?orJ flJJIOfYlel~ % o,.de!-( iO a.. vo1d ov&f!trrg.
1esl- 23d; ~ ~~t
set 16 the dcdo-- 1 whose oul:come ,s
o&m.~ k11owr1 e-n d 'IS used to deteHm'fne ./Jie O.CLW-10-rg
~ -/he nucMne fe;,l}{n'7) oJaoo'i'ihm, based 011 bhe -t"rcfin7o W

"fOJ1gef Ycll-lfable g /& 7s the oL~tt hi be pyecm:kd cf<om


o.. moch~11e. Jeci,1.1 n~1B oJ8ori'th n, • It eo tLld bs bin~ o 'lll'o 1 ~
0
(] u t\l-1e L-b.s&~(f.111 0o lb coul. c.l he o. Co rd.J n vo U!J v CJJJob /e
V ij 01,\ G}{e do??J a.. J-1eress'r"n .

l lcLssycc.cJ!i'<Drt ~ CfWis f Mari ?s the fro 6!em ~


1den-t?Jtil~ tO wl-iWi ~ 0J- oel: ~ Cm:E.3mf~ °'- Ylew
abue,yvQtt1'on . belanu , Ort -/;he b?s ~ c;_ ·hr:J1r {{ng set- cg/
dctW Cotl:tofn9~ ohs eHvoctfon w i"o se c.oleff''d
rn ~shfp ~o l<nown.
~;r.e5')fon g <Re~e.5Sfol1 Ma!J'6P6 fJ 0. std?otf cal
f.roceM Jor es~vnoJt13 -the }{elol!roti s ~p CVhong
V~le3 .

1'ea:huw, g <;:l ~e ?0 0,11 ~nd~v?duol rneru l)J{ali/e


3/'{cp~ V <phenomenon b~ obaeJ1ved . f e_o..bU)-{(!J
C>..

OJid olhftb UteS ru-te- ~nd1yf cluJ rmeCQ, uttwi ent3 LUh en_
1

Conibf vied u.Jth <'Jthe« Jt2ailU-le2, rn o.k~ -/;;'roJ:n?cJ


&oxnp)e .
~2e.s eJ J:l\och?ne JE0,4 n11J ~
9.~G'µfised lEtll-lrf?rry ~- ~o?r,g uJi+h t:h e. help e/
fooC.eJ:S
0 a ..~ _\ 1
D. -1: fu theH 'l's Called 0upew1 .xa 1-.eoJ-in 171 ·
In 01-,LO&r·/i'8ed
Q

le~ n~~ ./k c,18 (mH, m 9e.neJioJ:e\ o._ Jwclfon -/;hat; "fY7CJf?5"' 'i'n pu:h
to desheel o\J:tfU:hs . die ,.s-/,orx.JGHd Jo'Ym ul dfon ~ the <S Uf i!J-N,:ed
Iewrn~ng ./Jw( ?s the cJo.so7fCW;'ror1 pro blem : -/;JiB /f!M o eJ..l "' 10
~ri.ift-red -W /ermn Ct Juncli°on ulh?e.h mo.ps Cl veclor ,'rtto OY1("

8 cfJleJ.fol c.la.sses bb /ooKf'ng o1 MJJmcd ?nfv:l- o vif~


example J Yie Jmck?ori .
un Uf32~YToe.d k(JJ-{nZc) - 8-
8
1&-m?rig p~oc_(';JS wftho ut '1e0Jccli(V{
~.5L<Wed unsupeJ-.tvi'&e.d /e_OO{n'l;J . 'DUH?, f:rc&n'fhd H,e_ rna.ch fne_
leaiin'l' j °1Jo"'l-l:hm J-\ec~ve.i the ~npub prutfJ-/rz 0-nd ofja.t{kes
J
~ t?e- rprilioM s to a rm c_/ w, /:_e;,5 . 'When o.. ne.u.,._ ?npul: ~exn
Clfplred 7 -Jk tV(!j on oulpi,J: !}{~ro~ 1'ndrwJJ'rg -/k
18

liUJteJ-f txJ wlfrdi -H,e 9nruf, tprilieHn beJoc} s .

xf9~ ori::erne.,-J; LeoJ-/11fcJ g S\'m~1iJJ-1 tfJ Su.p~ vited 1eev-w ?CJ .


ln -lli?s 1eCJJ.1r8~ rproc_~ 0~ cefi-bf'c. ~CJ'YmcJion ~.s

~cltbble not -the 0<0.el 9?Jormdfor1 . 41,~ /&i.Jm9't} proc_l%


r.s done !;rued on the l'rl'./Jc. r~orrmb?on Md ()J.__

~bock s~ noJ CQU.ed 'oe"'r~ orrc__e,ml!/17& sl& ro) ro oerd:-


becc.k ~ oiJ:pul +o the. '{'~ .
v.~
1:f>0ueo ~n r1a.cJfl ne J'.eCIH n'r?J
1
• }JhcJ ol~cn'!-Hm& ex'i'&t Jo, /~nYr;, SenmoJ i-001<9d
'fotrm Opt.~c. -t'l'birlfn8 <0<0-mple ~
,e.. 1fn w~ .se¾t1 8 uJJI pcwifwJaN CJ1081YT?-liims C6n V(?J-{(je !u
tlie desi'red Jwicl!r~n , ~~ven Oo/1/c'lenl -tr~ dcct:a S
6
· v-Jh~c.h mao-fi-thms 'P~O'rm h0it Jor wl-i?c.h -tf/~5 "(}
'J>Yo blems 0-nd J-lepreoenfu:11ons?
~- 1/oLU mLJ, -/xccfnlrig clctc1 ?0 6~Jenb ?
D- ~hctt aeneJ.loJ lvund& te Jourd -w J-iel.ale 1:he-
Lon

Corfeei;m '?n le:mn~d ~ po&!es to the o.m0 uril ~ -ho.Jn'lna


expeJ.tt.enc.e 0-ncl ./lie dn.wcteH ~ the /Eb.,•meH 's l potlies rs
Spette. ~ 0 0 I
11

b· When Md how can rp-rf"1o knOULieJ.ae held b~ -Me


/eroin€H ~ the r,oC.eJJ ~ 3eneJ-1~ (om exo.rnpte. ~
1-· U1f) fYl'1>Y knowied3e tl- heJpJJ t-ven when n 1\,
o~ o.pprox'i'mei:~ ~ ~

8· Whot ?s -the best Dhed:£3H Jo'r c.hoos?~ a.. W,rfuJ-


nex+ -hac~ 0<pMEnce I DJJd how do('J ths clio'fce t/-
-1:Ms cSh-ateg~ a1:/;e!1 the c.omplexrtJ ~ Me /e(J}{nf'i/
(frohlenz ?.

tD
~ ~
q. '2,Jhcrt ~& -Hie besl:-- LU~ xed,uce -the- lea.>Jn'r'tj
le, one o'Y rndYt {°n~oY1 Of!7lot<-~'('(ic,.;t1on ryro blem

· 1,J]iot, 8~c, .Jvn&on s &houlcl ~ 3a&tern cJ±enq+


10
-tD leo.Hl'l 9-, Co.n -1:l-& rroceM fl·,5~ ~ o.u,hitrnled 9.
5

• ~ow wn -!he ~rieJI {)Jlt0'11l~uill-(1 oJte!-l %


11
~l)e.nt/,.bl'llll tJl ~rnfYo'Ye 1ts obq\~~ -to .}{~~
o.nd le<ll-\ n .J:he, -t-O-Ht J und-'?ori Q_
-

<-fn devel (\ modrrne (eOJtr{f'n c1 JrtcalronJ


Co//f(f doia : '1hY5 ?ndudes c" 1/efr?•ncJ & scm,p/e s he/
Sc.~ ~1) o.. webme ond exbrr.Hr1EJ d(ti(1., ?om o.n 1<85 JetzI or
on API · V<£ m::ess°c} &ensens or make u.se ~ puhlfd ~
O
CWculnble dt'\.la. .
<frpJ_qc: lnpul dcda. g Ont-e. Jcda fo co/leded mah e c'3 uJfe
I

~ts ~n o. v~e_ t°'lTTXtt . 'if,e_ 6enf+- & hawirl§ -1:h<rs


~O.J-ld i nna:t Is -!kb 8ov Con rn~x ond ma:bch cJaoa?+h~
0

o.nd dctk_ ~OLlHC.~ .

r;:\n~.:e the ?nfui dcJng 'rs looKi''12 ai f:h e, dok rom -Me
'fYe,>fioWi ,tabK • I-1: 'i'n wlv~ .Xec°8 n'r:c?~ pcu±eHn6 ?dent-?cfd I
1 ns
ottl:.lf~ o.nd deift-1:Yori 3 novelkJ.
!foc&n the a.lg err?th m • "IJi e aclu.al mod, 'i'n e IE'CIJ-( 11f nB tad ks ·
'Plnce heHe . l-fh?~ otq> Md ihe ne.x-t s+ep OHe wln01e_ i:he
I

(fCoye." ol ~ o.n+1 m~ I?e. cJepen d.'f~ OY1 -1::he ol3o'l'(/Jirn • <(ee.J {he
I

Dlga~-l:hm 300d C.lEb.n do±O- Jmm the. /TSt who oteps one/
ex.ho.t t know.I e~e oY IY}PTiYl cdion · 'the kno wJ ~ e t?xtrorl&J
Is otcrre.d 'fn c;. J~ #iJ~ }{et1J'l~ vSQ\1\e. b~ 0--
m0-Ll1~ne ·Jorr the nex.:t -t!JX) oteps.
'il'-5t the olgaK1:hm g l1he q~or-mcill'orn /eOJ-ine.d ?'n +he
tpYe.vfoVJ, s~s ?o ~ ./:o we dw<?>] ~fi'~ .
Vte n g rthKe lJ~e ~ J{eol rp-ro3iYO,t)') -to do some -tf:M k,
(}Jlcl O)'lce Off&' (10U se.e ~ ojj +he rrevbU-O steps wo,/<ecl
~ ~O\A. e..x~cled ,
'Hpp?W'i'on ~ v\c:i.d,~n~ l fc1H n9nJ
u eOHn'ha Asoo c.ttJfo n& ~ Is ().. m l!ihod doY di'6('.0VV{)'.~1 'n.:/eHl'b-1:-'i'11 d
~C\,tfons be,h~en V ~I~ 'f' n IWise clo±ol,o.1 el, • ~ Is wed t-o
'fJ~, 1-\~ulo.J.8.t'te.i, o.m°j loJ13e 5L:e cl~e.b '11.e c.:nc epi
~eOH11 in9 c,,_~&J:Yon ~s moJri\~ wd ,n 't>'tru-tkcl: on sI s w hf a0 0
!
1
~ ~~3 o.s,so&d:Yon btlween <y'l-o&iJs bo~ht bJ M-1:om~ : 1J ,pe<Yf1~
vJ-c b~ >< tjr,'C.~ o.lso b~ \d , cmd ~ -\:he){' ~ s °' c..wlome}{ who
b~s x orid de€/, not bU!:J ~ , \ie_ r:,y- she '/'0 o. ,poll'Jt-!Yol 'I LU!lomeH
Once :;u<l, wk'fl'W CIJi e ~den!1)\'ed , t1, ~ co.n be i:-OJ-13 el.ed o-r c'l"t>ss - se.i/'i'~ d
<· Cl
ClSs'g'(c~on ~ llo.~lfc.Cttfon ?.s the pmb\em ~ ?denl-\JB~ 4o wA'ii:J-
~ o. 6~ ~ C.O:le30¥lei o. Yl0..u obs('H vo.\:i'on /;o w:'/J, ~ CJ.. .sd., ~
cci±e3CY'f1(?,\ 0.. ne/U O hse,~v~oYl belOJ'1j . on the b11.S1 6 ~ Cl. f-r-cJ;n~
ccl ~ cWti. e_Cl1t@i'i'nj obsei-ivt>li"on whou C cd.<zJo1 rn~m~&J.i:s fs know,
.:5· 'Ycuf.eun 'R~ n'i't'i'on ~ 1' $ c,._ b-ro.rd,, ~ rn o.di~oe Ieo>1n?l'lj 't-h CJ.t J<X. u&'~

Jl1 -tk> ~ "Ye~nHYon ~ ::>lrn~\OJ-vl;'i'e.. o.nd oe3 ulOJ-O'tf~ In dvlo. .


Cpwrol. ch(J)-{od0-l J{eC03 n~ Hon ~ r$ "the 1-tec un'/l:f o r7 ~ ch (J.){(l cl eJ-1
Cod~ cf4o'f{\ ¼~Y irnn3&). ln -8115 e,~e -l-h~Yf Me mul±fple C.{0-8JeJ,
~
0
'f'f\ ~ ~ +hQ>-t e O-He th()}{ o. c..t0-to +~ ~ '~ -l 0
~ }{E>co~y-'i'x ed .

11 L!),Z-1:?'c) J-tecc::in~tfati g ,reapte \~we


:51Cll:id fa) ct ?U e,-1 en.t
iJlJ
h0-n dw~1tf11 °1
0
be v11 ~n ~lt
~m UN OY \o.Hae ) s 101\ted
St~,u ·, C)rl()J-{oieH,b "'10{) W~\U.\.- 0

\),'.)\\Ji CA- ~w oo pen&l , o.nd -tl,QHe (')}le rn~ 17ossib 1


l1noa~ to'(T~pood?n(J -to the ~C/.)('ft(> c10J-to.. deli .
face }{ecognH'(on g °de ~npiJ.t; ?s ()IJl ?mo.8e o.nd · -Hie-
clo..sses ()J{e ipeople -Ir, le J-ter°8tfi:ted. trFe /eOJ.ln~ng ~~
Bhoili leo.i-tn ttJ a.ssocfuk the. Jo-Ce ~rm~e1 -kJ Yden¥Jfa ·
Ylec11eo.1 cl'.fu-sros'rs ~ "(he ?nputs Q}{e tlie Nekvod: ?~o~n
ahov.t- -Hie <p'ililenh cmd ./Jze c.J ~ ().l-le -Iii e 'r ll 11S'S5e8 •
S~ec.h XetoarHti'ort : Lcfie rnpttb 'fs CXCoLL!'b?c_ o.nd the.. c/as.se.s
ro-{e. WO,ctS -B,e,t eo.n te ult~ed . ~ e D-ScSolL'cJJ on io 6~
\e.()J..\ ne.d is (<om OJ7 O-Co w 1:'(c. s18r. al to Cl w (') -i--d ~ 5 0 m c'
\Gnfiei. .
4•'No.:.buJiaJ hDflMe (fTOCl:&s~ng g ?s C<. /eJ.cl ~ Compv:l:ei-<.
Sdence I W-lt.'i'JX cJol 9nte.U1a erice I o..nd eom.pul-r.J:FonaA l?nau?s-1-?cs
CoflteH ned ~~ -the '? n±E.Ha.dfor,B beb,ue<m Corrtp utfJ{J Md

human CnOWHol) l0-ngi-w3. e1i.


6• <B~O[Ylek?Cs ! ?s ~oen~t\'on O't Cltlthenb?colfOl'l.S ~ fl"Op/e
\) s111 -!:-Ii ei, - ph~ IWOlo~?c..ol o.nd Ior }-Wl:e 'i'o 0.: .~s n cplt 1'Yloo&-
15~ ,crl e.x pJcfus . -He a±a , o.s9nrwl ,rth-o:n-~pl-Moife9n--a~
,l;k_ fYll teAs be.hOJJZos-tol c..hOJ-1C1~-l:-'ts -J+cd; ~e.qµf re,s OlYl
!vu£.Soe&o'1 ~ Y~ rom
~errt modo.lftl'v,.
6· Kn owled3e ,&ty~on ~ lerm n?neJ o. JWle C'ffi dctl:a. 1s f
KnmJJn o.o knouJWSe e.x-haillon • '1he J-wle, 'i's ('}. sfrnph mcM_
-thctl e;1plo.'rns the.. do±o. , u.sfnei UJ\,\'th c:m e.xplo..n al:'?OII o.b('J ul-
-}ie_ p-tot™ unO~?~ -1:iie. delta. 'i's 3eneJ.1oied .
1· C>mpre-ssFon g 'B~ /tt-~ e1. J-UU.e to -1:k daJ;o.. OJL
e.xploii ol:'i'on -l)iQ:t 'fs s?mp Ie,.{ 8 ll\!
-1:h on -bh e do.ta. , J-1eqµZ o1'n
m~o'}J to 8tooe 011d !ew tompukcJ:Jon to rp,oc.~ Con
be ~ e.n eH o.ted. . ~f o p7©Ce&'> ?o k rz ou.Jf1 OJ UJ mpre ss f'm? .
ig • OUt/?eH J)eftclfon g OuJ;)?eH cleku-Pon rs H1 e fYO C~

~ /nd?na the 9MlClnC ei -/:/Qt clo not: ob1'1 the J-(J.Jk


ond OJ1e e.xcepl-?ons ·lo {h {onda td Co.Je .
1

9. CJ{QaTe.SS7On ~ {T,J2 foo 0/1 ~S?s ?.s Cl S to£'i'sg aJ. pnile.Yi


.f-o1r e.st-'i'mcd:'?1'3 tli2 "t l ~ nsh?ps amon3 ' \ ~ }ru .

loo · Pe.Jr\ 1~ £-l:-i'n1Cll fon g trh ,,e 1.s


0
Cl Shl'clW/8 tfJ -the
' nrp\.lt S f\;Jte <-.; th th [tttlc[n rpctlleHllS OCCLlH )'Ylo, e
~ -\.en tl1 1 othl LU , b(\\(ld on ./die J'-leq,u.et?~ ?:}
ncl
CtLL~L<0-n ce ~-t h1.S_ ~o te (}()nltJ sed -l:hcrl: Lu heel hof})en o
0-11 cl luhC\:t d oe.I not . e +u 11 dens'?1t e t-?mokfon 'i'.s
L \us.J,elft~ .

'J&0J Wo IJ J.., lfccJJ oo !tiCt'tn?ro .


• \\l e.b oeOJ-,en .
• Comp~oyiol 'Si'o ~~ .
• 1rn•once.
• £- CommeHc-9 .
• l ] _'\C~ l., ) lo1 c\'l;k') n
.
• <f\()h; lYc\ .
• .i i donO-lCt:Obn L'(·h ~tt?on .
• t1e,~i · 00.sed enHruen-l: OJJo_1tJsfs ?n soc'W 'Nclwn-~
• 'N'elwo·r K' inlw~of1 d-e.lecl?oo MJ J'mUd dclec±fo n .
..-
• f- rno.11 or,o.m j l-t~n8 .
• 0)1ed?d?of'l 8 t:C\},&pme,J; JcJlw-1e,s .
• /S.n3ui'oHc )-U.Lle t'3"eCTb?on .
Module 04 - Learning with
Regression and Trees
Contents

• Learning with Regression


• Linear Regression
• Logistic Regression
• Learning with Trees
• Decision Trees
• Constructing Decision Trees using Gini Index
• Classification and Regression Trees (CART)
Linear Regression
Supervised Learning and Regression

• Regression analysis falls under supervised machine learning.


• The system tries to predict a value for an input based on previous
information.
• Characteristics of regression:
• The responses obtained from the model are always quantitative in nature.
• The model can be constructed only if past data is available.
Regression Analysis: Estimating Relationships

Regression Analysis is a study of relationship between a set of


independent variables and the dependent variable.
Independent variables are characteristics that can be measured directly
(example the area of a house). These variables are also called predictor
variables (used to predict the dependent variable) or explanatory
variables (used to explain the behavior of the dependent variable).
Dependent variable is a characteristic whose value depends on the
values of independent variables.
Regression Analysis: Estimating Relationships
Purpose of Regression Analysis

Past/ Experience/ Known Time Future/ Unknown


Explanation: Use regression Prediction: If the regression model
analysis to develop a adequately explains the
mathematical model to explain dependent variable, use the
the variance in the dependent model to predict values of the
variable based on values of dependent variable.
independent variables.
Purpose of Regression Analysis
Explain Selling Price of a house (dependent) based on its characteristics
(independents) . If the model is valid, use it for prediction.

Develop Regression Model using known data (sample)

Selling Price = 40,000 + l00(Sq. ft.)* Area + 20,000*(#Baths)

If the above model is reliable and valid, use this model to predict the Selling
Price of any house based on its area (Sq. ft.) and the number of bathrooms
(#Baths).
General Nature of Linear Regression

Most types of nonlinear relationships may be reduced to linear cases


by means of appropriate preliminary transformations to the original
observations.
General Nature of Linear Regression
General Nature of Linear Regression

Let xf = x 2

Therefore,
Y =Po+ f31x1 + f32x2
General Nature of Linear Regression
General Nature of Linear Regression

Let y' = Iny

Therefore,
Simple Linear Regression

• Is a statistical method that allows us to summarize and study the


relationship between two continuous (quantitative) variables.
• Between predictor, explanatory, or independent variable and
response, outcome, or dependent variable.
• Statistical relationships are different between deterministic
relationships
Simple Linear Regression

y==f3o+f31X+E

Celsius
Selecting Independent Variables: Scatter Plots

Scatter Plots are used to visualize the relationship between any two
variables. For regression analysis, we are looking for strong linear
relationship between the independent and dependent variable.

Sales vs. Promotion


y= 0.7623x + 25.126
130
II) 110 •• .1 • •• • , ••• • • •• • r •
QJ
ta
V')
90
70 -
50
60 70 80 90 100 110 120
Promotion
Selecting Independent Variables: Scatter Plots

• In statistical relationships, the plot is called a scatter diagram or


scatter plot because of the scattering of points in the relation.
• The purpose of regression models is to identify a functional
relationship between the target variable and a subset of the
remaining attributes contained in the dataset.
• It enables to
• Highlight and interpret the dependency of the target variable on other
variables.
• Predict the future value of the target variable
Procedure for Building Regression Models
Define Objectives Define/Clarify Purpose. Identify and describe the measurement of the dependent variable

Identify possible independent variables (predictors - should make sense) . Use scatter plots and
Select Variables
correlations for selection .

Estimate Model Estimate Regression Coefficients (using least squares method).

Test to see if all coefficients are significant (reliability). Establish validity (are relationships as
Test Model expected , do predictions match actuals).

Implement the model in Decision Support System. Incorporate error in predictions. Outline
Implement and Use limitations/constraints of the model.

Compare predictions with actual values. Modify/Refine/Expand model if necessary. It is about


Monitor Performance continuous improvement.
Least Square Method
(Simple Linear Regression)

where:
Yi = observed value of the dependent variable
for the ith observation
I\
Yi = estimated value of the dependent variable
for the ith observation
Least Square Method
(Simple Linear Regression)
Slope for the estimated regression line (hypothesis function)
Least Square Method
(Simple Linear Regression)
y-intercept for the estimated regression line (hypothesis function)

where:
xi = value of independent variable for .zth
observation
Yi = value of dependent variable for ith
observ ation
x = mean value for independent variable
y = mean value for dependent variable
n = total number of observations
Simple Linear Regression

Estimate the regression line (hypothesis function) for the given data.
Height (x) Weight (y)
151 63
174 81
138 56
186 91
128 47
136 57
179 76
163 72
152 62
131 48
Simple Linear Regression - -

s. Height Weight (xi -x) (Y, -Y) (xi -x)(Y, -F) (xi -x)2
o. (X) ems (}?kgs

1 151 63 - 2.8 - 2.3 6.44 7.84


2 174 81 20.2 15.7 317.14 408.04
3 138 56 -15.8 -9.3 146.94 249.64
4 186 91 32.2 25.7 827.54 1036.8
5 128 47 - 25.8 -18 .3 472.14 665.64
6 136 57 -1 7.8 -8 .3 147.74 316.84
7 179 76 25.2 10.7 269.64 635.04
8 163 72 9.2 6.7 61.64 84.64
9 152 62 -1.8 -3.3 5.94 3.24
10 131 48 -22.8 -1 7.3 394.44 519.84
X= Y = L = 2649.6 L = 3927.6
153.8 65.3
Simple Linear Regression
Weight (y)
100

90 •
80

70

60

50

40

30

20

10

0
0 20 40 60 80 100 120 140 160 180 200
Simple Linear Regression

Estimate the regression line (hypothesis function) for the given data.
Area in Square feet {x) House Price in $1000 {y)
1400 245
1600 312
1700 279
1875 308
1100 199
1550 219
2350 405
2450 324
1425 319
1700 255
Simple Linear Regression
House Price in $1000 (y)
450

400 •
y = 0.1098x + 98.248
350

300

250

200

150

100

50

0
0 500 1000 1500 2000 2500 3000
Simple Linear Regression

Estimate the regression line (hypothesis function) for the given data.
Year (x) Sales$ (y)
2000 1350
2001 1610
2002 1430
2003 1790
2004 1550
2005 1670
2006 1870
2007 1850
2008 1920
2009 1980
Simple Linear Regression
Sales$
2500

2000
y = 62.424x - 123427


1500


1000

500

0
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Simple Linear Regression

Estimate the regression line (hypothesis function) for the given data.
Study Hours (x) Test Results % (y)
60 67
40 61
50 73
65 80
35 60
40 55
50 62
30 50
45 61
55 70
Simple Linear Regression
Test Results% (y)
90

y = 0.6955x + 31.212
80

70

60

50

40

30

20

10

0
0 10 20 30 40 50 60 70
A B AB A~
5 Lud e.n l: x,· Y; Xj - 5( Yr -
-Y
.

I 95 ~s I1 ~ I 3€ 2S9 I

2 ~5 q5 ,i I 2--(; '--{q
l
3 80 10 Q__ -7 - I l( 4

-5 - I 2 q6 b'-/
y 10 6'5
-1 I 2. b.. 321+
5 bO 10 - Ii

.390 3~5 Y10 730


Totq/
~

X = q 5 -t 'f5 5 1-'60 T 1 0 f-6 (J -= 1 r:,


5'
,--

!:J ~ ~5 + 9s+:::;o+6G-+ 9o ='-;; :J


5
Now
/.:Jn ( =- _zAt3
____,, -
Evaluation of Model Estimators

• Once the model is established, you need to confirm whether the


model is good enough to make predictions
• Various metrics are used
• Karl Pearson's Coefficient of Correlation
• R-Square
• Multiple-R
• Standard Error of Estimate
Karl Pearson's Coefficient of Correlation

Karl Pearson's coefficient of correlation is a helpful statistical formula


that quantifies the strength between two variables.
This coefficient value helps in determining how strong that relationship
is between the two variables .
.NI ,:,ry - ~ xr y

where x and y are variables and N is the number of instances.


Karl Pearson's Coefficient of Correlation

• It has a value between +land -1


• 1 is total positive linear correlation
• 0 is no linear correlation
• -1 is total negative linear correlation
• Limitations of Pearson coefficients
• It always assumes linear relationship.
• Interpreting the value of r is difficult.
• Value of the correlation coefficient is affected by extreme values.
• It is time consuming.
Karl Pearson's Coefficient of Correlation

Estimate the Karl Pearson's correlation coefficient for the given data.
Height (x) Weight (y)
151 63
174 81
138 56
186 91
128 47
136 57
179 76
163 72
152 62
131 48
Karl Pearson's Coefficient of Correlation
Height (x) Weight (y) xy xz y2
151 63
174 81
138 56
186 91
128 47
136 57
179 76
163 72
152 62
131 48

Ix Iy Ixy Ix2 Iy2


Karl Pearson's Coefficient of Correlation
Height (x) Weight (y) xy xz y2
151 63 9513 22801 3969
174 81 14094 30276 6561
138 56 7728 19044 3136
186 91 16926 34596 8282
128 47 6016 16384 2209
136 57 7752 18496 3249
179 76 13604 32041 5776
163 72 11736 26569 5184
152 62 9424 23104 3844
131 48 6288 17161 2304

Lx = 1538 Ly= 653 L xy = 103081 L 2=


x 240472 I y 2 = 44513
Karl Pearson's Coefficient of Correlation

10 * 103081 - 1538 * 653


r=----::=====================================================
✓ [10 * 240272 - 1538 2
] * [10 * 44513 - 653 2
]

r = 0.977
R-Square

• R-square gives information about the goodness-of-fit measure for


linear regression models
• It indicates percentage of variance in the dependent-independent
variable pair.
• It measures the strength of the relationship in a Oto 100% scale.
R-Square
For the regression line,

where
Yi is the observed output for the input xi
y is the mean of the observed outputs
Yi is the estimated output for the input xi

L(Yi - y) 2 = SST is the total sum of squared deviations in y from its mean.
L(Yi - y) 2 = SSR is the sum of squares due to regression.
L(Yi - yJ 2 = SSE is the sum of squared residuals (errors).
R-Square
Relationship Among SST, SSR, SSE

SST = SSR + SSE

w h ere:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squ ares d u e to error
R-Square

The coefficient of determination is:

where:
SSR = sum of squares due to regression
SST = total suin of squares
R-Square

Estimate the R-Square for the given data.


Height (xd Weight (Yi}
151 63
174 81
138 56
186 91
128 47
136 57
179 76
163 72
152 62
131 48
R-Square
Height (xd Weight (Yi) Yi (Yi - Y) 2 (~ -)2
Yi-Y (Yi - Yi) 2
151 63
174 81
138 56
186 91
128 47
136 57
179 76
163 72
152 62
131 48
y SST SSR SSE
R-Square
Height (xd Weight (Yi) Yi (Yi - Y) 2 (~ -)2
Yi-Y (Yi - Yi) 2
151 63 63.410 5.29 3.574 0.168
174 81 78.925 246.49 185.652 4.304
138 56 54.640 86.49 113.640 1.850
186 91 87.021 660.49 471.784 15.836
128 47 47.894 334.89 302.976 0.799
136 57 53.291 68.89 144.226 13.760
179 76 82.298 114.49 288.946 39.670
163 72 71.505 44.89 38.500 0.245
152 62 64.084 10.89 1.478 4.344
131 48 49.918 299 .29 236.618 3.677
y = 65.3 SST= 1872.1 SSR = 1787 .392 SSE= 84.652
R-Square

2
SSR _ 178.392 _
_
R - SST - 1872.1 - 0, 955
Multiple R

• Multiple Risa correlation coefficient.


• Gives the strength of a linear relationship
• It is the square root of R-Square.

(ill
Multiple R = ✓ R-Square = ✓ SIT= v'0.955 = 0.977
Standard Error of Estimate
The standard error of the estimate is a measure of the accuracy of
predictions. It is the measure of variation of an observation made around the
computed regression line.
LCY; - }';)2 = ~
N ✓N

The denominator is the sample size (n) reduced by the number of model
parameters (p ) estimated from the same data.
N = (n - p) for p regressors
N = (n - p - 1) if an intercept is used
Standard Error of Estimate

_
(Jest -
L(y; - )';) 2
N
_ /m -_j84.652
- ✓ JV S
_
- 3.253
Logistic Regression
Why not linear regression?
Why not linear regression?
Examples of Binary Classification Model

• Spam Detection: Predicting if an email is Spam or not.


• Credit Card Fraud: Predicting if a given credit card transaction is fraud
or not.
• Health: Predicting if a given mass of tissue is benign or malignant.
• Marketing: Predicting if a given user will buy an insurance product or
not.
• Banking: Predicting if a customer will default on a loan.
Logistic Regression

• In linear regression, the response Y is continuous.


• If output is discrete, it is a classification problem.

• Logistic Regression is a binary classification algorithm used when the


response variable is dichotomous (1 or 0).
• Output: A random variable ~ that take values (1 and 0) with
probabilities Pi and 1 - Pi , respectively.
Why not linear regression?

• Let p denote probability that Y == 1 when X == x .


• For linear model to describe p, the model for the probability would
be
p == Probability(Y == 1IX == x) == /3 0 + {31 x
• Since, p is probability it must lie between O and 1.
• The linear function is unbounded, and hence cannot be used to
model probability.
Odds
• The odds of an event is the ratio of the expected number of times
that an event will occur to the expected number of times it will not
occur.
• If p is the probability of an event and O is the odds of the event, then

= p probability of event
0
1- p probability of no event
• Transforming the probability to odds removes the upper bound.
Why Odds?

• Unlike probabilities, there is no upper bound on the odds.


• However, Odds do have lower limit.
Logit Function

• If we take the logarithm of the odds, we also remove the lower


bound.
• Thus, we get t[he ; at ]hematical model as,
log
1
_ Pi = f3o + f31xi1 + f32xi2 + ... + f3kxik
• The expression log [-1!..!:_] is called logit function .
1-Pi

Note: In the above model, we have taken the natural log.


Logit Function

6 ~-
Tfx-) do-gl
r.
-x
4
I'
2 __,,,,.. ~
~v
0
2 _---tt r.l
0 .,,,,,,-
_,..,..
--------
06 08 1.{)
-2 /
-4 II
-6
Logistic Regression

What is Pi ?
Logistic Regression

• We can solve the logit equations for Pi to obtain

Pi = eCf3o+/31Xi1 +/32Xi2+ .. ·+f3kxik)


1-pi

1-pi 1
Pi e (/3o+f31xi1 +/32Xi2 +···+f3kxik)
Logistic Regression
1 1 + e (/3o+/31Xi1 +/32Xi2 +.. ·+/3kXik)
Pi eC/3o+/31Xi1 +/32Xi2+···+/3kxik)

Pi eC/3o+/31Xi1 +/32Xi2+···+/3kxik)
1 1 + e (/3o+/31Xi1 +/32Xi2 +·· ·+/3kXik)
• We can simplify it by dividing both numerator and denominator by
numerator to obtain:
1
Logistic Regression

This function is the Logistic Function .

y=l ------- - -------------------

X
Maximum Likelihood Estimation

• In linear regression, we used the method of least squares to estimate


regression coefficients.
• In logistic regression, we use another approach called maximum
likelihood estimation .
• The maximum likelihood estimate of a parameter is that value that
maximizes the probability of the observed data.
Maximum Likelihood Estimation

• Likelihood function is probability that the observed values of the


dependent variable may be predicted from the observed values of the
independent variables.
• The likelihood varies from Oto 1
• It is easier to work with the logarithm of the likelihood function. This
function is known as the log-likelihood .
Maximum Likelihood Estimation

• Suppose in a population, each individual has the same probability p


that an event occurs.
• For sample of size n, ~ = 1 indicates that an event occurs for the i th
subject, otherwise, ~ = 0.
• The observed data are Yi, Y2 , ... Yn and Xi, X 2 , ... Xn .
Maximum Likelihood Estimation

• The joint probability of the data (the likelihood) is given by

• Natural logarithm of the likelihood is

• In which
p(
1 + a+px
Maximum Likelihood Estimation

• Estimating the parameters a and J3 is done using the first derivatives


of log-likelihood, and solving them for a and J3.
• For this, iterative computing is used.
• An arbitrary value for the coefficients (usually O) is first chosen.
• Then log-likelihood is computed and variation of coefficients values
observed.
• Reiteration is then performed until maximization of L.
• The results are the maximum likelihood estimates of a and (3.
Estimating the Parameters

P(y; = Ijx;;,B) = h(x;)


P(y; =Ojx;;,8)=1-h(x;)

P(y;jx;;/3) = (h(x;)f (I-h(x; )tY,


Estimating the Parameters

f (/30 ,/31 ) = I:! p( \[!o ( -p( x/))


X; 1

L ( /3)
n J;
= IT (h ( x;)) (1- h( X; )
i=I
t Y;

f(/3) = log(L(/3))
Estimating the Parameters

n
I! ( {3) = L J; log ( h ( X; ) ) + (I - J; ) log ( 1 - h ( X; ) )
i=I
Using Gradient Descent Algorithm

[t
Cost Function:

](/3) =- ~ Yi log(h(x;)) + (1 - y;) log( 1 - h(x;))]

Objective:
min](/3)
/3
Using Gradient Descent Algorithm

Repeat{
a
/3j == /3j - a apjl(/3) ( simultaneously update all /3j)
}

Repeat{ n
/lj == /lj - a L(h(x;) -y;)xij (simultaneously update all /lj)
i=1
}
Decision Trees

74
Classification and Decision Tree

• A tree is build in which the leaf nodes contain the output category.
• The class of the output is predicted based on the rules generated
from the tree structure.
• Learned trees can be represented as a set of IF-THEN rules as well.
Decision Trees

Should I wait at this restaurant?


Patrons?
Classification and Decision Tree
• A decision tree is a classifier that partitions data recursively into to form
groups or classes.
• This is a supervised learning algorithm which can be used in discrete or
continuous data for classification or regression.
• The Algorithm used in the decision trees are 1D3 , C4.5, CART, cs.a, CHAID,
QUEST, CRUISE, etc.
• The decision tree consists of nodes that form a rooted tree, meaning it is a
directed tree with a node called "root" that has no incoming edges.
• All other nodes have exactly one incoming edge.
• A node with outgoing edges is called an internal or test node.
• All other nodes are called leaves.
Problem Solving Using Decision Trees

Decision trees can be used to solve problems that have the following
features:
• Instances or tuples are represented as attribute value pairs, where the
attributes take a small number of disjoint possible values.
• The target function has discrete output values such as yes or no.
• Decision trees require disjunctive descriptions which implies that the output
of the decision tree can be represented using a rule-based classifier.
• Decision tree can be used when training data contains errors and when it
contains missing attribute values.
Date

Lost - _______ ___.

Dec(sion l oee n_o_tu, ______


- -
.
F!Je Com pe ti b on - 1pe. ~aft
-
O/J Yu SDrtwaYtL ])_{j(.JJ

otd N6 So~l:wan,._ OWJ).


Old No ~Q)Jc/.wo)U_. O_o_(,L)_n
icl
r<) y~ ~OJ?.J._ 0-0
rn1d YUJ HanJw~.m n
rn,d NO _ _ l10J1clwan;_
rn, d _ NO SaflwOCU.
r\lw Yw So{twarLt
~ ND - 11andwa)JA_ _l/P
1\el.,v -NO scfrwruu.. UP
-- - ----------- -- - --
0(0Lu- 0 clec;s,o-q_ L'"fe.e_ fuy q 1uet.1-c./Qfq~- l:- -
--je.h 9,u,.Aki'ori tJ-e ur.iiJJ.-VIS1~- ~oro_,n,._JQm¢2k
· te l[;e,. Pucho. -!)°'J°' J-f-,.,, - - · . - - - - - - --

. -
Let Se.~ how b-o So I Vt, J:h,S ----~---:---- -
0 Dekh bhcti _koi bhi cJatase6 .J°' ·. /C1ble. cJi_Jo ho
vs kci. Jo /rut colomn hotC/ ha i no voh hoC-a
h~, class l) TrR11Bvr E

~ /'!I St apnc f-obl.t. fr1.< PROFIT 1,af au, uskf


/o valr.u. o nhe,r~i vo h ho~ rn.vu /eQ(! 0 (!) r&._

1' \\
'--- r~ k-t-
Jo
f~t wale
node Ko lea P nocb-
1<..q ht:-a ha i
j
I
I

I
I

,.I
I
~

'
- - - -- -- - ---
P-= O I N~Z ~

lO t \{h 'onC{ \
bhi valu..t
:rrr;,N ,J -: . D
OldG'
5 I m I I$'.]' l.l,l.L ~nJ 8>{ __________
p;!- 2., (V :- 2-
T (P,·,Ni) ~
m,d
~loq
"1
(J=) -~
Y
loq (-3y:J- - -
Ji 4 -:Jr :;.----

--
- ~ ----

----- --~- --·---- ------ ------- - --


P-=3,fv=O

TC0,rvi) = 0 ~

NeuJ

P+N "- :Je.h Po.nJ IV cl~--s>


attt"lb&>h~, }{j vq/LU. hai
evru>u_ P=6, N-:: 5

== o+.3 _ (oJ + 2+:2. (J.) + 3+0 (oj


10 10 lo

- 10
- - - - -- --
-- .b_te_p_1-!~ :___ _ _ _ _ _ _ __:,___ _ ___ _
=:rn -~~~-~----------
- - - - - - - - -- - -- - - - - -
- uJ; ~ ---- - - - - - -
-· C1 ct58 P.: .Proft-_{u PJ ~ £
_ cl Clb.s_[V = Prd1: t- cc) ow ri) :: 5

(Toh _hrun-c. e-la..~ oJiloJb_vl--c,~ hci j_ Ptof!t._ _


_ UBkc, U>Wor~ - ~1t4Jn_a ~o ------

- - - - - - - -- -
-e /DQ
__ _- eJ-tv_,_ ~ ri-
( (? ~ -
~
fV
P+rV_
fBQ f ==
(v -\- ' - -
v'l._~_f +rv
..-=a
_ _ __

P.;:5✓. {J/_ t; _.....____ _


- tAPt;;:_p_utc[ j __v_a 1w..~_ o f
- - ~h- - - . - - - -- - - - - -
__ - - 5 foq ( --5J - 5 /QJ -~ 1
- - -

- - - _ I_p___ U_?..- ID)_ / o d rr-z 1o:]


a-_ _ __
foirophzj ; o_t, _ ____________ _
lfYoR!:.) / ---- __·_ · - - - ~ -

--·----- ·- - - - · - · - - -

.--· ..,
chc:,lo n:nc, Poca chq1 Y!!q _k; ,

up Q/) d cloc.vri
me,-ie, lea P no d..R. hem:} e. So
1ltk:Jhons
Co~ ooob no~ I-rem ho~q

"-----=:)

To h11d /-h i5 • hv m saan.c c.o l um n Jro.


J~ ~ 0 n; frq len_g e. £twJ k ~ e.
Corr, paJu.
J1S1'<0. Ja.irisqbse, z.Ja.acJc, bead Voh roo'ls- no&..
.
fuo AJ<: .
Pr '
!Vi
-
T(f11rv;J

o le}· 0 3 0

~ 2 2-
rn; d (

- --
n~ 3 G
I a
- j
- .. -- · - - ~ - - - -- ·- ..
Now ~rf
Co rnpcJi h'o n
Pt Ni rcP;, (v iJ
Ye.b . I 3 O·Stt99
No y '2

E'n L-oor\y ((om-pe l:i h'onJ


_ __ - Z"_ (i -1-Jv i ( TC P,·1 (\Ji)
P1-rv
_ . "" f It ~ l (0 · ~ I ''2 9] 1 ~ +- !2..) ( 6 · q l '&2~ .-- .. ~--~ _
- --
10 - - - - ~s
- -- . -

... -- . - - - · ------ --- ---------·- - ----· ------:-... --


hrc tjPe
8· _
Nt TCH,JYj) _
So{f:wo.JLL, 3 __ 2' _ _ __ L - - -
h a.JJ du.J aXL 2- _L ~ I_ _ _

_ _T((};7 /vi) -:_ /,, __ _ :'becauk_ (J_ ~ - --


Sof1:wrou_ __
-
TC p,, N;) :. /_ - - -- -- -- - -
Ho))dwwu._ - - - - - - -- - -- -

- - -- - - -- -- --- --
..

- ?it 3 lD + ('l+?.) (0
s+s . s+5

Gciin : class - ~nt"Y'ophtf {"3p0


er> /;-ro(J hy .J
I i
\

~ I- f
- - - -------
Tn~Yma kon ~aln __________ __ - -
- - -- - - - - -- -------------
--- --
- _O·l 2 45L5 _ _ __
Co rnpth'h"o n
0
~p~ -
11 -ro h J ain 5ql_, j< 2T ~11cJoi ~ Ira bq_i
'?f"oob no JJZ.. e,o 11l_ be- ~ - -.-:----
"
Ab AGi b m.t -jilni bhi _ va flU-bCl i (i5 _ _
uine rroot n(!)clt:_ S-L b ~c_hu> _fl'clLung,e-- - -
- ---
Aj~ - - - - - - - - -
" c) - - - () fv,j
o - - - mid -

AjYo ci(ip- f4l>-89-e.. __o_ld_ k· vctl_wJ-1VJ.d-floR t-· ki


valt.Ub __Kei_..h«tr()f> ~ _).a,u tQ Sg_b v_ci.lw d a_w.. :n. .,. ._
h~ i ch,lC..t_~own_ W"(_h c) o_ Sarru._ fu.Y-_ ~~ - - -
Bu~ m:ro mt Uf' OllJ down
dono va/org ,,
fsli~ J1utv1 dinc1- /-fud, Puh nahi ka>, sq/1f;e.
5 0 h(J.{Y\R Ho~e.... s1·a( mi cl ho. fa bu h tl/)(Jf}~
e°'d»)JGi
.

~ 9e_, Co rn peJiHon 1/Jpe.. R-oat


rn I d Yu So(tw oYlL awn
..
. .
rniJ yl6 handwOJu.. Down
I

rn; d . JVo }1 aJ1 dwa)U_ up


m1 d ND · 5 0 P-tw Q.}tt_ ur
11 lo SClln,t. r?oeu,_g C':r~ Ka))~ llil ha; ·
cl ~5 g aJJ /;fi b)re, I-< i en Ho phj ni tral n,· ha. i
Md Col"'f)r.eti lion {)f)c} T!}J?t- ko. (}q;n ()i ~qi~
Jo ba_.JCA hor::3o usk C> rn i c1 '10- no ci._
ban an ct htA.; ·
cl °'s-.s P ==- P,o ~ t CDP) ,; '2
clws fV ~ ?--ro Qt- (d,bll)tj) ~ 2 ,
[D ho phj (fro~ t J "' __1
-
CoMpehhAo n
p; /Vi TCP;, fVJ
2- 0
Yes 0
NO 0 2 0 -

.n f,, f\J ;J ~o
LY-e.D
(_koi bhi e.k' va\ttt p l)Y IV MV- Q _ hai fvh -
~ ()f).J WJJ7 wi 11 b.e. Z<Jl o )
~ - - - - -- --
[n/;rop~ Ccol"f)pt:hVl'oi)
- - ---- -- -
-=~ (CJ) -t !::- CoJ 2
0
¼ - ~

(i-,atn :: fnt;Yophd - fnl-roph~ - - - -


___cl as-::S _ C::9m pd'\vl o ~1 - 1 - - - - -

---
- ::..[-- 0

-------
- - -·- -------.-- - -- - - - - - - -
GC{; n
)
com pe l:1'110 n
0
1jpi

Oowr1
I
I

O:ovfov-.s!_J a.pho oloubt:; q!/Q hoCJ q ki


J <l.!>JJ H -r 1jp-e Ker ~ a-. 'J o J e.. . cYe.b_)
dekh bhqf J 'q b mai WTY)f).ehnon kr Q./) d
Pfo~t k i vq ILU. co tY).paJ'U- J<aJJ rah~ hv J-D h
:c M Gutt;~ df?f~cl:- MS~ clo Wf) roh
-r:!;P<- k'Q 2 hJ_ct 1Y KJ'" Pa In elfo
srrni fo.5l'J h'J-( /VOJ == UP

petif-ion
Co rn ;:}" ~ t Co mfeVJ Vi o ri Pro h' l:
y~ -:3 Dow\/\ rvo ~ uJJ
7 tb -:;) Oow r-1 rvo ➔ oP


,<-- - -- -

Io h CjOJ1 Opp clec,:Sfon fret _ko


CuJ Q \ ~ S ~ li'0)7 0 rb h WJ- CM Ja ·I, i _ ---
C\.JC<J) hurn QUJJ pth8nY) 7~(;~ _hcti
___ _
o.n__d Vo h 03.e rru. JaWM ~ i- lnh
Pco ~ t- up CL7Jd budho. hll; 'l-o h - -· j
P(o~t down

- -- - ------
1o h fs ~ t-c07 ~ h I /;{?(✓•
k_ 'oe.q __ ~ ,- - -

de. cf910 o k-r0- Ka.om KaJJt-4-bai,__ _ __


-- - - - - - - - - - - - -
.
NO ti ~ £x:~.s . YYk__ ser.h Ii l·uu. Jr; ~JJ)JrJom_ls_ rJgh1_ .
_ hat _S1T( j'6 I~ cal c,vlqho.n_ :vo_b_____
• _ )i l'\'hn~ - hq, vak'i- etpl-c.i._ -1Lnduisfill)Ji~-
-- _ pM pok fu__ LfJ e hq ;__ .
11-,0J)h'. .Joo so__mvcb (D __ '3.iv~_ ff_oro_ /~fs-_ _
Me rt'W) t- __ fv1hon S · _ __
.
o.r,Jpleo.?>-L -ut;. V.s __ kn,HP -:,c -~ cM_~o!- -~
so~ . h d p ro 30v h' 0 (f) / Mf rrio ffi,U):._t _ l:ii hq_Y)J
a..v.h -h°' .- -- -----
Attribute Selection Measures

• Information Gain/ Entropy Reduction


• Gain Ratio
• Gini Index
Decision Tree Induction: Attribute Selection

Intuitively: A good attribute splits the examples into subsets that are
(ideally) all positive or all negative .

•••••• ••••••
•• •••• • • •• ••
Decision Tree Induction: Attribute Selection

Intuitively: A good attribute splits the examples into subsets that are
(ideally) all positive or all negative.

••••••
• • •• ••
Decision Tree Induction: Decision Boundary

+ +
+ +
+ +
• + ++ +
• +++
• • • • •
• • •
+ + + • •
Decision Tree Induction: Decision Boundary

+ +
+ +
+ +
• + ++ +
• +++
• • • • •
• • •
+ + + • •
Decision Tree Induction: Decision Boundary

+ +
+ +
+ +
• + ++ +
• +++
• • • • •
• • •
+ + + • •
Decision Tree Induction: Decision Boundary

+ +
+ +
+ +
• + ++ +
• +++
• • • • •
• • •
+ + + • •
Decision Tree Induction: Decision Boundary

+ +
+ +
+ +
• + ++ +
• +++
• • • • •
• • •
+ + + • •
Decision Tree Induction: Decision Boundary

+ +
+ +
+ +
• + ++ +
• +++
• • • • •
• • •
+ + + • •
Basic Decision Tree Learning Algorithm

• Attribute selection measures are also known as splitting rules because


they determine how the tuples at a given node are to be split.
• The attribute having the best score for the measure is chosen as the
splitting attribute for the given tuples.
Basic Decision Tree Learning Algorithm ·.

• Iterative Dichotomiser 3 (/03) Algorithm was the first of three


Decision Tree implementations developed by Ross Quinlan.
• It is intended for use with nominal (categorical) inputs only.
• Real-valued variables are first binned into intervals, with each interval
being treated as an unordered nominal attribute.
• It builds a decision tree for the given data in a top-down fashion,
starting from a set of objects and a specification of properties
Resources and Information.
Basic Decision Tree Learning Algorithm

• Each node of the tree, one property is tested based on maximizing


information gain and minimizing entropy, and the results are used to
split the object set.
• This process is recursively done until the set in a given sub-tree is
homogeneous (i.e. it contains objects belonging to the same
category).
• The 1D3 algorithm uses a greedy search.
• It selects a test using the Information Gain criterion, and then never
explores the possibility of alternate choices.
Basic Decision Tree Learning Algorithm
Disadvantages
• Data may be over-fitted or over-classified, if a small sample is tested.
• Only one attribute at a time is tested for making a decision.
• Does not handle numeric attributes and missing values.
• It maintains only a single current hypothesis as it searches through the space of
decision trees.
• It does not have the ability to determine alternative decision trees.
• It does not perform backtracking in search. Hence, there are chances of getting
stuck in local optima.
• It is less sensitive to error because information gain, which is a statistical property
is used.
Basic Decision Tree Learning Algorithm ':·

• C4.5 Algorithm developed by Ross Quinlan as a successor and


refinement of 1D3, is the most popular tree-based classification
method.
• It uses Gain Ratio as an impurity measure.
• In C4.5, multiway splits for categorical variable are treated the same
way as in 1D3.
• Continuous-valued attributes have been incorporated by dynamically
defining new discrete-valued attributes that partition the continuous-
values into binary set of intervals.
Basic Decision Tree Learning Algorithm

The new features (versus 1D3) are:


• accepts both continuous and discrete features.
• handles incomplete data points.
• solves over-fitting problem by (very clever) bottom-up technique
usually known as "pruning".
• different weights can be applied the features that comprise the
training data.
Basic Decision Tree Learning Algorithm

Disadvantages
• C4.5 constructs empty branches with zero values.
• Over fitting happens when algorithm model picks up data with
uncommon characteristics, especially when data is noisy.
Basic Decision Tree Learning Algorithm

• Classification and Regression Trees (CART) which uses Gini Index as


impurity measure, is characterized by the fact that it constructs
binary trees, namely each internal node has exactly two outgoing
edges.
• CART can handle both numeric and categorical variables and it can
easily handle outliers.
Basic Decision Tree Learning Algorithm

Disadvantages
• It can split on only one variable.
• Trees formed may be unstable.
ID3 Decision Tree
Simplified 1D3 Algorithm

1. Compute the Entropy for training data set 5.


2. For every attribute/feature:
a. Calculate entropy for all categorical values.
b. Take average entropy for the current attribute.
c. Calculate Information Gain for the current Attribute.
3. Pick the highest Information Gain Attribute.
4. Repeat until the desired Decision Tree is generated.
Entropy

• Entropy is an entity that controls the split in data .


• It computes the homogeneity of examples.
• Entropy ranges between O and 1
• 0 if all members of S belong to the same class
• 1 if there are equal number of positive and negative examples

Entropy(S) = -pffi logz Pffi - Pe logz Pe


Information Gain/ Entropy Reduction

Information gain is the measure of effectiveness of an attribute in


classifying the training data.
Information Gain(S, A) = Entropy(S) -
VE
I
Values(A)
1
~
1
1
* Entropy(Sv)
Problem based on ID3
Instance Outlook (x1) Temperature (x 2) Humidity (X3) Wind (X4) Play? (y)
5Cl) Sunny Hot High Weak No
5(2) Sunny Hot High Strong No
5(3) Overcast Hot High Weak Yes
5(4) Rain Mild High Weak Yes
5(5) Rain Cool Normal Weak Yes
5(6) Rain Cool Normal Strong No
5(7) Overcast Cool Normal Strong Yes
5(8) Sunny Mild High Weak No
5 (9) Sunny Cool Normal Weak Yes
5(10) Rain Mild Normal Weak Yes
5(11) Sunny Mild Normal Strong Yes
5(12) Overcast Mild High Strong Yes
5(13) Overcast Hot Normal Weak Yes
5(14) Rain Mild High Strong No
Problem based on ID3

Sis a collection of 14 samples.


S attributes are outlook, temperature, humidity and wind.
They can have the following values:
• Outlook= {Sunny, Overcast, Rain}
•Temperature= {Hot, Mild, Cool}
• Humidity= {High, Normal}
• Wind= {Strong, Weak}
Problem based on ID3

s == [9+, s-J
Calculate Entropy(S).
Problem based on ID3

s == [9+, s-J
Calculate Entropy(S).

Entropy(S) == -pm log 2 Pm - Pe log2 Pe


Entropy(S) == - (t 4
)
9
log 2 ( 1 4 ) - (
5
14
) log2 (t
4
)

Entropy(S) == 0. 94
Problem based on ID3

Values(Outlook) = Sunny, Overcast, Rain


s = [9+, s-J
Ssunny = [2+, 3-]
Sovercast =[4 +, o-]
SRain = [3+, 2-]

Information Gain(S, Outlook) =?


Problem based on ID3

Values(Outlook) == Sunny, Overcast, Rain

s == [9+, s-J
Ssunny == [2+, 3-]
Sovercast == [ 4 +, o-]
SRain == [3+, 2-]

Information Gain(S, Outlook)


5
== Entropy(S) - 1 4 * Entropy( Ssunny) - 1~ * Entropy(Sovercast)
5
- 1 4 * Entropy(SRain)
Problem based on ID3
Values(Outlook) = Sunny, Overcast, Rain
s = [9+, s-J
Ssunny = [2+, 3-]
Sovercast = [4 +, o-]
SRain = [3+, 2-]

Information Gain(S, Outlook)


= Entropy(S) - 154 * Entropy( Ssunny) - 1~ * Entropy(Sovercast) - i54
* Entropy(SRain)

5 5
= 0.94 - 14 * 0.971 - 1~ *0- 14 * 0.971 = 0.247
Problem based on ID3

Values(Temperature) == Hot, Mild, Cool

s == [9+, s-J
SHot == [2+, 2-]
SMild == [ 4 +, 2-]
Scool == [3+, 1 -]

Information Gain(S, Temperature) ==?


Problem based on ID3
Values(Temperature) = Hot, Mild, Cool

Information Gain(S, Temperature)


= Entropy(S) - 1~ * Entropy(SHat) - i64 * Entropy(SMild) - 1~
* Entropy(Sc 00 z)

6
= 0.94 - 1~ *1- 14 * 0.918 - 1~ * 0.811 = 0.029
Problem based on ID3

Values(Humidity) == High, Normal

s ==[9+, s-J
SHigh == [3+, 4-]
SNormal == [6+, 1 -]

Information Gain(S, Humidity) ==?


Problem based on ID3

Values(Humidity) == High, Normal

s == [9+, 5-J
SHigh == [3+, 4-]
SNormal == [6+, 1 -]

Information Gain(S, Humidity)


== Entropy(S) - ;4
* Entropy(SHigh) - ;4 * Entropy(SNormaa
== 0.94 - ;4 * 0.985 - ;4 * 0.592 == 0.151
Problem based on ID3

Values(Wind) = Strong, Weak


s = [9+, s-J
Sstrong= [3+, 3-]
Sweak = [6+, 2-]

Information Gain(S, Wind) =?


Problem based on ID3

Values(Wind) = Strong, Weak


s = [9+, s-J
Sstrong = [3+, 3-]
Sweak = [6+, 2-]

Information Gain(S, Wind)


= Entropy(S) - i64 * Entropy( Sstrong) - i84 * Entropy(Sweak)
Problem based on ID3

• The Outlook Attribute wins pretty handily, so it's placed at the root of
the Decision Tree.
• Afterwards, the Decision Tree is expanded to cover Outlook's possible
values.
• In Play, the Outlook can be Sunny, Overcast or Rain, so all of the
examples that have a Sunny Outlook are funneled through the Sunny
path, the examples with an Overcast Outlook are diverted to the
Overcast path and so on.
Problem based on ID3
Outlook
(1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14)

Sunny Overcast Rain

/ \
Problem based on ID3

Sunny outlook on decision


Instance Outlook (x 1 ) Temperature (x 2 ) Humidity (x 3 ) Wind (X4) Play? (y)
5Cl) Sunny Hot High Weak No
5(2) Sunny Hot High Strong No
5(8) Sunny Mild High Weak No
5 (9) Sunny Cool Normal Weak Yes
5(11) Sunny Mild Normal Strong Yes
Problem based on ID3

Gain(Outlook = Sunny ITemperature)=?


Gain(Outlook = Sunny I Humidity)=?
Gain(Outlook = Sunny IWind)=?
Problem based on ID3

Gain(Outlook = Sunny ITemperature)= 0.971- (2/5)*0 - (2/5)*1-


(1/5)*0 = 0.571
Gain(Outlook = Sunny I Humidity) = 0.971- (3/5)*0 - (2/5)*0 = 0.971
Gain(Outlook = Sunny IWind)= 0.971- (2/5)*1- {3/5)*0.918 = 0.02
Problem based on ID3
Outlook
(1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14)

Sunny Overcast Rain

\
Humidity
(1, 2, 8, 9, 11)

High Normal

I \
Problem based on ID3

High humidity on decision


Instance Outlook (x 1 ) Temperature (x 2 ) Humidity (x 3 ) Wind (X4) Play? (y)
5Cl) Sunny Hot High Weak No
5(2) Sunny Hot High Strong No
5(8) Sunny Mild High Weak No
Problem based on ID3

Normal humidity on decision


Instance Outlook (x 1 ) Temperature (x 2 ) Humidity (x 3 ) Wind (X4) Play? (y)
(9)
5 Sunny Cool Normal Weak Yes
S (ll) Sunny Mild Normal Strong Yes
Problem based on ID3
Outlook
(1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14)

Sunny Overcast Rain

\
Humidity
(1, 2, 8, 9, 11)

High Normal
Problem based on ID3

Overcast outlook on decision


Instance Outlook (x 1 ) Temperature (x 2 ) Humidity (X3) Wind (X4) Play? (y)
5(3) Overcast Hot High Weak Yes
5(7) Overcast Cool Normal Strong Yes
5(12) Overcast M ild High Strong Yes
5(13) Overcast Hot Normal Weak Yes
Problem based on ID3
Outlook
(1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14)

Sunny Overcast Rain

\
Humidity
(1, 2, 8, 9, 11)

High Normal
Problem based on ID3

Rain outlook on decision


Instance Outlook (x 1 ) Temperature (x 2 ) Humidity (x 3 ) Wind (X4) Play? (y)
5(4) Rain Mild High Weak Yes
5(5) Rain Cool Normal Weak Yes
5(6) Rain Cool Normal Strong No
5(10) Rain Mild Normal Weak Yes
5(14) Rain M ild High Strong No
Problem based on ID3

Gain(Outlook = Rain ITemperature)=?


Gain(Outlook = Rain I Humidity)= ?
Gain(Outlook = Rain IWind)=?
Problem based on ID3

Gain(Outlook = Rain ITemperature)= 0.971- {3/5)*0.918 - (2/5)*1 =


0.02
Gain(Outlook = Rain I Humidity)= 0.971- (2/5)*1- {3/5)*0.918 = 0.02
Gain(Outlook = Rain IWind)= 0.971 - (2/5)*0 - {3/5)*0 = 0.971
Problem based on ID3
Outlook
(1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14)

Sunny Overcast Rain

Humidity Wind
(1, 2, 8, 9, 11) (4, 5, 6, 10, 14)

High Normal Strong Weak

I \
Problem based on ID3

Strong wind on decision


Instance Outlook (x 1 ) Temperature (x 2 ) Humidity (x 3 ) Wind (X4) Play? (y)
5 (6) Rain Cool Normal Strong No
5 (14) Rain Mild High Strong No
Problem based on ID3

Weak wind on decision


Instance Outlook (x 1 ) Temperature (x 2 ) Humidity (x 3 ) Wind (X4) Play? (y)
5(4) Rain Mild High Weak Yes
5(5) Rain Cool Normal Weak Yes
5(10) Rain Mild Normal Weak Yes
Problem based on ID3
Outlook
(1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14)

Sunny Overcast Rain

Humidity Wind
(1, 2, 8, 9, 11) (4, 5, 6, 10, 14)

High Normal Strong Weak


Problem based on ID3
Instance Eye-colour (x 1 ) Married (x 2 ) Gender (x 3 ) Harilength (x 4 ) Class (y)
5(1) Brown Yes Male Long Football
5(2) Blue Yes Male Short Football
5(3) Brown Yes Male Long Football
5(4) Brown No Female Long Netball
5(5) Brown No Female Long Netball
5(6) Blue No Male Long Football
5(7) Brown No Female Long Netball
5(8) Brown No Male Short Football
5 (9) Brown Yes Female Short Netball
5(10) Brown No Female Long Netball
5(11) Blue No Male Long Football
5(12) Blue No Male Short Football
Problem based on ID3
Instance Temperature (x 1 ) Wind (Xz) Humidity (y)
5 Cl) Hot Weak Normal
5 (2) Hot Strong High
5 (3) Mild Weak Normal
5 (4) Mild Strong High
5 ( 5) Cool Weak Normal
5 (6) Mild Strong Normal
5 C7) Mild Weak High
5 (8) Hot Strong Normal
5 (9) Mild Strong Normal
5 (10) Cool Strong Normal
Problem based on ID3
Name Hair Height Weight Location Class

Sunita Blonde Average Light No Yes


Anita Blonde Tall Average Yes No

Kavita Brown Short Average Yes No


Sushma Blonde Short Average No Yes
Xavier Red Average Heavy No Yes

Balaji Brown Tall Heavy No No


Ramesh Brown Average Heavy No No
Swetha Blonde Short Light Yes No
Problem based on ID3
Instance X1 Xz X3 X4 y
5(1) Low Medium Medium Low False
5(2) Low Medium High Medium False
5(3) Low Medium Medium High False
5(4) High Medium Medium Low False
5(5) High High Medium Medium True
5(6) Low High Medium Low False
5(7) Low Medium High Low True
5(8) High High Low Medium False
5 (9) Low Medium Medium Medium False
5(10) Low Low High Medium False
5(11) High Low Medium Medium True
5(12) Low Low Medium High False
Problem based on 1D3
RID age income student credit_rating Class: buys_computer
1 youth high no fair no
2 youth high no excellent no
3 middle_aged high no fair yes
4 senio r medium no fa ir yes
5 senior low yes fair yes
6 senio r low yes excellent no
7 middle_aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle_aged medium no excellent yes
13 middle_aged high yes fair yes
14 senio r medium no excellent no
Multiclass Classification
N

Entropy(S) =- I
i=1
Pi log2 Pi

where,
N is the number of classes
Pi is the probability of an element to belong in class i
Problem based on ID3
Transportation
Instance Gender Car Ownership (x 1 ) Travel Cost (x 2 ) Income Level (x 3 )
(y)
5Cl) Male 0 Cheap Low Bus
5(2) Male 1 Cheap Medium Bus
5(3) Female 1 Cheap Medium Train
5(4) Female 0 Cheap Low Bus
5(5) Male 1 Cheap Medium Bus
5(6) Male 0 Standard Medium Train
5(7) Female 1 Standard Medium Train
5(8) Female 1 Expensive High Car
5 (9) Male 2 Expensive Medium Car
5(10) Female 2 Expensive High Car

Decision Tree Tutorial - Kardi Teknomo


Problem based on ID3
Instance Weather (x1) Parents (xz) Money (x 3) Decision (y)
5(1) Sunny Yes Rich Cinema
5 (2) Sunny No Rich Tennis
5 (3) Windy Yes Rich Cinema
5 (4) Rainy Yes Poor Cinema
5 ( 5) Rainy No Rich Stay in
5 (6) Rainy Yes Poor Cinema
5 (7) Windy No Poor Cinema
5 (8) Windy No Rich Shopping
5 (9) Windy Yes Rich Cinema
5 (10) Sunny No Rich Tennis
C4.5 Decision Tree
Simplified Algorithm

1. Let 5 be the set of training instances.


2. Choose an attribute that best differentiates the instances contained
in 5 (C4.5 uses the Gain Ratio to determine).
3. Create a tree node whose value is the chosen attribute:
a. Create child links from this node where each link represents a
unique value for the chosen attribute.
b. Use the child link values to further subdivide the instances into
subclasses.
Gain Ratio and Split Information

Splitt Info A •D') =-I,-


·v In-I [In-I]
_.1 Xlog - 1

j =l IDI IDI '

. ~ . . . Gain(A)
1
Gru.n Ratio( A) = - - - -
Split Info A. ( D)
Problem based on C4.5
Instance Outlook (x1) Temperature (x 2) Humidity (X3) Wind (X4) Play? (y)
5Cl) Sunny Hot High Weak No
5(2) Sunny Hot High Strong No
5(3) Overcast Hot High Weak Yes
5(4) Rain Mild High Weak Yes
5(5) Rain Cool Normal Weak Yes
5(6) Rain Cool Normal Strong No
5(7) Overcast Cool Normal Strong Yes
5(8) Sunny Mild High Weak No
5 (9) Sunny Cool Normal Weak Yes
5(10) Rain Mild Normal Weak Yes
5(11) Sunny Mild Normal Strong Yes
5(12) Overcast Mild High Strong Yes
5(13) Overcast Hot Normal Weak Yes
5(14) Rain Mild High Strong No
CART (Decision Tree)
Simplified Algorithm

1. Let 5 be the set of training instances.


2. Choose an attribute that best differentiates the instances contained
in 5 (CART uses the Gini Index to determine).
3. Create a tree node whose value is the chosen attribute:
a. Create child links from this node where each link represents a
unique value for the chosen attribute.
b. Use the child link values to further subdivide the instances into
subclasses.
Gini Index
N

Cini Index(Attribute = Value) = 1 - I


i=l
(p;) 2

Cini Index(Attribute) = L Pv * Cini Index(v)


v=Values
Problem based on CART
Instance Outlook (x1) Temperature (x 2) Humidity (X3) Wind (X4) Play? (y)
5Cl) Sunny Hot High Weak No
5(2) Sunny Hot High Strong No
5(3) Overcast Hot High Weak Yes
5(4) Rain Mild High Weak Yes
5(5) Rain Cool Normal Weak Yes
5(6) Rain Cool Normal Strong No
5(7) Overcast Cool Normal Strong Yes
5(8) Sunny Mild High Weak No
5 (9) Sunny Cool Normal Weak Yes
5(10) Rain Mild Normal Weak Yes
5(11) Sunny Mild Normal Strong Yes
5(12) Overcast Mild High Strong Yes
5(13) Overcast Hot Normal Weak Yes
5(14) Rain Mild High Strong No
Problem based on CART
RID age income student credit_rating Class: buys_computer
1 youth high no fair no
2 youth high no excellent no
3 middle_aged high no fair yes
4 senio r medium no fa ir yes
5 senior low yes fair yes
6 senio r low yes excellent no
7 middle_aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle_aged medium no excellent yes
13 middle_aged high yes fair yes
14 senio r medium no excellent no
CART (Regression Tree)
Standard Deviation
Problems based on CART
Day Outlook Temp. Humidity Wind Golf Players
1 Sunny Hot High Weak 25
2 Sunny Hot High Strong 30
3 Overcast Hot High Weak 46
4 Rain Mild High Weak 45
5 Rain Cool Normal Weak 52
6 Rain Cool Normal Strong 23
7 Overcast Cool Normal Strong 43
8 Sunny Mild High Weak 35
9 Sunny Cool Normal Weak 38
10 Rain Mild Normal Weak 46
11 Sunny Mild Normal Strong 48
12 Overcast Mild High Strong 52
13 Overcast Hot Normal Weak 44
14 Rain Mild High Strong 30
Tutorials Labs Exam
All Complete 74
Some Partial 23
All Complete 61
All Complete 74
Some Partial 25
All Complete 61
Some Complete 54
Some Partial 42
Some Complete 55
All Complete 75
Some Partial 13
All Complete 73
Some Partial 31
Some Partial 12
Some Partial 11

https://www.youtube.com/watch?v=nWuUahhK30c
Underfitti ng and Overfitti ng
Underfitting and Overfitting

• Underfitting occurs when a machine learning algorithm cannot


capture the underlying trend of the data
• Overfitting occurs when a machine learning algorithm captures the
noise of the data
• Given a hypothesis space H, a hypothesis h EH is said to overfit the
training data if there exists some alternative hypothesis h 'E H, such
that h has smaller error than h 'over the training examples, but h '
has a smaller error than hover the entire distribution of instances.
Avoiding Overfitting

• Approaches that stop the growth of tree before it reaches the point
where it perfectly classifies the training data.
• Approaches that allow the tree to overfit the data, and then post-
prune it.
Approaches to Identify Final Tree Size

• Use separate dataset for training and for evaluation. This is done by
the training and validation set approach.
• Use the entire dataset for training but use a statistical test (chi square
test) for estimation.
• Use an explicit measure of the complexity for encoding the training
examples and the decision tree, halting growth of the tree when this
encoding size is minimized. This is done using the minimum
description length principle.
• Continuous values attributes: Uses a threshold based Boolean
attribute approach
• Missing attribute values: assign the value that is most common
among training examples at node n.
• Handling Attributes with Differing Costs: prefer low-cost attributes are
preferred over high-cost attributes
Rule-Based Classification

• Using IF-THEN Rules for Classification:


• Define coverage and accuracy of a rule. It is given by
Coverage ( Rl = m covcn
IDI
Accuracy ( Rl) = m corrcct
m CD\·ers

• Properties of rule generated by rule based classifier


• Mutually exclusive rules
• Exhaustive rules
Rule Ordering Schemes

• The rule ordering scheme does the prioritization of the rules


• Ordering based on class: the classes are sorted in the order of
decreasing "importance".
• With rule ordering, the triggering rule that appears first in the list has
the highest priority and it fires the class prediction
Rule Extraction from a Decision Tree

• To extract rules from a decision tree, one rule is created for each path
from the root node to the leaf node.
• Each splitting criterion in each path is logically ANDed to form the rule
antecedent. The leaf node holds the class prediction.
Pruning the Rule Set

• For a given rule antecedent, any condition that does not improve the
estimated accuracy of the rule can be pruned
• Sequential covering algorithm can be used for the same

Algorithm
1. Start with an empty cover
2. Find the best hypothesis using learn-one-rule
3. If the just-learnt-rule satisfies the threshold then
(a) Put just-learnt-rule to the cover.
(b) Remove examples covered by just-learnt-rule.
(c) Go to step 2.

4. Sort the cover according to its performance.


5. Return: cover.
Comparison
C4.5 CART
Splitting Information Gain Gain Ratio Gini Index
Criteria
Attribute Handles only Handles both Handles both
Type categorical values categorical and categorical and
numerical values numerical values
Missing Do not handle missing Handles missing Handles missing
Values values values values
Pruning No pruning is done Error based pruning is Cost-complexity
Strategy used pruning is used
Outlier Susceptible to outliers Susceptible to outliers Can handle outliers
Detection
Module 4
Support Vector Machines
Support Vector Machines:
A Support Vector Mnchine (SVM) is a discriminative classifier fonnally defined by a
separating hyperplane. In other words, given labelled training dnta (s11pen1ised learning), the
algorithm outputs an optimal hyperplane which categorizes new examples. In two-dimensional
space this hyperplane is a line dividing a plane in two parts where in each class lay in either
side.

Let's /rm•e a simple approach to it.•••..

Suppose you are given plot of two label classes on graph ali shown in image I below. Can you
decide a separating line for the classes?


•• •
• • •• ••
•• ••

• ■ ■ •



You might have come up with something similar to image 2. It fairly separates the two classes.
Any point that is left of line falls into black circle class and on right falls into blue square class.
Separation of classes. TJ,at's w'1at SVlf does. It finds out a line/ hyper-plane (in
multidimensional space that separate outs classes).

1. Making it a Bit complex ...

Now consider what if we had data as shown in image below?

■ ■ ■
■ ■ ■

• • • ••• ■ ■

■ • •• ■ ■


• • ■ ■
■ ■ ■
■ ■■
Clearly. there is no line that can separate the two classes in this x-y plane. For this we apply
transformation und ndd one more dimension ns we cull it z-uxis. Let's assume value of points
on z plunc, w = x2 + y2. In this cusc we can manipulate it ns distance of point from z-origin.
Now if we plot in z-uxis, u clear separation is visible und u line cun be drawn.

Can you draw n scpurating_linc in this plane?



■■ z■ ■ ■■ ■
■ ■ ■ ••
■ •• '■ ■
·-----------------►
...-·-· • ••
--- + - - - - - -y

plot of zy axis. A separation can be made here.

When we transform back this line to original plane, it maps to circular boundary as shown in
image below. These transformations arc called kemels.

■ ■ ■
■ ■


••• ■ ■


• ■
·y
• •
i ■■ ■





Transforming bnck to x-y plnnc, n line transform.,; to circle.
Thankfully, you don't hnve to guess/ derive the transformation every time for your data set.
The sklenm library's SVM implementation provides it inbuilt.

2. Making it a little more complex ... What if data plot overlaps?

• • ■
•• ■ ••
• •• •
• ■ ■
• • ■ ■
■ ■
••
What in this case?


• •
• •••

••
.••(: •••
• • • •

•• • •
• •
0

/.
• •••
.I

• •

Image 1 Image 2

Which one do you think? Well, both lhe answers arc correct. The first one tolerates some outlier
points. The second one is trying to achieve Otolerance with perfect partilion.

But, there ls trade off. In real world application. finding perfect class for millions of training
data set takes lot of time. This is called rcgulnrfzation pnrnmctcr.

Tuning parameters:
Kernel
The learning of the hyperplane in linear SVM is done by lransfonning the problem using some
linear algebra. This is where the kernel plays role. For linear kernel the equation for prediction
for a new input using the dot product between lhe input (x) and each support vector (xi) is
cnlculatcd ns follows:
f(x} = B{O} + sum(ui * {x,xi})
This is an equation that involves calculating the inner products of a new input vector (x) with
all support vectors in training data. The coefficients BO and ai (for each input) must be
estimated from the trnining data by the learning algorithm.
The polynomial kernel can be written as K(x.xi) = 1 + s11111(x * xi)"d and exponentinl as
K(x.xi) = e.,11(-gamma • s11m((.t- xF)).
Polynomial and exponential kernels calculate separation line in higher dimension. This is called
kernel trick.

Regularization

The Regularization parameter (C) tells the SVM optimization how much you want 10 avoid
misclassifying each training example. For large values of C, the optimization will choose a
smaller-margin hyperplane if that hyperplane docs a better job of getting aJI the training points
classified correctly. Conversely, a very small value of C will cause the optimizer to look for a
larger-margin separating hyperplane. even if that hyperplane misclassifies more points.

The images below arc example of two different regularization parameter. Left one has some
misclassification due to lower regularization value. Higher value leads to results like right one.
• • •
• •• ■

••
•■


■■ ■

••
•• ■

0
/..·:■■ ■ ■
Low regularlzatlon value High regularization value

Gamma
The gamma purumctcr defines how far the influence of a single training example reaches. with
low values meaning 'for' and high values meaning 'close'. ln other words, with low gamma,
points far away from reasonable separation line arc considered in calculation for the separation
line. Whereas high gamma means the points close to reasonable line are considered in
calculation.

High Gamma
Only nearby points are
considered.

• • Low Gamma

. . •··..... ...
Far away points are also
. · considered .

Support Vector
• .··•
·.·. .

It can be defined as the distance from a class element to the line / hyperplane.
+

//
/
/
//
/
,Bsupport
/// vector
Margin

A margin is a separation of /i11e to t/,e closest class points.

A good margin is one where this separation is larger for both the clao;;ses. Images below gives
to visual example of good and bud margin. A good margin ullows the points to be in their
respective classes without crossing 10 other class.

Good margin
equidistant as as lar as
- - ---f-~--r.,:;__- - - ---tnosslble for both side.

• ••
Bad margin
very close to blue class.

The target is to determine the decision boundary (Good Margin). This can be done using
Lagrange Multiplier as follows -

The Lagrangian is
1 n
.C=-wTw+ Loi(l-yi(WTXi+b))
2 i=l
■ Note that I lwl 12 = wTw
Setting the gradient of £ w.r.t. wand b to zero, we
have n n
w+ L
Oi(-yi)Xi = 0 => w = OiY-i"Xi L
i=l i=l
n
L OiYi =0
i=l
Where W is a unit vector with class labels.
r• If we substitute w = I: oiYiXi to £, we have
i=l

.C = .!_2 I: 0;11;x7' I: OjYjXj +


i=l j=l
I: (1
i=l
o; -11;(
j=l
I: OjYjxJ'x; + b))
11 71 II II 11 11
1
=2L L 1'
cr;OjYiJljX; Xj + La; - L o;v; L 1·
OjYjXj x; - b L o;v;
i=l j=l i=l i=l j=l i=l
1 IL II
" + "£-
11

= - - ,- £- £- OJOj1/iYJX;'/'Xj
2 i=l j=l
Cl'j
i=l
11

■ Note that L n;y; = O


i=l

If we know w, we know nil al: if we know nil al. we know w.

Quadratic Programming Problem


For SVM, sequential minimal optimization (SMO) seems to be the most popular.
A QP with two vnrinbles is trivial to solve. Ench iteration of SMO picks n pair of (ui,aj) nnd
solve the QP with these two variables; repeat until convergence.
In practice, we can just regard the QP solver as a "blnckbox" without bothering how it works.

Class 2
····.... a =O

~:::·· ~·· .. a8'l~-- .~1=0Q2=0


•••••••• ••••(:;) N
·":"1- •
-o 8
··.
- . ··..........WT X + l, = 1
a■
9 =0
■ ····... ·•.
Class 1
a 3 =0 ····...
·•. wTx+u =0
wTx+b= -1
Module 5
Learning with Classification
Rule Based Classification
In Rule based Classification, the learned model is represented as a set of IF-THEN rules, which is an easy
way of representing information or bits of knowledge. The rules are less expensive for compuling and
in1erpre1able by human. An IF-THEN rule is expressed as IF condition THEN conclusion.
\Vhere lhe " IF" part (or left side) of a rule is known as the rule antecedent or precondition and the "THEN"
pan (or right side) is the rule consequent. If the condition consists of one or more attribute tests, then all
those attributes arc logically ANDed.

Example
Consider the following dataset (Table: 1).
RID age Income student creditJoting Goss: buys..computer
you1h high no fair no
2 youlh high no cxcellcnl no
3 middlc_aged high no fair }'~S
4 senior mroium no fair yes
s senior low y<.'S fair yes
6 senior low yes exccllenl no
7 miJllle_aged low yes exceJlenl yes
8 you1h medium no fair no
9 youlh low yes fair yrs
10 senior medium yes fair yes
11 youth medium yes cxccllcnl yes
12 miJllle_ageJ meJium no excellenl yes
13 miJllle_aged high yes fair yes
14 senior medium no excellent no

From the above dataset a rule Rl can be derived as follows,


RI: IF age = youth AND student = yes THEN buys computer = yes.
In the rule Rl, the rule antecedent consists of attribute tests as "age =yowl," "student =yes " which are
joined together using AND operator.

Characteristics of Rule based Classifier


The rules generated for rule-based classification can be of either of the following types
• Mutually Exclusive Rules: The rules are said to be mutually exclusive rules if all the rules generated
for a dataset are independent of each other, that is each record in the dataset are covered by atmost
one rule. Mutually exclusive means that we cannot have rule conflicts here because no two rules
will be triggered for the same tuple.
• Exhaustive Rules: The ru]es are said to be exhaustive in nature if it accounts for every possible
combination of attribute values. In this method there exist multiple rules that covers the same tuple,
and every record is covered by atmost one rule. This does not require a default rule as it covers
every set of tuples.

Conflict Resolution Strategy


1. Size Ordering scheme assigns the highest priority to the triggering rule that bas the "toughest"
requirements, where toughness is measured by the rule antecedent size. That is, the triggering rule with the
most attribute tests is fired.
2. Rule ordering scheme prioritizes the rules beforehand. The ordering may be class-based or rule-based.
• Class-based orderillg, the classes arc sorted in order of decrea.1;ing order of prevalence (or
importance). All the rules for the most prevalent class come first, the rules for the next prevalent
cluss come next, and so on. The rules arc then sorted based on the misclassification cost per cluss.
Within each clnss, the rules arc not ordered because they nil predict the same class nnd no cln.11s
conflict occurs in there.
• Rule-based orderi11g, the rules arc organized into one long priority list, uccording to some measure
of rule quality, such m; uccurncy. coverage. or size or based on advice from domain experts. The
rule set here is known to be a decision list.
Any other rule thut satisfies X is ignored.

If there is no rule sutislicd by Xu fullback or default rule can be set up to specify a default class, based on
a 1raining set. The class used as default cln.c;s can be the class in majority or the majority class of the tuples
thnt were not covered by any rule.

Approaches for Creating Rules


Direct MeU10d: TI1is method directly learns the required classification rules from the training dataset. Eg.
Sequential Covering Algorithm
Indirect Method: These types of approach first create n decision tree or neural network using the trnining
data.~ct. The rules for clussificution arc later a.-;sumed from the structure of decision trees or from the neural
network epochs.

Rull' lndut'finn 11,;ing Sl'<llll'nlial Covl'ring Algorithm (Oir('cf l\1l'lluul)


IF-THEN rules can be extracted directly from the training data using a sequential covering algorithm. 1l1e
rules arc learned sequentially one at a time. where each rule for a given class will ideally cover many of the
class's tuples. Each time n rule is learned, lhe tuples covered by the rule arc removed. and the process
repeats on lhe remaining tuples. When learning a rule for u class, C, lhe rule cover all (or many) of the
training tuples of clnss C nml none (or few) of the tuples for other clnsses. So, the learned rules possess
high uccurucy. AQ, CN2, RIPPER ure the common scquentiul covering algorithms.
The Sequential covering to lenm a set of IF-THEN rules for clussificution is ns follows
Input:
D: A dutn sci of cluss-lnbeled tuples;
Att vals: The set of nil attributes and their possible values.
Output: A set of IF-THEN rules.
Algorithm:
(1) Rulc_sct = {}; // initial set of rules learned is empty
(2) for each class c do
(3) repeat
(4) Ruic = Lcarn_Onc_Rulc(D, Att _vals, c);
(5) remove tuples covered by Ruic from D;
(6) Rulc_sct = Rulc....sct + Ruic; II add new rule to rule set
(7) until terminating condition;
(8) cndfor
(9) return Rulc...Sct;
Learn_011e_R11le
1. Co11sider one class
2. Pass throu~/1 trai11i11g ,Jato
3. Find attribwe that i11crease accuracy of c11rre11t empty set
4 . Append the attribute imo rule
Rules arc grown in a general-to-specific manner. The Learn One Ruic procedure finds the " best" rule for
the current class, given the current set of training tuples. Initially the algorithm stans with an empty rule
nnd then gradually keep a ppending attribute tests to it. Appending is done by adding the attribute test as n
logical conjunct to the existing condition of the rule antecedent.
Rule E-\'.tradfon from a Decision Tree <Indirect :\letlwdl
Decision tree classifiers arc a popular method of classification. easily understandnble und having high
accuracy. But decision trees can become large and difficult to interpret. A rule-based classifier can be
created from decision tree by extmcting IF-THEN rules which arc easier for human to interpret.
To extract rules from a decision tree, one rule is created for each path from the root to a leaf node. Ench
splitting criterion along a given path is logically ANDed to fonn the rule antecedent ("IF" part). The leaf
node holds the clnss prediction, forming the rule consequent ("Tl IEN" part). A disjunction (logical OR) is
implied between each of the extracted rules. Because the rules arc extracted directly from the tree, they are
mutually exclusive and exhaustive. The set of mies generated In this manner from a Decision tree mny
contain subtree repetition and replication which has to be removed by rule pruning.
Rede Pr1111i11g: Any rule that docs not contribute to the overall accuracy of the entire rule set can also be
pruned. The rules arc individuaJly pruned by removing any condition that docs not improve the rule
accuracy.
One problem of Pmning is that the Rules may not be mutually exclusive and exhaustive after pruning. This
can be avoided using Conflict Resolution strategies (e.g. Class-Based Ordering).

Rule Quality Measures


There needs to be some measure that checks for the improvement in the rule efficiency by appending the
attribute for a rule-set. Some of the measures to evaluate the rule quality arc ac; follows :
!Refer Module 3 for explanation on Entropy, Information Gain.

Cnnrngc & Accurnn':


Coverage of a rule can be defined as the fraction of tuples that satisfy the antecedent of n rule, whereas the
accuracy of the rule is the fraction of tuples Lhnt satisfies both the antecedent and consequent of the mle.
Given a tuple 'X', from a class labelled data set 'D ', let nconn be the number of tuples covered by R; llcnrrttt
be the number of tuples correctly classified by R~ und IDI be the number of tuples in D. TI1e coverage and
accuracy of R can be defined as follows:
llcov<.7S
coverage(R) = IDI
ncorrcct
accuracy(R) =- -
ncovcrs
In other words, a rule's coverage is the percentage of tuples that are covered by the rule and rule's accuracy
is the percentage of the tuples covered and are correctly classi fied.

First Order Inductive Learner (FOIL)


FOIL is a type of sequential covering algorithm that learns first-order logic rules. Leaming first-order rules
is complex because those rules contain variables, where as in normal case mies are propositional in nature
which are variable free.
The tuples of the class for which the rules are being learned are called positive tuples, while the remaining
tuples are negative.
Pos : be the number of positive tuples covered by R .
Neg : be the number of negative tuples covered by R.
Pos ' : be the number of positive tuples covered by R'.
Neg' : be the number of negative tuples covered by R'.
FOIL assesses the information gained by extending condition as
FOIL.Gnirr = pol x (tog 1 pol - logz pos )
pol+ 11rg pos+ neg
It favours rules that have high accuracy and cover many positive tuples.

Likelihood Statistics:
Likelihood ratio statistic is n statistical test of significance to determine if the apparent effect of o rule is
not attributed to chance, but instead indicates n genuine correlation between attribute values and classes.
TI1c test compares the observed distribution among clusscs or tuples covered by n rule with the expected
distribution that would result if the rnlc made predictions at random.

Likclil1oodJ?ntio = 2f 1; log(/;.)
. I Cr
•=
For tuples satisfying the rule, f I is the observed frequency of each class i among the tuples. e I is expected
frcqucncy of each class i to be if the rule made random predictions.

Ruic rnming
To avoid the overfilling (Overfittin~ menns the rules mny perform well on the training dntn, but less
well on subsequent data) or rules on data a rule is pruned by removing n conjunct (attribute test). A Ruic
R is chosen for pruning, if the pruned version of R ha~ greater quality, as assessed on an independent set of
tuples which is known as Pruning Set.

Pruning can be done in several ways as in decision trees. In FOIL, the quality of pruning is evaluated as
Foil prune for given rules R
pos- 11cg
FOILPrimr(R) = -------,
pos+ 11eg

Metrics for Evaluating Classifier Performance

Positive and Negative Tuples:


Let Positilie Tuples be the set of tuples of dataset under consideration by any rule. Negatit1e Tuples arc all
the remaining tuples other than Posiriw? 111p/es.
There arc four basic tenns that arc used in computing many evaluation measures.
True pn<iiitives (TP): This refers to the positive tuples that were correctly labelled by the classifier.
True negatives (TN); These arc the negative tuples that were correctly labelled by the classifier.
False positives (FP); These are the negative tuples that were incorrectly labelled as positive.
Fal~e negatives (FN): These arc the positive tuples that were incorrectly labelled as negative.

Based on the above four measures (TP, TN, FP, FN) the classifier's performance can be evaluated using
the following equations
accuracy, recognition rate TP + TN
P+N
error rate, misclassification rate FP+FN
P+ N

sensitivity, true positive rate, TP


recall p

specificity, true negative rate TN


N
Tr
precision 'l'P+1~P
Confusion matrix: The confusion matrix is a useful tool for analysing how well the classifier can
recognize tuples of different classes. TP nnd TN tell when the classifier is getting things right. while FP nnd
FN tell us when the classifier is getting things wrong.
PmJictcd class
Total
Actual class )'CJ
)'CS

17>
'"'
FN p
110 Fl' TN N
Total p N' P+N

Classifiers can also be compared with respect to the following additional aspects:
Speed: This refers to the computational costs involved in generating nnd using the given classifier.
Robustness: This is the ability of the classifier to make correct predictions. given noisy data or dntn with
missing values.
ScnlablUty: This refers to the ability to construct the classifier efficiently given large amounts of data which
keeps on increasing.
Interpretabillty: TI1is refers lo the level of understanding and insight that is provided by the classifier or
predictor.

Bavesian Classification
Bayesian classifiers nre statistkal classifiers that predict class membership probabilities. The probability
that a given tuple belongs to a particular class. A simple Bayesian classifier is also known as the naive
Bayesian classifier which is comparable in performance with decision tree and selected neural network
cla..,sifiers. Bayesian classifiers have exhibited high accuracy and speed when applied to large databases.
Naive Bayesian classifiers assume that the effect of an attribute value on a given class is independent of
the values of the other attributes. This assumption is called class-conditional independence.

Baves' theorem
Bayesian classification is based on Bayes· theorem. Let X be a data tuple considered as Evidence in
Bayesian terms. X is described based on measurements made on a set of n auributes. Let 'H' be some
hypothesis such us that the data tuple ·x· belongs to a specified class 'C'. In-order to perform a
clnssification. the conditional probability P(HIX) has to be detennined. P(HIX) is the probability that tuple
X belongs to class C, given that the attribute description of X is known. It is given ns -

P HIX = P(XIH)P(H)
( ) P(X)

where.
P{HIX) is known ns the posterior probability of H conditioned on X.
P(XIH) is the posterior probability of X conditioned on H.
P(H) nnd P(X) is Lhe prior probability.

Naive Bayesian Classification


The naive Bayesian classifier or simple Bayesian classifier, work.c; as follows:
1. Let D be a training set of tuples and their associated class labels. Each tuple is represented by an n-
dirnensional attribute vector. X = (x 1 • x 2 • •• • • Xn ). depicting 'n' measurements made on the tuple from n
attributes. respectively. A1 , A2 • ... • An .
2. Suppose that there are m classes. C1 , C 2 ••••• Cm, Given a tuple. X, the classifier will predict that
X belongs to the class having the highest posterior probability. conditioned on X.
i.e., the naive Bayesian classifier predicts thnt tuple X belongs to the class Ci if and only if
IP(C. IX) > P(CJ IX) for I $ j $ m, j ~ i.

Thus P(C1 IX) has to be maximized. The class C 1for which P(C1 IX) is maximized is called the maximum
pmueriori h_rpmhesis.

Using Bayes theorem, P(C, IX) can be calculated as


P(C-IX) = P(XIC;)P(C;)
I P(X)

3. P(X) is constant for ull classes, only P(XICi)*P(Ci ) needs to be maximized.


If the class prior probabilities arc not known, assume same probability for all class that is.
P(C, ) =P(C2) =.......= P(Cm ). and only maximize P(XIC1).
Otherwise,
Maximize P(X ICi )*P(Ci ).
The class prior probabilities may be estimated by P(C1) = IC1, D J/1D1, where JC;. D I is the number of training
tuples of class C1in D.

4. If tl1e dataset have too many allributes then computation is expensive to compute P(XICi ).

"
P(XIC;) = fl P(xklC;)
k=l

= P(xdC;) x P(x2IC;) x · · · x P(x,,IC;)Thus


The probabilities P(x, IC 1). P(x2 IC1 ) ..... P(xn IC,) from the training tuples. x" refers to the value of
attribute A" for tuple X.

The allributcs can be either Categorial or Continuous.


A.. If AL is categorical, then P(Xt IC1) is the number of tuples of class C, in D having the value xi.: for
A11., divided by IC1,D I, the number of tuples of class C, in D.
8. If AL is continuous-valued, A continuous-valued attribute have a Gaussian distribution with a mean
µ and standard deviation er , defined by
l - <x- 192
g(x, µ,a)= $ e 2.a
2

21ra

5. To predict the class label of X, P(XIC1 )P(C1 ) is evaluated for each class C1 . The classifier predicts that
the class label of tuple Xis the class Ci if and only if P(XIC )P(C;) > P(XIC; )P(C; ) for I ~j ~ m, j = i.

i.e. predicted class label is the class C for which P(X ICi )P(Ci) is the maximum.

Bavesian Belief Networks


Bayesian belief networks are classifiers that makes use of probabilistic graphical models. Naive Bayesian
classifier includes Class-conditional independence that makes computations easy. But the attributes are not
always independent from each other in real world classification problems. There exist some relationships
between different attributes of a dataset. Bayesian belief networks specify joint conditional probability
distributions that alJows to represent class conditional dependencies between attributes. Bayesian Belief
Network provide a graphical model of relationships, which cnn be trnined and can be used for classification.
Bayesian belief networks nre nlso known us belief networks, Bayesian networks. and probnbilistic
networks.

A belief network is defined by two components


• Directed Acyclic Grnph
• Set of Conditional Probability Tables (CPT)
Directed Acyclic Graph
Each node in the directed acyclic graph represents u random variable. The variables may be discrete- or
continuous-valued. l11c vorioblcs mny be either actual attributes which is given in the dataset or .. hidden
variables". A hidden vnrinblc can be on undefined entity which is not specified in the dataset. but hove a
influence in decision making or it can be a collective effect of uctuul variables which result in a new
attribute.

If an arc is drawn from a node Y to a node Z, then Y is a parent or immediate predecessor of Z, nmJ Z is n
descendant of Y. Each variable is conditionally independent of its non-descendanLc; in the grnph. given its
parents

Example: A directed ucyclic graph showing the Six nttributcs of n person resulting in Lung Cancer is
generated from the dataset. For example, having lung cancer is influenced by a person's family history of
lung cancer, us well us whether or not the person is a smoker. Note thut the variable PositiveXRuy is
independent of whether the patient hu.c; u family history of lung cancer or is a smoker. given that we know
the patient has lung cancer. In other wore.ls. once we know the outcome of the variable LungCancer, then
the variables FumilyHistory and Smoker do not provide uny additional infonnation regarding
PositiveXRay. The arcs also show that the variable LungCancer is conditionally independent of
Emphysema. given its parents, FnmilyHistory nnc.l Smoker.

Conditional probability table (CPT)


A belief network has one conditional probability table (CPT) for each variable. The CPT for a variable Y
specifics the conditional distribution P(Y IParcnts(Y )), where Parents(Y) arc the parents of Y .
FIi. S Fil. -S -FIi, S -FIi. -S

LC I
-LC_ 0.2
0.8 I05
_ 0.5
IO7
_ 0.3
I_0.9
OI I_

The above table represents the CPT for the variable LungCancer. The conditional probability for ench
known value of LungCnnccr is given for each possible combination of the values of its parents. From the
above CPT it is clear that the upper leftmost and lower rightmost entries represent
= =
P(L1mgCn11ccr ) 'CS I Fnmi/yHistory yes, Smoker= yes)= 0.8
P(L1mgCn11ccr = 1101 FnmilyHistory = 110, Smoker= 110) = 0.9
Let X = (x 1 , •• • , Xn ) be a data tuple described by the variables or attributes Y 1 , ••• , Y" ,
respectively. Each vnrinble is conditionally independent of its non-descendants in the network graph, given
its parents. The joint probability distribution is represented as

P(xi, .... x,,) = n


i=I
P(x;IPnre11ts(Y;))

where P(x1 , .. . , Xn) is the probability of a particular combination of values of X, and the values for
P(x; IParents(Y; )) correspond to the entries in the CJYf for Y; .
A node within the network can be selected as an "output" node, representing a class label attribute. There
may be more than one output node. The classification process can return a probability distribution that
gives the probability of each class.

Classification by Backpropagation
Backpropagation is a neural network learning algorithm. A neural network is a set of connected input/output
units in which each connection has a weight associated with it. During the learning phase, the neural
network learns by adjusting the weights so as to be able to predict the correct class label of the input tuples.

Disadvantage of Neural Network


► Long training time required in order to make the neural network efficient for pcrfonning a
classification or Prediction.
, As compared to Decision tree, Neural networks are less interpretable to human.
Advantages of Neural Network
► Highly tolerant to noisy data.
, Ability to clnssify patterns that arc not trained.
, Well suited for continuous valued output, unlike Decision trees.
► Decision trees arc sequential in nature where, as parallelization technique of ncuml network speeds
up the computation.

Multilayer Feed-Fonvard Neural Nehvork


A muJtilayer feed-forward neural network consists of an input layer, one or more hidden layers, and an
output layer. It iteratively learns a set of weights for prediction of the class label of tuples.
Input lfiJJen
l~y~r byn-

The attributes measured for each tuple are fed into the input layer simuJtaneously. These inputs aJong with
the weights associated with each Link wilJ be the input for the next layer called hidden layer. The output of
the first hidden Layer may be input to other hidden layer or to the output layer. The number of hidden layers
is determined by the complexity of the problem to be solved. Usually a single Hidden layer is used. The
output produced by the output layer correspondc; to the network prediction for the given tuples.

Input Units: Units in input layer that takes anributes from outside into the network.
Neurodes: The hidden units present between input unit and output unit.
Two-Layer Network: Have three layers (Input layer, Hidden layer and output layer). Input layer is not
counted as ii docs not perform any computa1ion.
Feed fonrnrd Network: A ne1work which hnve connection only in forward direction. There won'l bo any
loops or backward connections.
Fully connected network: Each unit provides input to each unit in the next forward layer.

Back Propagation Algorithm


Backpropngntion learns by iteratively processing a data set of training tuples, comparing the network's
prediction for each tuple with the actual known target value. The target value may be the known class label
of the training tuple (for classification problems) or n continuous value (for numeric prediction). ror each
training tuple, the weights ure modified so as to minimize the meun-squnred error between the network's
prediction and the uctunl tnrget value. These modifications arc made in the "backwards" direction (i.e.,
from the output layer) through each hidden luyer down to the first hidden layer (hence the nume
backpropugation). The weights will eventually converge, and the learning process stops.

Inputs
D: A dutn set consisting of the training tuples and 1heir associated target values:
I: the learning rute:
Network: A multilayer feed-forward network.

Algorithm
(I) Initialize all weights aml biases in 11ctwork;
(2) while terminating condition is not satisfied (
(3) for each training tuple X in D (
(4) // Propagate the inputs forward:
(5) for each input la)'er unit j {
(6) Oj = lj; II output of an input unit is its actual input Villue
(7) for each hidden or output layer unit j (
(8) lj = Li wijOi + Of //compute the net input of unit j with respect to
the previous layer. i
Oj =
1
(9) _ , , ;) // compute the output of each unitj
l+c• l
(IO) // Backpropagate the errors:
(I 1) for each unit j in the output layer
(12) Errj = Oj( I - Oj)(Tj- Oj); // compute the error
(13) for each unit j in the hidden layers, from the last to the first hidden layer
(14) Errj = Oj( I - Oj) Ls. ErrkWjk; II compute the error with respect to,
the m:xt higher layer. k
(15) for each weight w;j in network I
(16) 6 Wij = (/)ErrjO;; II ,wight increment
(17) Wij = Wij + 6 u,j; ) II weight update
(18) for each bias 8j in network {
( 19) 60j = (l)Errj; II bias increment
(20) 0j = 9j + 69j;} II bias update
(21) })

The steps can be described as follows:


Initialize the weights: The weighls in the network arc initialized to small random numbers (e.g., ranging
from - 1.0 to 1.0, or - 0.5 to 0.5). Each unit has a bias associated with it. The biases are initialized to small
random numbers.
Each training tuple, X, is processed by the following steps.
Propagate the inputs forward: The training tuple is fed to the network's input layer. The inputs pass
through the input units, unchanged. For an input unit, j, its output. Oj • is equal to its input value, lj . The
net input and output of each unit in the hidden and output layers arc computed. The net input to a unit in
the hidden or output layers is computed as a linear combination of its inputs. Given a unit, j in a hidden or
output layer, the net input, lj , to unit j is

Ij = L WijO; + 0j
I

where wij is the weight of the connection from unit i in the previous layer to unit j; Qi is the output of unit
i from the previous layer; and j is the bias of the unit. The bias acts as a threshold (0) in that it serves to
vary the activity of the unit. Each unit in the hidden and output layers takes its net input and then applies
an activation function. The logistic, or sigmoid. function is used as activation function. Given the net input
Ij to unit j , then Oj , the output of unit j , is computed as
1
0 ·=---
J I+ e-lj
This function is also referred to as a squashing functi on, because it maps a large input domain onto the
smaller range of O to J. The logistic function is nonlinear and differcntiablc, allowing the backpropagation
nlgorithm to model classification problems that arc linearly inscparnblc. output values. Oj • for each hidden
layer, up to and including the output layer, which gives the network's prediction.

Backpropagate the error: The error is propagated backward by updating the weights and
biases to reflect the error of the network's prediction. For a unit j in the output layer, the error
Errj is computed by Errj = Oj ( l - Oj )(Tj - Oj )
where Oj is the actual output of unit j , and Tj is the known target value of the given training tuple. Oj ( I -
Oj ) is the derivative of the log istic function. To compute Lhe error of a hidden layer unit j. the weighted
sum of the errors of the units connected to unit j in the next layer are considered. The error of a hidden layer
unit j is
Errj = Oj( l - Oj) L Errk Wjk
k

where wjk is the weight of the connection from unit j to a unit k in the next higher layer and Errk is the
error of unit k. The weights and biases are updated to reflect the propagated errors.
Weights are updated ns
8 Wij =
(l)ErrjO;
Wij = lY;j + 8 Wjj

The variable I is the learning rate, a constant typically having a value between 0.0 and 1.0. Backpropagation
learns using a gradient descent method to search for a set of weights that fits the training data so as to
minimize tJ1e mean-squared distance between the network's c lass prediction and the known target value of
the tuples. The learning rate helps avoid getting stuck at a local minimum in decision space (i.e., where the
weights appear to converge, but are not the optimum solution) and encourages finding the global minimum.
lf the learning rate is too small, then learning will occur at a very slow pace. lf the learning rate is too large,
then osciJlation between inadequate solutions may occur. Biases are updated
89j = (l)Errj
0j =Bj+8Bj

Updating the weights and biases after the presentation of each tuple is referred to as case updating. The
weight and bias increments could be accumulated in variables and updated a fter all the tuples in the training
set have been presented. This strategy is called epoch updating, where one iteration through the training set
is an epoch.
Terminating condition: Training stops when
► All wij in the previous epoch arc so smull ns to be below some specified threshold.
► The percentoge of tuples misclassified in the previous epoch is below some threshold.
► A prcspccified number of epochs hus expired.

The computational efficiency depends on the time spent in training the network. Given IDI tuples and w
weights. ench epoch requires 0(10 1x w) time.

Hidden Markov Model


A hidden Markov model (HMM) is u sllltistical model bused on the Markov process with unobserved
(hidden) stntes. A Mnrkov process is referred ns n memoryless process which satisfies Mnrkov property.
The property states that if one can make predictions for the future or the process based solely on its present
state. Markov property stntcs that the conditional probability distribution of future states of u process
depends only upon the present stutc, not on the sequence of events that preceded it.
The Hidden Markov Model (HMM) is a variant of aji11ite state machine having a set of hidden Jtale.\' (W),
un output alphabet (observations) or visible stntes (V), transition probabilities (A). output (emission)
probabilities (B). und initial stute probabilities (fl). The current stale is not observable. Instead. each slate
produces an output with u certain probability (B). Usually the stales. W. and outputs. V. arc understood. so
nn HMM is said to be u triple. (A. B. ll).

Examples or Markov Process


► Measurement of weather pattern.
► Daily Stock market price.

Explnnntion of Hidden Murkov Model

► HMM consist of two types of states, the Hidden State (w) nnd Visible State (v).
► Transition probability: is the probability of transitioning from one state to another in a single step.
, Emission Probability: The conditional distributions of the observed variables P(Vn I Z.,) from n
specific stnte.
Consider n hidden Markov model with three visible slates nnd three hidden states
W= l\'/ W l . WJ
V= V, Vi . VJ

In each hidden state W, the system will emits n visible output V=V, Vi . Vi. The probability of emitting nny
visible output by the hidden states are represented by emission probability which is represented as b ;j. i.e
State w, will emits the output V, Vz Vi with probability b, ,, b12, b13 respectively.
State w1 . will emits the olltput with probability bu,b11,b1J respectively.
State '1'J. will also emit the output V, Vz . VJ wit/, probability bJ1,bJ1,hJJ re:,pectively.

From any hidden state there is always a transition to nny other state. The probability of such transition
between states is represented as transition probability A1J,
The transition probability from state w, to w1 . is represemed as n,1 mu/from w1. tow, is au.
Similarly between wimrd . W J .

It is also possible to have n transition from n state to itself which is represented as a.LThe
sum of probability to have transition from one state to other will be one. i.e.
L Aij = 1
j
for nil j
similarly, the sum of probability to emit any of the visible output by n state is also one. i.e.
I:
/3jk = 1
for all k. k
After several transitions between hidden states, the system will reach a final state represented by Wn. Once
the system reaches finnl state, then there is transition to only itself. This transition is represented ns nm with
probability 1. It emits only one type of visible symbol, Vo. Also, the probability of emitting the visible
symbol Vo is given by bmwhcre bm=I.

The above state diagram can be drawn with the final state nnd the output as shown below.

The hidden markov model have to address three central is..'lues.


• Evaluation Problem: The evaluation problem states that for a given Model 0, with
\V: Hidden states V: Visible symbols A il: Transition Probability BJt: Emission Probability
V": Sequence of visible symbols emitted
what is the probability that the visible symbol sequence VT will be emitted by the model 0.
J.C. P(VTl0) = ?

• Decoding Problem: In decoding problem, WT, the sequence of states generated by the visible
symbol sequence yr has to be calculated.
i.e. WT=?
• Training Problem: For a known set of hidden states, Wand visible symbols V, the training problem
is to find the transition probability Aii, and emission probabiljty B;i: from the training set.
i.e. A,i = ?
B jL = ?
Evaluation Problem
For a given HMM e, and sequence of visible symbols vr, the conditional probability of getting the visible
symbol sequence have to be estimated.
P(\/1l0) =?
So P(VTIO) can be written as

where W,Tis one of those possible sequence of vrr and P(W,T) is the probability of getting Wl selected.

Since there nre different number of possible sequence of W, the P(VTJO) can be represented as
rma.r

L P(\ Tl(H ,!).P(H ;)


r= I
where r"'"' is the number of possible sequence of W.
\V(f =(W111, Wm, ...... Wm}

If N is the number of Hidden states then rn,u =NT

In e<1uation (I) the tem1 P(W,T) can be wrillen us


P(W,T) = P(Wrl W ,.,) as the system is a markov proccess. In
order to compute the probability of a stale for entire time sequence, the individual probabilities have to be
multiplied.
T
ITIP(11·T)l11' l - 1] -------------,- -------(2)
t= l

i.e. product of transition probabilities at consequent time sequence in consecutive time. The
P(VTIWl) term of the equation (I) can be rewritten as
T
P(vT111•,.r) = IJ[P(i/ r)11v(t)]
1=1 --- ---(3)

i.e. the product of all possible values of V

Substituting (2) and (3) back in equation (1) results in the final equation of P(VTJO) as
rmnr T
P(VT) l0) = L IT P(\l(t) lll. (t)).P(ll.(t)lw(t - 1))
r= I 1= 1 ---------------(4)

The above equation (4) have high computational complexity which is of the order O (N1N)

Consider the following Trellis diagram that shows number of states at different instances of times.
At time t=O, the stntc of the system is w 1.
If the visible symbol emitted ut time t= I i.e. v( I )=vO. then there is n trunsition from w I to wO.
From the above dingmm, the probability that a machine is nt state W, at time T=t can be represented as a1(t).
So a1(t-l) is the probability thut the machine is at state WI ut time t-1. Similarly, for a1(t-l ), a1(t-l) etc.
The probability of transition from one state to other can be rewritten ns a product of probability of being in
a state and probability of transition to other state.
i.e. probability of transition from w 1 to w1 can be represented ns a1(l-l )n,J
Example:
n1J = Probability or transition rrom w1(t) to w1(t) at time instant t =a1(t-l )a 12.

an= Probability of transition from w2(t) to w2(t) at time instant t =a2(t-l )an.

Therefore, the probability of a visible state is the product of sum of nil the probabilities and t11eir emission
probabilities.

when sum is calculated for the product of probabilities of transition from a state w,.1 to wt it gives the
probability that there will be a transition from one of the state to another in a time instant t.

Probability that machine make a transition from a state nt time t-1 to another state at time tis given by
ai(t-l)au.

1l1e probability that the machine is in state w1 at time t after emitting t number of visible symbols from
sequence of visible symbols yr is given by sum of product terms multiplied by emission probability. In
Simplified Tenns it can be written as

= 0 if t=O & j;t initial state


1 if t=O & J= initial state
«Jlr«,(l•l)auJbJkv(t) Othenvise

Using this an algorithm can be generated to find the probability of HMM 8 has generated a sequence vr.
Forward Algorithm
Initialize: t=O, c!ij I bilt, vr. <Xj(O)
Fort=t+l
.\'
nJ{I) - l,;J.:\ "(I) . L
,-,
o i(t - l )111J

until t=T
Return P(VTl0) = aorn for final state
end

Decoding Problem
The decoding problem deals with finding the most probable sequence of set through which the machine
passed to reach the final state.
Algorithm for Decoding Problem
Let yr be the visible symbol sequence.
Aim: Find the probable sequence of hidden states.

Initialize: Path = {}; t=O; J=O;


For t=t+l
For j =j+l ,
....
oj(t) = lijk\ (t) . L oi(I -
1 l )nij
i= I

Until j =N
j' = arg max [D I(t□
Append j' to path
Repeat until t=T
Return Path.

Training Problem
The hidden markov model can be trained using supervised learning method. It uses a number of known
visible symbols for training. The main motive of training a HMM is to estimate the transition probability
and the emission probability au and bJi. which are the unknown terms in equation (5).

Example
Consider a visible symbol sequence <V1,V3,V1,Vs,V1,V2,vo>
Then a 3(4)=the probability that the machine is in states W 3 after generating the first four visible symbols of
the sequence, v1,V1,v,,v.s.
Pi4)=The probability that the machine remains in the state W1 and it will generate the remaining four
visible symbols Vs,v 1, v2,Vo.

In the above example a new term is introduced p,(t) represent the probability that model will be· in
w,(t) and will generate remaining of the given target sequence vr. i.e. all the symbols from V(t+ 1)
to V(t).

In Simplified Terms it can be written as


0 if t=T & wi(t) ¥- wO
1 if t=T & wi(t) = wO
l1Jk\l (I + 1). L Pij(I + I )(lij Otbcnivise
j

Using this nn nlgorithm cnn be generated to ftnd the probnbility of HMM Othnt will generate the remnining
symbols of V"by continuing in the present state. 1l1is algorithm is known as backward nlgorithm.

Backward Algorithm
Initialize: t=T, n,J, bJk, VT, P,(t)
For t=t-1

l1jk\l(I + J). L fJij(I + J)t1ij


j

until t=T
Return P(V"IO) = pi(O) for the known initial stme.
end
Both aJ(t) and P,(t), the forwnrd probability and backward probability will be used to calculate the a1J und
b,J.

Using the nbove tenns, the probability of transition from wi(t-1) to WJ(t) which emits visible symbols VT
can be written as

P(\'1l0)

Still. in the above equation of Yu(t), there arc tcnns of nu nnd h Jk which are unknown. So. in order to solve
the above equation, assume random values for a;1and bJk initially.
By using random values of OIJ and bJt the Yu(O can be estimated, and later improve the values of n,J and bJt.

The expected number of transitions from Wi(t-J) to Wj(t) at any time in the sequence yr is
T
L ; ij(T)
t= t

The total expected number of transition from Wi to any other state is


T
LL ",i~·(T)
t= I I.:

The estimated value of transition probability nu can be given as


.....
n11 =
. Li=
·r
t ; ik(T)
.
Lt= I L-k·A ,,,k(T)
Similarly, the estimated value of transition probability nu can be given as
.In;
. . . = Li=• L 1'i1·jl(T)( .\IT = \lk)
Lt=l LI ')'Jl(T)
where the numerator is the transition which emitted symbol V k nnd the denominator is the trans ition of all
symbols emitted.

Using the above sequence. the value for P(VTl8) can be estimated for a particular model 8. Similarly, for
other models 0, also. If the conditional probability of one model 0, is greater than other OJ tl1en the model
points the corresponding class of 0,.
P(VTIO,) > P(VTJOi) then yr c 0,
Example
Consider a model with 4 hidden states and 4 visible states
W = W1,W2,WJ,W4
V= V1,V2,V1,V4 and final symbol Vo
Let the Transition probability and Emission probability as follows.

W1W2 W3W4 Vo V1 V2 V3 V4
1 0 0 0 w, 1 0 0 0 0 W1
AIJ= .2 .3 .1 .4 Wz Bij= 0 .3 .4 .1 .2 Wz
.2 .5 .2 .1 W3 0 .1 .1 . 7 .1 W3
.7 .1 .1 .1 w4 0 .5 .2 .1 .2 w4
where
L Aij = 1 L Bjk = l
j k

Let v~ =(v1, VJ, v2,vo) and the machine is at W1 at t=O. Find P(VTl8)

Solution:
From above data, the trellis diagram can be generated.
• &. 3 ...
w. 0 • D 0 . oo,,,
..,, O-•.s2 O·
~&.
0 -•001 Q o
'-'.s Q -0011,
oC..>(I)

--.,(4)
cc.i•) O<JJ~)
"'

.. P [v +I ~ == 0-00,.,,
Given Initial state is W ,. So, probability of W1is I at t=O. And remaining states have probability zero.
From initial state transition is possible 10 W 1. W 1 and W J at t=l Considering the emission probability for
the first symbol given in sequence v,, at state \Va. using the equation of aj(t), the probability will be
obtained as 0.09. i.e . .3x.3. Similarly, for the other states W 2 and W 3.
So, at time instant t= I , the state with highest probability will be firing for the subsequent states i.e. state
\V1have 0.2 probability which is the highest. Same process is continued till all visible symbols are
produced as per the sequence given. At t=4 the system reaches final state Woand emits Vo with a
probability 0.001 38. which is P(VTJ8) for given HMM.
Why naive bayes is called naive?

(Nai e baye ek clas ifier hai jo clttssify I nntt hni cheezo ko naive ka mc1tlab hota
hai a Person/action hawing a lack of e,perience t h isko is classifier ko naive k u
kehte hai voh 1ss an wer me bdla a hai )
• Naive Bayes dassifiers are a collection of classilicaUon algorithms based on
Bayes' Theorem. lt is not a single algorithm but a family of algorithms
where alJ of them share a common principle, i.e. every pair of features being
cla sHied i independent of each olher.
• The aive Bayes Classifier assumes that all the features of a class are
independent of each other.
• This means I.hat anyone of the feature is missing then dassiUcalion may nol
be perfect.
• Fore ample, a fruit may be considered to be an apple if it is red, round and
about 4" in diameter, or a ball if we don't give the red colour classification.
• o even if Lhe featu re are inter-dependent on each oth r, the Naive Ba es
la ifier will consider them independent! .
It's called nai e also becau e iL ma l e the a sumption that all attributes are
independent of ea h other. Thi assumption is why it' caJl d naive as in lots of real
world situations this does not fil. Despite this the das ifier works e Lremely well in
lots of real world siluations and has comparable performance lo neural net works
and S M's in certain cases (I.hough not aU).
~~u t\assi ~tff •p"~<'{ ~~{,(
. - 30 I~ ~ Yt.t) I
~l ~ -=- 0 -?.:~ ~
~f.. ,~(OVY\t $™41.l,\t C-rutr Buj_
<.1c) H,~\.\ lfo f.liv ~o r (3,c : <.10 [~.,, /Jo)-; ">} s- -:; o. ooo
El(ctlltMf No
'L ,10 ~~ No • h,co~
ra;.., 'Ill>
?. '31-40 Mij\i No p (1"101MJ. : ""uf / Ii-:, =~U) -= "ljq -:. 0 -4- 4
)4-0 "'d.uWY\ Alo ~aiY YLI
1- PL i111co ""'-' = ~t..td {½-=-No).._ "J./s-
faiv
YcJ, -=- 0 · 40
5 >4-0 low ~
No • Sfi\W {
'
l-
)4-0
ll-40
low
low
Yt~
'ltS
httllu.J
t11ctlluJ 'ta P($W: 'WI
kr Yu) :: '/1 '= 0-tH-
#Jo
i

"
<30
~30
~(d_j\lWI

tow
#Jo
'/t~
Fai-1
Fa1"v .,~ p ( s~,lol .. ~u ) B~ "'No) -= 1/r :: 0 .1.0
• (-,ec;t ~ (. _
10 ~ ~\&.WI 'fa Faiv 'iu

" <3o ~{lW.W\ 'ft.I ()((tllt..J" iu. r(t-vt.CLI ~t} b,_, -= 'lu) ~ c/q -:. 0 .H 1 .
'""
13
31-40
11-40
Mcdi"'-w,
~\,
No ElCLtJ!.u.it 'lt.-1
p lc"\·d J ,: faiY / b-j ~ No) -- 'tfr -;
~q J:aiv iu.
14 )40 1-\~m No ~,c,t.l.lo.J /\Jo_
~ ("'hu) .~) = P[~e / Yt.-\) - P( ,,.~ [~u). P{Yu/~
l r (fJ.-.,,h~) . P))t6) 1

-= 0 • 1.1.1. "i<0 • 44- 'IC 0 , 6,:i '<' 0 ,b6l-


-:; O· 04-?S
P(b"J = Yt.&) = o/,4 = 0 -643
p(x}No)· ~-= r{;rye /No) . P( rwl/1Jo) rbu/M'>)
P( b
1 : Alo)=- ~ • /14 -= C>-351 Plbrv/AJo) • P{~
: 0 ·~0 11 0, Ao " O·'Z.0 't(' Q,4-0
~ 0 ·01'l1..
r(,d 1ey i!> jrto~ fve vi Pl ~)No)'
.' • I X ..._,i\ l b~ \I \-W f>~odM.l ~ •
- - - --- - - - ·------ --·- -- --- --

·- ---------

Na iVe __ _B~e,._'3_ _
J

- ---· ----- -

- ··-- -·-·-- - - - - - - - -- - - - - - - - -·- - - -------- -

- - - - - ----- - -· --- - -

p~;OI) io -~~-N~~ ~--- Gitod~--~_Hl~ h ~_-__:-~_C- La;;-


- ------- - ---- - - - - -- - - - -~ - ----- __ ,...._ ....

__ _ _____·A ____f~al_e,_ f..6'__m ______5 boY t.__


--- _Q_ ·-- - rtl.C<IL 8-
___ ___ _ 2~Q_rr, ___ _____k_
gJ_(_ ___ _
3_ ______ G __ _ _f €mE_f_t_ 1·9m ffild.iu_rn__ _
Li ____Q FemqLe \''65,n ____ O)_e,d,om__ _
__$____ __ _E OO_q_Le 2~~ ro _ _ _____,{:q /1 _
_ 6_ _______E_ m_au 1·"tm sh<'.tb~ -
1- ____(R ___r(}_q_u, t►i m med,_u_m_
'1 H ~_fem_qj_e I •( rn % _o'(_t_ _
____9_ _ _I · Fernq/,e rCsrn Sho_Y_t___

- -··- -

___Q_gbse.~ pLlbe io ~Lass_ a tff(b_jj_k ~g_,,__---'------


. , ·.pxob_mb'\~
___tlS_,c.t,_ _u -J J_ML Lo ·- - - - - - -- _____ _
.~ ~
- - --
,/" eta% r:n£- 3 -11.t1.JlJ.L_=-----'-'hcc_
0
. L. - ______ _____

__ · - 6_boy_t_ _ ________ ____


- --- ~ -e.d_w.ro . - - - - --- -- --
-._- R·--· , ,. - ----=- .________. :____
---· ---- -=-t.al\ __< _______- :- --:-----
-... _JX,UJJY___}_1_ __ :(Q btiJ~l ~ _ r.t1Jiq Lf)f_ _ ___-- . . ----···-·

. '• .
---- - -- - - - - ---- . -- ----·· - - ·- ---~·-----
. .

..
- .
. . .

Scanned by CamScanner
- ~- - --- - ·- -·- -- -- ----·- - ---· -

h,rrnu 1Ct o-f -p~hq-bJH~ _ ($_ _______ -------~---=--- ~


-- -
.

p(&ho, tJ
-- ----- ----
-~-
- ---
no . a h"™- shoY t____C4JnA-_,,_____ r -
--
. ____ ____ kol:-4/_ oo_, ol__i_l:w)_ __ __
-- - -- - - --- - - -- - - - - ------

-- ·-· - ·~9Y-·- ----. ---·:Tfs- n5r


-:= r- - - - ---~
u-f v ncJJjs>ark}__ ~
- -- .--- . - - - --5----
- _ CS ab _hJ oLA ____ oiJ(b t, kL.jaJ.on r:nL _Yla_bLr?tJJ
YJY)
_ _ _ S'i l'flf Iofl IJ-_r; y_ J:l}l df om Md_ frl 11 fill_ tU_ilL Y1-t1_d~
- - - -· -- - ---- -

_ _p(mLc!S\JmJ__ ~ ~
-- . --- -- - - _ g___ ___:____ _ _ _ _ __
- - - -· ---- ·- -- - - - - -- -- -- - -- - - -- -

·--- --- --

r cta_f_Q ____
---'---C - _2--_ _ _ _ _~ -- -
-- -- ,,t,

___ · __Ab . ~vm1w_ he1fbt ko •an~~~~ · ___ _


_ d12ll_d_e_J_ia}j_tl_a. _Q_p-f2- a{J OL~ S {1' q b- -~ ~
- 1 -·-·_,sKc;_ JJ_vi ~ - ~an sq111G'----
-h_,_o_,:.___ _ _ _
----·- --,--- - -
- - - - - - -- - -- - - - - - - - - --

- -- --- - - -- - - - - - - - - - - -

-- - -------- -- . ·~

. ., .
---...-- I- - - ' - - - - -- - - - - -- - - - ---'-____...:.,....- - ' - - - - : - -~ --

. .• •' • . : .J

Scanned by CamScanner
---.- ~ r====== = = = = = = = = = = ==

[r-i, ~ ,,11·
[ \, qI - 2 .. oJ _
[2 ·0 -aj

Gien ci(h ~ ma Ll
F'emal e, - -- - -- - - --- - - -· - - - -- ·
- ------- - - - -- - -· - - - - -- -·

···---·
.
. .--. . . --- - -- ---1--------- - i· ... ' ..
#' • •• " .,

• L - -- - -• • - _ . . .. •• ~ _, .,J
_ ..

Scanned by CamScanner
·r

· Pago Iii;,:.. ·: ' ·_

A.h it!h U., ci pn q io. iJ ao.J1


/4 G:C!rf)__g roe m os1- o (' h'rvu,, _jfh jo llurnrv~
1-cib'Le_ b!Jf)a!f-0- h~,' voh df:r~ 'Ythhl hai ~I
k: .

COu.0 ht! ri puch/;o, h q; hvrf¼ lK l'1l.ilJ

blpk d/J°' ·rrar)}.J)JC{ a.vi pvc.h~ c.lo.JSS-ic


3"1Wn /,du~lt iS 51wYb; mtJJum//~I/
v-~
J )J

/f(ls Jsno,t) :, p(,vi/ sh~'f D~fC(!!· O-Oo) /shovhJ


-i-l X 0
y ·
. ;Q ,•
.. . . . .
.
. 4,,
.. .
. . .··

Scanned by CamScanner
Puyo No; i. •,

s !:.{p 2 J- Hvf(),(_, 1l\Lffl h(')O J ni kal nc:t


~ah\ J0 M.V1t ruw val~ hen VO Y)
shoY ~ let, 1-(\fna /1'(l}U..l,b I rneJ lorn \'Lt Kan t1i
l<C\JJ.ll b hq t' ; fa I) /(-(. KIl::n a k Q)l iU. 6 V'<A. i

::. . PO:J5hoYt) X f'{ShoYl;J


/- 0 X ~
. ' . 9·. ,.
~o
Lil-u.lf he, o J (;o' ~JtoM
~ fC 1-;/meJfurrj)x P(m,edil)I\/U
~ ox 2
1

.· • ' . .
. . .

Scanned by CamScanner
i---·---- .
Page No . .

~ l x~ -
2- q

, 2_ ~ J_ = o·_...,.l I .
1g q <--

q J) n1~a Io e_ Sn mA fi v a Iut

e5'h·rna Kr ::: AJclih'OI') o-C q/1 li'4\tho od

I
I
.,
' .

. ( . . .

l//- ~~ _,. :----~-~~- ----:------------- s:~n~~ ;;~~~;:::~r ..


ab ocluql f;·ohllf~ n'1Kctln; hci i

h) rn,v lq

PCX/,':J) ~ PC!16:,) f e:r)


-- Pc.~J

f(Sho-ct / /:;J -"- P{l;;/Sho'(~ i.,f>(s1-i o-o t J


rctsJ
~ 0 i.!i.
q :: 0
a, o

PCfYI edio('(J / ~ :: P( 1.-:/ medium) ,d(fYlecl i~WJ]


P{t)

-·· - ...._...:...-
.. ·~ - .. - .. .

. .

Scanned by CamScanner
fCti1II /&J - Pa-/1-r,Jj) • P(tr11!)
f(t-)

---- ,,
1-
~ 2-
~

Cf --- D · lI _,
_
(......
0 . ((
o· 11

ci b l,i no k i V ,,Iut. Com fo.Jvv 'k OV/ O

bafr -1ctbu baJa hci i toh

. .
. ..
. . . ' .. ~---~ : ._

Scanned by CamScanner
Module 6
Dimensionality Reduction

Dimensionality reduction or dimension reduction is Lhe process of reducing Lhe


number of random variables under consideration. via obtaining a set of ..uncorrelated" principle
variables. Dimension Reduction refers to the process of converting a set of data having vast
dimensions into data with lesser dimensions ensuring that it conveys similar information concisely.

Advantages of Dimensionality Reduction


• It helps in data compressing and reducing the storage space required.
• Reduces lhc time required for performing computations.
• Allow usage of algorithms unfit for a large number of dimensions.
• Removal of multi-collinearity improves the performance of the machine learning model.
• It becomes easier to visualize the data when reduced to very low dimensions such as 2D or
3D.
• It is helpful in noise removal

Methods for Reducing Dimensionality


There are two main methods for reducing dimensionality:
• Feature selection: Feature selection deals with finding k of the d dimensions that give the
most information and discard the other (d - k) dimensions. Feature selection approaches lry
to find a subset of the original variables. There arc three strategies; filter (e.g. information
gain) and wrapper (e.g. search guided by accuracy) approaches and embedded (features arc
selected to add or be removed while building the model based on the prediction errors).
• Feature extraction: Feature extraction deals in finding a new set of k dimensions that are
combinations of the original d dimensions. (It docs not eliminate the remaining dimension
like feature selection). Feature extraction transforms the data in the high-dimensional space
to a space of fewer dimensions. The data transformation may be linear, as in principal
component analysis and linear discriminant analysis or non-linear method like isometric
feature mapping, Laplacian eigbcn map etc.

Some other methods of Dimensionality reduction arc as follows.


• Missing Values: Given a training sample with n set of features . Among the available features,
a particular feature has many missing values. The feature or dimension with more missing
values contribute less to the process of classification. So, such a feature with missing values
can be completely eliminated.
• Low Variance: Consider that a particular feature has constant value for all the training
sample. That means the variance of that feature for different sample is comparatively less.
This implies that the feature with low variance or constant in nature have less impact in tbe
process of classificatfon, there by resulting in elimination of such a feature.
• mgh Correlation: if there are two features having high correlation with each other, then it
imp)jes that both the features contribute almost same for the classification process. In such a
case, instead of using two features which is more or less same, it can be represented by a
single feature.
• Principal Component Analysis (PCA): In this technique, variables are transformed into a
new set of variables, which are linear combination of original variables. These new set of
variables are known as principle components.
• Backward Feature Elimination. In this technique, al a given iteration, the selected
classification algorithm is trained on II input features . Then we remove one input feature at a
time and train the same model on n-1 input features 11 times. The input feature whose removal
has produced the smallest increase in the error rate is removed.
• Fonvard Feature Construction. This is 1he inverse process to the Backward Feature
Elimination. This method stans with a single feature only. progressively adding one feature
at a lime. i.e. 1he feature that produces the highest increase in pcrfonnance.
• Random Forests I Ensemble Trees. Decision Tree Ensembles. ulso referred to us rundom
forests generate a large and carefully constructed set of trees against u tnrget attribute and tJ1cn
use each attribute 's usage statistics to find the most informative subset of features.

Subset Selection
Subset selection, is the process of finding the best subset of the set of features. The best subset contains
the least number of dimensions that contribute most to the uccurucy by discarding the remaining,
unimponunt dimensions. Once the necessary dimensions arc found out, it can be used in both
regression and clussilication problems using a su itable error function . Suppose there arc d dimensions
or variables. then there exist 2" possible subsets. Unless if d is small, it is difficult to lest nil 1he
variables. There arc two nppro.ichcs:
• Sequential Forward Approuch
• Scquentiul Buckwurd Approuch

Scnucntial Forwnrd Approach


Forward selection. start with no vnriables nnd then ndds vnriable one by one with leac;t possible error,
until any further addition docs not decrease the error (or decreases it only slightly).
Let F= n feature set of input dimensions. x1. i = I, ... , d.
E(F) = lhc error incurred on the vulidation sample only when the inputs in F arc used.
Depending on the npplicution. the error is either the menn square error or misclassification error.

In scquentiul forwurd selection,


Initially there is no features: F = 0
Al each step, for all possible X1, train the model on training set and calculntc E(F U x; ) on the valitlation
set.
Choose that input Xi that causes the least error
J = arg m1n E(F u x 1)
I
and
add x1 to F if E(F u x,1) < E(F)
Feature adding stops when nny feature docs not dccrense E (error). If there exist a user-defined
threshold that depends on the application constraints, then also adding of features will be stopped.
The process of sequential forward selection may be costly because to decrease the dimensions from
d 10 k, training and testing the system is O(d 1 ) . This is n local search procedure and docs not guarantee
finding the optimal subset, namely, the minimal subset causing the smallest error. For example, xi
and xj by themselves may not be good but together may decrease the error a lot, but because this
algorithm is greedy and adds attributes one by one. it may not be able to detect this. To avoid this
drawback noating search methods is used where the number of added features and removed features
can also change at each step.

Sequential Backward Appronch


The backward selection, starts with all variables and remove them one by one, at each step removing
the one that decreases the error the most (or increases it only slightly), until any further removal
increases the error significantly. In sequential backward selection, the process starts with F containing
all features and then remove one attribute from Fat a time that causes the least error.
j = argminE(f- xJ)
I
and
remove XJ from F if E(F - x1 ) < E(F)
The process stops if removing a feature does not decrease the error. To decrease complexity, it decides
to remove a fcmurc if its removal causes only a slight increase in error. The complexity of backward
search has the same order of complexity a1, forward search, except that training a system with more
features is costlier than training a system with fewer features, and forward search may be preferable
especially if many use less features are included. Subset selection is supervised in that outputs are
used by tJ1e regressor or clussilier to calculate the error.

Principal Component Analysis (PCA)


PCA is a useful statistical technique that has found application in fields such as face recognition and
image compression and is a common technique for finding patterns in data of high dimension.
Principal component analysis is a statistical procedure that uses an orthogonal transformation to
convert a set of observations of possibly correlated variables into a set of values of linear'ly
uncorrelated variables called principal components. The number of principal components is less
than or equal to the number of original variables. This transformation is defined in such a way that
the first principal component has the largest possible variance, and each succeeding component in
tum has the highest vnrinnce possible under the constraint that it is orthogonal to the preceding
components.
The Principal Component Analysis is a projection method that finds a mapping from input in original
d dimensional space to new k (k<d) dimensional space. with minimum loss of information. Projection
of X on W is given by Z=W1'X. PCA is an unsupervised method that docs not use output
infonnation.
Some of the mathematical concepts used in PCA are summarized below.
Statistics: Statistics is based around the idea that there exist a big set of data. and it has to be analysed
in tenns of the relationships between the individual points in that data set. Some of the measures used
for finding the relationships among data points arc as follows .
i\fea11

,.- -
J \. -
~ •= I T l.
Tl

The mean is the nvernge of the numbers or n given set or a calculated "central" value or a set or
numbers.
Sta11dard De1·iatio11
The standard deviation or the dataset is a measure of how spread out data is. It is the average distance
from mean of dataset to a point. The standard deviation can be calculated by computing the squares
of the distance from each data point to the mean of the set, add them all up, it is to compute the squares
of the distance from each data point to the mean of the set, add them all up, divide by n-1, and take
the positive square root. Thnt is

s=
I:;~l (Xi - X) 2
(n - 1)
where s is the symbol representing Standard Deviation. Here n-1 is used instead of n in order to
represent a sample smaller in size than the population, where the result obtained by considering a
small sample will always be more or less same as that of considering the actual sample.
Varin11ce
Variance is another measure of the spread or data in a data set. In fact, it is almost identical
to the standard deviation. The formula is

,<;2
~~l- _1_
= _L-_1_ (X · - X)2
_1_ _ __
(n. - 1)

Both standard deviation and Variance are used to find the relationships among data points in a One
Dimension Space.
Ct11'aria11ce
Covariance is always measured between 2 dimensions. If the covariance between one dimension and
itself is calculated then variance is obtained. Covariance is n measure to find out how much the
dimensions vary from the mean with respect to each other. The covariance between any two
dimensions in a multi-dimensional space can be given as

Covariance Matrix
Covarinnce is nlwuys mem;ured between 2 dimensions. Ir u dntu set contains more tlmn 2 dimensions,
there is more thnn one covariance measurement that cnn be cnlculnted. The matrix representation of
nil the possible cnlculnted values of covariance is known us covariance matrix. A covariance matrix
three dimensions is given as follows
r:011(~:, 11)
<:m1(11: 11)
,:m,(z, 71)

Two important points to be noted about Principal Component Analysis arc that. it lowers the
dimension of the input features und orthogonality of the new dimension is ulwuys maintained.

Consider the darn plot given below, which hns two dimensions, one along axis XI and other along
X2.

I
I •
~
,._~,.~{?-:\

In the above data plot, the variability of data points along both lhe axis is large. That means, both the
dimension provide significant information about the data points in the plot. So, each data point can
be represented using the corresponding coordinate values of both dimensions.
X, Xi
X11
Xi,
X=
Xu

And the covariance between both dimensions can be denoted us

Cov(X) = ~ ~
~~
The main aim of PCA is to find new dimensions such that the vuriubility across atleast one of the
dimension will be so smull enough to be neglected. Inordcr to find new dimensions. rotate the axis
XI and X2 by and angle o. Aller rotaing both axis, the new axis obtained arc Z I and Z2. The data
plot after transformation is shown below.

Now it can be noted that the variability of data along the uxis Z2 is very small. That means it docs not
provide much infonnntion about the data points in the plot. So, ignoring such a dimension will not
affect the quality of the classification process. The two new dimensions obtuined by transforming the
initial axis arc known us Principal Components. Herc the principal components arc 2 1 und 22.
Vur(21 )>> Var(22)
In original data XI and X2 arc highly co-related. So, getting un ellipse where major and minor axis
urc not parallel to XI und X2 which shows the dependencies between x I und x2. After
transformation major and minor axis arc along 21 and 22 and it is known us orthogonality.
Variability across Z2 is very less n.c; compared to 21 so that dimension 22 can be ignored.

Consider n single dnta point in the plot say M(x l ,x2). When the axis are changed to 21 and z2 from
x l and x2, the corresponding coordinates have to be found out for each point in the dutn plot.
Consider the below diagram

The new coordinates can be written in terms of old as


21: xi cos o + x2 sin o Z2:
-x I sin o + x2 cos o

which can be written as Z=AT.X

sin~ A - lc~se -sin -;J


cos~ ~•ne cos~
where
If the number of dimension is 'p' then
au au__ ___ a1p
ai, an_ ______..a?J>
t~OV"t\ ~e... -=- beve. ~f')

A• [ c.os
s,~
9
e
-~~J
Co.s..9
w~ CJ>-~ ~
~

~~ ~

.So

So
~~
9 r~e
l:-.!,," s
g,"~e\.
~~ L8, .. e =ru 'J
~B _s,.,e]-= [~ ~,::>,;.~

- ~ _r._.LOv~ .:4-..,_ ~0~ ~ ~.:......:IMJ


0
- - -- - -~ :____ _ ___J
,r ~ W\..u
t'
f
"
oUIV1 e.nJ;Dn,S ~ ..u..r.J....
°f-
bt.. t . u ~ O;!
~ ~
f~o-l ~O""'-J_ ~

.,.'I-.
z, -= 0....
I ~> a_ 11'1,. I + C\.. I l "".,_ . - ... - o...,~ ~p
,
a..2-..,.. -) °'-2-1 '#,_ I -+
z 'Z. -:.
a. u '1,..
'2. - - - -
O..il> '1-..t

- .. --
.
..,IJL.

I •
-tu.

.&u.bjui...J.
0."""'
Cov[~..,._
• ,r-k, .. ~
0-.J. '/...
f-'-'"'~
ct..o

\'J\ ~\C\i,~J:CQ.\ -l4L~.S ~e.... T"f'\Q..O.V') o.-nd sb."'da~


d ..\,l~O,.\,l.~O 0~ 0... '?Of?"''~"' ~ -"-11..4..5~0J ~
}-A ClV'\J 2... .A..Lafu. l,~«.~ . I¼~ N\U)Y) °'"'J.. &.crrt\o.\"d-.
cl_ ~',JI o.\;~'() 0 ~ ~ ~.. j>-l>- ..U, ()UM.O.;t_._J_ ~ ~ o.nd s.
No~ -t\.~ ~'1''"~\ ~t°"e¥\~ aoY'\ bq_ w.....fu"'
~ l <.A.~ ot ~l.on d, ov-c.\ d. ~\J \O \ ~ ()()

'FO't" .:r-\"' 'r l..l.VI~\ \ C.0\"t\ \~~ ~,nt_ "\J r,_ Au~t e_.

\J ~J) = \j~ 0 - a.} \J ~) 0._;

~°1-~C.'~· ed "lo.\UQ...0-\ z... J

E ~.s) - E (<½~~ ::::


0..J y.. ~ [ °:;-r~, °.JT ~ o..J

\n ~L o-t° CL ,Sa,nrll.,_, -t'w__

E~J)-= El°'./~ ~ ~~ E(B -

No~ -\\,..a_ rhk~ o¾ o-rc.18,(l ) .-\¼.~~


C..O"\o~~vJts ao.,., 'be_ t.0....Jl.A..\') ~ .A.UW\ Dt Q.,

o:>~ ~ i°'-'CM,()t\ 4tt b~


\..~ 9.D.<f~ ~~
Q..

~~, ""'.;..A ~~e.AL..J .;to ..s~ ~"""" l:.~s .

'J ~j) = a:J 5 • C\; a.'4~ c.J.a..J


~ o:;. °'j ~ 1
o'L

~ frC½'O..j-1 '=~
J\.w,cl-~t"t -4-o be. ~~ \~,-.e.J ~l,~,~
.
So ~ -r~t) U-l;d ~ U~"'
\... -= o.J S°:,· - )., LoJ. °':; -3 t,!) kt... >- ..t& Jw
~1F" WI~

a. \~-'.o~o
'
~~ c.c\'T\r~ ~e....

cl~\'-b~l~Q.. 0.V\~ ~\,\~\e__ -+Q °ZE..'r0,

.
f. e
s
. •
'P1'?
..u, C\ ~---? 1'(\Q~IU.~ O.Y\t\ \
- \)) a.
..

\~ ¼e.. ~~'r~ ~ '\)~? ~Q'f"I ~~ C, \ CLJ.~ ~

~-~fl~o ~e. ~t.


( rp' ~o~ i°A. ~-
r= {;~qy,np\~ .
Cc,"' .s.,J.' ~ ~e.... d.o.AA a. er.±.
~~.\: ~,~

10 IDO

I 'l- llO
II ,o.,-
~ C\4-
C\
qs- • • •
10 qC\
to.4
I\
IOI • • •
\'2..
IOS"

"
10
II
ll!a
103
•• •
12. 110 ••

---------~ ~IC! Vo)u,.,..e,

Q:_,s -~ ~9-~-'>-)- .s-.1-( - o


r-n..~ ~\ll'J'~" ~ \T) .\\e.. ~'° 0.\5- 0.. qy.ocl'\'o.~\~ ~...,

~~, ~1m~~~ ~() O:t") b~ ~o\ved

- b ± J I:,"I..- 4-c:,_ c..


2,_Q.,
'2..
).: - 3o,qtc, ~ -+C..
lA'.) ¼."--l.(. C. ~ ~ c.o~~~ ~

"'-= 30-60
~'2- - o.0-3 .

W~\?'(\ " -=- ~, ~> 30. b o ~~ Sv.bsl,~ u.h. ,~


~-~~~o

ku.v

-\\.t!_ 0 'ao" e__ ~ r, .

'pc_ i I\) C. "l.

{) ' \ q C) • C\i

Q.l\t -0 , \'\

~e_~ 3/ -z_, - c,.\C\.'f-., + O,q'6~~


Independent Component Analysis (ICA)
ICA is a statistical and computational technique for revcaJing hidden factors that underlie sets of
random variables, measurements. or signals. It is a statistical technique for decomposing a complex
dataset into imlcpcndcnt sub-parts. One of the mnjor npplicntion of ICA is for solving the Blind
Source Separation (BSS) problem. The BSS problem is to detcm1ine the source signals given only
the mixtures. 1l1c linear mixtures of two source signals which nre known to be independent of each
other. i.e. observing the value of one signal docs not give any information about the value of the other.
The above problem can be mmhemutically represented us
x = As
where 's' is u two-dimensional rnndom vector containing the independent source signals. A is the
two-by-two mixing mutrix. und x contains the observed (mixed) signals. The main aim of
independent component nnulysis is to find nn unmixed mntrix w whore
\V=A· 1

l\1otivntion
Imagine that you nre in a room where two people nrc speaking simultuncously. You have two
microphones, which you hold in different locations. 1l1e microphones give you two recorded time
signal~. which we could denote by x I (t) and x2 (t). with x I und x2 the amplitm.lcs. and t the time
int.lex. Each of these recorded signals is u weighted sum of the speech signals emilled by the two
speakers, which we denote bys I (t) und s2 (t). We could express this as n linear equation:
X1 (t): nu S1 + 811S1 (1)
X1 (t) = n21 s, + nus1 (2)
where a 11 • a 11• a21 , and n22 arc some parameters that depend on the distances of the microphones from
the speakers. It would be very useful if you could now estimate the two original speech signals s 1 (t)
and s1(l), using only the recorded signals x, (t) and x1 (l). This is culled the cocklnil-pnrty problem.

One approach to solving this problem would be to use some information on the statistical properties
of the signals s, (t) to estimute the u,1• Actuully. und perhaps surprisingly. it turns out thut it is enough
to assume that s, (t) and s2 (t). at each time instant t. arc statistically independent. Independent
Component Analysis, or ICA, can be used to estimate the U1J based on the infonnation of their
independence, which allows us to separntc the two original source signals s , (t) and s2 (l) from their
mixtures X1 (t) and Xi (l).

Definition of ICA
Assume that we observe n linear mixtures x, •... , Xn of n independent components x
J = a jl S1 + Up S2 + ... + U jnSn , for all j.
Using vector-matrix notation let us denote by x the rnndom vector whose clements arc the mixtures
x, •...• Xn- and bys the rnndom vector with clements S1 , ..., Sn • Let A the matrix with clements nu. All
vectors arc understood us column vectors: thus xT. or the transpose of x, is a row vector. Using vector
matrix notation this can be represented as
x=As. (3)

"
x = [a,s;.
i= t

The statistical model in Eq. 3 is called independent component analysis, or ICA model. 1l1e ICA
model is a generative model. which means that it describes how the observed data arc generated by a
process of mixing the components s1 . The independent components are Intent variables. meaning that
they cannot be directly observed. Also the mixing matrix is assumed to be unknown. All we observe
is the random vector x. nnd we must estimate both A and s using it The starting point for ICA is tl1e
very simple assumption that the components si arc statistically independent and the independent
component must have non-gnussinn distributions. After estimating the matrix, A, compute its inverse,
say W, and obtain the independent component simply by: s = Wx.

Amhjgujtjf§ of ICA
1. Cnnnot detennine the vnrinnces of the independent components. Because both sand A being
unknown. any scnlar multiplier in one of the sources s1 could always be cancelled by dividing the
corresponding column n, of A by the same scalar
2. Cannot determine the order of the independent components.

Definition and rundnmcntnl properties


What is independence?
Consider two scular-vulued random variables yl und y2. Busicully, the variables y1 and y1 arc suid to
be independent if infonnution on the value of Y• docs not give uny information on the value of y2, und
vice versa. independence cun be defined by the probubility densities. Let us denote by p(y1, y2) the
joint probability density function (pd() of y, and Y1• Let us furtJ1er denote by p, (y1 ) the marginal pdf
of y, , i.e. the pdf of y, when it is considered alone.
p1 (y1) = / p(y1 ,.n)dy2

and similarly for Y2 . Thus Y1 and Y2 arc independent if and only if the joint pdf is factorizable in the
following way:

Uncorrelated variables
A weaker form of independence is uncorrclatcdncss. Two rnndom variables yl and y2 arc said lo be
uncorrelated if their covariance is zero.
E{y1 Y1} - E{y1 )E(y1) = 0
If the variables arc independent, they arc uncorrelated, On the other hand. uncorrelatedness does not
imply independence. Since independence implies uncorrelatedness. many ICA methods constrain the
estimation procedure so that it always gives uncorrelated estimates of the independent components.
This reduces the number of free parameters and simplifies the problem.

Non-Gaussianity and Independence


Sum of two independent signal have a distribution closer to Gaussian. So gaussian can be considered
as a linear combinatjon of many independent signals. So, separation of signals can be accomplished
by making linear signal transformation as non-gaussian as possible.

Measures of non-gaussianity
Kurtosis: The classical measure of non-gaussianity is kurtosis or the fourth-order cumulant. The
kurtosis of y is clnssically defined by
kurt(y) = E {.r''} - 3(£ {.i} )2
If assuming that y is of unit variance, the right-hand side simplifies to E( '/ } - 3. This shows that
kurtosis is simply a normalized version of the fourth moment E {y4 }. For a gaussian y. the fourth
moment equals 3(E( y2 })2 • Thus. kurtosis is zero for a Gaussian random variable. For most (but not
quite all) non-gaussian random variables. k'llrtosis is non-zero. Kurtosis can be both positive or
negative. Random variables that have n negative kurtosis are called sub- gaussinn. and those with
positive kurtosis arc called super-gaussian.
Modnle7
Learning with Clustering
K-means Clustering
The k-means algorithm takes the input parameter. k. and partitions a set of n objects into k
clusters so thut the resulting intra-cluster similarity is high but the inter-cluster similarity is
low. The cluster similarity is measured in regard 10 the mean value of the objects in a cluster,
which can be viewed as the cluster's centroid or centre of gravity.

Steps:
► First, it randomly selects k of the objects, each of which initially represents a cluster
mean or centre.
► For each of the remaining object,;, an object is assigned to the cluster 10 which it is the
most similar, based on the dis1ance between the object and the cluster mean.
► It then computes the new mean for each cluster.
► This process iterates until the criterion function converges.
► Typically, the square-error criterion is used, defined as

I.
£ = L, L, IJ1 - 111il2 •
/c lpEC1

where E is the sum of the square error for all objects in the data set; p is the point in space
representing a given object; and •mi' is the mean of cluster Ci (both p and mi are
multidimensional)

Algorithm:
1l1e k-mcans algorithm for partitioning, where each cluster's centre is represented by the mean
value of the objects in the clus1er.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(I) arbitrarily choose k objects from Das the injtiaJ cluster centres;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar, based on the
mean value of the objects in the clus ter;
(4) update the cluster means, i.e., calculate the mean value of the objects for each cluster;
(5) until no change;

Hierarchical Clustering
Hierarchical Clustering
A hierarchical clustering method works by grouping data objects into a tree of clusters. The
quality of a pure hierarchical clustering method suffers from its inability to perform adjustment
once a merge or split decision has been executed. That is, if II particular merge or split decision
Inter turns out to hnve been a poor choice. the method cannot backtrack and correct it.
In general, there arc two types of hierarchical clustering methods:
Agglomerative hierarchical clustering:
This bottom-up strategy starts by placing each object in its own cluster and then merges these
atomic clusters into larger and larger clusters, until all of the objects arc in a single cluster or
until certain termination conditions nre satisfied. Most hicrnrchical clustering methods belong
10 this category. They differ only in their definition of inter-cluster similarity.

Divisive hierarchical clustering:


TI1is top-down strategy docs the reverse of agglomerative hierarchical clustering by starting
with nil objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until
each object forms a cluster on its own or until it satisfies certain termination conditions, such
as a desired number of clusters is obtained or the diameter of each cluster is within a certofo
threshold.

In either ngglomerative or divisive hierarchical clustering, the user can specify the desired
number of clusters a.c; a termination condition.

Difficulties with Hierarchical Clustering:


, TI1e hierarchical clustering method, though simple, often encounters difficulties
regarding the selection of merge or split points.
► Such a decision is critical because once a group of objects is merged or split, the process
at the next step will operate on the newly generated clusters.
► It will neither undo what wn.c; done previously nor perform object swapping between
clusters.
► 1l1us, merge or split decisions. if not well chosen at some step. may lead to low-quality
clusters.
► Moreover, the method does not scale well, because each decision to merge or split
requires the examination and evaluation of a good number of objects or clusters.

Solution to above Problems:


One promising direction for improving 1he clustering quality of hierarchical methods is to
integrate hierarchical clustering with other clustering techniques, resulting in multiple-phase
clustering. There are three of them -
► The first, caJled BTRCH, begins by partitioning objects hierarchicaJly using tree
structures, where the leaf or low-level non-leaf nodes can be viewed as " micro-clusters"
depending on the scale of resolution. It then applies other clustering algorithms to
perform macro-clustering on the micro-clusters.
► The second method, called ROCK, merges clusters ba,;;ed on their interconnectivity.
► The third method, called Chameleon, explores dynamic modelling in hierarchical
clustering.
clustering.

Expectation Maximization Algorithm


1l1e EM algorithm is an erlicient iterative procedure to compute the Maximum Likelihood
(ML) estimate in the presence of missing or hidden data. In ML estimation, we wish to estimate
the model paramctcr(s) for which the observed data arc the most likely.
Each iteration of the EM algorithm consists of two processes: The E-stcp. and the M-step. In
the expectation, or E-stcp, the missing data arc estimated given the observed data and current
estimate of the model parameters. 1l1is is achieved using the conditional expectation,
explaining the choice of terminology. In the M-step, the likelihood function is maximized under
the assumption that the missing data arc known. 1l1c estimate of the missing data from the E-
step arc used in lieu of the ac tual missing data. Convergence is assured since the algorithm is
guaranteed to increase the likelihood at each iteration.

Derivation
l.c:l X IK? rnrulnm vnc:t.or wliid1 rc,:mll1' from n pnrnmnl<:rl:wcl frunily. \Vr. wil'h
lo fiml O i-11d1 tlmt 'P(XIO) 11' n llll\xln111111. 1111N IN known n..'l tl,r. Mnxlmnm
l.ikdll1uocl (ML) t!tellmnlt: fur 0 . 111 urclt:r ln <!tCllmnlu 0, It IN lyplcnl lo lnlnHl11cc:
llic fog likdilwod funrtion tlcfim.-cl n..-c,
1.(0) - 111 'P(XIO) . (7)
'rim likc·lihnCHI r1111rlio11 be nml'i«lc·n-cl lo IM1 ,, f11m:tlo11 of thn Jmrnmrt.<·r O ~ivun
llm clnln X. Sl11r.e: 111(:r) ,,.. n ,..lrir.tly l11r.n·1~l11~ f1111r.t.lo11, the: vnlm: or (J wl1icl1
1111uci111iY.c,.. "P(XIO) nltW, 1111uci111iYA~ 1,(0).
'11u: F.1\1 nl,tnrilhm is nu ilcrnlivc: proouclun: for mn:,dmizing 1,(0). Assumt!
llmt nrtc r the 11 th ile rnlio11 the c11rrc11l t..""4li11mlc for O itt given hy 0,.. Sim,-c t.lie
uhjt.'c livc itt Lu 1111uci111i:t.c L(O), we w~la tu cu111p11lu 1u1 11pdalt..-c_l CNlimnlc O Ht1cl1
Llml,
L(O) > L(O,.) (8)
Ec111ivnlu 11t.ly we wnut. Lu 1111uci111i:t.e t.lm cliffercuc.x.:,
/.,,(fJ) - /.,,(fJ,.) - l11P(XIO) - l11P(XIO.. ). (U)
Su far, we: tmvu 11ul cut1Niclcn..-cl nny 1111ul1NC.1rvcel or mitCMiu,; vnrinl,ICN. 111
pruhlc111H where tt11c h clala cxi."'l, lhu El\,1 nlgurithm pruvielett" 1ml11nll frnmcwurk
fur Llacir i11d11Hio11. Allc rimtcly, laiclelc 11 vnrinhlett mny ho i11lrucl11cul purely ''"
IUI nrt.ilic.:c ru r 11mki111; the 11m.xi11111111 likclil1oocl :t<li111nliu11 uf (} lrnctnl,I •• 111
llaiH c..:1U<O, it. iH 1u..-..11111ccl llml kuuwlL-cl~u u f lh • hiclel •11 vnrinblett will 111,lku t.11 •
1mu.i111iY,11Liu11 of tl1u likclilauucl fu11 ·tiu11 t·11Nicr. Eitl1cr way, clc uuw t.l1e l1iclele 11
r,uulu111 ve-clnr l,y Z nml n 1~ivc11 rc·1lliY.ntin11 l,y z . Tiu: tul1ll pn,lml,llit.y P(XIO)
mny IM.' wrillc11 i11 Lurm>< of t.hc hiclelt•11 vnrinl,lc"H z ,ix,

P(XIO) - L P(Xlz, O)P(zlO) . (lCI)



\Vo muy t.lum rcwrit.u E<11mtiu11 (0) tlN,

1,(0) - 1,(0.. ) 111 E P(Xlv., O)P(v.10) - 111 P(XIO.. ) . (11)



Nut.ice llmt. t.hit4 cxprt.:NHiu11 iuvulvCN thu lug1u--i ll1111 of n Hum. 111 &'Cliu11 (1) mriug
.le11>-e11 •,. im,-c1uulit.y, it. w ,ix ><l1u w11 tlml,

,.. E" ~.:r, ~ E" ~. 111(:r,>


I I 1- 1
E:. .
for cot11'tt_mt.,.i ~. ;::::: U with 1 ~. - l. 'I'hui ruutlt. mny Le ttpplil.-.l to &1un,.
tiuu (11) which i11vulv~ the logoritluu of u t4UU1 proviclccl tlmt tLc cut1t1tnutR
~. cnn l,c iclc nt.ilk-.l. Gu&Nicl ·r I ·Uiu~ ll1c cunHllllll.tc l,c of t.lac fonu 'P(zlX, fJ,.).
Siucc 'P(zlX, 0,.) iH " prul,nl,ilit.y 111c,LN11rc, we lmv • llanl 'P(zlX, 0,.) ;::::: U uucl
llml Ea 'P(zlX, 0,.) - I 11.N rL-.111in..-.l.
Tl1cm ,-t.nrling willa EcJ11nlio11 ( 11) tl1n cum,t.Ju1lt. 'P(7.IX, 0,.) nrc inlnHIIIC'ucl

L(O) - L(O.. ) - In L'P(Xlz,O)'P(zlO) - ln'P(XIO,.)


>

F.q11ivnlr.11tly wn mny writr.,

(13)

iwJ for couvcu.ieuco Jcliuc,

~ dint tlao rclntiom~laip in F.q11ntio11 ( I 3) cnn h e nuulc explicit n.~,

1,(0) ~ l(0j0,.).

\Ve Lnv • uuw n fuuct.iou, 1(010,.) wLicla bl 001uul :<I nix.we by t.lac likcliJ10ocl
fu11clio11 L(O). Acldit.lu11nlly, ulJt(Crv · tlint,

L(0,.),
• (14)
Our ohjc-ctivo lK to d , ~ n vnluC!C of O i.o thnt 1,(0) iK mnximi:,.NI. \Vr. hnvr.
Khown thnt the function l(OIOn) iK IMnmdccl nhovn hy tlm likdihood function /,(0)
rmcl thnt the vnluc of the~ fnnctiorL-c l(OIOn) mul /,,(0) rln! c.-qmJ nl lhc current
CNlimnlc fur O - O~.. 'l'lu.in .ifure , nuy O whida i11cnm..-«~ 1(010") i11 tuna i11cmn.'tl!
the /,,(0). 111 orcl •r Lo nchic v • tht! grc al~t JH.Jt(.'Cil,I • i11crciu-te iu Lhc viJuc of L(O),
Ll,c EM 1Jgurillam c:llm fur ~lt.-clirag O tmda LlmL 1(010,.) iK m 1lXi111iu.'CI. \Ve clcraulc
lhiK upclnh."Cl vnluc nx 0.,, , . Thi.,.. prt.><..'CNH iN il111Hlmlccl iu Fi~.,,rc (2).
Fun111Jly w e hnvc,

arg lllllX { l(OIOn))


"
~ P(Xlz, O)P(zlO} }
nrg IIIJlX { L(O.. ) I ~ 'P(zlX, 0,.) 111 P(XIO,.)P(zlX, O,.)
Now clrop lcrnuc which llrC co1uclm1t w . r.l. 0

1u-g IIIJU< { ~ P(zlX, 0,.) In 'P(Xlz, O)P(zlO)}


~'"( IX )l 'P(X,z, O)P(z,O)}
nrg IIIJUC { L;--- r z , O., 11 'P(z, 0) 'P(O)

nrg m:,x { ~ ·P(zlX, 0,.) Ira 'P(X , z lO)}

- nr~ IIIJtX { f'.,...t 1x,o. { In P(X, 7.10))} {Hi}

Ju Ec111nlio11 (15) t.l1c c.xpc...-clnliuu ruul 11mxi111iznlio11 Nl.cptt nrc nppnnmt.. 'l'l1 •
EM nlgurill1111 tllllH CUllP«iNl.H ur itcmliug ll1c:

1. F:-Rt,.p: Ocitcmnh11• tlw c-0111lit.int1nl c•xJl<~lnllon F.zrx,o.. ( ha 'P(X, v.lfl)}


2 . M-1tl'71 : Mnxhnlz,· thb1 c:xpn~;o11 with rt:Npc-cl lo (J.
Al tliis puiut il is fofr tu n.-;k whnt Im.-; he-cu gninc~I ~iv.:11 thnt v.,: lin~ mmply
trrvlrd thr. mn.ximi:,.ntion of 1,(0) for thr. mnximi:,.ntion of l(OIOn) - TI1r. nnswcr
lic!S in tlic fncl thnl 1(010,.) tnkc::4 into nct'uunl tl1c 111ml~1ourvc.-cl or mi.'4Ning clnt.n
Z . 111 tho en."( whcru we wish lo ~Limnlu tl1~, vnrinl,lt!S till! EM nlgorithnm
pruviclc.-:4 n fr1u11cwork fur <loiug :fiD. AL'4l>, n..'i nlluclc."CI lo earlier, il mny he co11,•u-
njc11L lo iulru<l1tt."C ,mcl1 11-itlclcu vnriuLIL-:4 f4l) tlmL llac u111.Xituoo1liu11 of L(OIO,.) ii4
HimpHlk-J givc11 lu1uwl,.-d go of the W<lclcu Vllriublut. (Wi cornpnrt.~ with 11 direct
mnxi111001liu11 of L(O))
·----
,- - - - ,
PageNo. L_I

Whal- ts ClustU/inJ ?
ClusteJJinJ rs /:fu, pYoms oC rnaki?} g.-.ovp
a.

o(: a bsh'a.cC- objec6 1n!o clo..sw.i of 8imt/aJJ obje_eJ


l mat /ab a66 b<ac& c bj<?CG bo~ ro'h bahJG StlJ;JJv.-

<J lj ~c t- T'0J- Se.., S i rndc-0? obJc_c{:- (-<c, :roo p ,b QJ)ClfJo.

\ ,:i k O kt hJ--e, hq f cl os l-cJt i n (J'J

"'Cl vs lvJ~ OJ)o, ~&i--S ;s broad J v¥d /0 l'YlCUJ J


Q ppH ca b"o n su e, h °'--6 rn oJ7 k:J.,h O e_s-ean.eJ1 / p qi-kn r,

0 e.,coj n I <-Cl biCJ n , d q /;-q [lj) C/ !Js,,; and irr,0-. J e, Pro CLbS i j
Page No.

k-rneM
.. 'Th I 5 Is uS<..d in CJU>a ti V& CJv...a b n(J ceJJ 6'm I
cl u,s to-,

• k IY) c_on c!lfiStV/inq ciiTY>e, h, pw1/;i!;iOn n ob:µ))1JC1


ti o ,-, in ko k v cl usl=-c.J-i , n w h i'ch ea Q h O b8-Vi. u a l-i'on
b-e l o nj s h> !Ju_ cI u s 1--vi w·, t1. ~ 1:s- fl1,.Wfl ,

SOlvlnj as o_ fYo-l: o ~p-c o P lhz. clua/:O, ·


11,;s LJehv(t in panb~on or~ da.t-o.. s-pa__ee_ rnlsn
eo..J (
(~ I--< {'()Qc..lf) I\ 6 rn a k Su t ~I l-q hq j ra11 boc n
in k_

E:jte..pD Tak~ m CcO.() Va ( ().__t,

G'b~ i) h'nJ fl.WJ1 U) r n l> rn 6 07 oC rne_a.n GJ) d


?ob- in c luscm
ste_p §} Re.pQ_,a t /1() 0<):S one..,. cu,d huo
u n 1:-1 I r t,uJt ~t- &3..fnR._ ~

1 e..n s1 c, n mob Io j o..b · S-v rn so 7u e, I~ W1.v).Je,


tvn Zjo0id c"' ac~e,, s 0-t'nj rtt5 °' -
u tn<7\ clu s tm banaru. ~ -

st~ j) Tq fCt. rn{,0-1) va/ t.u.,,


K ~ 2 h °' r cl e: h' h O a ~'hI hGl/Y\,OJ, ~ sCOJI t hq ~
s✓op~hl t. -~ a3 M 2 'fYVJ}J) 11'<3i fw nah1
J 1 oB h~1 h h f,o , b h, 2 o OJ;d CI TY) v a1VJ~
li. o /e.., !V) e,, h °' 1- l-1
1~ bco 6 k ~ 2. i, q ·,
fs1r·

!firs c epj) find nc:oJ'UJ:) t n urn 'b-oi CJ r m~ Qr) d


Pvb lfl c.lus'ttn

if 2 ~ i. l o 12, 2.a,., 25,3~


K1 :- i Z1 31 Y} I I,,

Jo tt KL 2j oa_d o._ Jo t2 lu., 2:J a. o,cl


k oYt,t,lD hc(t o s1'\o kcuiu lo -~or US Ka
Kt r11.LdctC((v l·f' 2._ rru_ d Cl Cd O

..
Ct IJ hr,.5 mL m et:zn n 7/tq IO pQj1 O_,.sl l (J_)q ( 0 ~

Kl K(,\_ r'rt e.OJ) o.,nd /1 2 h'o rrv[}J)

rn 2. :: lo + 11 ,t-l '2. T'20 ± 2 6 +30


3 6

.sewu- b Cil) ao

K1 : { 2. 1 3 1 Y1 lo 11 1 1
1 2] k 2 ~ { 201 2S13 c }

M, -:. 1 Yt12._ ~ 25

K, )., -i213/ -1; IOI ,i,2] t(2 ~i £CJ/ 2 5'13c]

07 ' ~ 1
1hv s c.UL OJ-LL, ~e tl:Jnj s Cl../Y\,(_ m e Cl.I'\ (,{,(,e_ hq u e:
tb s tD p K1 CU) d k: 'l_ Q;U_ fJ.iLU cl u 0 h:V-/ _,.
PW7t- ( 1)
fl3 J lornm o bve_,
cf us [e.ninJ

/;v dmw qJJae. en cJ mcvbri~

nttr16vti{xJ Rttfibvte.C}j
2
2
2
3
l

i
o - 5

Beh>oe., Co e..q b J the ad Jo J C<'.11 ma ti ix-, W,(, ¾.d


to h'"n d If, e., ch.5 ta.n Ce.., be tlw..v) eq ch abj e_ct
~o rh c, o theJ7 01:_j e__d
BJ r
cucl e Jran D,·°c t:oJ) Q. Po omo1 °'
dCfJeJ
J
: JC3 ~J0 2
+c 2 ~2-f- =I
cf(A,C)-= Jc,- 2J'2+ C I ~2=) 2 == {2_ .:: H((

d CA, 0) cc J(3 - 2) 9.--t- ( I -2] '2_ :::- Jz-- "' I· l{ I

d CA, EJ "'Jc ,.~y Co-s-<£)'2_ ~ 1- 5~


2
--t-

cl CB , cJ ~ Jr1 _ JJ + C1 _ ;y ,_ :: ✓5 ::= 2 .. 2 '-1

chB,D)~Jc3 -3J2.+(l-2)'2 .:: , I

d CB, f) "J Q•S -3)Q'T(0·5 - ;f;t = 2• I '2.

d c 0 1EJ -- J (i -5-3) --
2
~
a_
+--(0 -5 - I} f ·5 CZ
°' c}J a CLn ~ m~H, x TS ,WJJ aJ
1h e 0
A B C 0 f
obj e.cb
A 0
0
B I
0
C I· l,- I 2· 24
0 f, 4-1 I 2 0
~ Jo 5g 2"/2 o-:;, 1-si 0
!.. '

. . (

,,.--

A C B D
-r -rte.1V\ E
• 3
I ,
~
Ii - -
£
A--
0 \
0
l
~
2
s I-
l
3
r-

' '
' 0
C ~ :2. I
I =~~--:---
\

1 3
~ _ ..J
5
' 0
' 0
.
3 \ 3
' \ ~

~ -· ~

I '!.ttM. E· A (; 8 D
£ 0
A
{.,
'
2
0
~-s 0 '
1
'
B c2. I 0 '

D 3 3 6 3 0
. . . J sV=-,

r
7'fow (ffnd <Y\,~rn um
ro~f\lmurn crt~tMce ,:; \ p.

J-/erice YYte(ae iA

Scanned by CamScanner
~

. " . .'Pend"o0 -ra.m

I I
£ ~
I -

~,n d q ,~ \ __
, ro,n C:hs1l>ftle from fA. -1:.o othtY

: min [ dTsl lCF,-P.), c J] r,.

4 " = t'n"ln [d1s-l (E, c.). drs-! CA, c)l


8 r

• r . : 'N\~n, [.2., .2.J :: !l,


e r
• ,. -min [dfst !(~,A),'DJ] di&\- (€0, frf!>J
: "'~~ [3, 3] : 3 ~ ~
f"I

roln [,:lrst f (E,A),'BJ] J;.s t ( t B, -A S)


,, r
f'
= lV'in [2., 5]
r . ' 0

f' C

I.

Scanned by CamScanner
n~ni___ - ,
£A C 8
-
7)

cA _ ..,, __
--- .2. 0 . .
C 0
-
)
r
.. I

f
8~ - ·- - -2. I 0
~.._,... __~..,..-
-- - -.-.4:.......,-: .. -
,

1) 3• 0

1n -MrJ '
rnTnimum %totlce
3
=I ,.....

~d"'1Yont /; I

I l I
A c s

r;.

.....\

Scanned by CamScanner
. . ..... - .

' - Now o.:1atn "F'rnd mrn d1storlce .1'w-om c.s


,b, o-1:'J( ~1' <p,fn-1: '
~ mln dt~t [ (E,A), (c,a)]
r~.
~ =lY\in d?&t. [ (£,,),, 'B), (A,c:.) ... CA, 8)]
: ~ roYn cl~,t [ ~, :1., ~, s] = ..i,
: tn1n cl~.st' [Cc..,&) ., 1>] - ~

:: :: -rn~n. df,-l [ c. c.,'B) /i>J


f ~ 'm Tn C.Tflt [_t c., Q) , ('8, D)J
•f :: rnTn . cli,i [t,i] : 3
f

f'

f,


..
,
'
..

Scanned by CamScanner
-
p

- -· -

C.S } 1 0


• - - . - • ~ •-- - _ _ _ ,,,_ _ - ••- , _, - -• w - - I

p i
3
\ I 3 0

~J,~c.m

A ""'

.r mfn. dr~t r , ~,T\,C, 8), l) J


J.
('

r d~~t [( f,o) , (A,P) ( l,l) ) , ( ~.o )


:: '('(lrf1 . I

r
,-
: min dl$4; [3, 31 t., 3 J
= s
-:;,c
r

3 0

Scanned by CamScanner
I

If\ I

('
3
...... 2.

............ ,
A C /
D
,-,.

Scanned by CamScanner
Radial Basis Function

A radial basis function (RBF) is a term that describes any real valued function whose output
depends exclusively on the distance of its input from some origin. Typically, radial basis
functions are defined in terms of the standard Euclidean norm of the input vector, but
technically speaking one can use any other norm as well.

A radial basis function gives value to each point based on its distance from the origin or a
fixed centre, usually on a Euclidean space. It is spherically symmetrical.

In machine learning, radial basis functions are most commonly used as a kernel for
classification with the support vector machine (SVM). For this application of RBFs,
the Gaussian Radial Basis Function is virtually always selected.

One of the commonly-used types, Gaussian, is usually chosen as the activation function to
compute the derived features (units in the middle of the network) in neural networks known
as radial basis function networks.

RBFs can also occasionally be used as activation functions in neural networks to form what is
known as a Radial Basis Function Network, but this application is uncommon.
MODULES

Reinforcement Learning

Introduction
► Reinforcement Leaming is a type of Machine Leaming, and thereby also a branch of Artificial
/11tellige11ce.
► It allows machjnes and software agents to automatically determjne the ideal behaviour withjn
a specific context. in order to maximize its performance.
► Simple reward feedback is required for the agent to learn its behaviour. this is known as the
reinforcement signal.
► Reinforcement Learning allows the machine or software agent to learn its behaviour based on
feedback from the environment.
► If the problem is modelled with care, some Reinforcement Leaming algorithms can converge
to the global optimum; this is the ideal behaviour that maximises the reward.
Functionini
► In reinforcement learning, the learner is a decision-making agent that takes actions in an
environment and receives reward (or penalty) for its actions in trying to solve a problem.
► After a set of trial and error runs, it should learn the best policy, which is the sequence of
actions that maximize the total reward.

Core Concept
► Reinforcement learning differs from standard supervised learning in which correct
input/output pairs are never presented, nor sub-optimal actions explicitly corrected.
► There is a focus on 'on-Line' performance, which involves finding a balance between
exploration (of uncharted territory) and exploitation.
► It is called "learning with a critic," as opposed to learning with a teacher which we have in
supervised learning.
► A critic differs from a teacher in that it does not tell us what to do but only how well we have
been doing in the past; the critic never informs in advance.
► The feedback from the critic is scarce and when it comes, it comes late. This leads to the credit
assignment problem.
► The reinforcement learning generates internal values or the intermediate actions in terms of
how good it is in leading to the goal and getting us to the real reward.
► Once such an internal reward mechanism is learned, the agent can just take the local actions
to maximize it.
► The solution to the task requires a sequence of actions which resembles a markov process. A
Markov decision process is used to model the agent.
► The difference is that in the case of Markov models, there is an external process that generates
a sequence of signals. In this case it is the agent that generates the sequence of actions.
► Sometimes a partially observable Markov decision process is seen in cases where the agent
does not know its state exactly but should infer it with some uncertainty through observations
using sensors.

Elements of Reinforcement Learning


The main components of reinforcement learning are
States and Actions
• Environment and Agent
• Reward
TI1e learning decision maker is called the agent. The agent interacts with the environment that
includes everything outside the agent. The agent has sensors to decide on its state in the environment
and takes an nclion 1hut modi fies its state. When the ngent tnkes an uction. the environment provides
a reward. Time is discrete us t = 0, l. 2, . ..• and s, E S denotes the state or the agent al lime I where
S is the set of all possible stutes. a, EA(s, ) denotes the action that the agent takes m time t where A( st
) is the sci of possible uctions in states,. When the agent in slate s.tnkes the uction 'at'. the clock ticks,
reward r,,. ER is received, and the agent moves to the next state, Markov decisions,... 1l1c problem
is mo<lellctl using a Markov process decision process (MOP). The reward anti next state arc sampled
from their respective probability distributions, p(r., ,Is,, ut ) and P(s,, dsa. u, ). TI1is is similar to Markov
system where the state and reward in the next time ste de end onl on the current state and action.

ENVIRONMENT

Reward
Action

Episode/frail: The sequence of actions from the start. to the terminal state is an episode or a trial.
Bused on the npplicution the uny state may be designated us the initial state. There is also an absorbing
tem1inal (goal) stute where the search ends; ull actions in this tenninul state transition to itself with
probability I and without any reward.

Policy: TI1e policy, defines the agent's behaviour and is a mapping from the states of then
environment to actions: Sn - A. The policy defines the action to be taken in any states,: a,= (sn ,).
The value of a policy, V 7t ff (s, ), is Lhe expected cumulative reward that will be received while the
agent follows the policy, starting from states,.

Finite Horizon: In the finite horizon or episodic model, the agent tries to maximize the expected
reward for the next T steps:
....--- - -----------------------.
V 11 (s,) = E[rr+I + rr +2 + · · · + rr +T J = £ [±,_ I
rr +I-
-

Inrtnite Horizon: Tasks that arc continuing, a.nd there is no prior fixed limit to the episode. In the
infinite horizon model, there is no sequence limit. but future rewards arc discounted.

V 11 (s,) = E[rr + 1 + }' rr +2 + )' 2 rr+J + · · ·] = E [f


I• I
}'f- I rr+t]

where O~y < l is the discount rate

Discount Rate: The discount rate is used to keep the return finite. If= 0, then only the •y' immediate
reward counts. As approaches I, rewards further in the future count more. •y• is less than 1 because
there generally is a time limit to the sequence of actions needed to solve the task.
Optimal Policy: For each policy, there is a Vn •(s,), and we want to find the optimal policy n· such
that
\I* (s1 ) = max
1T
V" (s1), "f/ s,
In some cases instead of working with the values of states, V (s, ). it is preferred to work wi1h the
values of state-action pairs, Q(s, • a, ). V (s, ) denotes how good it is for the agent to be in state s,,
whereas Q(s, , n,) deno1es how good it is to perform nction a, when in stnte s, . We define Q· (s., n,) as
1he value, 1ha1 is, 1he expected cumula1ive reward, of action a, 1aken in s1a1e s, and !hen obeying the
optimal lie nflerwnrds. The value of a state is c ual to the vnlue of the best ssible nction:
v• (s,) - max Q * (s,, a,)
n,

= max
a,
E [f 1-1r,+1]
I• I
y

= maxE [r, . . ,
a,
+)'I1-1)'1-1r,+1+1]
max
a,
E [rr +I + )'\/* (sr+d]

v• (s,) = ~
a, (E[rr +tl + }' ...... P(s1 +1ls,,a,>\l*(s,+d
max ------------------(1)
s, ..
The equation (I) is known as Bellman's Equation. Similarly. the value of expected cumulative reward
can be represented by
Q * (s,,a,> = E[r, +il +>'I

P(s1 +1ls,,a,) maxQ*(s1+1,a1+1)
Dt • I
Sp. 1

Once the vnJue of Q·(s,,a,) is calculated the policy can be defined by trucing the action an ., which has
the highest vnJue among nJl Q•(s,, a.)
rr *(s,): Choose at where Q *(s,,at) = maxQ* (s,,a,)
a,
If values of Q· (s,. a,) are known, then using greedy approach at each local step, the optimal sequence
of steps that maximises lhe cumulative reward can be cnJculn1ed.

Learning in Reinforcement learning can be done by two ,vays


► Model Based Learning
► Model Free Leaming

Model-Based Learning
In model-based learning, the environment model parameters, P(r"1 ls,, a,) and P(s.,,ls,, a,) are known.
If the environmental parameters are known optimal value function and policy can be solved directly
using dynamic programming. The optimal value function is unique and is the solution to the
simultaneous equations (Bellman's equation). The optimal policy is to choose the action that
maximizes the value of optimal value function, in the next state:

rr* (s,) = arg~~ ( E[r1+tls,, a,]+ y I


s, +1es
P(sr +tls,,a,)\/* (s, + 1))
TI1e optimal policy can be evaluated using either
► Value iteration Algorithm
► Policy itermion Algorithm
Vnlur Itrrntion
To find the optimal policy. use the optimal value function and opply the iterative algorithm called
value iteration thot will converge to the correct V* values.

Initialize \/(s) to arbitrary values


Repeat
For alls e S
For all a E .'A
Q(s,a) - E[rls,a] + l'Is•esP(s'ls,a)\/(s')
\I (s) - ma-xo Q (s, a)
Until V(s) converge
If the maximum value difference between two itcrntions is less than a certain threshold o, then the
values can be assumed to be converged.

where I is the iteration counter. It is possible that the policy converges to the optimal one even before
the values converge to their optimal values because only the actions with maximum values arc
considered.

Polin· lll'r:ttion
In policy iteration, the policies are stored and updated, rather than doing this indirectly over the
values. The process starts with a policy and improve it repeatedly until there is no change. The value
function can be calculuted by solving for the linear equations. Improvement in the policy by is
checked. This step is guaranteed to improve the policy, and when no improvement is possible, tlhe
policy is guaranteed to be optimal.

Initialize a policy TT' arbitrarily


Repeat
1
TT - TT
Compute the values using TT by
solving the linear equations
V"(s) = E[rls,TT(s)] + Yis• esP(s'ls,TT(s))V"(s')
Improve the policy at each state
TT'(s) - argma'<a(E[rls,a] + )' Is•e sP(s'ls,a)V"(s'))
Until TT = TT'
Temporal Difference Learning
Temporal difference (TD) lcurning is a prediction-bused machine learning method. It has primarily
been used for the reinforcement learning problem and is said to be "a combination of Monte Carlo
idea,; and dynamic progrnmming (DP) ideas. TD resembles u Monte Carlo method because it learns
by sampling the environment according to some policy, and is related to dynamic programming
techniques us it approximates its current estimate ba1,ed on previously learned estimates.
Temporal Difference leuming is one of the most used approaches for policy evaluation. It is a ccnlrnl
part of solving reinforcement learning tasks. For deriving optimal control, policies have to be
evaluated. This tnsk requires value function approximation.
Optimal policy cnn be solved using dynamic programming, but it need good knowledge about the
environment. The environment is explored to get values of next stale and rewards, and these
infonnution is used to update values of current srntc. These algorithms arc called temporal differencc
algorithms because the difference between our current estimate of the value of a state (or a state-action
pair) and the discounted value of the next state and the reward received arc considered.

Exploration Strategies
c-greedv search: c-grccdy search is where with probability E, we choose one action uniformly
randomly among all possible actions reffcrcd to as explore, und with probability 1-c, we choose the
best action, rcffcred us exploit. Start with a high value and gradually decrease it inorder to exploit.
Soft Policy: Policy is soft, if the probability of choosing any action a E A in states ES is greater than
0. A Soft-Max function is a function used to convert values to probabilities
P(als) = expQ(S,a)
LbE.'A expQ(s,b)

and then sample according to these probabilities. The above equation can be rewritten use n
"temperature" variable T and define the probability of choosing action a as
P(als) = exp[Q(s,a)/T]
Ibe.'A exp[ Q (s, b) / T]

When Tis large, nil probabilities arc equal and we have exploration. When Tis small, better actions
are favored. The strategy is to start with a large T and decreac;e it gradually. a procedure named
annealing. which in this case moves from exploration to exploitation

Model Free Learning is divided into


• Deterministic Rewards and Actions
• Non-Deterministic Rewardc; and Actions

Deterministic Rewards and Action.'>


In deterministic case, at any state-action pair, there is a single reward and next state possible.
Q (st , at ) = rt+ 1 + y max
a1 + I
Q (st+ 1 , at +1 )

use this as an assignment to update Q(st , at ). At state s. , action a. is chosen, which returns a reward
r,,. and takes us to state s1, 1 • The value of previous action can be updated a.c;
Q(s,, a,) - r, +1 + )1 maxQ(sr
a,,. I +1, a,+d
where the hat denotes that the value is an estimate Q(st+ 1 • at+ 1 ) is a Inter value and has a higher
chance of being correct. We discount this by and add the immediate reward (ify any) and take this ns
the new estimate for the previous Q(st • ot ). This is called a backup because it can be viewed as taking
the estimated value of an action in the next time step and "backing it up" to revise the estimate for
the value of a current action.

Assume thnt all Q(s. a) values arc stored in a table. Initially all (st. at) arc O. nnd thcyQQ arc updated
in time a~ a result of trial episodes. If there is a sequence of moves and nt each move the estimate of
the Q value of the previous state-action pair is updated using the Q value of the current state-action
pair by the formula
Q(s,, a,) - r, + 1 + l' max Q(s, + 1, a, + i)
<lt, I

In the intermediate states, ull rewards have values as 0, so no update is done. When the goal state is
reached. the reward r is obtained and update the Q value of the previous stateaction pair as r. As for
the preceding stute-action pair. its immediate reward is O nnd the y contribution from the next
state-action pair is discounted by y because it is one step Inter. Then in another episode, if we reach
this state. we can update the one preceding that as y tor, and so on. This way, after many episodes,
this information is backed up to earlier statcaction pairs. Q values increase until they reach their
optimal values as we find paths with higher cumulative reward.

Nnndctcrmini~tic Rl wanl1; an,I Actions


1

If the rewards and the result of actions are not deterministic, then there is a probability distribution
for the reward p(r11 ijs,.a,) from which rewards arc sampled. and there is a probability distribution for
the next state P (s,, iJs.,u,). These help us model the uncertainty in the system that may be due to forces
that cannot control in the environment.

For example. we may have an impcrfect robot which sometimes fails to go in the intended direction
= E[r1+d + y I
Q(s,,a,)
S, . P(s1+ils,,a,)maxQ(s1+1,a1+d
a,.,
and deviates or advances shorter or longer than expected. we can write

A direct assignment is not possible because for the same state and action, different rewards or move
to different next states are received. This type of problem can be handled by keeping the Average,
and the method is known as Q-Leaming.
Q (s,, a,) - Q(s,, a,) + 11(r1+1 + )' max
Dr, I
Q(s1+1, a,+ i) - Q (s,, a,))

In the above equation, the term


r1+1 +y max0 ,. 1 Q.(sr+t, a 1+ a)
has values as a sample of instances for each (s.,a,) pair and Q(s1,a,) converge to its mean. 11 is gradually
decreased in time for convergence, and it has been shown that this algorithm converges to the optimal
Q• values. The pseudocode of the Q learning algorithm is given below
Initialize all Q(s, a) arbitrarily
For all episodes
lnltallze s
Repeat
Choose a using policy derived from Q, e.g., €-greedy
Take action a, observe rands'
Update Q (s, a) :
Q(s, a) - Q(s, a)+ rJ(r + )'maxa• Q(s',a') - Q(s, a))
s - s'
Until s Is terminal state
Q learning, which is an off-policy temporal dlfTcrcncc algorithm.

TcmpornJ difference (TD) algoriLhms nrc those reducing the difference between the current Q value
and the backed-up estimate, from one-time step later. This is nn off-policy method as the value of the
best next action is used without using the policy. In nn on-policy method. the policy is used to
dctenninc also the next uction. The on-policy version of Q leurninA is the Sursa nlAorithm.

Initialize all Q (s, a) arbitrarily


For all episodes
In ltallze s
Choose a usl ng policy derived from Q, e.g., €-greedy
Repeat
Take action a, observe r ands'
Choose a' using policy derived from Q, e.g., €-greedy
Update Q(s, a):
Q(s,a) - Q(s,a) + 17(r + yQ(s'.a') -Q(s,a))
s - s', a - a'
Until s Is terminal state

Sarsa algorithm. which is an on-policy version of Q learning.


The on-policy Snrsa uses the policy derived from Q values to choose one next action a' and uses its
Q vnJue to calculate the temporal difference. On-policy methods estimate the value of a policy while
using it to take actions. In off-policy methods, these are separated, and the policy used to generate
behaviour, called the behaviour policy is different from the policy that is evaluated and improved,
called the estimation policy.

Eligibility Traces: TI1e temporal difference is used to update only the previous value (of the state or
state-action pair). An eligibility trace is a record of the occurrence of pac.t visits that enables us to
implement temporal credit assignment, allowing us to update the values of previously occurring visits
a-; well. To store the eligibility trJce, an additional memory variable is required associated with each
state-action pair. c(s. a), initialized to 0. When the state-action pair (s, a) is visited, i.e. an action a is
taken in stutc s, its eligibility is set to 1; the eligibilities of all other stnte-nction pairs arc multiplied
by y"A.. 0 ~A~ I is the trace decay parameter.
I ifs = s, and a = a,
e,<s, a) = { yi\e,_ a(s, a) othen,1se

If n state-action pair has never been visited. its eligibility remains O; if it has been, as time passes and
other state-actions are visited, its eligibility decays depending on the value of y and > .. In Sarsa, the
temporal error nt time t is
Dr = rr+1 + )1 Q(sr+1, ar+1) - Q(sr, ar)
In Sarsa with an eligibility trace, named Sarsa(l.). all state-action pairs arc updated as

Q(s,a) - Q(s,a) + 178,e,(s,a), 'vs,a


The algorithm for Sorsa(A) is as follows:
Initialize all Q (s. a) arbitrarily, c(s, a) - 0, 'r;/ s, a
For all episodes
In ltallze s
Choose a using policy derived from Q, e.g., e::-greedy
Repeat
Take action a, observe r ands'
Choose a ' using pollcy derived from Q. e.g .. c-greedy
c5 - r + yQ(s',a') - Q(s,a)
c(s,a) - 1
For alls, a :
Q(s, a) - Q(s, a) + 11c5e(s, a)
e(s, a) - l'Ae(s, a)
s - s', a - a'
Until s Is terminal state
Sarsa(i\) algorithm.

Generalization
In all reinforcement learning algorithms its assumed that the Q(s, a) values (or V (s), if values of
states arc estimated) are stored in a lookup table, and the algorithms nrc called tabular algorithms.
Some issues with this tabular approach are:
• when the number of states and the number of actions is large, the size of the table may become
quite large.
• States and actions may be continuous, and to use a table, they should be discretized which
may cause error.
• When the search space is large, too many episodes may be needed to fill in all the entries of
the table with acceptable accuracy.
Instead of storing the Q values, consider a regression problem, which is a supervised learning problem
with a regressor Q(s,a l )0, taking s and a as inputs and parameterized by a vector of parameters, 0,
to learn Q values. This can be an artificial neural network with s and a as its inputs. one output, and
8 its connection weights.
A good approximation may be achieved with a simple model without explicitly storing the training
instances; it can use continuous inputs; and it allows generalization. If we know that similar (s,a) pairs
have similar Q values, we can generalize from past cases and come up with good Q(s.a) values even
if that state-action pair has never been encountered before.
To be able to train the regressor, a training set is required. In the case of Sarsa(O), we saw before that
we would like Q(st , at ) to get close to rt+ l + Q(st+ 1 , at+ 1 ). So, we can form a set of training
samples where the input is the state-action pair (st , at ) and the required output is
rt+I + yQ(sy l ♦ I, ll1t1 ).

111c squared error is us follows.

E 1 (0) = [rO,,. + AQ(sy '" • n,11 ) - Q(s, , n, )]2

Ir using n gradient-descent method, us in 1mining neural networks, the parumc1er vector is updmed ns
Da0 = 17[r1+1 + yQ(s1+1,a1+d -Q(s,,a,)]v'9 1 Q(s,,a,)
This is a one-step updutc. In the cusc of Sursu( ). the eligibility truce is nlso 1uken into nccounl

where the lemporul difference error is


tl8 = r,8 1e 1
81 = r1+1 + yQ(s1+1, a1+1) - Q(s1, a1)
nnd the vector of eligibilities of parameters arc updated ns

with c Oull zeros. In the ca°'c of n tubular algorithm. the eligibilities urc stored for the statcnction pairs
because they arc the parameters (stored as a rnblc). In the case of un estimator, eligibility is associated
with the parnmetcrs of the estimator. ,my regression method cun be used 10 truin the Q function, but
the particular tusk hus u number of requirements. First. it should ullow gencrnlizution: that is. need to
guarantee thul similar stutes und uc1ions have similar Q values. This ulso requires a good coding of s
nnd n. a._ in uny upplicution, to mukc the similarities apparent. Second. reinforcement learning updates
provide instunces one by one and not usu whole training set, um.I the learning ulgorithm should be
able to do individual updates to learn the new instance without forgetting what has been learned
before.
Well Posed Learning Problem:
A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E.
It incudes 3 features

Learning Task : The thing that needs to be learnt.


Training Experience : The fetching and processing of data for training the model.
Performance Measure : The critic that helps us to recognize the result were correct or
incorrect.
• Example
Task T: Playing checkers
Training Experience E: Playing games against itself
Performance Measure P: Percentage of games won against opponents.

You might also like