You are on page 1of 12

SIMS 290-2: Applied Natural Language Processing

Today
Algorithms for Classification Binary classification

Barbara Rosario October 4, 2004

Perceptron Winnow Support Vector Machines (SVM) Kernel Methods

Multi Class classification
!ecision "rees #a$%e Bayes K nearest neigh&or

1

2

Binary Classification: examples
Spam filtering (spam' not spam) Customer ser%ice message classification (urgent %s( not urgent) )nformation retrie%al (rele%ant' not rele%ant) Sentiment classification (positi%e' negati%e) Sometime it can &e con%enient to treat a multi way pro&lem li*e a &inary one+ one class %ersus all the others' for all classes

Binary Classification
,i%en+ some data items that &elong to a positi%e (-. ) or a negati%e ( . ) class "as*+ "rain the classifier and predict the class for a new data item ,eometrically+ find a separator

3

4

Linear versus al!orit"ms

on Linear

Linearly separable data

Linearly separable data+ if all the data points can &e correctly classified &y a linear (hyperplanar) decision &oundary

Linear Decision boundary
5

Class1 Class2
6

(((n Perceptron and Winnow algorithm /inear Binary classification 6nline (process data se7uentially' one data point at the time) Mista*e dri%en Simple single layer #eural #etwor*s 4 in .d (4 is a %ector in d dimensional space) feature %ector y in 8 .'-.& ? > then y : . Statistical Learning Theory Tutorial 12 .on linearly separable data on linearly separable data Class1 Class2 7 Non Linear Classifier Class1 Class2 8 Linear versus al!orit"ms on Linear Linear versus al!orit"ms #on /inear on Linear /inear or #on linear separa&le data0 We can find out only empirically /inear algorithms (algorithms that find a linear decision &oundary) When we thin* the data is linearly separa&le Ad%antages 1 Simpler' less parameters When the data is non linearly separa&le Ad%antages 1 More accurate !isad%antages 1 More complicated' more parameters !isad%antages 1 2igh dimensional data (li*e for #/") is usually not linearly separa&le 34ample+ Kernel methods 34amples+ Perceptron' Winnow' SVM #ote+ we can use linear algorithms also for non linear pro&lems (see Kernel methods) #ote+ the distinction &etween linear and non linear applies also for multi class classification (we5ll see this later) 9 10 Linear binary classification #imple linear al!orit"ms !ata+ 8(4i'yi)9i:.& = > then y : -. 1 if w4 . 11 From Gert Lanckriet.9 la&el (class' category) <uestion+ !esign a linear decision &oundary+ wx + b (e7uation of hyperplane) such that the classification rule associated with it has minimal pro&a&ility of error classification rule+ 1 y = sign(w x + b) which means+ 1 if w4 .

Statistical Learning Theory Tutorial 16 &inno' al!orit"m )nitialiAe+ w. w) )f w4 .& = > return -. else w*-. w* w' 0 w'+( +( wx + b !lassi"ication #ule: $ sign%wx + b& 0 @unction decision(x. w' x + b )'+( x + b 0 Winnow is more ro&ust to high dimensional feature spaces From Gert Lanckriet. else w* w*-.b) in Rd+1 that correctly classifies data points as much as possi&le )n online fashion+ one data point at the time' update weights as necessary $erceptron al!orit"m )nitialiAe+ w.Linear binary classification @ind a good hyperplane (w. w* .yi4i w !exp(yixi) Perceptron "innow * *-. : > Bpdating rule @or each data point 4 $erceptron vs( &inno' w' 0 w'+( +( Assume # a%aila&le features only K rele%ant items' with K??# )f class(4) C: decision(4'w) then w*-. Statistical Learning Theory Tutorial 15 From Gert Lanckriet. Statistical Learning Theory Tutorial 14 $erceptron al!orit"m 6nline+ can adDust to changing target' o%er time Ad%antages Simple and computationally efficient . w) )f w4 . Statistical Learning Theory Tutorial 17 From Gert Lanckriet.uaranteed to learn a linearly separa&le pro&lem (con%ergence' glo&al optimum) &inno' al!orit"m Another online algorithm for learning perceptron weights+ f(x) = sign(wx + b) /inear' &inary classification Bpdate rule+ again error dri%en' &ut multiplicati%e (instead of additi%e) /imitations 6nly linear separations 6nly con%erges for linearly separa&le data #ot really Eefficient with many featuresF From Gert Lanckriet.& = > return -. : > Bpdating rule @or each data point 4 )f class(4) C: decision(4'w) then w*-. Statistical Learning Theory Tutorial 13 From Gert Lanckriet. 3lse return % -( w' x + b )'+( x + b 0 0 From Gert Lanckriet.yi4i * *-. 3lse return . w +1 w* . Statistical Learning Theory Tutorial 18 . Perceptron+ num&er of mista*es+ 6( K #) Winnow+ num&er of mista*es+ 6(K log #) 0 -( @unction decision(x.

$erceptron vs( &inno' #erceptron 6nline+ can adDust to changing target' o%er time Ad%antages Simple and computationally efficient .better error bounds- Limitations Computationally more expensive.uaranteed to learn a linearly separa&le pro&lem "innow 6nline+ can adDust to changing target' o%er time Ad%antages Simple and computationally efficient . Statistical Learning Theory Tutorial 23 24 .--+ Maxi.uadratic pro!rammin! From Gert Lanckriet.uaranteed to learn a linearly separa&le pro&lem $uitable for proble%s wit& %any irrele'ant attributes &e)a Winnow in We*a /imitations only linear separations only con%erges for linearly separa&le data not really Eefficient with many featuresF /imitations only linear separations only con%erges for linearly separa&le data not really Eefficient with many featuresF (sed in NL# From Gert Lanckriet. lar!e .GHI) )f the classes are linearly separa&le+ Separate the data Place hyper plane EfarF from the data+ large %argin Statistical results guarantee good generali+ation *A+ Lar!e mar!in classifier *ntuition (Vapni*' . Statistical Learning Theory Tutorial 21 From Gert Lanckriet. Statistical Learning Theory Tutorial 19 20 Lar!e mar!in classifier )not&er fa%ily of linear algorit&%s *ntuition (Vapni*' .al Margin !lassi"ier From Gert Lanckriet. Statistical Learning Theory Tutorial 22 Lar!e mar!in classifier )f not linearly separable )llow some errors Still' try to place hyperplane EfarF from each class Lar!e *ar!in Classifiers +dvanta!es T"eoretically better .GHI) if linearly separa&le+ Separate the data Place hyperplane EfarF from the data+ large %argin Statistical results guarantee good generali+ation .

Statistical Learning Theory Tutorial 29 30 . Statistical Learning Theory Tutorial 26 on Linear problem on Linear problem 27 28 on Linear problem Kernel methods A family of non linear algorithms "ransform the non linear pro&lem in a linear one (in a different feature space) Bse linear algorithms to sol%e the linear pro&lem in the new space *ain intuition of 0ernel met"ods (Copy here from &lac* &oard) From Gert Lanckriet.#/*/arge Margin Classifier w/xb + b -( #upport /ector *ac"ine .#upport /ector *ac"ine . Statistical Learning Theory Tutorial 25 From Gert Lanckriet.oal+ find the hyperplane that ma4imiAes the margin w/ x + b 0 "e4t classification 2and writing recognition Computational &iology (e(g(' micro array data) @ace detection @ace e4pression recognition "ime series prediction Support %ectors From Gert Lanckriet.#/*( M w/xa + b /inearly separa&le case .

d .! (! == d) w/Φ%x&+b 0 Basic principle )ernel met"ods Linear separability+ more li*ely in high dimensions -apping+ Φ maps input into high dimensional feature space Classifier+ construct linear classifier in high dimensional feature space -oti'ation+ appropriate choice of Φ leads to linear separa&ility "e can do t&is efficiently. Statistical Learning Theory Tutorial From Gert Lanckriet.i%en+ some data items that &elong to one of M possi&le classes "as*+ "rain the classifier and predict the class for a new data item . Statistical Learning Theory Tutorial 32 Basic principle )ernel met"ods We can use the linear algorithms seen &efore (Perceptron' SVM) for classification in the higher dimensional space *ulti1class classification .eometrically+ harder pro&lem' no more simple geometry 33 34 *ulti1class classification *ulti1class classification: 2xamples Author identification /anguage identification "e4t categoriAation (topics) 35 36 . X=[x z] "%x& Φ(X)=[x2 z2 xz] sign%w(x2+w202+w1x0 +b& 31 From Gert Lanckriet.Basic principle )ernel met"ods Φ . .

ex: 3ecision Trees- #on /inear K nearest neigh&ors 37 38 Linear.ex: k earest ei!"bor- 39 40 Trainin! 2xamples 3ecision Trees !ecision tree is a classifier in the form of a tree structure' where each node is either+ /eaf node indicates the %alue of the target attri&ute (class) of e4amples' or . parallel class separators .+l!orit"ms for *ulti1class classification /inear Parallel class separators+ !ecision "rees #on parallel class separators+ #a$%e Bayes Linear. O parallel class separators ..M !.php 41 .hr/tutorial/tut_dtrees. !M !L !K !I !H !O !N !G !.ain Sunny 6%ercast 6%ercast .K 6utloo* Sunny Sunny 6%ercast .ain "emp( 2ot 2ot 2ot Mild Cool Cool Cool Mild Cold Mild Mild Mild 2ot Mild 2umidity 2igh 2igh 2igh 2igh #ormal #ormal #ormal 2igh #ormal #ormal #ormal 2igh #ormal 2igh Wind Wea* Strong Wea* Wea* Wea* Strong Wea* Wea* Wea* Strong Strong Strong Wea* Strong #lay /ennis #o #o Jes Jes Jes #o Jes #o Jes Jes Jes Jes Jes #o 42 !ecision node specifies some test to &e carried out on a single attri&ute %alue' with one &ranch and su& tree for each possi&le outcome of the test( A decision tree can &e used to classify an e4ample &y starting at the root of the tree and mo%ing through it until a leaf node' which pro%ides the classification of the instance( http://dms. !.ain 6%ercast Sunny Sunny .oal+ learn when we can play "ennis and when we cannot !ay !.ain ..L !.> !.ain .irb.#ome.ex: a4ve Bayes- on Linear .

ain 2umidity Jes Wind 2umidity 3ach internal node tests an attri&ute 2igh #o #ormal Jes Strong #o Wea* Jes 43 2igh #o #ormal Jes 3ach &ranch corresponds to an attri&ute %alue node 3ach leaf node assigns a classification 44 www.il/~nin/ Courses/ML04/DecisionTreesCLS.tau.math.tau.il/~nin/ Courses/ML04/DecisionTreesCLS.pp Foundations of Statistical Natural Language Processing.math.i%en training data' how do we construct them0 "he central focus of the decision tree growing algorithm is selecting which attri&ute to test at each node in the tree( "he goal is to select the attri&ute that is most useful for classifying e4amples( "op down' greedy search through the space of possi&le decision trees( "hat is' it pic*s the &est attri&ute and ne%er loo*s &ac* to reconsider earlier choices( Foundations of Statistical Natural Language Processing.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.ac.math.pp 3ecision Tree for $layTennis 6utloo* "emperature 2umidity Wind Sunny 2ot 2igh Wea* 3ecision Tree for Reuter classification Play"ennis 0 #o 6utloo* Sunny 6%ercast .tau. Manning and Schuetze 47 48 .ain 3ecision Tree for $layTennis 6utloo* Sunny 6%ercast . Manning and Schuetze 46 3ecision Tree for Reuter classification Buildin! 3ecision Trees .pp www.ac.3ecision Tree for $layTennis 6utloo* Sunny 6%ercast .ain 2umidity 2igh #o #ormal Jes Jes Wind Strong #o Wea* Jes 45 www.

irb.8 9 3ecision Trees: #tren!t"s !ecision trees are a&le to generate understanda&le rules( !ecision trees perform classification without re7uiring much computation( !ecision trees are a&le to handle &oth continuous and categorical %aria&les( !ecision trees pro%ide a clear indication of which features are most important for prediction or classification( #plit t"at !ives us t"e maximum information gain .hr/tutorial/tut_dtrees. no need to split furt"er )n practice' one first &uilds a large tree and then one prunes it &ac* (to a%oid o%erfitting) See @oundations of Statistical #atural /anguage Processing' Manning and SchuetAe for a good introduction 49 http://dms.Buildin! 3ecision Trees Splitting criterion 5indin! t"e features and t"e values to split on 6 for example.irb.or t"e maximum reduction of uncertainty- Stopping criterion &"en all t"e elements at one node "ave t"e same class.raphical Models+ graph theory plus pro&a&ility theory #odes are %aria&les 3dges are conditional pro&a&ilities A B P(A) P(B|A) P(C|A) C 53 54 .hr/tutorial/tut_dtrees.php 50 3ecision Trees: 'ea)nesses !ecision trees are prone to errors in classification pro&lems with many classes and relati%ely small num&er of training e4amples( !ecision tree can &e computationally e4pensi%e to train( eed to compare all possible splits $runin! is also expensive 3ecision Trees !ecision "rees in We*a Most decision tree algorithms only e4amine a single field at a time( "his leads to rectangular classification &o4es that may not correspond well with the actual distri&ution of records in the decision space( http://dms.php 51 52 a4ve Bayes More powerful that !ecision "rees !ecision "rees #a$%e Bayes a4ve Bayes *odels . '"y test first 7cts8 and not 7vs89 6 &"y test on 7cts : 28 and not 7cts : .

w2. Manning and Schuetze 55 56 a4ve Bayes for text classification a4ve Bayes for text classification Topic earn w1 w2 w3 w4 wn-1 wn "he words depend on the topic+ P(wi| Topic) P(cts|earn) > P(tennis| earn) Shr 34 cts vs per shr 57 #a$%e Bayes assumption+ all words are independent gi%en the topic @rom training set we learn the pro&a&ilities P(wiP "opic) for each word and for each topic in the training set 58 a4ve Bayes for text classification Topic a4ve Bayes: *at" #a$%e Bayes define a Doint pro&a&ility distri&ution+ P(Topic .raphical Models+ graph theory plus pro&a&ility theory #odes are %aria&les 3dges are conditional pro&a&ilities A&sence of an edge &etween nodes implies independence &etween the %aria&les of the nodes a4ve Bayes for text classification A B P(A) P(B|A) P(C|A) C P(C|A. w2. … wn) / P(w1. w2. w2. … wn) w1 w2 w3 w4 wn-1 wn "o+ Classify new e4ample Calculate P(Topic | w1. … wn) P(Topic | w1.B) Foundations of Statistical Natural Language Processing. … wn) for each topic Bayes decision rule+ Choose the topic "5 for which P(T’ | w1. w2. … wn) = P(Topic)∏ P(wi| Topic) We learn P(Topic) and P(wi| Topic) in training "est+ we need P(Topic | w1.a4ve Bayes *odels . w2. w1. … wn) = P(Topic . w2. w2. … wn) > P(T | w1. … wn) for each T≠ T’ 59 60 . w1.

a4ve Bayes: #tren!t"s Very simple model 3asy to understand Very easy to implement a4ve Bayes: 'ea)nesses #a$%e Bayes independence assumption has two conse7uences+ "he linear ordering of words is ignored (&ag of words model) "he words are independent of each other gi%en the class+ @alse 1 President is more li*ely to occur in a conte4t that contains election than in a conte4t that contains poet Very efficient' fast training and classification Modest space storage Widely used &ecause it wor*s really well for te4t categoriAation /inear' &ut non parallel decision &oundaries #a$%e Bayes assumption is inappropriate if there are strong conditional dependencies &etween the %aria&les (But e%en if the model is not ErightF' #a$%e Bayes models do well in a surprisingly large num&er of cases &ecause often we are interested in classification accuracy and not in accurate pro&a&ility estimations) 61 62 a4ve Bayes #a$%e Bayes in We*a k earest ei!"bor Classification #earest #eigh&or classification rule+ to classify a new o&Dect' find the o&Dect in the training set that is most similar( "hen assign the category of this nearest neigh&or K #earest #eigh&or (K##)+ consult * nearest neigh&ors( !ecision &ased on the maDority category of these neigh&ors( More ro&ust than * : . 34ample of similarity measure often used in #/P is cosine similarity 63 64 %1 earest ei!"bor %1 earest ei!"bor 65 66 .

<1 earest ei!"bor <1 earest But this is closer(( We can weight neigh&ors according to their similarity ei!"bor Assign the category of the maDority of the neigh&ors 67 68 k earest Strengths ei!"bor Classification #ummary Algorithms for Classification /inear %ersus non linear classification Binary classification Perceptron Winnow Support Vector Machines (SVM) Kernel Methods .o&ust Conceptually simple 6ften wor*s well Powerful (ar&itrary decision &oundaries) Wea*nesses Performance is %ery dependent on the similarity measure used (and to a lesser e4tent on the num&er of neigh&ors * used) @inding a good similarity measure can &e difficult Computationally e4pensi%e 69 Multi Class classification !ecision "rees #a$%e Bayes K nearest neigh&or 6n Wednesday+ We*a 70 .