0 Up votes0 Down votes

2 views21 pagesPercept Rom

Nov 12, 2016

© © All Rights Reserved

PDF, TXT or read online from Scribd

Percept Rom

© All Rights Reserved

2 views

Percept Rom

© All Rights Reserved

- The Woman Who Smashed Codes: A True Story of Love, Spies, and the Unlikely Heroine who Outwitted America's Enemies
- NIV, Holy Bible, eBook
- NIV, Holy Bible, eBook, Red Letter Edition
- Steve Jobs
- Cryptonomicon
- Hidden Figures Young Readers' Edition
- Make Your Mind Up: My Guide to Finding Your Own Style, Life, and Motavation!
- Console Wars: Sega, Nintendo, and the Battle that Defined a Generation
- The Golden Notebook: A Novel
- Alibaba: The House That Jack Ma Built
- Life After Google: The Fall of Big Data and the Rise of the Blockchain Economy
- Hit Refresh: The Quest to Rediscover Microsoft's Soul and Imagine a Better Future for Everyone
- Hit Refresh: The Quest to Rediscover Microsoft's Soul and Imagine a Better Future for Everyone
- The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
- Autonomous: A Novel
- Algorithms to Live By: The Computer Science of Human Decisions
- Digital Gold: Bitcoin and the Inside Story of the Misfits and Millionaires Trying to Reinvent Money

You are on page 1of 21

Piyush Rai

Machine Learning (CS771A)

Aug 20, 2016

Recall the gradient descent (GD) update rule for (unreg.) logistic regression

N

X

w (t+1) = w (t)

((t)

n yn )x n

n=1

(t)

1

1+exp(w (t) > x n )

w (t+1) = w (t) t ((t)

n yn )x n

where t is the learning rate at update t (typically decreasing with t)

(t)

(t)

Lets replace the predicted label prob. n by the predicted binary label yn

where

Thus w (t)

w (t+1) = w (t) t (

yn(t) yn )x n

(

>

(t)

1 if n 0.5 or w (t) x n 0

(t)

yn =

>

(t)

0 if n < 0.5 or w (t) x n < 0

(t)

gets updated only when yn 6= yn (i.e., when w (t) mispredicts)

Mistake-Driven Learning

Consider the mistake-driven SGD update rule

w (t+1) = w (t) t (

yn(t) yn )x n

Lets, from now on, assume the binary labels to be {-1,+1}, not {0,1}. Then

(

(t)

2yn if yn 6= yn

(t)

yn yn =

(t)

0

if yn = yn

(t)

w (t+1) = w (t) + 2t yn x n

For t = 0.5, this is akin to the Perceptron (a hyperplane based learning algo)

w (t+1) = w (t) + yn x n

Note: There are other ways of deriving the Perceptron rule (will see shortly)

Machine Learning (CS771A)

One of the earliest algorithms for binary classification (Rosenblatt, 1958)

Learns a linear hyperplane to separate the two classes

example at a time. Also highly scalable for large training data sets (like SGD

based logistic regression and other SGD based learning algorithms)

Guaranteed to find a separating hyperplane if the data is linearly separable

If data not linearly separable

First make it linearly separable (more when we discuss Kernel Methods)

.. or use multi-layer Perceptrons (more when we discuss Deep Learning)

Machine Learning (CS771A)

Hyperplanes

Separates a D-dimensional space into two half-spaces (positive and negative)

w is orthogonal to any vector x lying on the hyperplane

w >x = 0

w >x + b = 0

b > 0 means moving it parallely along w (b < 0 means in opposite direction)

Represents the decision boundary by a hyperplane w RD

Classification rule: y = sign(w T x + b)

w T x + b > 0 y = +1

w T x + b < 0 y = 1

Note: Some algorithms that we have already seen (e.g., distance from

means, logistic regression, etc.) can also be viewed as learning hyperplanes

Notion of Margins

Geometric margin n of an example x n is its signed distance from hyperplane

n =

wT xn + b

||w ||

Margin of a set {x 1 , . . . , x N } w.r.t. w is the min. abs. geometric margin

= min |n | = min

1nN

1nN

|(w T x n + b)|

||w ||

Positive if w predicts yn correctly

Negative if w predicts yn incorrectly

For a hyperplane based model, lets consider the following loss function

`(w , b) =

N

X

`n (w , b) =

n=1

N

X

max{0, yn (w x n + b)}

n=1

otherwise the model will incur some positive loss when yn (w T x n + b) < 0

Perform stochastic optimization on this loss. Stochastic (sub)gradients are

`n (w , b)

w

`n (w , b)

b

yn x n

yn

Upon every mistake, update rule for w and b (assuming learning rate = 1)

w

w + yn x n

b + yn

Machine Learning (CS771A)

Given: Training data D = {(x 1 , y1 ), . . . , (x N , yN )}

Initialize: w old = [0, . . . , 0], bold = 0

Repeat until convergence:

For a random (x n , yn ) D

if sign(w T x n + b) 6= yn i.e., yn (w T x n + b) 0 (i.e., mistake is made)

w new

w old + yn x n

bnew

bold + yn

All training examples are classified correctly

May overfit, so less common in practice

Completed one pass over the data (each example seen once)

E.g., examples arriving in a streaming fashion and cant be stored in memory

10

Lets look at a misclassified positive example (yn = +1)

Perceptron (wrongly) predicts that w Told x n + bold < 0

Updates would be

w new = w old + yn x n = w old + x n (since yn = +1)

bnew = bold + yn = bold + 1

wT

new x n + bnew

(w old + x n )T x n + bold + 1

T

(w T

old x n + bold ) + x n x n + 1

T

Thus w T

new x n + bnew is less negative than w old x n + bold

11

w new = w old + x

12

Now lets look at a misclassified negative example (yn = 1)

Perceptron (wrongly) predicts that w Told x n + bold > 0

Updates would be

w new = w old + yn x n = w old x n (since yn = 1)

bnew = bold + yn = bold 1

wT

new x n + bnew

(w old x n )T x n + bold 1

T

(w T

old x n + bold ) x n x n 1

T

Thus w T

new x n + bnew is less positive than w old x n + bold

13

w new = w old x

14

Convergence of Perceptron

Theorem (Block & Novikoff): If the training data is linearly separable with

margin by a unit norm hyperplane w (||w || = 1) with b = 0, then perceptron

converges after R 2 / 2 mistakes during training (assuming ||x|| < R for all x).

Proof:

yn w T

xn

||w ||

= yn w T

xn

k x n 0, and update w k+1 = w k + yn x n

T

T

T

wT

k+1 w = w k w + yn w x n w k w + (why is this nice?)

k+1 w > k

2

2yn w T

k xn

(1)

2

k x n 0)

(2)

w

||w

k+1 || R k

k+1

k R 2 / 2

Nice Thing: Convergence rate does not depend on the number of training

examples N or the data dimensionality D. Depends only on the margin!!!

Machine Learning (CS771A)

15

Perceptron finds one of the many possible hyperplanes separating the data

.. if one exists

Large margin leads to good generalization on the test data

16

One of the most popular/influential classification algorithms

Backed by solid theoretical groundings (Vapnik and Cortes, 1995)

A hyperplane based classifier (like the Perceptron)

Additionally uses the Maximum Margin Principle

Finds the hyperplane with maximum separation margin on the training data

17

A hyperplane based linear classifier defined by w and b

Prediction rule: y = sign(w T x + b)

Given: Training data {(x 1 , y1 ), . . . , (x N , yN )}

Goal: Learn w and b with small training error the maximum margin

For now, assume the entire training data can be correctly classified by (w , b)

Zero loss on the training examples (non-zero loss case later)

w T x n + b 1 for yn = +1

w T x n + b 1 for yn = 1

Equivalently, yn (w T x n + b) 1

min1nN |w T x n + b| = 1

The hyperplanes margin:

= min1nN

|w T x n +b|

= ||w1 ||

||w ||

18

We want to maximize the margin =

1

||w ||

Our optimization problem would be:

||w ||2

2

T

yn (w x n + b) 1,

Minimize f (w , b) =

subject to

n = 1, . . . , N

Machine Learning (CS771A)

19

We can give a slightly more formal justification to this

Recall: Margin =

1

||w ||

Small ||w || regularized/simple solutions (wi s dont become too large)

Simple solutions good generalization on test data

20

Next class..

Introduction to kernel methods (nonlinear SVMs)

21

- 5_Recognizing Human Actions_ a Local SVM Approach_2004Uploaded byAnkit Sharma
- Intrusion Detection System Using Support Vector MachineUploaded byvinoth1128
- The Effective Use of the DSmT for Multi-Class ClassificationUploaded byAnonymous 0U9j6BLllB
- Percept RonUploaded bykadexjus
- Applicability of Data Mining Techniques for Climate Prediction – A Survey ApproachUploaded byijcsis
- [Fiot, Paucher] The Captchacker Projectthe Captchacker ProjectUploaded byjemf_19
- LS-SVM Parameter Optimization Using Genetic Algorithm To Improve Fault Classification Of Power Transformer.pdfUploaded byVarun Kumar
- Component-based Face DetectionUploaded bySetyo Nugroho
- Agricultural Plant Disease Detection and its Treatment using Image ProcessingUploaded byInternational Journal for Scientific Research and Development - IJSRD
- Taylor-Applying Machine Learning to Stock Market TradingUploaded byprashantsinghiitbhu
- Journal Paper FormatUploaded byMrs. Vasanthi Muniasamy
- Assignment 3Uploaded bySiddhant Sanjeev
- Zc 33219224Uploaded byAJER JOURNAL
- Beyond SlidingWindows: Object Localization by Efficient Subwindow SearchUploaded byXitocran
- Zhelp LearnUploaded by123DAAN
- Traffic Sign DetectionUploaded byDoğukan Duman
- Paper 7-Application of Relevance Vector Machines in Real Time Intrusion DetectionUploaded byEditor IJACSA
- SEMI-AUTOMATIC LEAF DISEASE DETECTION AND CLASSIFICATION SYSTEM FOR SOYBEAN CULTUREUploaded byDinesh S
- Audio-Visual Content Description for Video Genre Classification in the Context of Social MediaUploaded byAnonymous ntIcRwM55R
- Reaserch_Report_V.0.8Uploaded bysolix1
- MO GS for Disel Fuel in Engine OilUploaded byvinsmn
- PERFORMANCE EVALUATION OF DIFFERENT KERNELS FOR SUPPORT VECTOR MACHINE USED IN INTRUSION DETECTION SYSTEMUploaded byAIRCC - IJCNC
- Artificial Neural NetworkUploaded bynilabjyaghosh
- ijcnn2009Uploaded byEren Kocbey
- Volatility 2Uploaded byRishi Maheshwari
- Single Image Super-ResolutionUploaded byTanya Wilkerson
- Oil and Gas Pipeline Failure Prediction System Using Long Range Ultrasonic Transducers and Euclidean Support Vector Machines Classification Approach 2Uploaded byalioo44
- Neural NetworksUploaded byTharun Raaj
- (10) What Are Kernels in Machine Learning and SVM_ - QuoraUploaded byjc224
- lab1Uploaded byRohit Verma

- 18RadixSort.pdfUploaded byPushpendra Singh
- Cerc2016 PresentationUploaded byAnil
- regression.pdfUploaded byAnil
- decisoin.pdfUploaded byAnil
- Shortest Paths AlogrithgmsUploaded byatulw
- 05AdvTopicsSorting.pdfUploaded bySugar Plums
- 771A Lec2 SlidesUploaded byShivamGoyal
- probablistic.pdfUploaded byAnil
- PPT on Linear ProgrammingUploaded byh_badgujar
- 00overview.pdfUploaded byAnil
- logistic.pdfUploaded byAnil
- reading-cs330.pdfUploaded byAnil
- Hashing introUploaded byatulw
- 11UndirectedGraphsUploaded byVikas Srivastava
- 01 Union FindUploaded byprintesoi
- 19Tries.pdfUploaded byAnil
- Stack QueueUploaded byBethanyethan
- 06 Priority QueuesUploaded bybsgindia82
- SortingUploaded byrubikmail8218
- 07SymbolTables.pdfUploaded byAnil
- 09 Balanced TreesUploaded bySoohwan Harry Jeon
- Binary Search TreesUploaded bymani271092
- 14MSTUploaded byUpinder Kaur
- 13DirectedGraphs.pdfUploaded byAnil
- 17GeometricSearch.pdfUploaded byAnil
- lec18Uploaded byraw.junk
- 16 GeometricUploaded byKen Arok
- 20CompressionUploaded byArif Setyawan
- Pattern MatchingUploaded byamitavaghosh1

- (Kocak, Abimbola, 2009) the Effects of Entrepreneurial Marketing on BG PerformanceUploaded byFilipe Falcão
- 08 Instructor's GuideUploaded bylaarni sardea
- Jaimini Karkas(Lama)Uploaded bysurinder sangar
- NSTP1Uploaded byraul gironella
- SelfUploaded by薄一玮
- Group12-ClassB-LessonPlan.docxUploaded byLily Chang
- Kroeber_1939.pdfUploaded byMilutinRakovic
- Abstrak kajian tindakanUploaded byChristina Thevamalar
- Carolinas Communication Annual 2016Uploaded byJohn A. McArthur
- 449- assignment 2Uploaded byapi-264842331
- Family Violence -Presentation1Uploaded byRabhak Khal-Hadjj
- Punjabi MCQ Test ( Multi choice Qus.)Uploaded byMandy Singh
- Chapter 11 Reaction PaperUploaded bySteven Sanderson
- Ref234erences FINALUploaded byATLAS
- 14_bernard_effect of Selected Marketing CommunicationUploaded byiiste
- Research Study About Role of BenefactorsUploaded byAlfonso Jabines
- ManipulationUploaded byNina Claries Basoy
- IQmatrix Guide for BeginnersUploaded byGerman Castro
- Plan English- Week 34Uploaded byPaul Viteri
- Vol 7 Issue 36 January 3-16, 2015Uploaded byThesouthasian Times
- Evaluation Of Process Models Using Heuristic Miner And Disjunctive Workflow Schema Algorithm For Dyeing ProcessUploaded byijitcs
- speech community essay draftUploaded byapi-242066660
- Towards Truth - -Uploaded byrodneyshrewsbury
- Session 7Uploaded byJune
- 3.11__Artificial_Intelligence_and_Robotics.docxUploaded bySoumil Kurani
- cogsci_9Uploaded byhyperlinguist
- PhD Thesis Mohammad Arif RohmanUploaded byRavi Shankar
- Risk Management in TunnelsUploaded bygeoteres
- Difference Between HRM and Personnel ManagementUploaded byDavid Mwakidimi Msengeti
- Travel Behaviour of Medical Tourists in KeralaUploaded byarcherselevators

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.