2 views

Uploaded by altorresro

Documento en ingles acerca de las regresiones spline

- DINA Model and Parameter Estimation a Didactic
- Thesis-IP
- 20120229000242SMA3013_Chapter 1
- Introduction to Finite Elements
- SIAM2000.pdf
- Inverse Vogel
- Kuttler LinearAlgebra Slides Matrices MatrixArithmetic Handout
- 300 Solved-numerical Problems
- PG EXTRA Appendix Small o 1 k v0.1[1]
- MATLAB Tutorial
- Errata of Introduction to Probability 2nd
- Lecture 1 Basic Concepts of FEM
- A 5240112
- Matlab Matrix Operations
- C372_L8_geopt.ppt
- Syllabus
- Coding Lda
- Coreg and Spatial
- !!!edu-mc-exam-c-0507.pdf
- m40_s12_lecture8

You are on page 1of 48

Sargur Srihari

srihari@cedar.buffalo.edu

1

Machine Learning Srihari

Simplest form of linear regression:

linear function of single input variable

Task is to learn weights w0, w1

y(x,w) = w0+w1x from data D={(y ,x )}i=1,,N

i i

Polynomial curve fitting

y(x,w) = w0+w1x+w2x2+= wixi

linear combination of non-linear functions of

input variables i (x) instead of xi called basis

functions

Linear functions of parameters (which gives them

simple analytical properties), yet are nonlinear with2

respect to input variables

Machine Learning Srihari

Plan of Discussion

Discuss supervised learning starting with regression

Goal: predict value of one or more target variables t

Given d-dimensional vector x of input variables

Terminology

Regression

When t is continuous-valued

Classification

if t has a value consisting of labels (non-ordered categories)

Ordinal Regression

Discrete values, ordered categories

3

Machine Learning Srihari

Polynomial Target

Variable

Single input variable scalar x

Input Variable

Generalizations

Predict value of continuous target variable t given value

of d input variables x=[x1,..xd]

t can also be a set of variables (multiple regression)

Linear functions of adjustable parameters

Specifically linear combinations of nonlinear functions of input

variable

4

Machine Learning Srihari

Regression with d input variables

This differs from

y(x,w) = w0+w1x1+..+wd xd = wTx Linear Regression with one variable

and Polynomial Reg with one variable

parameters w0,..,wd

input variables x1,..,xd

variables

In the one-dimensional case this amounts a straight-line fit (degree-one

polynomial)

5

y(x,w) = w0 + w1x

Machine Learning Srihari

Assume t is a function of inputs x1, x2,...xd

(independent variables). Goal is to find the

best linear regressor of t on all the inputs.

For d =2: fitting a plane through N input

samples or a hyperplane in d dimensions.

x1 x2 t

6

Machine Learning Srihari

LeToR

Multiple Inputs

Target Value

t is discrete (eg, 1,2..6) in training set but a continuous

value in [1,6] is learnt and used to rank objects

7

Machine Learning Srihari

Input (xi):

(d Features of Query-URL pair)

Log frequency of query in

(d >200) anchor text

Query word in color on page

# of images on page

# of (out) links on page

PageRank of page

URL length

URL contains ~

Page length

46 query-document features Output (y):

Maximum of 124 URLs/query Relevance Value

Target Variable

- Point-wise (0,1,2,3)

- Regression returns continuous value

Yahoo! data set has d=700 -Allows fine-grained ranking of URLs

Machine Learning Srihari

Input Features

See http://research.microsoft.com/en-us/projects/mslr/feature.aspx

Feature List of Microsoft Learning to Rank Datasets

26 body

feature id feature description stream comments 27 anchor

1 body 28 min of term frequency title

2 anchor 29 url

covered query term

3 title 30 whole document

number

4 url 31 body

5 whole document 32 anchor

6 body 33 max of term frequency title

7 anchor 34 url

covered query term

8 title 35 whole document

ratio

9 url

36 body

10 whole document

11 body

37 anchor

12 anchor 38 mean of term frequency title

13 stream length title 39 url

14 url 40 whole document

15 whole document 41 body

16 body 42 anchor

17 anchor

variance of term

IDF(Inverse document 43 title

18 title frequency

frequency) 44 url

19 url 45 whole document

20 whole document 46 body

21 body

47 sum of stream length anchor

22 anchor

23 sum of term frequency title

48 normalized term title

24 url 49 frequency url

25 whole document 50 whole document

Machine Learning Srihari

Feature Statistics

Most of 46 features are normalized as continuous

values from 0 to 1, exception some features are

all 0s.

Feature 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Min 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Max 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1

Mean 0.254 0.1598 0.1392 0.2158 0.1322 0.1614 0 0 0 0 0 0.2841 0.1382 0.2109 0.1218

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0.2879 0.1533 0.2258 0.3057 0.3332 0.1534 0.5473 0.5592 0.5453 0.5622 0.1675 0.1377 0.1249 0.126 0.2109 0.1705

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 0 1 1 1

0.1694 0.1603 0.1275 0.0762 0.0762 0.0728 0.5479 0.5574 0.5502 0.5673 0.4321 0.3361 0 0.1065 0.1211

Machine Learning Srihari

Extended by considering nonlinear functions of input variables

M 1

y(x, w ) = w0 + w j j (x)

j =1

We now need M weights for basis functions instead of d weights for

features

Can be written as

M 1

y(x, w) = w j j (x) = wT (x)

j= 0

11

Machine Learning Srihari

Linear Basis Function Model

M 1

y(x, w) = w j j (x) = wT (x)

xj

j= 0

j(x)=x j with degree M-1 polynomial

x

Disadvantage

Global:

changes in one region of input space affects others

Difficult to formulate

Number of polynomials increases exponentially with M

Can divide input space into regions

use different polynomials in each region:

equivalent to spline functions

12

Machine Learning Srihari

(Not used in practice!)

[ ]

j

1 2 d

Vector is being converted into a scalar value thereby losing information

Better approach:

Polynomial of degree M-1 has terms with variables taken none, one,

two M-1 at a time.

Use multi-index j=(j1,j2,..jd) such that j1+j2+..jd < M-1

For a quadratic (M=3) with three variables (d=3)

y(x,w) = w (x) = w j j 0 + w1,0,0 x1 + w 0,1,0 x 2 + w 0,0,1 x 3 + w1,1,0 x1 x 2 + w1,0,1 x1 x 3 +

( j1 , j 2 , j 3 )

For d=46, it is 1128

Machine Learning Srihari

" (x j )2 %

Gaussian j (x) = exp $$ 2

''

# 2s &

Does not necessarily have a probabilistic interpretation

Usual normalization term is unimportant

since basis function is multiplied by weight wj

Choice of parameters

j govern the locations of the basis functions

Can be an arbitrary set of points within the range of the data

Can choose some representative data points

s governs the spatial scale

Could be chosen from the data set e.g., average variance

Several variables

A Gaussian kernel would be chosen for each dimension

For each j a different set of means would be needed perhaps chosen from

the data

1

j (x) = exp (x j )t 1(x j )

2

Machine Learning Srihari

" x j % 1

Sigmoid j (x) = $ ' where (a) =

# s & 1+ exp(a)

sigmoid by

tanh( a) = 2 (a) 1

Logistic Sigmoid

Machine Learning Srihari

Fourier

Expansion in sinusoidal functions

Infinite spatial extent

Signal Processing

Functions localized in time and frequency

Called wavelets

Useful for lattices such as images and time series

Further discussion independent of choice

of basis including (x) = x

16

Machine Learning Srihari

Probabilistic Formulation

Given:

Data set of n observations {xn}, n=1,.., N

Corresponding target values {tn}

Goal:

Predict value of t for a new value of x

Simplest approach:

Construct function y(x) whose values are the predictions

Probabilistic Approach:

Model predictive distribution p(t|x)

It expresses uncertainty about value of t for each x

From this conditional distribution

Predict t for any value of x so as to minimize a loss function

Typical loss is squared loss for which the solution is the conditional

expectation of t

17

Machine Learning Srihari

and Least Squares

Will show that Minimizing sum-of-squared errors is

the same as maximum likelihood solution under a

Gaussian noise model

Target variable is a scalar t given by deterministic

function y(x,w) with additive Gaussian noise

t = y(x,w)+

is zero-mean Gaussian with precision

Thus distribution of t is univariate normal:

p(t|x,w,)=N(t|y(x,w), -1)

mean variance

Machine Learning Srihari

Likelihood Function

Data set:

Input X={x1,..xN} with target t = {t1,..tN}

Target variables tn are scalars forming a vector of size N

It is the probability of observing the data assuming they

are independent

N

p(t | X,w, ) = N ( t n | wT (x n ), 1 )

n=1

Machine Learning Srihari

Log-Likelihood Function

N

Likelihood p(t | X,w, ) = N ( t n | wT (x n ), 1 )

n=1

Log-likelihood

N

ln p(t | w, ) = ln N ( t n | wT (x n ), 1 )

n=1

Where N N

= ln ln2 E D (w)

2 2

2

1 N

{

ED ( w ) = t n w T ( xn )

2 n =1

} Sum-of-squares

Error Function

1 1 2

N(x | , ) =

2

exp 2 (x )

(2 2 )1/ 2 2

Machine Learning Srihari

Log-likelihood N

ln p(t | w, ) = ln N ( t n | wT (x n ), 1 )

n=1

N N

= ln ln2 E D (w)

2 2

Where

2

1 N

{

ED ( w ) = t n w T ( xn )

2 n =1

} Sum-of-squares

Error Function

Take derivative of ln p(t | w, ) wrt w and set equal to zero

and solve for w

or equivalently derivative of -ED(w)

Machine Learning Srihari

{ } (x )

N

ln p(t | w, ) = tn wT (x n ) n

T

n=1

1

Since w [ ( a wb)2 ] = (a wb)b

2

N #N &

0 = tn (x n ) w % (x n ) (x n )T (

T

as shown in next slide

n=1 $ n=1 '

this a maximum

Machine Learning Srihari

X={x1,..xN} are samples

Solving for w: w ML = + t (vectors of d variables)

t = {t1,..tN} are targets (scalars)

M Design Matrix

Design Matrix: generalization of notion of matrix inverse

Rows correspond to (x ) to non-square matrices

N samples, = 0 2 If design matrix is square and invertible.

Columns to

then pseudo-inverse is same as inverse

M basis functions

(x ) M 1( x N )

0 N

1

j (x) = exp (x j ) (x j )

t 1

2

Machine Learning Srihari

M 1

y(x, w ) = w0 + w j j (x)

Regression function: j =1

2

Error Function:

1 N

{

ED (w) = t n wT (x n )

2 n=1

}

Explicit bias parameter: 1 N M 1

ED (w) = t n w0 w j j (x n )

2

2 n=1 j=1

Setting derivatives wrt w0 M 1

w0 = t w j j

equal to zero and solving for w0 j=1

1 N

t = tn

N

1

where j = j (x n )

N n=1 N n=1

First term is average of the N values of t

Second is weighted average of basis function values

Thus bias w0 compensates for difference

between average target values and weighted sum of

averages of basis function values

Machine Learning Srihari

We have determined m.l.e. solution for w

using a probabilistic formulation

p(t|x,w,)=N(t|y(x,w), -1) w ML = + t

With log-likelihood

2

N N

ln p(t | w, ) = ln ln 2 ED (w)

1 N

{

ED (w) = t n wT (x n )

2 n=1

}

2 2

Taking gradient wrt gives

1 1 N

{

= tn wTML (x n ) }

2

ML N n=1

Residual variance of the target values

25

around the regression function

Machine Learning Srihari

N-dim space with axes tn: t =(t1,,tN)T

Each basis function j(xn), corresponding

to jth column of , is represented in this subspace

space as vector j

If M<N then j(xn) are in subspace S of dim M

y is N-dim vector of elements y(xn,w)

Sum-of-squares error is equal to squared

Euclidean distance between y and t M Basis functions

Solution w corresponds to y that lies in ( x ) ( x ) ... ( x )

0 1 1 1

M 1 1

N Data

( x ) 0 2

=

Corresponds to ortho projection of t onto S

(x ) M 1( x N )

0 N

Represented as N-dim vectors

0 1 M-1

Machine Learning Srihari

Direct solution of normal equations

w ML = + t + = (T ) 1 T

When two basis functions are collinear, T is singular

(determinant=0)

Parameters can have large magnitudes

Not uncommon with real data sets

Can be addressed using

Singular Value Decomposition

Addition of regularization term ensures matrix is non-singular

27

Machine Learning Srihari

Maximum likelihood solution is

w ML = (T )1 T t

It is a batch technique

Processing entire training set in one go

It is computationally expensive for large data

sets

Due to huge N x M Design matrix

Solution is to use a sequential algorithm where

samples are presented one at a time

Called Stochastic gradient descent 28

Machine Learning Srihari

2

1 N

ED (w) = t n wT (x n )

2 n=1

Denoting ED(w)=nEn update parameter vector w

using

w ( +1) = w ( ) En

Substituting for the derivative

w( +1) = w( ) + (t n w( ) (x n )) (x n )

w is initialized to some starting vector w(0)

chosen with care to ensure convergence

Known as Least Mean Squares Algorithm

29

Machine Learning Srihari

every weight vector, a parabola with

30 a single global minimum

Machine Learning Srihari

Procedure Gradient-Descent ( Intuition

1 //Initial starting point Taylors expansion of function f() in the

t t T t

f //Function to be minimized neighborhood of t is f () f ( ) + ( ) f ( )

//Convergence threshold Let =t+1 =t +h, thus f (t+1 ) f (t ) + h f (t )

Derivative of f(t+1) wrt h is f (t )

) t

At h = f ( ) a maximum occurs (since h2 is

1 t1

positive) and at h = f (t ) a minimum occurs.

2 do

Alternatively,

3 t+1 t f t ( ) t

The slope f ( ) points to the direction of

4 t t +1 steepest ascent. If we take a step in the

5 while || t t1 ||> opposite direction we decrease the value of f

( )

6 return t

One-dimensional example

Let f()=2

This function has minimum at =0 which we want

to determine using gradient descent

We have f()=2

For gradient descent, we update by f()

If t > 0 then t+1<t

If t<0 then f(t)=2t is negative, thus t+1>t

Machine Learning Srihari

Performance depends on choice of learning rate

Large learning rate

Overshoots

Small learning rate

Extremely slow

Solution

Start with large and settle on optimal value

Need a schedule for shrinking

32

Machine Learning Srihari

Procedure Gradient-Descent (

1 //Initial starting point

f //Function to be minimized Line Search

//Convergence threshold Adaptively choose at each step

) Define !a line in the direction of gradient

g ( ) = + f ( )

t t

1 t1

2 do Find three points 1 < 2 < 3 so that

3 t+1 t f t ( ) f(g(2)) is smaller than at both others

4 t t +1 =(1+2)/2

5 while || t t1 ||>

( )

6 return t

Three points bracket its maximum

Dashed line shows quadratic fit

Machine Learning Srihari

In simple gradient ascent:

two consecutive gradients are orthogonal:

fobj (t+1 ) is orthogonal to fobj (t )

Progress is zig-zag: progress to maximum slows down

Solution:

Remember past directions and bias new direction ht

as combination of current gt and past ht-1

Faster convergence than simple gradient ascent

Machine Learning Srihari

Steepest descent

Gradient descent search determines a

weight vector w that minimizes ED(w) by

Starting with an arbitrary initial weight vector

Repeatedly modifying it in small steps

At each step, weight vector is modified in the

direction that produces the steepest descent

along the error surface

35

Machine Learning Srihari

The Gradient (derivative) of E with respect to each component of

the vector w

!" E E E

E[w ] , ,#

w

0 w1 wn

Specifies the direction that produces steepest increase in E

Negative of this vector specifies direction of steepest decrease

36

Machine Learning Srihari

Negated gradient

Direction in

w0-w1 plane

producing

steepest

descent

37

Machine Learning Srihari

! !

w w + w

where

w = E[w]

is a positive constant called the learning rate

Determines step size of gradient descent search

Component Form of Gradient Descent

Can also be written as

wi wi + wi

where

E

wi =

38

wi

Machine Learning Srihari

Regularization term in error to control overfit

Error function to minimize E(w) = ED (w) + EW (w)

where is the regularization coefficient

controls relative importance of data-dependent error ED(w) and

regularization term EW(w)

Simple form of regularizer EW ( w ) = 1 w T w

2

T

1 N

{

E(w) = tn wT (xn ) }

2

Quadratic Regularizer + w w

2 n=1 2

because in sequential learning weight values decay

towards zero unless supported by data 39

Machine Learning Srihari

Quadratic Regularizer

1 N

E(w) = { t n wT (x n ) } + wT w

2

2 n=1 2

function of w

Its exact minimizer can be found in closed form

Setting gradient to zero and solving for w

1

w = ( I + ) tT T

solution w ML = (T )1 T t

Machine Learning Srihari

Regularized Error 1

N

M

{ t n wT (x n ) } + | w j |q

2

2 n=1 2 j=1

q=1 is known as lasso

Contours of constant regularization term: |wj|q

Space of w1,w2

w1 + w2 = const w14 + w24 = const

41

Lasso Quadratic

Machine Learning Srihari

w2 Contours of

We are trying to find w that minimizes

Unregularized

Error function N

{ t w (x n ) }

T 2

n

n=1

gives same error

Choose that value of w

Subject to the constraint

w1 M

j

| w |q

j=1

The two approaches are related by using Lagrange

multipliers

42

Machine Learning Srihari

subject to constraint

Blue: Contours of Minimization

unregularized error With quadratic

function Regularizer,

q=2

Constraint region

w* is optimum value

43

Machine Learning Srihari

With q=1 and is sufficiently large, some of the

coefficients wj are driven to zero

Leads to a sparse model

where corresponding basis functions play no role

Origin of sparsity is illustrated here:

Quadratic solution where Minimization with Lasso Regularizer

w1*and wo* are nonzero A sparse solution with w1*=0

44

Machine Learning Srihari

Regularization: Conclusion

Regularization allows

complex models to be trained on small data sets

without severe over-fitting

It limits model complexity

i.e., how many basis functions to use?

Problem of limiting complexity is shifted to

one of determining suitable value of regularization

coefficient l

45

Machine Learning Srihari

Try:

Several Basis Functions

Quadratic Regularization

Express results as ERMS

rather than as squared error E(w*) or as Error

Rate with thresholded results

E RMS = 2E(w*) /N

46

Machine Learning Srihari

Multiple Outputs

Several target variables t =(t1,..,tK) where K >1

Can be treated as multiple (K) independent

regression problems

Different basis functions for each component of t

More common solution: same set of basis

functions to model all components of target

vector y(x,w)=WT(x)

where y is a K-dim column vector, W is a M x K

matrix of weights and (x) is a M-dimensional

column vector with with elements j(x)

47

Machine Learning Srihari

Set of observations t1,..,tN are combined

into a matrix T of size N x K such that the

nth row is given by tnT

Combine input vectors x1,..,xN into matrix X

Log-likelihood function is maximized

48

- DINA Model and Parameter Estimation a DidacticUploaded byDaniel Barragán
- Thesis-IPUploaded byHUANAN
- 20120229000242SMA3013_Chapter 1Uploaded byNurul Hana Balqis
- Introduction to Finite ElementsUploaded byaap1
- SIAM2000.pdfUploaded bym_asghar_90
- Inverse VogelUploaded bydmitryig
- Kuttler LinearAlgebra Slides Matrices MatrixArithmetic HandoutUploaded byasdasdf1
- 300 Solved-numerical ProblemsUploaded byAbdul Rashid Mussah
- PG EXTRA Appendix Small o 1 k v0.1[1]Uploaded byWeiShi
- MATLAB TutorialUploaded byRajeev Singh
- Errata of Introduction to Probability 2ndUploaded bymissinu
- Lecture 1 Basic Concepts of FEMUploaded byfawadn_84
- A 5240112Uploaded byIOSRJEN : hard copy, certificates, Call for Papers 2013, publishing of journal
- Matlab Matrix OperationsUploaded byHuy Phan
- C372_L8_geopt.pptUploaded byMaulana Malik Al Ghofiqi
- SyllabusUploaded byJosh Potash
- Coding LdaUploaded byPurie DisanaaSiini
- Coreg and SpatialUploaded bycavron
- !!!edu-mc-exam-c-0507.pdfUploaded bytayko177
- m40_s12_lecture8Uploaded bynarendra
- White Paper - Camera Calibration and Stereo VisionUploaded byworkhardplayhard
- p Classifying Facial ExpressionsUploaded byAmimul Ihsan
- GaussElimUploaded byRachmawati Badrudin
- Algebra Liniara - Lb. EnglUploaded byjean4444
- 488365Uploaded byJuan Carlos Ruiz
- cs229-linalg.pdfUploaded byhảo nguyễn
- ClayUploaded byAshish Singh
- Assignment 2 (1).docxUploaded byShrey Mangal
- Factors Effecting Employee Turnover in Banking Sector.docxUploaded byanum
- spss fileUploaded byNitish Sen

- VerbitskiiGorban1992.pdfUploaded byBrent Cullen
- Ilka_Agricola,_Thomas_Friedrich_Global_Analysis_Differential_Forms_in_Analysis,_Geometry,_and_Physics_Graduate_Studies_in_Mathematics,_V._52.pdfUploaded byHenrique Zangaro
- Finite Deformation General TheoryUploaded bykhayat
- Assignment 2Uploaded byDevashish Soni
- Lecture 1 Vector SpacesUploaded byRamzan8850
- Visual 3D Modeling From ImagesUploaded byNERI000
- Bases for infinite vector spacesUploaded byFran Globlek
- (Springer Series in Materials Science 241) Kwang Soo Cho (Auth.)-Viscoelasticity of Polymers_ Theory and Numerical Algorithms-Springer Netherlands (2016)Uploaded bysaurabhchandraker
- Euclidean Topology PDFUploaded byRaquel
- Peter G. Casazza, Gitta Kutyniok, Friedrich Philipp (Auth.), Peter G. Casazza, Gitta Kutyniok (Eds.) - Finite Frames Theory and ApplicationsUploaded byDor Shmoish
- 174012203-Linalg-Friedberg-Solutions.pdfUploaded byQuinn Nguyen
- Eoin Mackall Less Abstract AlgebraUploaded byJohn
- Real and Functional Analysis Gerald TeschlUploaded byOffes
- Analog TestingUploaded byFarhat Parveen
- Fault Analysis Using z BusUploaded byJagath Prasanga
- Darling - Differential FormsUploaded byPatrik Jakobsson
- Math 600 (Graduate Algebra, UMD) Notes (Sept. 2016)Uploaded byJoseph Heavner
- 121104589-matrixUploaded bybolinag
- fembemnotesUploaded bytemoana
- Acceleration of Isogeometric Boundary Element Ana 2016 Engineering AnalysisUploaded byJorge Luis Garcia Zuñiga
- QR Factorization Chapter4Uploaded bySakıp Mehmet Küçük
- Assignment1_2014Uploaded byJoshua Nolan Luff
- MA515 Notes PastUploaded byabhikirk99
- MathsUploaded bybrajeshbj
- DIP Chp7 DocUploaded byAditya Shukla
- actuarUploaded byJoel Schwartz
- Essential PhysicsUploaded bypticicaaa
- Abstract Algebra Problems and SolutionsUploaded byKarunakar Nalam
- Multi Linear Algebra PDFUploaded byImraans
- New FourierUploaded byManvendra Singh