0 Up votes0 Down votes

3 views48 pagesDocumento en ingles acerca de las regresiones spline

Apr 26, 2017

© © All Rights Reserved

PDF, TXT or read online from Scribd

Documento en ingles acerca de las regresiones spline

© All Rights Reserved

3 views

Documento en ingles acerca de las regresiones spline

© All Rights Reserved

- Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
- Hidden Figures Young Readers' Edition
- The Law of Explosive Growth: Lesson 20 from The 21 Irrefutable Laws of Leadership
- The E-Myth Revisited: Why Most Small Businesses Don't Work and
- The Wright Brothers
- The Power of Discipline: 7 Ways it Can Change Your Life
- The Other Einstein: A Novel
- The Kiss Quotient: A Novel
- State of Fear
- State of Fear
- The 10X Rule: The Only Difference Between Success and Failure
- Being Wrong: Adventures in the Margin of Error
- Algorithms to Live By: The Computer Science of Human Decisions
- The Black Swan
- Prince Caspian
- The Art of Thinking Clearly
- A Mind for Numbers: How to Excel at Math and Science Even If You Flunked Algebra
- The Last Battle
- The 6th Extinction
- HBR's 10 Must Reads on Strategy (including featured article "What Is Strategy?" by Michael E. Porter)

You are on page 1of 48

Sargur Srihari

srihari@cedar.buffalo.edu

1

Machine Learning Srihari

Simplest form of linear regression:

linear function of single input variable

Task is to learn weights w0, w1

y(x,w) = w0+w1x from data D={(y ,x )}i=1,,N

i i

Polynomial curve fitting

y(x,w) = w0+w1x+w2x2+= wixi

linear combination of non-linear functions of

input variables i (x) instead of xi called basis

functions

Linear functions of parameters (which gives them

simple analytical properties), yet are nonlinear with2

respect to input variables

Machine Learning Srihari

Plan of Discussion

Discuss supervised learning starting with regression

Goal: predict value of one or more target variables t

Given d-dimensional vector x of input variables

Terminology

Regression

When t is continuous-valued

Classification

if t has a value consisting of labels (non-ordered categories)

Ordinal Regression

Discrete values, ordered categories

3

Machine Learning Srihari

Polynomial Target

Variable

Single input variable scalar x

Input Variable

Generalizations

Predict value of continuous target variable t given value

of d input variables x=[x1,..xd]

t can also be a set of variables (multiple regression)

Linear functions of adjustable parameters

Specifically linear combinations of nonlinear functions of input

variable

4

Machine Learning Srihari

Regression with d input variables

This differs from

y(x,w) = w0+w1x1+..+wd xd = wTx Linear Regression with one variable

and Polynomial Reg with one variable

parameters w0,..,wd

input variables x1,..,xd

variables

In the one-dimensional case this amounts a straight-line fit (degree-one

polynomial)

5

y(x,w) = w0 + w1x

Machine Learning Srihari

Assume t is a function of inputs x1, x2,...xd

(independent variables). Goal is to find the

best linear regressor of t on all the inputs.

For d =2: fitting a plane through N input

samples or a hyperplane in d dimensions.

x1 x2 t

6

Machine Learning Srihari

LeToR

Multiple Inputs

Target Value

t is discrete (eg, 1,2..6) in training set but a continuous

value in [1,6] is learnt and used to rank objects

7

Machine Learning Srihari

Input (xi):

(d Features of Query-URL pair)

Log frequency of query in

(d >200) anchor text

Query word in color on page

# of images on page

# of (out) links on page

PageRank of page

URL length

URL contains ~

Page length

46 query-document features Output (y):

Maximum of 124 URLs/query Relevance Value

Target Variable

- Point-wise (0,1,2,3)

- Regression returns continuous value

Yahoo! data set has d=700 -Allows fine-grained ranking of URLs

Machine Learning Srihari

Input Features

See http://research.microsoft.com/en-us/projects/mslr/feature.aspx

Feature List of Microsoft Learning to Rank Datasets

26 body

feature id feature description stream comments 27 anchor

1 body 28 min of term frequency title

2 anchor 29 url

covered query term

3 title 30 whole document

number

4 url 31 body

5 whole document 32 anchor

6 body 33 max of term frequency title

7 anchor 34 url

covered query term

8 title 35 whole document

ratio

9 url

36 body

10 whole document

11 body

37 anchor

12 anchor 38 mean of term frequency title

13 stream length title 39 url

14 url 40 whole document

15 whole document 41 body

16 body 42 anchor

17 anchor

variance of term

IDF(Inverse document 43 title

18 title frequency

frequency) 44 url

19 url 45 whole document

20 whole document 46 body

21 body

47 sum of stream length anchor

22 anchor

23 sum of term frequency title

48 normalized term title

24 url 49 frequency url

25 whole document 50 whole document

Machine Learning Srihari

Feature Statistics

Most of 46 features are normalized as continuous

values from 0 to 1, exception some features are

all 0s.

Feature 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Min 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Max 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1

Mean 0.254 0.1598 0.1392 0.2158 0.1322 0.1614 0 0 0 0 0 0.2841 0.1382 0.2109 0.1218

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0.2879 0.1533 0.2258 0.3057 0.3332 0.1534 0.5473 0.5592 0.5453 0.5622 0.1675 0.1377 0.1249 0.126 0.2109 0.1705

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 0 1 1 1

0.1694 0.1603 0.1275 0.0762 0.0762 0.0728 0.5479 0.5574 0.5502 0.5673 0.4321 0.3361 0 0.1065 0.1211

Machine Learning Srihari

Extended by considering nonlinear functions of input variables

M 1

y(x, w ) = w0 + w j j (x)

j =1

We now need M weights for basis functions instead of d weights for

features

Can be written as

M 1

y(x, w) = w j j (x) = wT (x)

j= 0

11

Machine Learning Srihari

Linear Basis Function Model

M 1

y(x, w) = w j j (x) = wT (x)

xj

j= 0

j(x)=x j with degree M-1 polynomial

x

Disadvantage

Global:

changes in one region of input space affects others

Difficult to formulate

Number of polynomials increases exponentially with M

Can divide input space into regions

use different polynomials in each region:

equivalent to spline functions

12

Machine Learning Srihari

(Not used in practice!)

[ ]

j

1 2 d

Vector is being converted into a scalar value thereby losing information

Better approach:

Polynomial of degree M-1 has terms with variables taken none, one,

two M-1 at a time.

Use multi-index j=(j1,j2,..jd) such that j1+j2+..jd < M-1

For a quadratic (M=3) with three variables (d=3)

y(x,w) = w (x) = w j j 0 + w1,0,0 x1 + w 0,1,0 x 2 + w 0,0,1 x 3 + w1,1,0 x1 x 2 + w1,0,1 x1 x 3 +

( j1 , j 2 , j 3 )

For d=46, it is 1128

Machine Learning Srihari

" (x j )2 %

Gaussian j (x) = exp $$ 2

''

# 2s &

Does not necessarily have a probabilistic interpretation

Usual normalization term is unimportant

since basis function is multiplied by weight wj

Choice of parameters

j govern the locations of the basis functions

Can be an arbitrary set of points within the range of the data

Can choose some representative data points

s governs the spatial scale

Could be chosen from the data set e.g., average variance

Several variables

A Gaussian kernel would be chosen for each dimension

For each j a different set of means would be needed perhaps chosen from

the data

1

j (x) = exp (x j )t 1(x j )

2

Machine Learning Srihari

" x j % 1

Sigmoid j (x) = $ ' where (a) =

# s & 1+ exp(a)

sigmoid by

tanh( a) = 2 (a) 1

Logistic Sigmoid

Machine Learning Srihari

Fourier

Expansion in sinusoidal functions

Infinite spatial extent

Signal Processing

Functions localized in time and frequency

Called wavelets

Useful for lattices such as images and time series

Further discussion independent of choice

of basis including (x) = x

16

Machine Learning Srihari

Probabilistic Formulation

Given:

Data set of n observations {xn}, n=1,.., N

Corresponding target values {tn}

Goal:

Predict value of t for a new value of x

Simplest approach:

Construct function y(x) whose values are the predictions

Probabilistic Approach:

Model predictive distribution p(t|x)

It expresses uncertainty about value of t for each x

From this conditional distribution

Predict t for any value of x so as to minimize a loss function

Typical loss is squared loss for which the solution is the conditional

expectation of t

17

Machine Learning Srihari

and Least Squares

Will show that Minimizing sum-of-squared errors is

the same as maximum likelihood solution under a

Gaussian noise model

Target variable is a scalar t given by deterministic

function y(x,w) with additive Gaussian noise

t = y(x,w)+

is zero-mean Gaussian with precision

Thus distribution of t is univariate normal:

p(t|x,w,)=N(t|y(x,w), -1)

mean variance

Machine Learning Srihari

Likelihood Function

Data set:

Input X={x1,..xN} with target t = {t1,..tN}

Target variables tn are scalars forming a vector of size N

It is the probability of observing the data assuming they

are independent

N

p(t | X,w, ) = N ( t n | wT (x n ), 1 )

n=1

Machine Learning Srihari

Log-Likelihood Function

N

Likelihood p(t | X,w, ) = N ( t n | wT (x n ), 1 )

n=1

Log-likelihood

N

ln p(t | w, ) = ln N ( t n | wT (x n ), 1 )

n=1

Where N N

= ln ln2 E D (w)

2 2

2

1 N

{

ED ( w ) = t n w T ( xn )

2 n =1

} Sum-of-squares

Error Function

1 1 2

N(x | , ) =

2

exp 2 (x )

(2 2 )1/ 2 2

Machine Learning Srihari

Log-likelihood N

ln p(t | w, ) = ln N ( t n | wT (x n ), 1 )

n=1

N N

= ln ln2 E D (w)

2 2

Where

2

1 N

{

ED ( w ) = t n w T ( xn )

2 n =1

} Sum-of-squares

Error Function

Take derivative of ln p(t | w, ) wrt w and set equal to zero

and solve for w

or equivalently derivative of -ED(w)

Machine Learning Srihari

{ } (x )

N

ln p(t | w, ) = tn wT (x n ) n

T

n=1

1

Since w [ ( a wb)2 ] = (a wb)b

2

N #N &

0 = tn (x n ) w % (x n ) (x n )T (

T

as shown in next slide

n=1 $ n=1 '

this a maximum

Machine Learning Srihari

X={x1,..xN} are samples

Solving for w: w ML = + t (vectors of d variables)

t = {t1,..tN} are targets (scalars)

M Design Matrix

Design Matrix: generalization of notion of matrix inverse

Rows correspond to (x ) to non-square matrices

N samples, = 0 2 If design matrix is square and invertible.

Columns to

then pseudo-inverse is same as inverse

M basis functions

(x ) M 1( x N )

0 N

1

j (x) = exp (x j ) (x j )

t 1

2

Machine Learning Srihari

M 1

y(x, w ) = w0 + w j j (x)

Regression function: j =1

2

Error Function:

1 N

{

ED (w) = t n wT (x n )

2 n=1

}

Explicit bias parameter: 1 N M 1

ED (w) = t n w0 w j j (x n )

2

2 n=1 j=1

Setting derivatives wrt w0 M 1

w0 = t w j j

equal to zero and solving for w0 j=1

1 N

t = tn

N

1

where j = j (x n )

N n=1 N n=1

First term is average of the N values of t

Second is weighted average of basis function values

Thus bias w0 compensates for difference

between average target values and weighted sum of

averages of basis function values

Machine Learning Srihari

We have determined m.l.e. solution for w

using a probabilistic formulation

p(t|x,w,)=N(t|y(x,w), -1) w ML = + t

With log-likelihood

2

N N

ln p(t | w, ) = ln ln 2 ED (w)

1 N

{

ED (w) = t n wT (x n )

2 n=1

}

2 2

Taking gradient wrt gives

1 1 N

{

= tn wTML (x n ) }

2

ML N n=1

Residual variance of the target values

25

around the regression function

Machine Learning Srihari

N-dim space with axes tn: t =(t1,,tN)T

Each basis function j(xn), corresponding

to jth column of , is represented in this subspace

space as vector j

If M<N then j(xn) are in subspace S of dim M

y is N-dim vector of elements y(xn,w)

Sum-of-squares error is equal to squared

Euclidean distance between y and t M Basis functions

Solution w corresponds to y that lies in ( x ) ( x ) ... ( x )

0 1 1 1

M 1 1

N Data

( x ) 0 2

=

Corresponds to ortho projection of t onto S

(x ) M 1( x N )

0 N

Represented as N-dim vectors

0 1 M-1

Machine Learning Srihari

Direct solution of normal equations

w ML = + t + = (T ) 1 T

When two basis functions are collinear, T is singular

(determinant=0)

Parameters can have large magnitudes

Not uncommon with real data sets

Can be addressed using

Singular Value Decomposition

Addition of regularization term ensures matrix is non-singular

27

Machine Learning Srihari

Maximum likelihood solution is

w ML = (T )1 T t

It is a batch technique

Processing entire training set in one go

It is computationally expensive for large data

sets

Due to huge N x M Design matrix

Solution is to use a sequential algorithm where

samples are presented one at a time

Called Stochastic gradient descent 28

Machine Learning Srihari

2

1 N

ED (w) = t n wT (x n )

2 n=1

Denoting ED(w)=nEn update parameter vector w

using

w ( +1) = w ( ) En

Substituting for the derivative

w( +1) = w( ) + (t n w( ) (x n )) (x n )

w is initialized to some starting vector w(0)

chosen with care to ensure convergence

Known as Least Mean Squares Algorithm

29

Machine Learning Srihari

every weight vector, a parabola with

30 a single global minimum

Machine Learning Srihari

Procedure Gradient-Descent ( Intuition

1 //Initial starting point Taylors expansion of function f() in the

t t T t

f //Function to be minimized neighborhood of t is f () f ( ) + ( ) f ( )

//Convergence threshold Let =t+1 =t +h, thus f (t+1 ) f (t ) + h f (t )

Derivative of f(t+1) wrt h is f (t )

) t

At h = f ( ) a maximum occurs (since h2 is

1 t1

positive) and at h = f (t ) a minimum occurs.

2 do

Alternatively,

3 t+1 t f t ( ) t

The slope f ( ) points to the direction of

4 t t +1 steepest ascent. If we take a step in the

5 while || t t1 ||> opposite direction we decrease the value of f

( )

6 return t

One-dimensional example

Let f()=2

This function has minimum at =0 which we want

to determine using gradient descent

We have f()=2

For gradient descent, we update by f()

If t > 0 then t+1<t

If t<0 then f(t)=2t is negative, thus t+1>t

Machine Learning Srihari

Performance depends on choice of learning rate

Large learning rate

Overshoots

Small learning rate

Extremely slow

Solution

Start with large and settle on optimal value

Need a schedule for shrinking

32

Machine Learning Srihari

Procedure Gradient-Descent (

1 //Initial starting point

f //Function to be minimized Line Search

//Convergence threshold Adaptively choose at each step

) Define !a line in the direction of gradient

g ( ) = + f ( )

t t

1 t1

2 do Find three points 1 < 2 < 3 so that

3 t+1 t f t ( ) f(g(2)) is smaller than at both others

4 t t +1 =(1+2)/2

5 while || t t1 ||>

( )

6 return t

Three points bracket its maximum

Dashed line shows quadratic fit

Machine Learning Srihari

In simple gradient ascent:

two consecutive gradients are orthogonal:

fobj (t+1 ) is orthogonal to fobj (t )

Progress is zig-zag: progress to maximum slows down

Solution:

Remember past directions and bias new direction ht

as combination of current gt and past ht-1

Faster convergence than simple gradient ascent

Machine Learning Srihari

Steepest descent

Gradient descent search determines a

weight vector w that minimizes ED(w) by

Starting with an arbitrary initial weight vector

Repeatedly modifying it in small steps

At each step, weight vector is modified in the

direction that produces the steepest descent

along the error surface

35

Machine Learning Srihari

The Gradient (derivative) of E with respect to each component of

the vector w

!" E E E

E[w ] , ,#

w

0 w1 wn

Specifies the direction that produces steepest increase in E

Negative of this vector specifies direction of steepest decrease

36

Machine Learning Srihari

Negated gradient

Direction in

w0-w1 plane

producing

steepest

descent

37

Machine Learning Srihari

! !

w w + w

where

w = E[w]

is a positive constant called the learning rate

Determines step size of gradient descent search

Component Form of Gradient Descent

Can also be written as

wi wi + wi

where

E

wi =

38

wi

Machine Learning Srihari

Regularization term in error to control overfit

Error function to minimize E(w) = ED (w) + EW (w)

where is the regularization coefficient

controls relative importance of data-dependent error ED(w) and

regularization term EW(w)

Simple form of regularizer EW ( w ) = 1 w T w

2

T

1 N

{

E(w) = tn wT (xn ) }

2

Quadratic Regularizer + w w

2 n=1 2

because in sequential learning weight values decay

towards zero unless supported by data 39

Machine Learning Srihari

Quadratic Regularizer

1 N

E(w) = { t n wT (x n ) } + wT w

2

2 n=1 2

function of w

Its exact minimizer can be found in closed form

Setting gradient to zero and solving for w

1

w = ( I + ) tT T

solution w ML = (T )1 T t

Machine Learning Srihari

Regularized Error 1

N

M

{ t n wT (x n ) } + | w j |q

2

2 n=1 2 j=1

q=1 is known as lasso

Contours of constant regularization term: |wj|q

Space of w1,w2

w1 + w2 = const w14 + w24 = const

41

Lasso Quadratic

Machine Learning Srihari

w2 Contours of

We are trying to find w that minimizes

Unregularized

Error function N

{ t w (x n ) }

T 2

n

n=1

gives same error

Choose that value of w

Subject to the constraint

w1 M

j

| w |q

j=1

The two approaches are related by using Lagrange

multipliers

42

Machine Learning Srihari

subject to constraint

Blue: Contours of Minimization

unregularized error With quadratic

function Regularizer,

q=2

Constraint region

w* is optimum value

43

Machine Learning Srihari

With q=1 and is sufficiently large, some of the

coefficients wj are driven to zero

Leads to a sparse model

where corresponding basis functions play no role

Origin of sparsity is illustrated here:

Quadratic solution where Minimization with Lasso Regularizer

w1*and wo* are nonzero A sparse solution with w1*=0

44

Machine Learning Srihari

Regularization: Conclusion

Regularization allows

complex models to be trained on small data sets

without severe over-fitting

It limits model complexity

i.e., how many basis functions to use?

Problem of limiting complexity is shifted to

one of determining suitable value of regularization

coefficient l

45

Machine Learning Srihari

Try:

Several Basis Functions

Quadratic Regularization

Express results as ERMS

rather than as squared error E(w*) or as Error

Rate with thresholded results

E RMS = 2E(w*) /N

46

Machine Learning Srihari

Multiple Outputs

Several target variables t =(t1,..,tK) where K >1

Can be treated as multiple (K) independent

regression problems

Different basis functions for each component of t

More common solution: same set of basis

functions to model all components of target

vector y(x,w)=WT(x)

where y is a K-dim column vector, W is a M x K

matrix of weights and (x) is a M-dimensional

column vector with with elements j(x)

47

Machine Learning Srihari

Set of observations t1,..,tN are combined

into a matrix T of size N x K such that the

nth row is given by tnT

Combine input vectors x1,..,xN into matrix X

Log-likelihood function is maximized

48

- Eigenvector ExpansionUploaded byLiberatedDreamer
- Gradient Descent Algorithm MatlabUploaded byJeff
- Cbse Xii Hots Determinants Chapter 4Uploaded byOm Prakash Sharma
- Machine LearningUploaded byRachit Sharma
- programacion momentosUploaded bypaladin_kaos
- Gocheva -SMB Albena 029-038Uploaded bysegorin2
- ElectronicsUploaded byRintuMathunni
- Chapter six.docxUploaded byGafeer Fable
- UT Dallas Syllabus for math2333.501.10s taught by (jxh025100)Uploaded byUT Dallas Provost's Technology Group
- Unconstrained OptimizationUploaded bySharif M Mizanur Rahman
- TET 4115_2015 Solution to Assignment 7Uploaded byeh2asham1
- Linear Algebra 022315Uploaded byAdeniji Adetayo
- 39Uploaded byKhuleedShaikh
- Chapter4.pdfUploaded bygheorghe gardu
- out (6)Uploaded byjorgeluis.unknownman667
- MatrixUploaded byRADHIKA GARG-DM
- Ch10Uploaded bymohammed2015amine
- 2 j Practice ProofsUploaded bypissed1109
- PUK in Yarn Tenacity.unlockedUploaded bytaianawer
- Special MatricesUploaded byRushi Desai
- Chapter 2Uploaded byDonald Toledo
- ProbabilityUploaded byNicole Ng
- Constrain 2D InversionUploaded byRizka
- Unit-11-55Uploaded byJ R Vinod Kumaar
- assessmnet 02Uploaded byLekhani Dasanayake
- qrUploaded byRetaliator Brave
- Discrete Mathematics Problems and SolutionsUploaded byZyad Es'ed
- AHSAN Project completeUploaded bymadihajutt
- Análisis de Factores Geomecánicos Que Afectan a Los Estallidos de RocaUploaded byRicardo Huisa Bustios
- [IJCST-V3I5P35]: P.V Nishana Rasheed , R. ShreejUploaded byEighthSenseGroup

- MosUploaded bySrikant Potluri
- StructuralsUploaded byAraby Gamal Gamal
- Process Control Chp 4Uploaded byChalmer Belaro
- Young EinsteinUploaded bykaubrey
- Chemical Grouting and Soil Stabilization Civil and Environmental EngineeringUploaded byMardi Rahardjo
- Good Paper on WaveletUploaded byKazi Ali Tamaddun
- Magnetic Coupled Circuits Modeling of Induction Machines Oriented to DiagnosticsUploaded byvalentinmuller
- BioI Cell Bio LabManual SL Version 6 201505Uploaded byMelissa Ann Vannan
- Aristotle's Theory of TOPOSUploaded byGregorio de Gante
- RAMAN, FT RAMAN AND FTIR SPECTRA OF MOLLUSCAN SHELLS.Uploaded byIJAR Journal
- QUADRATIC.pdfUploaded byprem_dreams4u
- Ball, Tube and Rod Mills, Rose and Sullivan, Publication year 1958Uploaded byTsakalakis G. Konstantinos
- Differential EquationsUploaded byemilia boicu
- FrictionUploaded byFrenchdefensec02
- STALL COMPRESSOR.docxUploaded byAlvaro
- p He t ProjectileUploaded byDifferentual
- 31 3 ScienceUploaded bySahil gupta
- Asian Journal of Applied Science and Engineering (AJASE)Uploaded byAsian Business Consortium
- trechless drainageUploaded byGuillermo Luque Pampamallco
- Top Head Lug & Tail Lug DesignUploaded byBolarinwa
- P435_Lect_08.pdfUploaded byt2vaibhav
- Equivalent Head-down Load vs. Movement Relationships Evaluated From Bi-directional Pile Load TestsUploaded byShahab Khaled
- Advanced Drive Train Modeling in a Virtual Wind Turbine Using SimpackUploaded byrr
- STORAGE AND FLOW OF SOLIDS.pdfUploaded byLM
- Failure of Composite Materials PDFUploaded byPatrick
- Effect of Test Duration on Impact and Sliding Wear Damage of 304L Stainless Steel at Room Temperature Metallurgical and Micromechanical InvestigationsUploaded byBogdan Buu
- is.14925.2001Uploaded bybhagwatpatil
- SPE-94065-MS.pdf---Estimation of Long Term Gas Condensate Well Productivity Using Pressure Transient DataUploaded byswaala4real
- Modeling Instruction--Power of our FutureUploaded byjrangel218
- Synthesis of new polyester polyols from epoxidized vegetable oils and biobased acidsUploaded byGabriel F Rueda

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.