You are on page 1of 24

Ch 4 Linearfunctions and Difrendition

4.1 Linear functions


Let fi Vs R be a function on a vector space V

f is a linear function if
flxxtpykxfcxjtpflyiux.PE Randx yEV
Example l The mean of vectors in R i.e
xtxzt
f x txnlh.fr x ER
is linear
Example 2 The maximum element of vectors in R i.e

fix max 化 hi Xn了 for XE R

is NOT linear
Example3 fi R R with f x áx is a linear function
Example 4 Fi Ca 𠮩 112 defined
by
FH f化 where x Ea 的 is a givennumber
is linear
Example 5 i F Lab R defined by

FA fǎfmdx for f EE lab is linear


Example G fi MR V is an inner product
spay defined by

fix ⼆ 化 Z where ZEV is a given vector in V

Example 7 A norm function on a vector space V is NOT linear


To see this
11 x 11 1⽐11
which contradict with
ffxkfcxtoxk.fm to 拟 fix
for f being linear on V

Properties of linear functions


Homogeneity flax xfcx EX ER and X EV
Because
fcxxkflxxtokxflxHOflxkxfcxD.lt
f 0
implies flo f
0
fix 0 because 0X 0

UXEU.AdditiviityifcxtnzfcxltfcyiuxyEV.nflxxtn
tx.dk X f 刈⼗ xkfcykxiiXKER.xi M.EU
ˋˋ_

To see this we note that


fax t.n.txkxkkxflxitflhkt
tXKXDE.xfythflxytfkskt.it
⼈们
xflx.lt ⼗ Xkflxd

Inner product representation of a linear function on Hilbert spaces


For simplicity let's consider a linear function on R equipped with the
standard innerproduct 化 以 ⼆ ㄨ可 and the induced mom 化⼼ ⽐唹
From the discussion above

Forany given aER the function fix a x is linear


The reverse is true i.e

Any linear function f 112 R must be in theform of


fix a x for some a ER

To see this let e ez en where ei a basis of IR


劁器能_be
so that any ㄨ 㓹ER is written as kxe.txz.at then
Therefore if f is a linear function then
f x f xe.txz.at then

xfk.it Xzfkyt txnfg _by property of


linearfunctions
D
where a
蠿 ER

a linearfunction fix is unique


Furthermore ⽐ representation of a x
whichmeans there is any one vector a ER for which fix a D holdsfor all x
Indeed suppose that a is not unique ie we have two vectors as

such that fix a x and fix b x for all x ER


Then let Kei fei a ⼼ aiandfeiklb.li bi
So hi hi i 1,2 ⼀
n
Therefore a b
Alger we see that
9ERnL
alinearfunctionf.RSRifandonlyiffcxFCa
lttttt.tt x
r
frsomeuni纰

The above holds true for any linear function on Hilbert spaces
widely known as Riesz representation theorem

Theorem Riesz representation thereon

Let It be a Hilbertspace and f be a function HR Then

axsforsomewiqueaEH.IE
fislinearandboundifandongiffM
xample Ii We know IR R mean x is linear on Since is a

Hilbert space we can find a


unique aan st
mean Cx XI
Indeed
mean M 合化中⽕ t.tl n fx thXzt thXn ⼼
where a 六 六 ⼀

Example 2 i Let It be a Hilbert space and 1111 is the norm


It is known that the norm function is NOT linear

Therefore
there doesn't
minute
exist a EH such that 化仁 ⽐EH

Hyperplanes
Again we Gonsider R as a Hilbert space where any linear
function is written as a D for some a ER
consider the set
S x1 ⼋⼋ ⼆ 0
go
Then U X y Esa and X
PER
dxxtpyt XGDtka.gs 0 xxtpy ES
Therefore Saois a Plane
linear
Since the codimension of So is 1 because it is define by oneequation

Sao is called a hyperplane


hang
Lili

Now let's consider


San⼆ x Ka 如 巧了 for b ER is given

Let Xo E Sab i.e Lax b be fixed

Then Sab ⼆ Sao t Xo because

lfXESa.ba X XoFa x7taxo7 0 3X XoESao


XESa.tk
a ⼼动 xtx.ES
UXESaohxtxi ta.DK
In other words
San is a shift of a hyperplane still called a hyperplane

xi if

ie.io
lets
This concept can be generalized to any inner product space V
了 ⽐⽐
唇品品 众蕊 台哈氵
0andbERaregiuen.Projec.tn
onto hyperplanes
Consider a Hilbert space V and a hyperplane S
S XEVKa.DE b

Let y EV be a given vector
The vector onSthat is the closest toy

1edtupriotyons.dmtedbypsy.TL
By arg 恐 化 411
S
o
Let us find an explicit expression of By in terms
of a b and y

thoremizisasolutionofggllkyllifandongifthrge
区 弧d

EsL_P

乆 Z o ⽐

roof.IOWe first prove that If Z ES is a solution of 哭 化州


then L Ey 2 X 0 EXES
Since Z is a solution ZES.ie z b
EXES and t ER it is easy to see that
Ht Z ⽐ ⼀ 1州 么 Z
tax b
Therefore cut Z tx ES
Since z is closest toy on S we have

NZ NEHCHHZ
l
tx
ypz.lk
z yttlz
xjpzllz
ykl tiz xlitzttz y.Z
tz y.z P.ie
xsz illz.pl I
If we choose to
yz PZ EHZ
xlilectingt
Z
sagivesfz y.Z
If we choose DZO.sn to

z yz DE EHZ
xlilettingt
so_givescz y.az_x 0

Altogether Z satisfies Z y Z x 0 AXES


We then show that Z ES satisfies
Gyz x 0 then Z is

a solution of 交出 化以11 by direct calculation


Since E y z X 0 EXES
11 X yN 11 2 x 2 y 112

112 N 112州2 242 X 2 y


112 9112 1112 x N 112 y 112 EXES
This togetherwith ZES implies z minimizes NX y.li in KS

2
Theorem The solution
of 㗊 化 州 exists and unique Which is
given by y fiib
r.la
proof denote
Ey_ 器 a

么 D a y 今器 aa
ay n.b b So Z ES
EXES
a_y Ex 9器 么 Z D
s个
花i as tax
because a 2 a b
By the previous theorem Z is a solution of 器 化 以11
It remains to show the uniqueness

Suppose we have two solutions Z and Zz Then

Z is a solution Z y Z
ZDZz.is
Gig
a solution Zz Z D 0

Taking difference leads to G ⽯ Z ⼀⼆ 0

12 22112 0 Z ⼆ 曲

Insummantiiǐttthf
exists and is unique Furthermore

1
r.la
uigcy
By y 巡品
迥i

Aflinefunctnsztty
A linear function plus
ate a constant is called an function

1ftp.isaffineifwwegivff
That is a function
1

siiYibERisaonstant
Properties

If fi Vs R is affine then
fcxxtfykxfcxHPHDUX.ge V and
xpERs.t.xtfl.t
t

_linear nrrET.se
this 似 1
flxxtpy 9 xxtpyjtb xgcxitpgcntibb
xgntbjtflgcntblt.fi x t PHD
If fi WR where V is a Hilbert space then

f must be in the form of


fix Ptb where GEVand bER
4国 Case Studies Regression and Classification

42.1 Linear Regression


Given a set of data
Xi ㄐ 1 ⽐2 ⽣ 从 ⽕
where

Xi EN is an inputfeature Vector
i 1.2 ⼀
N
yi ER is the corresponding response to Xi
Given a new input feature vector XEN how to predict the
corresponding response y ER
For example Xi ER represents n attributes of a house
and yi ER is the selling price
We want to predict the selling price of a house with feature

This
XERnanatfti.fi
lrreTTT
is called regression In this context
Xi are called regressor independent rarities
yi are called dependent variableshoutcomeKabel

The class of all functions R 112 is too large and the given
dataset Xi 炽 is not enough to determine a function uniquely
So we need to find a function cuss where we search f
Intuitively larger N larger function class 区
Linear model
We search f in the class of all affine functions
ie fix Ptb for some actRn b ER
Thus we find a ER and b ER it

La XDtb yi.ie 2 nn
N

by minimizing the error of the linear equations


while there are many possibledefinitionsof error it is popular to
consider the Square error as follows
a 如 tb Y.it i 1.2 N ⼀

Therefore we find a ER b ER by solving

GER
min
bER
Èka 如 tb 炉 ⼀

This problem is called the Least squares Ks problem

Write
x 器 ⻔ ERN p ⾏ ER
and y 劁ERN
Then Ls problem becomes

Since we
ngǐiixg
have Nlit equations to fit and ntl unknowns

N Intl is required to have a uniquesolution

However in practice N can be much much smaller than n


ie
For example Xi are images
of size NM pixels n 10M
and it is very difficult to have a database of lo M images

lie N ㄍ 10 M

Regularization
In many applications we don'thave enoughdata e.g N n 1
so that the class of affine functions is too large to search f
We search f is a subclass of all affine functions
all functions

instead
Therefore of searching PER we search BE SCR
㗊 11 P y1122
In other words we search f in

巫 filRRlflxka.pt b AGES
By choosing different S we obtain dyerent approaches

Ridge regression
We choose S p 瓵 ⼼ 仙⼉ for some co

The we solve min


MAYNE
PES
by convex optimization theory
It
minpigapiIHXM.IE t Malli
were No depends on C and others
Here ME is the regularization term
and dry HE is the datafitting term
In other words we find A治了 ER such that

the error of datafitting and the ME


are minimized simultaneously

Therefore ridge regression gives 汇⾏such that


Pty and small
X 114 is so that
PES
The parameter ⼊ is tune st.NL EC
LASSO regression
P 话了欣

We choose 5 y sC
The we solve min
MAYNE
PES
by convex optimization theory
It
minpigapiIHXM.IE t Mall
were No depends on C and others
Here the regularization term is Nah
So we find B ⾏ such that
the error of datafitting and the
yah
are minimized simultaneously

small Hall tends to give a sparse vector GER


lie many entries of a are zeros
Consequently LASSO gives a solution
Pig such that
x prey and a is sparse

Thus given XEN the prediction is given by


化 a tb Èaixitb 示意 axitb
Let I ik.to
Since I is a small set the prediction depends only on
a small portion entries of X
This able
is preferred because the prediction is interpret
42.2 Kernel regression
Again linear regression has its limitation
We extend it to nonlinear regression by Kernel trick
Feature map 中 112 H H is some Hilbert space
Then do regression in H
However since His very large the set of all linear functions
is also two large We need regularization
We solve
Nǎi
三 点 ⼮ ⽕炒 州 4 ⼊ 11幽

Representer Theorem

1nthefmofaiciokx.gg
蕊噍器器
Proof For any a EH we claim that a can be decomposed as

a East Ěci Ni
where C
GERN and as 中侧 0 for it 2 ⼀
N
Indeed consider Nil forsimplicity
到⼼ ⽐ 船 Then Sto
Ko 烌 is a

ai
ii.in 飝 i 𧅤

in
Psa_a Psa o o
because OE S

LBa a.BG 0
Also by direct calculation
a Psa c 中化
So as Bat a 中⼼
For general N it can be done similarly


䲜 䨊

N N N

化卧 新 㶭 ⼮N N吾 伙 们 ⼗ 闷
中侧 ⼗2
N N
主 点 点GKlxi.XD y.pt⼋点点的 ⽐㖄 ⼗⼋111

⼆ 主 HKC yktN.TK ⼗ ⼊11 Ii
𠯻

kcx.pk

ER
where K
州 Klay yn
GR C

Let F.cc 占11 kcylEtddkc
dpendsoncapinlg.F.la
Hi _depends
⼊ 11as
Hong on as E
Then the minimization is the same as

mini Flat Elas


asEH
as 中䢺 0 i ⼼

I a at Ěci中如
㗊 FM and 𣊭 Elas
as

让1 N

Obviously because MsNico


蠿 Flat is solved by as 0

Thus the solution


of the original minimization is
a ⼆ 点 a 中侧
where CERN is a solution
of 㗊不 ⼼ ⽹
From the above proof we see that
ǎii 主意 a 中如 ⼈ 则 4 ⼊11幽

I
Let the solution
Iitllkcyg
be CERN
Then the predicted output y for input XER is
y a 中化 歌 中侧 ⽐
⼆ 点 Ci Kai x

All the computation involves only the kernel function KC


No explicit feature map 中 o is needed

iii i

iitiiiiiftrnhig
i
kernel regression
result
4.2.3 classification
Classification Giving training data
X中 必⼒ ⼀⼀
⻰只 X.GR 纵⻔ i 1
y
find a classifier a function
f such that

if flxilzl
yi f if f Xi E 1

We use hyperplanes to separate the points


fix a ⼒ tb where GER b ER
Theweights a Epi andbeRare moralized such that
21 if y i 1
GX t bi
E 1 if y i 1
Support VectorMachine SVM
There exist many such linear Class忄
o
些 ⼼⼒ ⼒b a
1
b 0
A Classifier
to
Another r a ⼩b 1
themargin
不凡 themargin
lifer aptb 七⼆
fr
aptb
1
oaxin
i
0
_support

0
0 1
vectors

了support rears
0

which one is the better


From the figure the blue is better because it has a larger margin
and hence a larger buffer zone of mis classification

Therefore we want to maximize the margin among all candidates


St
Let us calculate the margin interms of a and b x
The margin is the distance between the two hyperplane
s

S N a ⼩ tb 13 and S ⽔么 ⼩ t b 1
Let Xt ES t and X.ES such that
HX X.IE dist St S_
onto 5 1 b
Since Xt is a projection of X xka x

Y X__
佌 是以 1
a
x__ ___ 应
是a
22

since X.ES
nap
x 赢a
Thus HX X.lk 11赢班 灵位
ie
themargg
Support Vector Machine SVM
max


it La t b 21 if y i 1
Xi t b Et if y i 1
which is equivalent to

The above SM
St iaRj.cn
ik
is NOT robust to noise
ist 21

For example shown on the right


have only two noisy points there is i
even we

GUM 1
x if
no solution to x x
x

We consider a jt version of SVM 1 as follows

We minimize the error to the separation


N

忢 hlyiax.it b 1
where the function 0 if t ZO
htt
in if to
hl.tl

the
In other words if 以 旧 b 1 error is O

yia.pt b 4 the is the abs 㖌


if error

of ykaxitbt

Also the margin 铫忙 can be viewed as a regularization

Altogether we solve


点点
Nl
吾川以 9XDtbi D t.la E suuy
We call this SUM with a soft margin
on
h can be approximated by some smooth function If his
chosen the socalled logistic function we call it logistic regression
kernel SVM
The linear SVMs don't work for Curved data
The kernel method can be used i 𧅵
Let 中 顺 H be a feature map 爽
So we use linear functions on H to classify the points
By Riesz representation theorem
any linear function in the form of
a x for some a EH
Therefore SVM 2 becomes

ǎih Gia 䖩 1 全Hali KS VM


Again one can prove the following representer theorem

Theorem Any solution of KSVM is in the form of


Ěci 中侧
proof Write a Ěci中侧 ⼗as for some GE Hand 必妙恐
The rest is the same as the linear regression case ⽹
Thus KS VM becomes
t

Ji
1zcikcl _T
hn KCXi xj CjJ D

whereK EkCXi
xjBiiiERNN
The prediction of the input x is given by

sgn Ěkcxxjkj
Again only KC is needed in the kernel SUM
and no explicit feature map ok is required
4了 Linear approximation and Differentiation

Recall that for a function fi Rs R the derivative at Xo is



f化 惢 x_x
which is the same as
⼦以 ⼦
点别 华华 1 0

Notice that fix f化 x_x is an affine function in R that


passes through No 批 1

t f 化 x_x
f
fix ____ __
it

t
x x

In other words in differentiation at Xo


l f is approximated by an affine function passes thru Xo 加⻉
2 the error of the approximation is oly
XD.TK
460 ielimlex.no x x1 30

This can be used to define deferentialton of functions on Hilbert spaces


We use Hilbert space for simplicity It can be easilyadapted to Banach spaces

Let fi V 112 with Va Hilbert space


Consider the differentiation off at xiv
i By Riesz representation theorem any affine function is in the form
y l ng f f
of U x ta for some 0 EV and a ER Since it passes thru

如 f㟗 ⼼ 如 a f⼼ Therefore the affinefunction


is in the form of
a x a a x_x 地如 a

f 如⼗ 亿 x_x
2 The approxiti on error is
error If 化 f們 ⼀
亿 x_x
The error should be in the order of ⽐如11 ie
error
iris oasx sxq.ie 化北川州

Definition Let V be a Hilbert space Let fi MR Then f is

said Freshet differentiable if there exists a UE U such that


limlfM.fi 0 x_x
0 您不𢇃 the
x
T sameas 化⽔叫

If f is differentiable at 奾 0 is called the gradient off


at 如 denoted by Jfl

Example Ii fix ⼆ 化112 where Ml is the norm on V


At any in
2
ME 114幽幽 ⼮ 两 划 如 敝⼩
⼆ 化 如11411幽 txixxs
Therefore MF lxitz.MG X_X 11 Xxi
the
affine approximation

So 仙⼈ 㗀 2
形 x_x 1 从幽
him
如妙

惢 两 0
Thus 7 fg 2和

Example 2 i fM D for some a EV

At any in
a x a 如 t a x_x
Therefore
1⼼⼼ __ 妙

fm如 化州

箱前 ⼆0

Thus 对闷 a

Example 3i fix 11 X a112 for some GEV


At any X EV
2
f X 1 从 all 佖⼈ X_x 112
11 ⼈如112 1 2 伈
112 1 111 a x_x

f阀 4 x_x tllx

如 IFS.fm fix f刚 九化么


Lire
11

Kill
⽐为1
台如 化幽 0

Therefore Jf ⽹ 2 ⽔冰

Properties

Freshet differentiation is linear ie

7 x ftp.g x x 7批 P79化
Chain rule Let fili Rand gi Ri R Then 9of 以 112 and
7 of x 9纵川⽻的
if f and g are differentiable at x and fix respectively

Example 4 flx 11 ㄨ 11
UXEV
This is a
of flx 化112
composition from Vs R
and t.lt 㺼 from MR
When MI to both f and f are differentiable
Also 折化 2X fit 正 if t to
So 7 fix 7 后听 化 fzkflxD 7f.ly
2X ⼆
trzyi MM
When 仙 0 lie X
flt is NOT differentiable at flxko
0

It can be shown that flxk Ml is NOT differentiable at no


For functions on Rn fi R R
fM

ie
对⽐

Freshet differentiation is
銂 the same as

the standard
differentiation in multivariate calculus

Taylors expansion
From the definition we see that
fix fit CAMP x_x

Or more precisely

fix f 如 t Gf 伈 x_x 以 0 化 如11

This is a generalization of Taylor's expansion

In patiala if fiR
Rfcxiz.fi
点 叕妙 化 ⼗ ⼋们
Differentiation on normed vector spaces

Let U be a normed vector space

Let fi M R be a function Let X EV


To define differentiation we still use an affine function approximation
and the affine function passes thru 如 f 如
Thus we use f 化以⼗ L X_X以 where L WR is a linear
function to approximate fix

However because there is no Riesz representation on normed spaces we

keep the linear function L in the definition of differentiation

Definition f is differentiable at 如 EY if i
⺕ a linear function Li h R such that

𡽪 璺
圿恕 算
The linear function L is called the dyerendition of fat x

You might also like