L1 Slides

Applied Data Science
Applied
Applied Data Science
Data Science
BootCamp
BootCamp
BootCamp
Machine Learning
Recommendation
Machine 1Systems
Learning
Lecture
Lecture 1
Lecture 1
John Tsitsiklis
Devavrat
November Shah
John Tsitsiklis
9, 2020
December 9,
November 2, 2020
2020
1
Introductions
• http://www.mit.edu/~jnt/home.html Privacy can be protected by

ing actions
• Tradeo↵: Privacy level versus obfuscation e↵ort
• Toy models suggest generic obfuscation strategies
• Connections to the real world? [DS18; YP19]
Introductions
• http://www.mit.edu/~jnt/home.html Privacy can be protected by

ing actions
• Tradeo↵: Privacy level versus obfuscation e↵ort
• Toy models suggest generic obfuscation strategies
• Connections to the real world? [DS18; YP19]
2
Introductions
Introductions
Introductions
Introductions An Example: Advertisement Campaign
Introductions
.html Overview of this week/module
•Introductions
• http://www.mit.edu/
http://www.mit.edu/ Overview of
jnt/home.html Overview
~jnt/home.html of this
this week/module
week/module
~
Learning
•••• http://www.mit.edu/
Central methods in
http://www.mit.edu/
Central methods in~Machine
Machine Overviewof
Learning
jnt/home.html
jnt/home.html
~ Overview
Learning
• http://www.mit.edu/~jnt/home.html Overview of this week/module
ofthis
thisweek/module
week/module
• •• Central
http://www.mit.edu/
Central methods
methods ininfromjnt/home.html
~Machine
Machine Overview of this week/module
Learning
Learning
–
mples • –Centralsupervised:
methods
supervised: learn
learn in from
Machine examples
Learning
examples
• Central methods in Machine Learning
– supervised: learn from examples learning”:
– we will only discuss “supervised
– supervised: learn from examples
• supervised:
– Predict
learn from
the learnoffrom
examples
value an examples variable
unobserved
bserved•variable
Predict the value of an unobserved variable
Regression
• Predict (linear)
the value
Regression of FIGURE
an unobserved variable
GURE••Introductions
Predict the (linear)
value ofFIGURE an unobserved variable
•• Predict
Predict
Regression
Assessmentthe
the value
value
(linear) of
of an
an unobserved
FIGURE unobserved variable
variable
Classification
Introductions
Assessment(focus on linear) FIGURE
• Regression
Regression
Regression
How
• Assessment good is(linear)
(linear)
our modelFIGURE
FIGURE and our
our prediction?
prediction?
How good
•• Assessment is our
http://www.mit.edu/ model and
jnt/home.html Overview of this week/module
fuscating actions
Assessment
•• How
Assessment
Testing, validation
goodvalidation
is our prediction?
http://www.mit.edu/ ~
~jnt/home.html Overview of this week/module
Testing,
How good
Central
• How is our
methods prediction?
in Machine Learning
s obfuscation
•• How good
e↵ort
good
Central
Testing,
Predict is
methods our
istype
our of
validation
the prediction?
model
in and our
Machine
an prediction?
Learning
individual
Predict the
• Testing, type of an individual
validation
obfuscation
• – –Testing,
Testing, validation
supervised:
strategies
Predict the type learn
validation of anfrom examples
individual
supervised:
classification
• Predict learn from
the type of an individual examples
•• Predict the type of an individual
d? [DS18;PredictYP19]the type of an individual
Predict the value of an unobserved variable
• classification
• Predict the value of an unobserved variable
Regression (linear) FIGURE
Regression (linear) FIGURE
• Assessment
• Assessment
How good is our model and our prediction?
How good is our model and our prediction?
Testing, validation
Testing, validation
• Predict the type of an individual
• Predict the type of an individual
classification
classification
3
Today’s agenda
Today’s
Today’s agenda
• Machine learning and statistics
agenda
•• Machine
Machinelearning
learningand andstatistics
statistics
– the
Today’s
Today’sbigger
agenda
agenda picture
–– the
the biggerpicture
bigger picture
• –Machine
Machine
• some learning
key learningandand
statistical statistics
statistics
approaches (plugin; maximum likelihood; B
–– some
somekey keystatistical
statisticalapproaches
approaches (plugin; maximum
(plugin; maximum likelihood;
likelihood;Bayesian)
Bayesia
thethe
bigger
– Regression
– biggerpicture
picture
••• Regression
Regression
– –some
somekeykey
statistical
statisticalapproaches
approaches (plugin; maximum
(plugin; maximum likelihood; Bayesian)
likelihood; Bayesia
formulation
–––formulation
formulation
• Regression
• Regression
–––solution
solution
solution
– –formulation
formulation
–––interpretation
interpretation
interpretation
– –solution
solution
–– (classical)
(classical) performance
performance assessment
assessment
–
– – (classical)
interpretation
interpretationperformance assessment
•• Further
Further topics
topics (to
(to continue
continue ininnext
nextsession)
session)
•– Further
– topics
(classical) (to continue
performance
(classical) performance assessmentin next
assessment session)
– what can go wrong
•what
•––Further can gogo wrong
– ridge regression wrong
what can
topics
Further (next
topics session)
(next session)
–– ridge
what
– regression
whatcan go go
can wrong
wrong
– ridge regression
– sparse regression and lasso
–– sparse
using
– usingregression
nonlinear
nonlinear and
features lasso
features of the data
of the data
sparse
– –using regression
nonlinear features and of lasso
the data
–– using
– nonlinear
overfitting
overfittingandand features
regularization of the data
regularization
using nonlinear
• nonlinear
– regression features of the data
–– nonlinear
ridge
– regression
regression
ridge regression
• nonlinear regression
– –sparse regression
sparse regression andandlassolasso
– –nonlinear
nonlinearregression
regression
4
Plugin estimators
Maximum likelihood estimators
Bayesian estimators
MACHINE LEARNING AND STATISTICS
5
1
X: symptoms, ... A: conceptual
test results, big picture
etc.(real ✓ ⇥ [regression]
XX : : symptoms, Y life
test expectancy
results,etc. etc. number)
symptoms, ... test results,
• Understand
A conceptual Y : state Aof ofconceptual
health : sick or not big (binary)
picture [classification]
YY: :big state
state picture
Xexpectancy
n
of Y
health
health ✓ ⇥
– build a model, Y : alife theory, X a1narrative, (real number)
a mechanism [regression]
YY: : life life expectancyexpectancy (real number)
(real number) [regression]
[regression]
...X A: (binary)symptoms, test results,
picture etc.
• Act/change Y Y:: : sick
YModel? sick
sick or or not
or not
not A
(binary)
conceptual
(binary)
conceptual
[classification]
big picture
[classification]
big
[classification] ✓X:⇥ symptoms, test r
AX Xconceptual
11 X : ...YYbig
symptoms, ::conceptual
state
state
picture of health
oftest
healthresults,
✓ big
⇥ picture etc. X: symptoms, test re
X 1 Y 1
A
... .. YY:n:: statelife expectancy (real
Data (ML) Prediction .. A . conceptual... Y (Stats) : state
Y X life
big Model
of expectancy
health
of health
picture Prob/Stats
✓ ⇥(real number)
number) [regression]
[regression]
.. .. ... Y : life YY:::“Predict”
sick or
.
.
. .
• Quick •
Y
demo X : A sick
expectancy
life or not
conceptual
in
symptoms, this not
expectancy Y (binary)
(binary)
based
(real
Friday’s
test (real
big
[classification]
on[classification]
number)
picture
recitation
results, [regression]
Xetc.
number) ✓ ⇥[regression]
A conceptual big picture ✓ ⇥
A conceptual big picture X X nn ✓Y⇥
“All models are X:X symptoms,
wrong,
n
n Y some: sick
YYtest ::are orresults,
sick
state not
useful” (binary)
orofnot (binary)
etc.
(George
health [classification]
[classification]
E.P. Box )
Y :X state ofX health
, . . . , X • Understand
X : symptoms,
Y ? 1 Y : life
test results, etc.(any real number) [regression]
n expectancy
X: symptoms, test results,
Understand: life expectancy .
..
Understand
etc.
(any real – build[regression]
number) a model, a theory, a na
X: symptoms, • test
Y : state of health
Y results,
Y : •
etc.
state
• “Predict” Y , based on X of Y : X
health :
sick symptoms,
or •
not Understand
test
(binary) results, etc.
[classification]
• Understand .
.. not
Y : state ofYhealth : – Y : Ysick
build
life expectancy a: model,
(real
life orexpectancy
– build aX(binary)
number) 1 a
Y, .:model,
theory, .state
,X
.[regression]
(real
n Y–
of
1 .•. , YAct/change
a[classification]
a ,theory,
narrative,
build
.health
number) n aa[regression]
narrative,
mechanism
model, a mechanism
a theory, a narrativ
Y•: life – build
expectancy
Understand
Y :• sick a model,
(real
or not number)
(binary) a theory,
[regression] a narrative,
.... . .Y, Y: nlife expectancy
[classification] a mechanism
nderstand X
Act/change1 , . . . ,•X
: sick or
Y[classification] n X Y n1 ,not
Act/change Act/change
(binary)• [classification] (real number) [regression]
Y : sick or not
• Act/change (binary)
– build X , . .
1a model,. , X...n a theory, a... narrative,
• “Predict” Yorabased
mechanism
Data on X (ML) Prediction (Stats)
build
X1,a. . model,
.• .
..Y1, . a
, XnUnderstand Yn X1 a narrative,
. . ,theory, Ya : sick
mechanism not (binary) [classification]
Understand
...•• Act/change ... Data ... (ML) ... Understand
•Prob/Stats
ct/change – build a model, Data Prediction
a (ML)
theory, XnXa 1 .Prediction
, (Stats)
narrative,n Model
. . , XData a (Stats)
(ML)
mechanism Model Prob/Stats
Prediction (Stats) Mod
... – Data
build Xna(ML) model, Xn...Prediction a theory, a(Stats) narrative,
... Model Prob/Stats
a mechanism
“All models
– X are a
build wrong,
model,some are
a theor
• Act/change Model? • “Predict” Y based on
X•n Act/change
• “Predict” Y based on X ... Model
Data (ML) “All Prediction
• “Predict”
models X “All
are (Stats)
wrong,Y based
models are
some on XProb/Stats
wrong,
are
“All useful”
models •are
some (George
areAct/change
useful”
E.P.(George
wrong, Box )areE.P.
some usefB
(ML)“All
• Prediction
“Predict” Y based on(Stats)
n X
models are wrong, some are useful” (George E.P. Box ) Model Prob/Stats
Xn
DataData
“All (ML)
models
(ML)
arePrediction Prediction
wrong, some (Stats) (Stats)
are• Model
useful”
“Predict”
Model
(GeorgeY
Prob/Stats
Prob/Stats E.P.
based Data Box
on X )
(ML) Prediction (
models are wrong, some are useful” (George E.P. Box )
“All “All models
models are wrong,
are wrong, somesome are useful”
are useful” (George
(George E.P.E.P.
Box Box
) )
“All models are wrong, so
6
Deep understanding Statistics Machine learnin
The pillars of machine learning Structured models
of simple models/methods
Statistics versus
Statistics versus machine learning machine learning
— a Deep – aFearless
caricature
caricature of a of
methods
understanding a spectrum
spectrum
Deep understanding
• Need a language: probability DeepFearless methods
Statisticsofversussimple machine
models/methods
learning – a
Statistics
Statisticsversus machine theorems of
learning – a caricature simple models/methods
Machine learning Algorithmicof a spectrum
emphasis
• Should not rediscover the wheel: on statisticsmodels
complex
Algorithmic emphasis
Statistics Deep theorems
Machine learning
Deep understanding
Statistics Machine learning Deep theorems
Cumulative
on complex modelsart
of simple models/methods Structured modelson complex models
Cumulative Deepart understanding
Deep understanding of simple StructuredTheorymodels
models/methods does not always expl
ofDeep
simple models/methods
theorems Structured models
Theory Fearless
does notmethods
always explain success
on complex models Deep theorems Models
Fearless Structured Unstruct
methods
Deep theorems Fearless methods
Structured
ModelsAlgorithmic Unstructured
on complexemphasis or absent Big
Data Few/moderate
models
onStructured
complex modelsmodelsData Few/moderate Big
Q: Doesemphasis
Algorithmic this treatment help
Q: Does Cumulative art Ishelp?
this treatment
Structured models it a cat or a dog? Algorit
Structured models Is it a cat or a dog?
Fearless methods Algorithms:
Cumulative art Clean Heavy
ModelsStructured
Models Models Structured
StructuredUnstructured
Unstructured ororabsent
absent Cumulative
Unstructured art
or absent
Fearless
Theory methods
Models
does Structured
not always explain Unstructured
success
Models
Data
DataModels Structured
Fearless Structured
Algorithmicmethods
Few/moderate
Few/moderate Models
emphasis
Data
Big
Big Structured
Unstructured
Unstructured or
Few/moderate Unstructured
absent
or absent
Big or absent
ModelsModels Structured
StructuredUnstructured
Unstructured or Theory
or absent
absent
Data does not always
Few/moderate Big explain s
Data Data Few/moderate
Few/moderate Data Few/moderate
Bighelp?
Q: Big
Does this Big Theory
treatmentemphasis
help? does not always expla
Q:Q: Does
Does this
thistreatment
treatment help? Algorithmic
Data DataFew/moderate
Few/moderate
Q: BigBig
Does this treatment Does this treatment help?
Q:help?
Q: Does Does
IsIsititaacat this
Algorithmic
Q: Cumulative
or
cat or a treatment
this treatment
emphasis
art
dog?
a dog? Is help?
it a help?
cat
Algorithms:
Algorithms: or a dog?
Clean Algorithms:
Heavy
Clean Heavy
Models Structured Clean Unstructured
Heavy
Q: Q: DoesDoesthis treatment
this treatmenthelp?
IsAlgorithms: help?
it Algorithms:
a cat or a Clean
dog? Is it a cat or a dog?
Models Structured
Algorithms: Clean Algorithms:
Heavy Unstructu
Is itIsaitcat or aordog?
a cat a dog? Cumulative Heavy
CleanDataHeavy
art Few/moderate Big
IsCumulative
itIsaitcat or aart
a cat ordog?
a dog?Algorithms:
Algorithms: Clean Heavy
Clean HeavyFew/moderate Big
Data
Theory does not always explain success Q: Does this treatment help?
Q: Does this treatment help?
Theory does Is itnota cat or a explain
always dog? Algorithms
success
Theory
Models does not always
Structured explain success
Unstructured or absent Is it a cat or a dog? Algorith
Data Few/moderate Big Models 7
Structured Unstructured or absen
TheStatistics
pillars of versus
machinemachine
learninglearning – a caricature of a spectrum
• Need a language: probability Statistics versus machine learning – a
Statistics Machine learning
• Should not rediscover the wheel: statistics
Statistics Machine learning
Deep understanding
of simple models/methods Deep understanding
The pillars of machine learning
of simple models/methods
Need
•The a language:
pillars of
Deep theorems probability
machine learning
• Should
Need a not rediscover
language:
on complex the wheel: statistics
probability
models Deep theorems
• Should not rediscover the wheel: statistics
on complex models
Structured models
Structured models
Fearless methods
Fearless methods
Cumulative art
Cumulative art
Theory does not always explain success
Theory does not always explain success
Models Structured Unstructured or absent
Data Few/moderate Big Models
8
Structured Unstructured or absen
A basic statistical framework
A basicbasic statistical
statistical framework framework
A • Unknown parameter vector ✓ , to estimate ✓ 2 Rm, ✓ 2 {0, 1}
A
• basic
An statistical framework
uncertain/random phenomenon generates generates data data X X1,,......,,X Xn
•• An Data uncertain/random
X 1 , . . . , Xn drawn from distribution P✓ , i.i.d.
phenomenon 1 n
•• An Drawn
The uncertain/random
from
basic a distribution
statistical phenomenon
framework P ✓ generates data X 1 , . . . , X n
• Drawn independent from a distribution P ✓ ✓
•• Drawn from aparameter
distribution The basic P✓statistical framework✓ 2 Rm , m
Unknown
•
• –Unknown Unknown
e.g., parameter
parameter vector vector vector , to
✓
✓ ,parameter , estimate
to
to estimate estimate 2 ✓R2
✓ estimatem ,R✓✓ 2 ,2✓{0,
m
{0,
2 1} 1} 1}
{0,
• Unknown vector ✓ , to R
m , ✓ 2 {0, 1} , {0, 1}
•
The •basic Unknown
• Data statistical
Assumption: parameter
X1, . .data . framework
, X✓nare vector
drawn independently ✓
from , to estimate
drawn ✓ 2✓ R
P , i.i.d.
✓ 2 ✓ 2
1 , . . . , Xdistribution
• Assumptions: data are• Data independently
X n drawn drawn
kfrom1 distribution P✓ , i.i.d.
••The
• Unknown Data
basic statistical
Assumptions:
parameter data P (are
vector
drawn
Xframework
i✓ = k) = ✓(1 ✓)drawn
independently
independent
,fromto estimate
distribution
, ✓m k = 1, 2, . . .
PR ,i.i.d.
✓ 2m{0, 1}
• Data – e.g.,
XX 1,,...... ,, XX n drawn from distribution ✓ 2 P ✓ ,, i.i.d.
ics in •
rk a• Unknown
nutshell
Data X 1 , . .parameter
. , X
c
n
drawn vector
– e.g.,
from , to estimate
✓distribution P
✓ , i.i.d. 2 R , ✓ 2 {0, 1}
✓ ,✓i.i.d.
• Data • independent
XEstimator:
nutshell independent ,
Elementary
1 . . . , 1
X n drawn⇥ n=
statistics from
g( X , distribution
.
1 in=ak)
✓ (X . . , X )
nutshell
n P k 1
ter to Data
• estimate X1R, ✓. .,2. ✓,R,
m X P
drawn 1} ifrom distribution P , i.i.d., = k1,=2,
= P✓ (Xi =✓)
✓(1 k) = ✓(1✓ , ✓)k 1k . .. . .
1, 2,
to ✓,
estimate independent
✓ 2 2 n ✓
{0, 2 {0,
1}
independent
estimate –– •to 2design
e.g.,
✓e.g., R, ✓ 2 {0,
Unknown 1}
parameter
✓ ✓, to estimate
• Estimator: c = g(X , . .✓
⇥ 1 2n)R, ✓ 2 {0, 1}
.,X
rawn from – e.g.,
✓
distribution
PEstimator:
e.g., P c , i.i.d.
⇥✓= g(X , . . . , Xn )
istribution
e.g.,•– ••Various
m –distribution Data
, i.i.d.
P✓ ,X approaches
i.i.d. drawn
to 1 design
from distributionk 1 , P✓ , ki.i.d.
1 , . . . , XP ✓n
PP✓✓((X( X
Xi = i = k) =
= ✓(1 ✓) k 1, = 1,
= 1,2, 2,......
– –plugin to design • Three k)
=plugin
main
= ✓(1✓(1 approaches✓) kk 1
1 k = 1,1, 2, . . .
e.g., P✓ (cXiP=(X = k) = ✓(1 ✓) , k=
k)ii–= k)
✓(1 ✓)k 1,✓) ,k = 1,k2, . . . 2, . . .
(Xi =✓) ••k) Estimator:
• Three✓)main
1 ,empirical
k–Estimator:
= c
k risk
⇥⇥
1, = =approaches
g(X X
minimization
g( 1= , ..1, ...,2,
–, . maximum
,X X.n.)).likelihood
✓(1 ✓(1
k 1,
Estimator:
• Estimator: k = 1, c c
⇥=2, . .
=g( .
g(X k 1 , . . . , X n )
= ✓(1 •
• Estimator: ✓)
to – design
plugin c
⇥ = k ⇥=
g(X 1,
1 , 2,
. P
. X
.
.
✓.
,(X
1 .
1
X ,
– n . .)= k)nn)= ✓(1 ✓)k 1,
.
Bayesian
i
, X k = 1, 2, . . .
–to design
maximum
tondesign likelihood
)g(X1to • . , to
, . .designX )design
Various approaches
– method of moments
Xn) •• –Various – maximum
Estimator:
•Notation
Bayesian approaches likelihood
c
⇥ = g(X 1, . . . , X n )minimization 1 2 3 4 5 6 7 8
• Three
Notation
• Various main key:
key:
approaches approaches – empirical risk
• Three •–– Notation
main
plugin
– to
plugin Bayesian
design key:
approaches
aches –
–
– pluginvectors: of
method
vectors:
plugin moments
boldface
boldface
– plugin –– –vectors:
–• empirical
empirical
Threemethod boldface
mainrisk
risk minimization
ofapproaches
moments
minimization
–– scalars:
maximum
scalars: normal
– empirical normal
risklikelihood font
font
minimization
– maximum –
––– ––scalars:
maximumempirical
plugin
maximum normal
likelihood likelihood
risk
likelihood font
minimization
ood 1 X2 :
Bayesian
– maximum
X 2 3
second
: second 4 data 5
data record 6
record 7 8
–
– Bayesian –– 2
Bayesian: second data likelihood record
––
–
X2maximum
Bayesian
X
method :
X22: second
– Bayesian second of moments
likelihood
component
component of of a a vector
vector X X
– method ––– method
X :
of second
moments
of component
moments of a vector X
–– –
Data: 2Bayesian
method 11 4 of 3
moments
ents – method empiricalofrisk moments minimization
– empirical – method of moments1
risk minimization 2 3 4 5 6 7 8
nimization
11 –22 empirical 33 44 risk 55 minimization
66 77 88
9
on
The basic statistical framework
The
Elementary basic statistical
statistics framework
inframework
a nutshell •The Unknown
basic statistical parameterframework vector ✓ , to est
The basic statistical
•The Unknownbasic statistical
parameter framework
vector , to•• in Data
estimate X ,RXm1} drawn from ✓ , distribu
✓ 2 nR, m✓, 2✓vector 1} 1}
•The The
The basic
Unknown basic
basic Elementary
statistical
parameter
statistical
statistical ✓, statistics
frameworkto
framework
framework estimate
✓ a✓
Unknown 2 R,
nutshell 1✓,✓.parameter
2.2.{0, {0, to estim
• UnknownElementary parameter vector statistics ✓ , toin estimate
a nutshell 2 {0,
Unknown parameter vector ,✓✓ to independent
estimate P ✓✓,✓ R✓mmm
•••The
••Data
Unknown
Data The
Unknown
The XX
basic
Unknown11basic
, •
basic
, ..parameter
.
Data X1,•. . Unknown . Unknown
statistical
. . , XX
parameter
, n drawn
drawn
statistical
statistical
parametern
. , Xn drawn vector from
parameter
from
framework
vector
vector ✓
framework
framework
parameter ✓ ,distribution
to
distribution
,,•
from distribution to
✓,
to to
estimate
estimate
Dataestimate X
✓, to estimateP
estimate , ,.✓.i.i.d.
✓2✓
.2
i.i.d.
✓
1 P , i.i.d., R
2
X2 R,✓,drawn
2
nR
m R,
✓2 2
,✓,2✓R,
✓✓{0,{0,
22 2✓{0, 1}1}1}
{0,
from
{0,
2 1}distribut
1}
{0, 1}
✓
••– independent
Data
•••Data
Unknown
e.g.,
Data
Data XX1X
•• Unknown
1 •. Data
, ,•.11..,.,parameter
Unknown
X
independent .X
..,.,.X.,,nX
Data
drawn
Xdrawn
parameter
parameter
n nn , .vector
Xdrawn
drawn
1X ,.from
., ..from X
. vector
vector
from
from
, nX
distribution
drawn
✓distribution
,drawn
to
distributione.g.,
–✓✓independent
, from
,estimate
to
tofrom
distribution PPdistribution
estimate✓ ,✓✓✓,i.i.d.
distribution
estimate ,P
P i.i.d.
,2i.i.d.
i.i.d.
m
R✓✓ 22, R ✓R ✓
m
P 2P
m ,,,✓{0,
i.i.d. 221}
,✓✓ i.i.d. {0, 1}
{0, 1}
1 n
––– independent
e.g.,
e.g.,
Data Data
Data
e.g., – e.g., drawn drawn
drawn from from
from distribution
distribution
distribution P ✓ , i.i.d.
PP ✓✓,, i.i.d.
• ••
– e.g., X
independent
1 , . XX
. .1 ,
1 ,
,X
– e.g., .
. .
.
n .
.✓,, X
X n
P (Xi = k) = ✓(1 ✓)
n – e.g.,
k 1 , k = i.i.d.P✓2,
1, (X . .i .= k) = ✓(1 ✓
–– e.g.,
–independent
independent
e.g.,
e.g., ✓ (✓X = k) =✓✓(1 kk =k11 11, k 1,✓1(,1,
c PP✓P (X ✓(X
P (1X
ii = i.i.== k) =
k) P
k)n)= = P(X
✓(1 ✓ ✓(1
(Xi
✓(1 = ✓)
✓)
=k)✓)
✓) k,,
= ✓(1 , kk✓)
c ==
kk =
P1,
=k 2,
2,
X
1, .i.2,
1 ,2, ..=....k .. =
k)
. ==1,✓(1
2, 2,
1, . . .. ✓)
• – Estimator:
– e.g.,
e.g., ⇥ = g(X
✓ , . , X • Estimator:
i k
k) 1
✓(1 ⇥ =
✓) g( X 1 , . . k
. , X n ) ..
cc
c PP (X(iX
✓ P,,=
✓ (X =
ck)ik) === ✓(1
k) =,to
✓(1 ✓)✓) ✓)
✓(1 k , , ,k =
1 k 1
k= 1,k1, 2,
=2, . 1,
. .. 2,.. ...
•• •to Estimator:
design
Estimator:
Estimator:
• Estimator: • ⇥⇥
Estimator:⇥ =
c
⇥= =g(
g(X X
g(X11 X .
1.i . ⇥
.
, .
. ,
,
. cX
=
X
. , n
nX ))
g(X n
Xng(X )
• 1 . . .design
,
Estimator:
) 1, . .k. , XX n ) c
⇥ = g( X , . . . , Xn )
• Estimator: P ✓ (g(
X P ✓ (X
= 1 , k) .⇥ . . ,=
= = k)✓(1 = ✓(1✓) 1
✓) , k) 1 ,
n k = 1,
k =2,
1
1,
. . .2, ...
to
to design
Estimator:
design to cdesign
⇥ = X i ,g(X i, X )
•• • Three
to• main
design
Estimator:
Estimator:
to design to⇥design approaches
c c
g(
⇥ = . .
= g(X1, . .1. , Xn) to
1 . , .n . . •, X )
Various
n design approaches
••–••• to
Three
Three•design
plugin
Three
to main
main
Estimator:
Estimator:
to design
main
design • approaches approaches
Three c
approaches
⇥ = c
⇥
approaches main
g( =
X 1g(X, .approaches
1, ,X . .n.),•X
. .approaches )plugin approaches
–nVarious
Various c • Three main
• Assessment
•––•– Various
plugin
to ofdesign
to
design
plugin
maximum ⇥–:likelihood
approaches plugin
• Three
plugin
– plugin approaches
Various main
– pluginapproaches – pluginempirical risk minimization
•––––
– plugin
maximum
–Three
Various
maximum
•maximum
Bayesian approaches
plugin
plugin likelihood
main
– likelihood maximum
likelihoodapproaches likelihood – empirical
maximumrisk minimization
likelihood
• Assessment empirical
of ⇥ c :– risk minimization
maximum likelihood
–––––– empirical
Bayesian
–– plugin
plugin
Bayesian
method maximum
Bayesian
empirical of –risk moments
risk minimization
Bayesian likelihood
minimization maximum
maximum – likelihood
Bayesian  –2 Bayesian likelihood
––––– maximum
method
–
method maximum
empirical of
of – likelihood
moments
riskmethod
moments likelihood
minimizationof E c
moments
⇥ ✓ – Bayesian
– empirical
– Bayesian
method
maximum risk
of
Bayesian– method of moments minimization
moments
likelihood – method of moments
– Bayesian
–––––empirical
empirical
–– Bayesian
maximummethod
empirical
Bayesian risk
–risk minimization
likelihood
empirical
risk minimization
of moments
minimization risk  1 –2
minimization 2 method 3 4 of moments5 6 7 8
method
• Assessment of ⇥: of
– moments
empirical
c risk c minimization
–– –method of moments E ⇥ ✓
– method
Bayesian
– empirical
method of moments of moments
risk minimization 1 2 3 4 5 6 7 8
Data: –11
–1 method 43 of3 4
2 empirical moments risk
5 minimization
6 P⇣ ⇥ 7 6= 8 1⌘ 2 3 4 5 6 7 8
c ✓
11 22 33 44 55 66 77 88 ⇣ ⌘2 4 3
⇣ Data:⌘
c 11
1 Data: 2 11 3 44 3 5 6 E Data: ⇥ ✓11 4 3
P7 ⇥ c 6= 8 ✓
Data:Data:11114 4 3 3 10
Data: 11 4 3
f⇥|X ( · | x)
Today’s agenda
• E[⇥ |X =
Machine x]
learning and statistics
– the bigger picture
– (LINEAR) REGRESSION
some key statistical approaches (plugin; maximum likelihood
• Regression
– formulation
– solution
– interpretation
– (classical) performance assessment
• Further topics (to continue in next session)
– what can go wrong
– ridge regression
– sparse regression and lasso
– using nonlinear features of the data
– nonlinear regression
11
An example: Advertising and Sales
• Advertising Budget Across Channels:
TV, Radio and NewsPaper
• Data
An acrossMarketing
example: 200 Markets and Sales
Spending
• –Planning for TV, Budget
Marketing Radio, NewsPaper
Across Channels: TV, Radio and NewsPape
Resulting
• –Data across Sales
200 Markets
•– Questions
An Spending
example: for TV, Radio,and
Advertising NewsPaper
Sales
An – example: Advertising
Is there aAdvertising
relationship and Sales
between
An Resulting
•–example:
Advertising Sales
Budget and
Across SalesMarketing Channel Budgets and Sales
Channels:
• Advertising
If yes,
–Questions Budget
can Across Channels:
we NewsPaper
“predict” Sales given Marketing Channel Budgets?
TV, Radio
• •Advertising and
Budget Across Channels:
•TV, IsRadio
– Data there
acrossa relationship
and NewsPaper
200 Marketsbetween Marketing Channel Budgets and Sales?
• Data across 200 Markets
• Sample
– If
Data data
yes,
across
Spending vector
can200weTV,
for scalar
“predict”
Radio,Sales
Markets given Marketing Channel Budgets?
NewsPaper
– Spending for TV, Radio, NewsPaper
– –Spending
Resulting forSales
TV, Radio, NewsPaper
Resulting
– Linear Sales
regression
Sample
–• data
Resulting
Questions Sales
• Questions
• Questions
– Is there a relation between Advertising Channel Budgets and Sales?
– Is there a relation between Advertising Channel Budgets and Sales?
– –Is Ifthere
yes, acan
relation betweenSales
we “predict” Advertising Channel
given the Budgets
Channel and Sales?
Budgets?
– If yes, can we “predict” Sales given the Channel Budgets?
– If yes, can we “predict” Sales given the Channel Budgets?
Sample data vector scalar
Linear regression
Linear regression
Linear regression 12
An example: Advertising and Sales
• Advertising Budget Across Channels:
• Data across 200 Markets
– Spending for TV, Radio, NewsPaper
– Resulting Sales
• Questions
– Is there a relationship between Marketing Channel Budgets and Sales
– If yes, can we “predict” Sales given Marketing Channel Budgets?
Linear regression
13
ending– Resulting
for TV, Sales
Radio,
– Spending XX : : symptoms, ... TV, Radio,
symptoms,
NewsPaper
for test results,
test results, NewsPaper etc.
etc.
Y : state of health • “Learn” a “good” g from the data
• Questions
/predictor:
sulting Sales b = Y
– Resulting Sales
Y g( X Y:) : state
state
or Yb ( of
X
Xexpectancy
n
of ) health
health
RegressionYY: : life (real number)m[regression]
dimension
“good” – Is
stions g from there a Y
relationship
the Ydata
• Questions : life
life expectancy
expectancy
between (real
(real
Marketing number)
number)
• Regressor/predictor: [regression]
[regression]
Channel BudgetsYb and = g(Sales?
X) or Yb (
– If yes, can Y
we Y:: :“predict”
sick
sick or
sick
oror not not
not (binary) [classification]
(binary)
(binary)
Sales given•
[classification]
dimension m
[classification]
Regressor/predictor:
Marketing Channel Yb =
Budgets? g(X) or Yb(
there a relationship
– Is there XX11 b betweena relationship Marketing between “Learn”
Channel Marketing
b = g(X) or Yb (X) a
Budgets“good” and
Channel g from
Sales? Budgetsthe data
and Sa
sor/predictor: •Yb Regressor/predictor:
= X g( X
Y ) = or g( Y bX()bXor ) Y b (X)objective:
Y b (X) a “good” g from the data
• Regressor/predictor:
yes, can we –“predict”If yes, . 1
.. ... can
Y
Sales 1wegiven Y
“predict” = g(X)
Marketing • or“Learn”
Sales ChannelY
• Regressor/predictor: given Marketing Budgets? Yb⇣= g(X)Budgets?
Channel or Yb (X)
•a “Learn”
..the .
.
. the a from
“good” gdatafrom the data
objective:
Learn”
a “good” a “good”
•• “Learn”
Sample g from
data g“good”
vector . A from data
conceptual
scalar g datab thebig picture
dimension b ✓ ⇥m E g(X) Y )2 g ⇤ (
Regressor/predictor: .. ... ... Y =• g(X) “Learn”or Y (X) a “good” g from the ⇣ data
.. • bQuick demo bin this Friday’s recitation 2
ek)data “Learn”
• vectorSample a “good”
scalar dataXXnn Ynm Y =
vector gg(X)from or
scalar the Y (X)
data dimension m b E g(X) Y
b )
dimensionX • Regressor/predictor: Y = g(X) nor⇣Y (X)
ension
m
Learn” m 
dimension
a ⇣“good”
Regression n
m g from the data dimension objective: (risk) 1 X
2 m
•| X“Learn” a “good” g from ⇣ the n data g(X n ⇣i)
E g(X) Y )X ⇤
X Yg ?(X) = E[Y
: symptoms, test ]results, etc.
objective: (risk) 12 X
ssiondimension Regression
m
E g( X )
i=1
Y ) g ⇤g(
(X
ective:
(risk) (risk)
small objective: Y • : state
“Predict”(risk) of healthY , based on X 
ension m n ⇣ ⇣ empirical
objective: (risk) ⇣
risk minimization (ERM) n
2i=1g ⇤ (X
⇣ 1 X  ⇣ ⌘2 E g(X) Y )
⇣
Y g( : X life ) expectancy
Y (real ⇤number) 2 [regression]
⇤ (X) = E[Y | X]
objective:
E g( E
) g( YX 2 E
) ) gY ()X) =
i ⇤ i 2 ⇤
g E[Y E )
(X) |= 2g( X
turns )
X]E[Y | X]
(X) Y )
out:
empirical= g
unavailable
E[Y risk
| X]minimization n (ERM)
nX Y : sick or
g(X)
not
Y
(binary)
g
[classification]
⇣ 1 2X ⇣
⇤ (X) =
ective: (risk) i=1 ⇣
• is⇤turns Restrict E g(X)
out:tounavailable
limited class Y ) of
n
g g( Xi )
predicto
E n g(X) nY ) 2 g (X) = E[Y | X] ⌘ n1 i=1
X ⇣
k minimization” (ERM) ⇣X 1 n ⇣
2 1• is Restrict i)
n ⇣ ⇣ ⌘ ⇣ ⌘ X ⌘ 2 g(X
1 ..X 1 X 2 1 ⇤2X
Overfitting 2
g( X ) to
Y b limited class (1
n of pred
available proxy: Marketing E . g(X)g(XYi)) g(YX i )(X)Yg(X
igfeature i= E[Y
“empirical
in) |YX] iY = 10 X
i riski minimization” n
+ ⇣(ERM)
1 X ) + · · ⌘·
i=1
n ..i=1 n i=1 and tests
example; try a multiplicative n1 for X
i=1n ⇣ i=1 ⌘2 g(Xi) (1Yi
• Statistics. texts have procedures
turns
empirical out:
adding removing variables
unavailable
risk minimization
(1) Yb = (m)
n (ERM)
0 + 1X ) +
– but: rely on “standard” assumptions, often violated g(Xwhere
empirical ) Y X = (1, X , . . . , X
i=1
1risk n minimization ⌘2 (ERM)
i i
pirical empirical risk
risk minimization
isk minimization (ERM)
• Two minimization
approaches (ERM)
Xn penalize (often Xused in⇣ (ERM)
n i=1 turns
combination)
out: unavailable
– regularization:
turns out: g(Xi)empirical
overfitting
methods,unavailable
Yi riskwhere X = (1, X (1)
minimization (ERM) , . . . , X (m)
turns
nsunavailable
out: out: •unavailable
unavailable – data-driven
n i=1
not relying on formulas
empirical risk minimization (ERM) turns • out:Restrict to limited class

unavailable
• is Restrict to limited class of predictors:
\ = 6.57 + 0.019 · (TV) + 0.029 · (Radio)
Sales + 0.001 · (TV) · (Radio)
piricalturns out: unavailable
risk minimization 2
(ERM) of predictors
R = 0.967
ns out: is Restrict
• unavailable to limited Yb = class (1
of predictors: (m) = T X
new feature 0 + 1X ) + · · · + mX Yb = 0 + 1X (1) + · · ·
estrict to limited class of predictors: (1 (m) = T X
= (1, XY(1) b =
where X https://en.wikipedia.org/wiki/Overfitting
, . . .0, + X (m) 1 X ) +where ··· + X mX = (1, X (1), . . . , X (m))
Yb = + X (1 ) + · · · + X (m) = T X
0 (1)
1
where X = (1, X , . . . , X (m) m
14
(1) (m)
ending – Resulting
for TV, Radio, Sales
XX : : symptoms, ...
symptoms,
NewsPaper test results,etc.
test results, etc. 1 X⇣ ⌘2
Y : state of health g(Xi) Yi
sulting• Questions
Sales
b YY: : state state
b X n
of
of health
health n i=1
edictor: = X
Linear regression
Y g( ) or Y ( X )
Y : life expectancy (real number) [regression]
stions – Is there a Y Y: : life
relationship life expectancy
expectancy
between (real
(real number)
number)
Marketing [regression]
[regression]
Channel Budgets and Sales?
ood” g from the data YY: : sick or not empirical risk
(binary) [classification] minimization
[classification] (ERM)
– If yes, can Y
we : sick or
sick
“predict” or not
not (binary)
(binary)
Sales given [classification]
Marketing Channel Budgets?
there a relationshipX•between Restrict Marketing
to turns Channel
limited out:
class Budgets
unavailable and Sales?
X X 11 b = g(X) or Yb (X)
• Regressor/predictor:
yes, can we “predict” . 1
.. ... Sales
Y1
of ..predictors Y
given Marketing
• Restrict Channel
to limited Budgets?
classclass
of predictors:
. • Restrict to limited
• • “Learn”
Sample .. A conceptual
data avector
Restrict to“good”
.. ...limited
.
scalar from the
... g class
bigdata An Example:
picture ✓ ⇥ Advertisement TCampa
.. • Quick demo in this of
b
Y = predictors
✓0 + ✓1recitation
Friday’s X + · ·0· +
Yb1 = + ✓1mXX(1 m)=+✓· · ·X+ mX
e data vector of predictors
scalarXXnn Yn b =(m)
where X = (1, X , . . .T, X ✓0)+ ✓1X1 + · · · +
X where = (1, (1) , . . . , X (m) ) (1) Y
dimension
Regression
⇣ m n
b
X X
2 XX : ⇤sum ?
symptoms,
Y Y = ✓0 + test + ··· +
✓1Xresults,
1points ✓m X m = ✓ X T
etc.
ed class
ed E
class g( X ) Y ) g ( X ) of
= over
E[Y | Xdata
] residual
let X = (1, X1, g(X) . . . , X=m)
X
ssion • Restrict
smalllet X =to (1,limited
Y X•:1,state . . . class
“Predict” , Xofm ) health
• Restrict
Y , based to limited class
sumonofXover data points residual g(X) =
ofsum predictors
ofn over
⇣ Y data : life points ⇣
expectancy
⌘2 residual of
(real b = predictors
number) ) = TX ✓
[regression]
1 X 2 Y ⇤ g( X ✓
YbYb == ✓✓00 ++✓✓11XX11 ++ g(··X ····i+)+✓✓E m Y
mXi mm=
Xg(X) = ✓✓TTXYX) g (X) = E[Yn | X]
n i=1 Y : sick
Yb = 0 + 1X or not (binary)
(1) + · [classification]
· · + mX (m) X =Yb T=X + (1) + · · · +
min (Yi T 0
Xi ) 2 1 X
...,,X Xmm)) X1 n ⇣ Xn
nimization where (ERM)
X =g( (1, ... ))X= (1) (m)
1 X ⌘2 i=1 (1)
) n where X = (1, X , . . . , min (m)
aa points
points residual
residual g( X
X = ✓✓,TT.X ..,X
X Xg(XiT) Yi X ) (Yi
ilable sum of over data ... points g(X) n i=1
min = Tsum (X✓ Xof Yi)2 data points residual
i over i=1 g(X) =
i=1
empirical
· · + ✓m X m = ✓ X Xn risk
T minimization (ERM)
turns out:X nnunavailable
X
min
min (✓✓TTX
( X i i YYii))22 Xn X n
• T is Restrict to limited class min of predictors:
(Yi T X) 2
X) = ✓ X ✓ i=1 i=1 min (Yi
i=1
Yb = 0 + 1X (1) + · · · + mX (m) = T X i=1
where X = (1, X (1), . . . , X (m)

Xi Y i )2
15
points
points i and to
Solution
Xresidual
residual
have
g(
g( the
X)) =
X =dimension
regression
✓✓X TX
TX X and✓m
iand
+
have
✓have 1 dimension
problem dimensionn datamm points
++11
i
X Solution
and have Solution
to dimension
the regression to
m + the1 problem n data pointsn data points
regression problem
o the Solution
iregression to problem
the regression n data points problem nmdata points
X i and ✓ have X and
dimension
i ✓ have m +dimension
1 + 1 2 3
·ion
+ ✓problem
ave dimension
m XX =
m and ✓ TnX mdata
+ 1 points
have dimension m + 1 T
1 X1 7
6
to
ution+ 1to
m the regression
theX Xnn problem
regression problem n data npoints data minpointsH( ) 6 ... ... 7
min
min T
T
((✓✓ X Xii YYii)) 22 Solution to the regression =min min
X✓ = H(H(6
.✓ )
✓problem) . 7 n data
have dimension m +1m +1 min ) ✓ 6 .
. .
. 7
nd=✓ XT have ✓ dimension
i=1
i=1 X and
H(
✓ have dimension m
4
+ 1
5
quadratic in quadraticinini✓✓ min H(✓ )
quadratic min H(✓ ) 1 XTn
min H( ) ✓ ✓
quadratic in
min H(✓ ) in ✓quadratic in ✓
quadratic
n optimaity
✓ conditions: optimality
optimality rH( )conditions: =2 0
conditions: rH(
rH( ✓✓))(m) ==0 30
min H( ) (1) 2 2 3 3
1 X · · · X (1)
1 7 ··· X T(m)
Xi Yoptimaity
i)
2 conditions: rH( @H) = 06 .
.
1
.
. .
.
@H@H 1 1X
.
.
6 .1 .
X11 min H(
. . 7 )
optimality
optimality conditions: rH( conditions:
=
✓) = 06
0, . . rH(
j = ✓
0,. ) 6
1,=
6 ..= .. 0
. .. ,. .
m 7 . .
..7 . ..j j==.0,0, . .
..1,1,7 7 .
conditions:
n rH( ) = 0 @ jX = 6
= 6 ... ...= X = . =6 .
60,.0, . .. . ,. m
,7
7m
@H .
. 6
@✓@✓ 6
j. j.... . ,6
. . 7
.m.. ..5 .. .. . .
.. ... 7.. 7 7 .
H(✓ ) =linear
0 @H = 0,
quadratic
4 j = in
@H0, 1, 4 4 . . 5 5
system of mlinear + 1@ equations = 0, (1) = =
0, 0,
1, (m) j =
(1) 0, 1, . . . , m(m)
@H linearsystem systemof
j 1ofm Xmn ++ j 11j·equations
·equations
· Xn 1(1)Xn . . . , m ·X · ·n(m) Xn
= 0, j = 0,
@✓ 1, . . . , m @✓ 1 X n · · ·
conditions:
H rH(
linear system@ j
) of
=0 m+ 2
linear 1 equations
linear algebra
algebra 3
j
solvers,
solvers, packages
2packages
2 conditions: 3 n⇥
nrH( ⇥(m (m
3 )+ + 1)1)
= 0, linearj = 0,
system 1, . .
linear
of. , m 1 + system
1X T
equations optimaity
of m + 1 (1)
equations (m)
2 3 = 0
✓ j oflinear
m 1 X Y 1 · · · X
em m + 1algebra@H solvers,
equations •6•. Formulas:
.
packages
.
.
17
Formulas: 6 . 6 ..1. 7
. . .
.
1 Y17
...@H
linear algebra= 0,linear 6
solvers, . j = .
algebra
0,
packages 1,
7 solvers,
. . . , m n ⇥6 packages
.
(m 6 +.. 1)
7 .
n ⇥ (m6 +... 7 71) 2 3
quations
• Formulas:
bra solvers, packages@ j = X =
n ⇥6(m
6
... + ...1)7 7 = X = = Y6 =
6 ... 4 ...... 5 =...Y = 6
6 7 6 ... . 77 7 = 0,Y1 j = 0, 1, .
• Formulas:
4 5 @ . 7 .
ckages • ⇥Formulas:
n (m + 1) T
4 4 .j 5 5 6
6 .. 77
s:
em of m + 1 equations 1 Xn 1 X Yn(1)
· · ·+ 1
(m)
=YY = 6 .. 7
Xnequations
linear system n of m n 4 . 5
bra solvers, packages n ⇥ (m + 1)
linear algebra 2solvers, 3 packages Y nn⇥ (m + 1)
s: Y 1
2 ˆ T 1 T3
• Formulas: 6 .
. 7
1 = (X X1 T X) X Y 6 . 7
6 .. ... ...✓ ... T 7 Y = 6 .
. 7
6. b = (X X)
7 1 X T Y 4 . 5
6. ... ... ... 7
6 .. 7 Y
4 • assuming 5 invertibility n
(1) (m)
1 Xn · · · (data Xn vectors Xi are not confined to a lower-dimens
16
e
Results
Results for
for our
our example
example
2 3
Results
r our example Results
for our for our example
example 2.94
60.0467
ˆ= 6
6
7
7
n = 200 m + 1 = 4 4 0.19 5
+ 1 = 4n = 200 mn+=1 m
=+41=4 0.001
2 3
2.94
60.0467 2 3
✓b = 6 7 \2 3 2 2.94 3 2 3
6
4 0.19 5
7 Sales = 2.94 + 0.046 · (TV)
2.94 2.94
60.046+ 0.19 2.94 + 0.001 · (NewsP)
7 · (Radio)
ˆ 660.0467 7 6 7
0.001 ˆ = 6
60.0467
7 ˆ = 66 7
7
ˆ 6 0.046 7
6 7 = 46 0.19 5
7 = 6 7
4 0.19 5 4 0.19 5 4 0.19 5
0.001
An Example: Advertisement Campaign
0.001 0.001 0.001
46 · (TV) + 0.19 · (Radio) + 0.001 · (NewsP)

\ = 2.94 + 0.046 · (TV) + 0.19 · (Radio) + 0.001 · (NewsP)
Sales
\
Sales = 2.94\
Sales =
+ 0.0462.94 + 0.046
· (TV) · (TV) + 0.19+· (Radio) + 0.001 · (New
inear
= 2.94regression
+ 0.046 · (TV) + 0.19 · (Radio) ++ 0.19
0.001 (Radio)
· ·(NewsP) 0.001 · (NewsP)
• Compare with simple linear regression
\
les • Compare
= 12.35
with simple 0.055with
+linear simple linear regression
· (NewsP)
regression
\ = 12.35 + 0.055 · (NewsP)
Sales
\ \
Sales
Sales = 2.94
= 12.35 ++ 0.046
0.055 · (TV) + 0.19 · (Radio) + 0.001 · (NewsP)
· (NewsP)
17
based on the data
– guarantees?
esults Interpretation and justification
How to
How
How to
to generate
generate
generate these
these
these
Maximum
• results
results
results Interpretation likelihood
Interpretation
Interpretation and — and
and
illustrate
justification for m = 1
justification
justification
ation How How to to generate
generate these these results
results Interpretation
Interpretation and and justification
justification
• Empirical
••• Empirical risk
risk minimization–
minimization assume structural model: Y i = 0⇤ + 1⇤ Xi + Wi
omplex, Empirical
Empirical risk
risk minimization
minimization
• –but find
Empirical theriskbest
true relation may be liner predictor,
minimization
complex, based
but find the best liner predictor,
–– true
true relation
relation
true Interpretation
relation may
may
may bebe
be complex,
complex,
beandcomplex, but
but find
but find findthe
find thebest
the bestliner
best liner
linearpredictor,
predictor,
predictor,
–– based
these results true relation
on the data may complex,
N(0,
justification
2 ) but the best linear predictor,
minimization based
based
based on
on
on the
the
the data
data
data
ay be complex, based
– guarantees? on the data
but find the best conditioned
– linear predictor, on all the Xi: all the Wi are indepe
––– guarantees?
guarantees?
guarantees?
—ata conditioned
•–– Maximum
illustrate 1 on all the
for m = likelihood
guarantees? —X i : all the
illustrate forWm = 1independent and Normal(0, 2)
i are
•••–Maximum
Maximum
Maximum
⇤assume likelihood
likelihood
likelihood
⇤ X structural —
model:——Yillustrate
= 0⇤ + for
illustrate
illustrate ⇤for
for
Xi +mm mW=== 111
el: Yi = • Maximum
0 + 1 i + likelihood
W i — illustrate
i
⇤ for
1
⇤m = i 1
hood — assume
assume⇤structural
––– illustrate
assume for m = 1model:
structural
structural model: YY
model: YYiii=
==✓0⇤00⇤+
⇤ + ⇤ ⇤X
+✓ ⇤111XX +
Xiii+++WWWWiii
ral model: – assume structural
⇤
Y = ✓ 0 + ✓ 1 X i + Wi model: i = 0 + 1 i i
N(0, 2i) P( X, Y) = P(X) · P(Y | X)
N(0, 222)
– conditioned on all the X P(:Xall , Ythe ) =WP(are P(Y | X) and Normal(0, 2)
X) ·independent
N(0,
N(0, )) i i n ⇢ ⇤+
⇢= P(X) ·⇤
Y 1 (Y i 0
– conditioned on all the Yn X : all the W
1 ✓0 ✓p
arei independent
i (Y
⇤ X )2 and expNormal(0, 2
max P(Y |= X;P(✓ )X) · i 1 2⇡i 2 2
p max exp P( Y | X; ✓ ) i=1
✓ 2 2 2
i=1 2⇡ ✓
Q P Q P
exp = exp exp = exp
2
all theN(0,
Xi: all)the Wi are independent
P(X, Yand P(X) · P(Y | 2X))
) = Normal(0,
Maximizing the likelihood n function = ⇢minimizing ⇤ +the
⇤ Xempirical
2 risk
Y 1 (Yi 0 1 i )
= P(X) · p exp
i=1 2⇡ 2 2 2
Maximum
P( likelihood
X, Y) = P( X) · P(Y |estimates
X) have appealing theoretical guarantees
nQ P
⇢
(Yi ⇤ + ⇤ X )2
Y 1
exp = exp
(X) · PERFORMANCE
p exp 0 1 i
ASSESSMENT R2 (R-squared)
i=1 2⇡
2 2 2 18
Maximizing the likelihood function = minimizing th
PERFORMANCE ASSESSMENT
19
We•X
•• Total ranTotal
nsum our ofsum of squares:
regression;
squares: how=
TSS TSSdid
well = we do? n
X
2 • Total sum of squares: TSS = (Y Y ) 2
•• We Totalransum
(Youri of regression;
Y )squares: TSS how well = n did X wen do?
i
ENT R 2 (R-squared) X i=1
n
(Y ) 2 X
• Total sum of squares:where
i=1 TSS = n (Y Y) i 2 Y
(Yi Y )2
X i
• We ran our regression; howi=1 well did we do?
n (Yii=1Y )2
X i=1 1
• We ran
• Total our
where
sum regression;
1 of squares: TSS i=1 how well did
= (Yi Y ) we do? 2 Y = Yi
where Y = Yi where n
We
•• •TotalWe ransum
ran our of
our
n regression;
squares:
regression; how
TSS how =well well
i=1 did
n, did wewe do? 1
do?
• where
We ran our regression; variation
how well in
Y
YX did
=
1
without
we
YY =do? doing
2 Yi any regression1
without •doingwhere
•Total any
Total sum ofof
regression
sum squares:
squares: TSS TSS =X n= of(Y 1i Y )n RSS =
i Y = Yi
• Total sum of squares: TSS = (Y • Residual sum Y squares:
n
= 1YY)i2) n
squares: variation variation
RSS = in Y , without in Y , without
doinginany doing
nX i=1 i any regression
nY doing X
regression
n without
0 where2 2 variation
 R  1 R ⇡ 1 is generally preferable i=1
X YnY, = i 2 2 any regression
n
variation
• Residual
in Y , sum
without of squares:
doing X
any (Y
RSS (Y Y
regression
i i )
n= Y 2 ) ) (Yi ˆT Xi)2
• where Xn
Residual sumT of •squares:2 Residual RSS sum =of(Ysquares:
i Y1) RSS =
variation
• Residual (Y in
i sum ˆY ,
X without
of )
i squares: RSS doing i=1
anyi=1regression
= YX n = Yi
i=1
R2 for our example Xn i=1 1 T n in2ˆYT, after n2 taking into account
• where where
Residual
i=1
where sum of squares: unexplained
RSS = variation
Y i= ˆYiXi i)
n (Y (Y ✓ X X
i ) (Y ˆT Xi)2
X 1
ation in Y , R variation
after taking in 0.897
R2 = into
Y , without
accountdoing X any n T1 )2
regression i
i= =ˆ
2 = 0.05 (Y i=1
i=1
X n Y Y 1YiX Yii i=1
variation in Y
unexplained , without
of variation doing inany Y regression
,taking= n T
ˆ ntaking
Y 2
• Residual
unexplained sum
variation squares:
in Y , after
unexplained
RSS
i=1 Y(Y iafter
=
variation X
n into
i i ) accountinto account
in Y , after taking X X
into account
variation
• Residual
variation
Newspaper in
sum in
Y
budget,
of
Y without
,squares:
without
explains doing
RSS
doing
little i=1 any
= any regression
regression 2 RSS
unexplained
variation in Yvariation , without in doing
Y , after any
Xn taking into account
regression R =1X
(Y T X )2
înto TSS
unexplained
• •Residual
Residual sum sumvariation
RSS of squares:
of in
squares: Y ,
RSS after
Xn ==
RSS taking i i account X
R2the
• Residual
All 1 sum together
= budgets of squares: explainRSS
fraction of
a lot =
variation
(Y
i=1 ˆ TinX Y)2that has been explained
TSS XnX n i
2 RSS
i RSS
unexplained
2
• More variables: R can only goi=1
variation in up
,2 n(or stay
after
(Y R =the
T
ˆ ˆXiX
taking 1T same)
) 2
into 2 account
2=1 RSS
n in Y that – has been explained Y
R X = 1(Y
i i RSST i2)
TSS R X
2 2 2 (Y
2 ˆ
1TSS i ) account
but
unexplained this may
0  R201variation
R R be a mirage
21⇡0 1 high
in ,
 Rafter
1R
RYgenerally
is i=1 is
2 R
=
i=1preferred
taking
⇡
1 X
is into
ipreferable
RSS
generally X TSS
preferable
fraction fraction
– adjusted of
R :
of variation 2 variation
in Yinthat in Yhas that
2 =been
i=1 has TSSbeen
explained explained
unexplained fraction
variation
unexplained variation in Y , after taking Yof, variation
R
after 1
taking in Yinto that
into has been
account
account X X explained
fraction
1 is generally of variation
preferable
unexplained 2 for variation in Yinthat has been
Y , after taking TSSexplained
into account X
R our 2 example RSS
fraction2 of variation
in simple regression Rin isYanthat
2 has
estimate
R 2been
of
= 1 explained
0R the  1 correlation
square R ⇡ 10 is 2
generally
R  1 between
coefficient R 2
preferable
⇡1 X isRSS
and generally
Y preferable
2 2 2 =alone:
R preferable1 TSS
0 •RForTV 1 2alone:
R ⇡R1 2= is0.61
2generally
For Radio R2RSS = 0.33
fraction
2 Rof = 0.05
variation
2 R in = Y 0.897
that 2 has
2 been RSS
TSS explained
0  R  1 R ⇡ 1 is generally RR preferable
= =1 1 RSS
fraction of variation in Y that has R2 = been 1 TSS explained
TSS
TSS
fraction
fraction of
2 ofvariation
variation
Newspaper 2 ⇡in inY
budget that
that
Ygenerallyhas 20 been explained
has been
explains littleexplained
fraction of variation in Y that has preferable
0  R 1 R 1 is been explained
60.0467 ˆ = 6 0.0467
2 6 27 6 7
R 2 0  R
for our budgetˆ 
=6
example1 R 7 ⇡ 1 is 4 generally
0.19 5 preferable
Newspaper explains
4 0.19 52 little
0.001 2 = 0.897
0  R2  R21 for R 22
our
⇡
0  R  1 0.001 1example
is generally
2 R = 0.05
preferable
R
R ⇡ 1 is generally preferable
2 = 0.05 2
R for 2= our example2
R
All the budgetsR 0.897
together
0  explain
R  1 a R 2 ⇡ 1 is generally preferable
lot
R2 for \ our
R 2example
for our example
2 Newspaper budget explains little
Sales
• For TV = 2
2.94 + 0.046
alone: R 2 = · (TV)
0.61+ 0.19 · (Radio) + 0.001 · (NewsP)
Newspaper
0  2R = 1 20.05
budget2⇡ 1 =
Rexplains
2is 0.897 little
generally preferable
2.94 +
• 0.046
For R 0· (TV)

Radio
2 R +
R
 10.19
alone:2 RR·
R2(Radio)
2
⇡= 1 is
0.33
for + 0.001
generally
our example (NewsP)
·preferable
2 0
R = 0.05RR  2
R
2 ==0.05 1
0.897 R ⇡ 1 is generally preferable
All the budgets together explain a lot
2 = 0.897
2 R
th simple20 linear
R Newspaper1 R2 budget
regression ⇡ 1 is generally
explains preferable
little
All
R thefor budgets
our
2 together
example
R for our example 2 explain
R = 0.05 R = 0.897 a2 lot
R 2 for our example
• For\
Sales TV alone:budget
= 12.35
Newspaper 2 = 0.61
+R0.055 · (NewsP)
explains little
2
R for our example
2
• RFor = All
Radio
0.05
2 the
R 2 budgets
alone:
= 0.897
2 R 2 together
= 0.33 explain a lot
R = 0.05 R = 0.897
Newspaper budget explains little
R2 = 0.05 R2 = 0.897
2All the budgets
2 together explain a lot
R =For0.05 = 0.897
R alone: 2 0  R22  1 high R2 is preferre
• For TV
Newspaper TV R = 0.61
R2 = 0.61
alone:explains
budget littleFor Radio alone: R = 0.33
Newspaper budgetAll the budgets
explains little together explain a lot
• For Radio alone: R2 = 0.33 R2 for our example
Newspaper budget explains little
• More variables: R2 can only go up (or stay R2 =
the 0.05 R2 = 0.897
same)
AllMore
• the budgets
variables: R2 canexplain
together only goa uplot (or stay the same)
– but this may be a 2mirage
• –More variables:
but this may
2
R a can
be only go up (or stayNewspaper
mirage the same)budget explains litt
– adjusted R : 2
– but this may
adjusted R : be a mirage
2
RSS/(n m 1)
– adjusted R : 1 0.897 ! 0.896
in simple regression R is 2 TSS/(n
an estimate1)of
2
in simple regression R is an estimate of
the square correlation coefficient
2 between X and All Ythe budgets together explai
in
thesimple
squareregression
correlation is an estimate
R coefficient of X and
between Y
• For TV alone: R2 = 2
0.61 For Radio alone: R•2 More
=2
variables: R2 can only
0.33
theFor
• square correlation
TV alone: R =coefficient between
0.61 For21 Radio X and
alone: R = Y 0.33
– but this may be a mirage
2 2
Performance assessment: back to basics
• Introduce concepts through Performance
simple example assessment: back to basics
Performance assessment: back to basicsY = ✓ ⇤ + W Data:
• Structural model: Yi = ✓ ⇤ + •Performance
W Introduce
i , i = 1,concepts . . assessment:
.,n throughback to example
simple basics
Performance Introduce
•Performance concepts
assessment: through
back
assessment: to back simple
basics to example
basics through
• Introduce concepts
• Structural model: Yi = ✓ ⇤ + W simple
, i=example
1, . . . , n
Performance assessment: backexampleto basics ⇤ i
Introduce concepts
• Introduce through
concepts simple example
through simple : independent,
Performance assessment: • Structural back to model:
basics Wi i=
Y ✓ + Wi, i =zero1, . . mean,
.,n
Structural Introduce
•model: ⇤
Yi = concepts
✓ + Wi, through i = 1, . . simple
.,n example
⇤ = E[Y ]
Y =✓ + ⇤ • WIntroduce
Data: concepts through ⇤simple example ✓
Performance • assessment:
Structural model: backYto i ⇤⇤=basics
✓ + Wi, i = 1, . . . , n
• Structural model: YYi = ✓ + Wi, i = 1, . . . , n
Introduce concepts through simple ⇤ example
Wi: independent,
Wi: ⇤independent, zero mean, Y
zero = ✓
mean,
vartiance +W 2 Data:
vartiance 2
Structural ⇤
model: Y = ✓ + Wi, i = 1, . . . , n
= ✓ + W Data:⇤ = E[Y ] i
⇤
✓ = E[Y ] ✓ ⇤ Wi: independent, 2 zero mean, vartiance 2
Wi: independent,
Y =✓ + zeroW mean,Data: ⇤vartiance 2
⇤ = E[Y ] ✓W =
i : independent,
E[Y ] zero mean, variance
Wi:✓ independent, zero mean, vartiance 2
⇤
⇤ = E[Y ]
“plugin”
“plugin” estimate✓ = E[Y ]
estimate
2
Wi: independent, zero mean, n
vartiancen
⇤ 1 X 1 X
✓ = E[Y ] c=
⇥ ✓ˆ = Yi Yi “sample mean” n
empirical risk minimization
empirical risk minimization nni=1 n i=1 1 X
1 X ˆ
✓= 1 X n Yi
ˆ=
✓ Y i c =n i=1 Y
plugin
plugin estimate estimate
“plugin” estimate ˆ“plugin” n
1 Xn ⇥
i=1 estimate plugin estimaten i
✓=
plugin Yi
estimate n i=1
plugin estimate n 1
n n X
i=1 ˆ1=1
✓ X X Yi 2
empirical empirical
risk risk
minimization minimization
empirical risk minimization min n (✓
minnminimization(✓ ) )2 risk minimization
Yi Y
plugin estimate empirical ✓risk i=1empiricali
empirical risk minimization✓ n
i=1
mpirical risk plugin estimate “sample mean” i=1
minimization
maximum maximum
likelihood
maximum likelihood
the Wifi are
iflikelihood the
if normal
Wi W
the are are normalnormal
mpirical risk minimization maximum i likelihood maximum
if the Wi likelihood
are normalif the Wi
maximum
maximum likelihood if if
maximum
likelihood the the WW are
ilikelihood
are normalif the Wi are normal
normal
maximum likelihood if the
empirical risk minimization
W i are normal i
maximum likelihood if the Wi are normal

maximum likelihood if the W22i are normal
i i
Performance
Performance assessment
assessment hh for for thethe sample 22
⇤ sample mean
mean
assessment:Performance back
Performance 2 to assessment
basics
assessment c c for
for
Y⇤ ⇤=2the
2the✓ sample
+ W Data:
sample mean
mean
mean, vartiance EE ( (
⇥ ⇥ ✓ ✓ ) ) ] ]= =
oncepts through simple Yexample = ✓ ⇤ + W Data: nn
c ⇤⇤Performance assessment for the sample mean
model: YE[ E[iE[
=c
⇥]c
⇥]
⇥] =
⇤
✓= = ⇤ (unbiased)
+✓✓Wi(unbiased)
✓ ,(unbiased)
i = 1, . . . , n Wi: independent, zero mean, variance 2
c mean, vartiance 2
Wi⇤: independent, ⇤ =zero mean, variance 2 sample mean
1. YbYb =
1.
ndent, =⇥ c
zero ⇥ c =✓
Performance ✓ E[Yassessment
] for the
E[⇥] ⇤ = (unbiased)
n ✓ E[Y ]
c= 1 X
⇥ Yi E[ c = ✓h⇤h h (unbiased) 222
⇥]
Model/parameter
n i=1 estimationE Eerror
E c
((⇥c
⇥
( c ✓✓⇤✓⇤)⇤)2)
⇥ 2]2] ]=
= = ii
hh ii hh ii hh 22 n
c 22 = 2 + E (✓⇤⇤h c
n
n
c 22 =n 22+ 1 X
EE(Y (Y ⇥)
Model/parameter
c⇥) EE(Y (Y ✓✓⇤⇤))2error
=estimation
n + E (✓ ⇥)⇥) ⇤ =2 +X 2
n
c=
⇥ Yi
Model/parameter 1 estimation error
X c
E (⇥ ✓ c ) ]= 1 n i=1
Prediction✓êrror = Yi ⇥ = nnn Yi
bbb= cc hn 2
1.
1.1.YY Y= =⇥ ⇥ c
⇥ n i=1 c i=1 ⇤ 2
tion Prediction error E (⇥ ✓ ) ]=
ate Prediction error n
b
1. Y = ⇥ cc c
Model/parameter
Model/parameter estimation
estimation error
error “sample
⇤n⇤))+ mean”
⇤⇤ ⇥) c c
Model/parameter
Model/parameter estimation h
estimation Y Y ⇥⇥error
== i
error(Y
(Y 1 ✓X✓ + (✓
(✓ ⇥) 1 n
X
the Wi are normal = 2 ⇤ 2 ⇤ 2 c ⇤ )2 = 2
hh hE “sample
(Y ✓
1.i ) mean”
b ⇡h
i iY =h h⇥ c (Y i i
ii ✓ ) h
hh ⇡ (Y i i
ii ⇥ 222b
minimization
irreducible
irreducible c 2
2 2 n ⇤⇤ 22 2 ⇤
⇤ n b 2
2 2 222
E
EE(Y(Y (Y ⇥) c
⇥)⇥) =
c ==E EE(Y(Y(Yi=1✓✓✓))) + ⇤ ++E EE(✓ ⇤
(✓(✓ i=1 c
⇥)
Y Y) ) =
b == + ++
Prediction
Prediction
Prediction error
error
error
Prediction error nn
h i h i h in 2
E (Y c
⇥) 2 = E (Y ⇤ 2
✓ ) + E (✓ ⇤ 2
Yb ) = 2 +
lihood if
uncorrelated
uncorrelated the W i are normal
Estimating the noise variance n
irreducible
irreducible
irreducible h i h i h
E (Y c 2 = E (Y
⇥) ✓ ⇤ )2 + E (✓ ⇤ b )2
Y
hhh h ii i 11X nnn n nnn
2
2= 2 1 X
X X 2 1111XX XX 22 2222
2=
= =E
E EE (Y
(Y (Y ✓✓✓⇤⇤✓)))⇤22)2⇡⇡⇡
irreducible
(Y ⇤ (Y
(Y
(Y(Yi
i ii ✓ ✓
✓
⇤
✓
2 2⇡
⇤ )⇤)⇤
) ) 2⇡⇡ ⇡ (Y
(Y(Y
(Yi
i ii
c
⇥⇥ ⇤⇤))⇤
c⇥)
c
⇥ 2=
) 2=
= = bbbb
uncorrelated
uncorrelated
uncorrelated nni=1
ni=1 nnni=1
ni=1
i=1
i=1 i=1
i=1
irreducible
uncorrelated
Estimating the noise variance
uncorrelated 23
Sample
2.5% c mean: 2.5% confidence
⇤ 2 intervals and hypothesis tests
⇥ ⇠ Normal (✓ , /n) 2.5% Prediction error Prediction error
Sample
Sample
Sample
Sample
2.5%
c
mean:
mean:
mean:
mean: confidence
confidence
⇤confidence
confidence
2 /n)
intervals
intervals
intervals
intervals and and
and
and hypothesis
hypothesis
hypothesis
hypothesis tests
tests
tests
tests
⇥ ⇠ Normal
Sample mean: Sample p (✓ ,
confidence mean: intervalsconfidence
Sample and hypothesis
mean: intervals
confidence testshypothesis
and
intervals and hypothesist
1.96 /2.5% n c
c
•c
c What does ⇤ this
⇤
⇤ 2 22 say about “accuracy” of ⇥? h i 1 n
X 2=E
⇥⇥⇥ ⇠
Sample Normal
Normal
⇠Normal
⇠ (✓
mean: (✓, ,, confidence
(✓ /n)
/n)/n) c
⇥intervals
⇠ Normal (✓and
⇤ , 2 /n) 2 =
hypothesisE (Y ✓ ⇤ )
tests2 ⇡ (Y ✓
⇣
c ⇠ Normal
⇥ (✓ ⇤, c
Sample 2 /n)mean: ⌘confidence ⇤ 2 intervals and hypothesis tests n
i
P |⇥ c p✓ ⇤| ⇥ 1.96 ⇠ Normal
· b ⇡ 95% (✓ •, What /n)does this say about “accuracy” of ⇥? c i=1
1.96 does
• What /p
pp n this say about “accuracy” of ⇥? c
c
⇥ 1.96
1.96
•⇠ 95%
1.96 /// nn
Normal Confidence
n⇥
c ⇠ ⇤ , 2 /n)
(✓Normal (✓ ,Interval:
⇤ 2
p/n) p 2.5% Estimating the
1.96 / n 1.96 / n Estimatingc
the noise 2.5%
variance
⇣ p ⇤ • What does p this⌘say about “accuracy” of ⇥?
1.96 c
P |⇥ / Back✓np|  1.96 n ⇡ 95%
· b /example
to our ⇣ Sample mean:
c ⌘✓ ⇤|  1.96
2.5%
p ⌘ confidence Sample
intervalsmean:
2.5%
and hypothe
confide
1.96 / n1.96 / cn ⇣ p P | ⇥ · b / n ⇡ 95%
• 95% Confidence P |⇥ Interval:
✓ ⇤|  1.96 • 95% 95%
· b Confidence
⇡⇥ c 1.96
⇠ Normal (✓ ⇤, 2/n)
bInterval: c ⇠ Normal (✓ ⇤, 2/n)
⇥
⇣ ⌘ c
c⇣ ✓ ⇤estimated
P |⇥ |  •1.96
⇣
Hypothesis
· 2⇡ ⌘ 95% testing:
p ⌘ ⇥ hypothesis: ✓ ⇤ = 0 0
Null
c • b⇤95% : Confidence Interval: np
✓ ⇤| P|⇥1.96 1.96 / n Performance 1.96 / assessme
c ✓ | b 1.96 · b / n ⇡ 95% p
P |⇥ · ⇡ 95% n
• 95% Confidence • 95% ConfidenceInterval:Interval:
• 95% Confidence Interval: c 1.96
p b • Introduce 1.96concepts
b th
Error Reject
covariance the null if |⇥| ⇥
matrix: ⇥c 1.96
+
c1.96 / n b b c
⇥ p
n
n n • Structural model: Yi =
1.96 b
P(✓ ⇤ 2 CI) ⇡ 95% c
⇥
1.96 b ⇥c c + 1.96
⇥ p
b
False rejection
(estimated) standard rate:errors: 1.96
P(reject bpn|b✓ ⇤ = 0) ⇡ 5%
1.96 n n
c
⇥⇤⇥ c + 1.96 b
• Hypothesis testing: NullP(✓ c
2 CI)n
⇥
hypothesis: n b✓ ⇤⇤ = 0 01.96 b
⇡ 95%
1.96
•P(✓ •⇤ Hypothesis
Hypothesis testing: testing: Null Null
hypothesis:
c
⇥+ n
hypothesis: ✓p⇤ =✓ 0=⇥ cY
0 0+= 0✓ ⇤ + W Data:
CI) ⇡confidence
2 95% 95% intervals: 1.96 b n
n
⇤ c
⇥ + p
P(✓ 2 CI) ⇡ 95% c
Reject the null if |⇥| 1.96p b / pn n
Reject the null bif |⇥|c 1.96
Reject
⇤ the null if | ✓| 1.96 b / nb / n Wi: independent, zero m
P(✓ 2 CI) ⇡ 95%
equivalently: if the confidence interval does not contain 0
⇤
✓ = E[Y ]
False rejection rate: P(reject | ✓ ⇤⇤ = 0) ⇡ 95%
False False rejection
rejection rate:rate: P(reject
P(reject | ✓⇤ = | ✓ 0)=⇡0)95% ⇡ 5%
24
independent How noisy/reliable are• my estimates
TSS/(n
Structural model: of1) Yi and = ✓ ⇤my+ Wpredictions
i = 1, . . .of
,nY
i,
• Assume structural model Yi = ( ⇤)T ⇤Xi + W i
How How noisy/reliable
noisy/reliable are are my estimates
my estimates of ✓ of and ⇤my and How
of Y noisy/re
my predictions
predictions of Y
next• How = ⇤ + W ⇤Data:
•Assume How
Assume noisy/reliable
noisy/reliable
structural
structural are
model modelYmy
i =are(
Y ⇤my
i⇤=) T Xestimates
estimates( i
⇤+
) TWY
X of
ii + ✓W i
of
and✓ my andpredictions
my• predictions
Assume of Ystr
session independence =
• Assume • Assume structural structural
model Yi ✓= +
Y model ( YWi
T Data:
⇤ )= X (✓i+
⇤ )T X + W
Wii i
independence
independence Wi: independent, zeroindependent mean, varian
2
Wi ⇠N(0,Wi ⇠N(0,
2)
Wi ⇠N(0, 2) )
independent Wi: independent, ✓ ⇤ =zero E[Y ] mean, variance 2
Wi ⇠N(0, 2) 2 ✓ ⇤ = E[Y ] next
Wi ⇠N(0, )
(If not, (If not, toneed to toresort to simulation/bootstrap methods)
(If need
not,next need resort
to resort simulation/bootstrap
to simulation/bootstrap methods) Why?
methods) For large samples, ˆˆ⇡⇡ ⇤⇤
largesession
samples,
• Assume • Assume
(If not, need
X i ’sneed
session have i ’s
toXresort have
been tosetbeen set
simulation/bootstrap methods) nn
• (IfAssume
not, Xi’sto have
resort been to setsimulation/bootstrap 2 = E[W22 ] ⇡ 11 X W22⇥
methods) X
c 11X
Assumeˆj is Xaestimate
i ’s haveestimate
noisy been ⇤set ⇤
of ⇤ j , because of theW noises E[Wii ] ⇡ X n Wi i = =
=
• ✓bj• is •aˆ noisy
•• Assume is a noisy X ’s have of
estimate ✓
been
j , because
of set , of
because the noises
of the noises i W
W i
c W 1n
n ⇠N(0, 2n
)ni
j
• ˆ•j isEstimate
a noisy i 2 ⇤ j i ⇥ = i i=1Y
i=1
• Estimate 2 by estimate
2 by
2 of j , because of the noises Wi n
i
ˆj W
•• Estimate is ia⇠N(0,
noisy
2 by )
estimate of ⇤ , because of the noises W i=1
• Estimate by j n ⇣ ⌘slight i
n 1⇣ nX ⌘T 2 downwards
downwards bias
bias
• Estimate 2 by 2 1 X
1 nX ⇣ bYT ˆ 2 ⌘
X 2 (If not, need t
ˆ = 1 XY ⇣ ✓ i Xˆ
i Yi ˆ“sample T
T Xi Xi
⌘2 i negligible bias if m ⌧
mean” bias if m ⌧nn
(If not, need to resort ˆ2 =n i=1 to n
simulation/bootstrap
n1i=1 X Y
n i⇣
i=1 i ⌘2 methods) • Assume Xi’
n
“sample i=1 mean” T
• Assume Xi’s have been set Yi ˆ Xi • ˆj is a nois
n i=1
• ✓bj is a noisy estimate of ✓j⇤, because of the noises W•i Estimate
• Estimate 2 by
Why? ForWhy?
Why? For large
large For For
Why?large
samples,For
samples,
Why? largelarge
⇡⇡ ⇤
✓b✓bsamples, ˆ ⇡ ˆ ⇤⇡
✓✓⇤samples,
samples,
, ,and
and ,ând
⇡ ⇤ , and
⇤ , and
n ⇣ ⌘2
2 1 X
b T
22= E[W 22 2=11Xnn
X 2 2 111X
2W 2⇡
n
Xnn
XX
2 ]1⇡W
⇣n n
1⇣2 X 21 ⇤X 2⇤
n
1T⇣X
T 1⌘⌘2X
n ⇣ˆ
2 ⇣1⇤=
n 1X n
TX
n ⇣⌘
⇣2 ⌘21 Y
⇤ )⇡
n
2i1⌘⇣
⌘X
bTTX
b✓
⌘
2X1✓X
n
2 ⇣ X
n
ˆ
⇣
TXi
⌘2 ⌘ ⌘2
2
= E[Wi i] ]⇡⇡ 2E[W =
= E[W
W] ⇡2
E[W
⇡
i ii i ] ⇡
i Y Y = ((
i i =
i iW W✓✓i ))= XY
X ⇡(
⇡ Y) X
i i Yi i( n)i X
⇤( T
Y Y TX✓ X ⇡ Y
i i i i⇡ i i i Yi i i X Y ˆT ˆT X
i i
nn i=1
i=1
nnn n n i=1
i=1
i=1
n n ni=1 i=1i=1 nn
i=1
i=1 i=1 n n n
i=1 i=1i=1
i=1i=1
slight
slightdownwards
downwardsbiasbias
negligible
negligiblebias
biasififmm⌧⌧nn
E[ˆˆ
•• E[ ] ]== ⇤⇤(unbiased)
(unbiased) hh ii
25
⇤⇤ ⇤⇤ TT 22 TT 11
Why? Why? For large Forsamples,large samples, •✓ b⇡ ✓ ,✓TSS/(n
Structural ⇡ ✓ ,model:
and andY 1) 1 Y i = ✓ ⇤ + Wi , i = 1, . . . , n
Why? For large samples, b✓n ⇡ ✓ ⇤
⇤ , and n
6 .
.
. 7 n 1 Xn
Why? For large 2=
samples, n
1 2Xare12my ✓X ⇡1✓ X n
2
, and⇣ 1YX
= = of ⇣6 7
6⇤ ..T✓7⇤ and ⌘2
⇤ 1 X ⌘ n
2 2 ⇣= 1 E[W ⇣
X 2] ⇡⌘2
How2 noisy/reliable
= E[W
⇤ and
2 ] ⇡E[W i ] ⇡
n W ⇡ Westimates
i n⇡ Y⇣ i ( ✓4 Y )i. 5 X (i ✓ ⌘ ) T
⇡ my
X i predictions
n⇡ Y⇣ i
bT
✓ iY X i i
of✓
n
⌘
b TYX
samples,How ✓b ⇡ 2✓=, E[W
How
noisy/reliable
i
noisy/reliable
2b n 11 ⇤X Xn nmy
are
i
are
2 n 1my
estimates
1 X
X
n estimates
⇣⇤n T Y of= ⇤✓ ⇤T+ ⌘
of
and W ✓
2
2⇤Data:
my n
and
1 1X X
n my ⇣n i=1
predictions predictions
bTT Xof ⌘22Yi=1
• Assume
y? For large samples, 2 = E[W structural
2 ] ⇡
i ]✓⇡⇡ ✓ , andi=1 modelW W i=1
2 ⇡
i ⇡i ⇤ Y =
i=1 ( ✓ )
YY i=1 X (+ ✓Y⇤
i i(✓ )n iX i ⇡W)T X ⇡ i=1
YY i ✓ ✓
b Xi
n Why? For n
Assumelarge i
⇣ samples, n
structural ˆ ⇡ ⌘ ⇤ i
Y,2and=
model ✓n +n W
⇤ ⇣ =T iData:
( ⇤ ) T ⌘ +
i n i i
1 X• Assume • 1 X structural n model
i=1 Y n
TnX⇣b ⇡i ⇤ i=1 i Y⌘ i ✓=
1 X(
i=1Y ) X ✓ + W X 2
ini ⇣ i W n i=1
⇡2 Wi22 ⇡ For
Why? 1 X n Y
large i 2 (
n ✓1
samples,⇤i=1
) X i ✓ n ⇡ ⇣ ✓ ,⇤ and
T i
⌘ 2 b T1X n Xi
⇣ slight
⌘ T
i=1⌘
2
downwards bias
n= E[W
slight ] ⇡ slight
downwards
2 n 2downwards
W 1
⇡X
bias 2 1
biasYX
(n ✓ ) ⇤ X
T 2W 1
⇡ : independent,
X
Y T b
✓2
X zero mean, varian
i=1 independent
ib
✓ = (X
=
ni=1
E[W
T X) i ] ⇡ 1inX TY n
W i = i Y i (
i=1 ) X ii ⇡ i
nn
Y i ˆ i X i i
W slight
⇠N(0, 2
downwards
independent
negligiblei=1 ) i=1
bias1 X
bias
ni=1
if
n i=1
W
⌧ : 1 X n
independent,⇣ ✓ ⇤ = zero
i=1E[Y
i=1 ⌘
] mean,
2 negligible
1 Xn ⇣
variance bias2if ⌘m 2 ⌧n
negligible
slight
i 2 downwards
= bias 2 if m ⌧
⇡ bias Wi ⇤⇡
E[Wi ] invertibility n m 2 i n Y i (✓ ) X i ⇡ ⇤ T Y i ⇤ ✓ Xi b T
•
negligible
b
assuming ⇤ bias
b if⇤ n
m ⌧ n ✓ = n E[Y ] • n E[ ˆ ] = (unbiased)
• E[ ✓ ]
next (databias
s bias negligible
slight •= E[
✓
downwards ✓ ] =
(unbiased) ✓
if m X
bias (unbiased)
⌧ nare bnot
i=1 i=1 hConfidenceto i i=1
Why?
ht downwards b For
bias vectors
large
⇤ samples, i ✓ ⇡ h confined
✓ ⇤ , and Banda lower-dimensional
i
• Covariance plane)matrix o
• •
(If
sessionE[
not, b✓
negligible
Covariance •]
next =
need ✓
bias
Covariance
⇤ (unbiased)
to
if m
matrix resort
⌧ n to
matrix
of ˆ : E of ( ✓b
simulation/bootstrap
ˆ : E ⇤ ( b
✓
)( ˆ ✓ ⇤ ⇤
)( )✓bT =✓ ⇤
methods)
) 2
T
(X = T X) 2 (X1 T X) 1
m ⌧ n• E[✓ˆ] = ✓⇤ (unbiased) h
h
i
i
igible •bias
• • E[
if session
]
Covariance
Assume
= (unbiased)
⌧Xn ’s(m
m dimensions matrix
have n of
been
(m h+ ✓b :
set
1) E ( ✓
(mb n + ⇣
✓ ⇤ b
i)(b✓of Sales✓⇤
1)
⇤ ) T ⌘= 2 (X dimensions
T
n X) ⇣ 11 (m⌘+ 1) ⇥
biased) • dimensions
slight downwards
•Covariance
2 = E[Wmatrix
Covariance i 2 ]matrix + 1
bias X
1)
of ˆ: of
⇥EW
(m
(✓
b
ˆ2 :⇡ + ⇥ 1
1) b X
E⇤)((ˆ✓ ⇤✓)TY-)(
•
⇤Example
=✓ covariance
2 (X✓
data
⇤T)X)T T
)matrix1 in= 2 2 1 X
T
parameters(X X) 2 bnTX 2 c
⇥=
b ⇤
[✓ ] = ✓•negligible
(unbiased)h i 2 ⇡ i i i
Error
( ✓ X i ⇡ use ˆ Y i 1 ✓X i
dimensions
ˆdimensions 2 (m + 1) ⇥of (m+ 2+
⇤ 1)
✓b22use
2 ˆ if
atrix ofW ✓buse Eisˆ
i: j⇠N(0, (a
dimensions noisy
bias✓)⇤(m )( (mestimate
+
✓bm 1)+hn⌧
⇥
✓ ⇤(m
1)
i=1) nT+⇥ (m
1)
=
⇤ j ,n
(X because
1) T X) i 1 const
i=1⇤
of the noises
2 TV
1
Wi=1
n i ⇥
Radio
c =Newspaper Y
ˆ i = var
ovariance matrix
use
buse ˆ ˆˆ ⇤ of
⇠N(0, 2 b
✓
ˆj by: E
2 ( ✓b
ˆ)j Var( ): ⇤ ✓ )( ✓
ˆj diagonalb ⇤ ✓ ) T = (X T X) • Variance n of j
m + 1) •⇥ (m Estimate
E[
Var(
use✓+ ]W•= ji2)Var(
ˆ1) ✓=(unbiased)
Var( )= j h j ): diagonal
constentries
entries
9.72867479E-02 of-2.65727337E-04
matrix
i
of -1.11548946E-03
matrix -5.91021239E-04 i=1
ˆ ˆ
• Tˆ is approximately
Variance of = variance of ⇤ : diagonal entries of matrix
mensions •
•
ˆ (m
Var( + ✓bˆ 1)
) =
is
j(m +
⇥approximately
Var( ✓b 1) ✓ ⇤⇤b): jdiagonal j
b
(mutivariate)n ✓TV⇤ entries
b ⇤) of
normal T matrix 2 1 (m
•(If•Var( is
Covariance
not, • ˆ
approximately
jneed matrix
to resort ˆ (mutivariate)
of to : E ( 1 ⇣normal
)( ⌘ = (X X)
j ): simulation/bootstrap methods)
•slight ˆ is j
downwards) = Var(
approximately j
bias j
(mutivariate)
j ✓ ✓
diagonal
normal X ✓
entries ✓
2.65727337E-04
TX of 2matrix
1.9457371E-06 -4.47039463E-07 -3.26595026E-07
2
seb ˆ ⇤ • • ˆstandard Y ˆ
“sample mean” standard error
• methods)
isis approximately (mutivariate) normal
(If not,
standard need to
error resort to i
simulation/bootstrap i
•):dimensions
ˆ
standard
(✓j b ✓•jnegligible
Assume • approximately
diagonal bias
Xerror
error (m
’s
entries
i if +
have
m 1)
⌧of ⇥
been
n (m
(mutivariate)
matrix +
set 1)
n Radio normal
-1.11548946E-03 -4.47039463E-07 7.41533504E-05 -1.78006245E-05
ar(✓j ) =• useVar( b2
✓ ✓ ⇤ ): diagonal “sample
entries q i=1mean”
of matrix q q
bstandard
b •ˆ jAssume
✓j✓ ]is=a ✓noisy
standard ⇤ j error
error X ’s have
estimate of ✓jsc s
c e( ˆbeen
) ⇤
= ,e( set
Var(
News
because ˆj )-5.91021239E-04
ˆ of the noises ˆ
•• E[
ately (mutivariate) normal
(unbiased) i j
h
ˆ )
paper
j⇤ s
=c e( q )
Var(
j = ˆ ji)Var(
-3.26595026E-07
)Wi
-1.78006245E-05
j
3.44687543E-05
is approximately
••• Var( • b✓
✓ ) b is
= (mutivariate)
a
Var(2 b
noisy
✓ ✓ ⇤ ):bnormal
estimate diagonal bof ✓ b , because
entries
⇤ b q of ⇤ T of
matrix
b= the2noises T X) W 1i
Estimate
Covariance j j matrixbyj of j ✓ : E (s ✓
c se(
e( ˆ ✓
✓
j j) ))(
== ✓ Var(
✓
Var( ) ˆ ✓ j) ) (X
andard • error
ˆ is • Estimate
approximately q 2 (mutivariate)
by j
normal
n ⇣
j
⌘2
dimensions b (m
se(✓j ) = Var(✓bj ) ˆ = b + 1) ⇥
b (m + q 2 1) 1 X T
• standard 2 error se( ✓ ) = Var( ✓ ) 1 YX i n ⇣ ✓b Xi T ⌘2
use ˆ j n2i=1 j b X
ˆ =q Yi ✓ i
• Var(✓bj ) = Var(✓bj ✓j⇤): diagonal b
sce(✓j ) = Var(entries n of
i=1 ✓j ) b matrix
• ✓ b is approximately (mutivariate) normal
• standard error
q
se(✓b26
j) = Var(✓bj )
Confidence Confidence
Confidence intervals
intervals and
intervals and
and hypothesis hypothesis
and hypothesis
hypothesis tests tests tests
tests
Confidence n intervals and n hypothe
h i n h ni 1 X 1 X
Confidence 2 = E intervals and hypothesis
1 X 2 ⇤ )2 ⇡ ✓ ⇤ )tests
1 X
2 ⇡ ⇤ )22 ⇡ c ⇤ )2 =
(Y ✓ ⇤ )2 ⇡ (Y= i E✓ (Y Confidence
Confidence
Confidence (Yi ⇥ c(Y ⇤ )2 =✓b
intervals
intervals
intervals
i and
and and (Y ⇥
hypothe
hypoth
hypothes
i
Confidence
ConfidenceConfidence intervalsintervals and and
hypothesis Confidence
nhypothesis
hypothesis
tests b intervals
n
tests T n i=1 1 andT hypothesis n i=1 tests
sampling sampling distribution
sampling
distribution
intervals
(of (of
distribution b and b
✓j ) ✓j ) (of ✓j )
i=1 b tests ✓ = (X
i=1
sampling distributionX) X Y
hesis tests b
sampling distribution(of ✓bj ) (of b) j
✓ ) assuming
•sampling invertibility b)
sampling distribution
Estimating the noise variance (of ✓
Estimating
j
sampling the noise sampling
sampling variance
distribution distribution
distribution
distribution
(of b
✓ ) (of (of ✓b ✓)
j j
Confidence intervals and hypothesis tests
(data vectors
b✓bstandard
bsc
j
bX are
b✓ ⇤ ✓not
⇤ )i deviation ⇤ confined
⇡ normal ⇡⇡normal
normal ⇡ ⇡zero normal
zero zero
mean
mean mean zero
standardstandard
normal zero mean standard deviation
mean
standard
deviation standard
deviation
⇡ s
c deviation
e( ✓b
j ) b
✓ ✓⇡⇡
⇤ deviation
bnormal
sc
j⇡ se(✓j ) ✓j ✓
c
⇡e(
b sc
✓be(
j
⇤ )✓ j )⇡
j b
✓ j e(
mean
b
✓
✓ j ✓
⇤ j j ⇡ se( ˆj
⇡ normal zero mean 2.5% standard deviation ⇡ sce(✓j ) ✓j ✓ b
2.5%
⇡⇡normal ⇡ ⇡ normal
normal standard
zero mean deviation
standard
b b ⇡ s
c e(
devia
⇤ ✓
ˆj ) standard deviation ⇡ se(✓j ) ✓j ✓ c
⇣ ⌘
⇡⇣P normal
⇣ sampling ✓bj ✓j⇤| ⇤
⇣ 1.96
⇣
standard
Pdistribution ˆj )1.96 ⇡deviation ⌘ ⌘ ⌘sc e( ˆj ⇤
|b |✓bj · ✓scje(
⇤| ✓  ·⌘sce(✓ˆj ) ⇡ 95%
95%
P | b P |✓✓jBack
✓
⇣ ✓P
⇤ |  b 1.96
b
jb 1.96
||✓to our
·
⇤
✓scje(
⇤ | ·ˆ
✓ sce()1.96
example ✓ˆ⇡j · s⇡
) 95% ce(⌘95%✓ˆj ) ⇡ 95% ⇣ j
viation j ⇡Pjse( c
| b
✓ ✓j )✓✓j| ✓1.96
j ⇤ j j · sc e( ˆj ) ⇡⇣ 95%
✓ ⇣ ⌘ ⌘ ⌘
⇤ j
✓j 1.96 · sce(
⇣ ✓j ✓ˆj
⇤
j )1.96
⇤
✓j + ✓ˆj ) · sc
1.96
· sce( ✓e(⇤
j +✓ˆj )1.96
⌘ · scP e(✓ˆj )
| ˆ ⇤P|  | ˆj ✓j⇤· |sc
P|✓b1.96
j j
⇤

e(| 1.96
ˆ )1.96⇡· sce(
95%
ˆj )ˆj )⇡ 95%
· sc✓e( ⇡ 95%
⇡ | ˆj1.96
P⇤ normal ⇤
⇤j·|standard
 1.96 ⇤sc
· e( e(ˆ ˆj ) ⇤⇡ c
deviation 95%⇡ j
sce( ˆjc j ˆj
)e( ⇤ j
⇤ ✓ ✓
estimated ˆ sc e(1.96✓ˆ j )
⇤2 ·
: s
c
✓ + ✓ j )
1.96 ✓ ·+s
ˆ e(1.96ˆ
✓ j ) · s ✓ˆj ) j
✓j 1.96 j ⇤ · se(· sc✓e(
c j )j·),1.96
✓e( +
ˆ ˆ1.96
j ⇤ ê(·+ ˆs1.96
c)e(✓· jsc)
j ˆj )  of
j we1.96 sc e()
j⇤ j+j1.96 · sc
ˆj✓ 1.96 ˆjj ˆ jˆj·✓s+ j ),✓· sc ˆje() ✓ 
j 1.96 e(
c
• if construct confidence  intervals the form
⇣95% of the95%
 ˆ  ⇤ time,of the
⇤
j will time,be inside j will 0
⌘ be inside 0
ˆ ˆ
ˆjcj ˆ1.96 1.96 · ˆsc·e(s
c ˆjˆ
e( ),
j ), ˆjc ˆ+j +1.96
ˆ 1.96 · s
c· s
c
e( ˆjˆ
e( )j )
P • | ifjwe construct 1.96 ˆ
· se(confidence
c j) ⇡ 95% 1.96 · s e( ), + 1.96 · s e( )

✓ˆj Error 1.96 j•|covariance
✓ˆ ·
if we
s
c 1.96
e(
construct
confidence
ˆ
✓ ), · s
c ˆ
e(
✓
intervals
ˆ
matrix:
✓+ ), 1.96 ˆ
✓
of
·+ sc
j form
intervals
the
e( 1.96✓ˆ of the form j
) · sce( ˆ
✓ )
j
⇤ ⇤
j
✓ˆj 1.96 • “Nullˆ · sc e( •j ✓ˆ ),
“Null
hypothesis:” jˆ ⇤ = 0j j j⇤ =c
+
hypothesis:”
✓ ˆ 1.96 ˆ · s j ˆ
0e(95%✓ ) j
of ˆ the95% 95% j ofofthe
time, the
⇤ willtime,time,
be will
inside
j willbe 0 beinside
inside0
✓j 1.96 j · se(j✓⌘j ),
c j
⇤
✓j +⇤1.96j · ˆse(✓ij ) c ˆ c j j
95% – Wald of
ˆj 1.96–· sc 95%
the
test:– Wald of
time,
reject
ˆ the
test:
the ✓ time,
will
reject
“null”
⌘ , jˆ + 1.96 the
if be
|✓ˆ | will
inside
“null” 1.96if be
|· s
ˆc
j |
e( inside
0
ˆ 1.96
) ·
• s e( if0 )we
i • if we construct confidence
j construct confidence interva
intervals
95% ˆ of–95% the time,
of· the e(
5% ✓ ˆ
⇤
j )
time,
probabilitywill j ✓be⇤
ˆrejection
j
j
inside
will
of false •
be· sc
rejection e(
if we
0ˆj ) 0
inside j ) j
construct confidence intervals of the fo
5% 1.96
probability
(estimated)
• if we•construct sc e( of
if we construct j )false,
standard
confidence+j 1.96 errors: ·
confidence s
c e(
intervals intervals ofhypothesis:”
the⇤ form ⇤
95%
•
j
“Null of the
hypothesis:”time,
j j
⇤ will ✓ ⇤= be 0 •inside “Null • of •“Null
hypothesis:”
the
“Null form
hypothesis:” = 0
⇤ ==
j j 00
) • if we construct
• of if we construct confidence ⇤j confidence intervals
j be inside of the
intervals of the form form j
•95% “Null •the “Null
hypothesis:”time, j will
hypothesis:” ✓j⇤ = 0 ✓j⇤ = 00 – –Wald Wald test:
test: rejectthe
reject the ˆj“null”
“null” if if| ˆ·|j
“Null hypothesis:” ⇤ –
✓j = intervals
0 Wald test: reject the “null” if | | 1.96
0 • – if• 95% we
Wald –confidence
construct Waldreject
test: test: intervals:
confidence thereject “null” the if“null” ||✓ˆ of–
j || – if
the
1.96
5%|✓ˆ5% form
sc 1.96
e( ✓ˆ
probability
j | ··probability )) · sce(✓ˆ of j)offalse
falserejection
rejection
– – Wald test: reject the “null” – 5% if ˆprobability
✓ ˆ 1.96 s
c e(
of ˆ
✓ jfalse
ˆ rejection
vals of• – the“Null Wald
form hypothesis:” test: reject⇤ = the0 “null” if |✓j | 1.96 · se(✓j ) j c j
5% – 5% probability
probability of false
j of false rejection
rejection
– – 5%5% probability
probability of false
ofconfidence
false rejection
rejection
– Wald equivalently: test: rejectif thethe “null” ˆj | 1.96
if |interval does · sce( notˆj ) contain 0
– 5% probability of false rejection
27
Confidence Band
estimated 2 : example
Back to our
• Example of Sales data
- Error covariance matrix in parameters
Error covariance matrix:
const TV Radio Newspaper Back to our example
The software side
(estimated)
const 9.72867479E-02 standard
-2.65727337E-04 errors:
-1.11548946E-03 -5.91021239E-04
• Wald test:
TV 2.65727337E-04 1.9457371E-06 -4.47039463E-07 -3.26595026E-07 intercept, TV, Radio are “signific
95% confidence intervals: zero)
Radio -1.11548946E-03 -4.47039463E-07 7.41533504E-05 -1.78006245E-05
Newspaper: accept the hypothes
News
-5.91021239E-04 -3.26595026E-07 -1.78006245E-05 3.44687543E-05
equivalently:
paper if the confidence interval does not contain 0
– No competing stores/dealers, Confidence intervals
– No competing advertise little,
stores/dealers,
estimated 2:
strong sales no matter what
advertise little,
strong sales no Back to our
matter what example
Error covariance matrix:
Back to–our The software
example
Competing side
stores/dealers,
The software
– Competing Wald test:
• side
advertise a lot,
stores/dealers, (estimated) standard errors:
• Wald test:
advertise but intercept,
a lot, sales can TV,neverRadio arehigh
be too “significant”
(reject 95% confidence
are zero) intervals:
butintercept,
sales canTV, never be the
Radio arehypothesis
too that they
“significant”
high
(reject Newspaper:
the hypothesis thataccept thezero)
they are hypothesis
⇤ ✓NewsP = 0
Newspaper: the hypothesis that ✓NewsP = 0 if“survives”
equivalently: (not
the confidence r
inter
Newspaper: accept the hypothesis ⇤ ✓NewsP = 0
Newspaper: the hypothesis that ✓NewsP = 0 “survives” (not rejected)
Confidence intervals 28
2
XX : : symptoms,
symptoms,...
q qq Y : state of hea
T T T (X 1
TX 1 X1 X Y Y : : state
state X of
of hea
hea
1.96 · 1.96
1.96 b ·X b · b(X
X TX X)
(X T X) X) Y : life expecta n
Making new predictions Making new predictions YY: : life life expectaexpecta
Making
– confidence
– – new
confidence
confidence predictions
interval
interval •width
interval After
widthchanges
width runningchanges
changes with the X
withwith
regressionX YY: : sick
(based on
sick or
or not
data)
not
After
Making
• running
new
Making new predictions the
predictions regression (based on n data),
X you are given
Y : sicksome n or not
X⇤
• – Afterin
want – running
simple
to in simple
predict the
regression, regression
regression,
corresponding this
given (based
gives
thissome a
gives onconfidence
confidence
new a data),
nconfidence
, band
predict band X ? X 11
• –
After in simple
running regression,
the regression
• After running the regression (based on n data), this gives
(based
Y ⇤ a on n X data), band Y Y ⇤ X
. 1 Y1
.
. .. . ...
•• given someconfidenceb ⇤,=predict Y⇤ ?about
X Y⇤about T
95%
Prediction:
given
• •95% confidence
95%
some
given some bnew new
confidence
Y ⇤ X ˆ
X ,T interval
intervalX • interval
predict
, predict
⇤ Prediction:
Yabout
Y
Y Y the
?
⇤⇤ ? theb
value
Y the= b
value
value
✓Y ⇤ :
X Y ⇤ :
Y ⇤ : .
. A conceptua
ˆ T ... .. ...
•• Prediction:
ˆ T ˆ T ˆ T
plus orY ⇤ =
minus ✓ TX ⇤ ⇤ . .
✓Keep X
•• Prediction:✓⇤ ✓X
Prediction: assuming⇤ plus
⇤Xplus b
YYbor
= =or
bb minus
T
minus
structural
✓✓ XX• Keep model:
⇣ assuming
Y⇤ =⌘ (structural T
) X⇤ + W model:
⇤ . Y• = Quick (✓ ⇤)Tdem X+
•• Keep assuming structural model: q q Y⇤⇤q = (✓ ⇤
⇤ ))
T X ⇤++WW⇤b T ⇣ X X n n ⌘
Y n
• Prediction
Keep assuming error b
structural
Y ⇤ • Y ⇤ = ⇣ ˆ
model:
Prediction
T ( Y )
= T ⌘(✓
error: X ⇤ bTX
⇤ W
T ⇤ = ✓ X
( ✓ n⇤ )T X
• Keep assuming structural b 1.96 model:
⇣
1.96·ˆ1.96
T · 1 +
· 1
⇤YX+T
T
1=⌘(X
+X (T✓
T
XY(XT
X)) T
(XXY
X)
1TX +
X) 1 W
X 1 X W
• Prediction error: Y ⇤ Y =
⇤ = ✓ ⇣ ✓ˆ (✓
TT (✓ ⇤ )T X⇤ ) X
⌘ W ⇤ Y?
: symptoms
•• Prediction
E[ Y b |, X ] =
⇤ ⇤ error:
E[Y ⇤ |YbXb⇤ ⇤ ] Y
(unbiased)
⇤E[ b b ] = E[Y ⇤ T ⇤] W ⇤
(unbiased)
XX
• Prediction error: Y • Y =Y | X
✓
⇤ ⇤ ⇤ ( ✓ ) | X X W
• E[ ✓b• Y b•✓
is
b ⇤
b ⇤is
b unbiased
| is
X
✓ ]=
unbiasedE[Yestimate
unbiased ⇤ | estimate of ✓?
] (unbiased)
X⇤estimate of of
✓ ✓ Y•: state “Predict” of he
• E[Yb⇤T| X⇤T] = E[Y ⇤ | X ⇤ ] (unbiased)
• ) E[Y✓b) |XX
) ✓b] is
= T E[Y
✓bXunbiased
is is
X X] (unbiased)
| unbiased
unbiased estimate estimate
estimate of (of ✓ ⇤of)(T✓ ⇤X
()✓T⇤X)T X (and (and of of Y )Y ) Y : life expect
Y : sick or not
Two sources of error:
•• Two Two sources
sources ofof error:
error: X1
unavoidable, from W ; variance 2
•– •Two Two sources
unavoidable, sources of
fromoferror:error:
W ⇤ ; variance 2
2
...
due to•– unavoidable,
estimation errorof from T
✓berror: ✓ ⇤; variance
W
–– Two
due
– sources
unavoidable,
to estimation
unavoidable, from from error
W⇤W ; ;ˆvariance
Tvariance
T ⇤ 22 ...
–– due to error
estimation b ⇤
Total prediction variance of ( b
variance:
✓ ✓ ⇤error
) T X 2✓+ T 2✓XT (X2T X) 1 X
–– unavoidable,
due
Total
– due toerror
toestimation from
variance:
estimation W
error⇤; Y
V(
error variance
bˆ b X⇤)⇤✓=
T|,
✓ ⇤ 2 + 2 X T (XT X) 1 X Xn
– Total prediction error ⇤
variance: 2 + 2 XT (X ⇤ T X) 1X ⇤
Yb is approximately
•–c95% normal
confidence interval ˆ
about ⇤ ⇤ T
•• – YYb– due
⇤Total
is toprediction
estimation
approximately
Total prediction error
error
normal
error ✓ T
variance:
variance: ✓ the2value +2 +2 X of
2T X
(T✓T)X)
(X (X
X:1X1
T X) ⇤X
b isT approximately normal⇣ ⌘ ⇤ T ⇤
95% confidence (✓Total
) X intervalplus or about
prediction minus error the value ⇤ofT (✓
variance: 2+ ) X2:X⇤T (X⇤T X) 1X⇤
••• –• c
Y
95%
bis
confidence approximately
⇤Y isconfidence interval
approximately normal
for
interval
ˆ
normal
⇣
T
about ( )
the ⌘
X ⇤ : plus
value of or
( ✓b )minus
T X:
b T
(✓ ) X plus b or minus q ⇤ T
⇤isTapproximately
•• •confidence
Yb95% interval
confidence normal
for
interval ˆT1.96 (
about ⇤ )T X
q the T :
value
T plus of 1 or( bminus
✓ ) X:
(✓ )⇤ X plus or minus q ·b X (X
⇤ X)1 X
• 95% (✓b )confidence
T X plus or interval minus about
1.96T ·T q the T
X1⇤ value T
(X X)of X (✓ˆ⇤ ) X⇤:
⇤ T
1.96 · b X (X X) TX T q
–✓ˆTconfidence
X⇤ plus or minus widthinterval 1.96 changes q⇤T(Xwith TX) X 11X
• 1.96 ·· b X X (X X) X⇤
q
confidence – inintervalsimple width regression, changes thiswith gives
1.96 X T
· ba confidence
TX (X T
T X)1 1band X
•• –confidence
confidence interval
interval forwidth 1.96
Y⇤: plus ·
29 or minus
changes X (X
⇤with X X) X ⇤
in simple• 95% regression,
confidence this gives
interval a confidence
about the band Y⇤:
value
– confidence interval width q changes with X
– due to estimation error ✓ bTT ✓⇤ 2 2 XT (XT X) 1 X
–
– due
Total to estimation
prediction error
error ✓
variance:
b ✓ ⇤ +
–– due Total to prediction
estimation
b errorvariance:
error
⇤ )T ✓ ✓ 2 + 2 XT (XT X) 1 X
–• – variance
Y b Total
is of ( ✓
prediction
approximately ✓ X
error
normalvariance: 2 + 2 XT (XT X) 1 X
Confidence
– YbTotal bands error
prediction variance: 2 + XT ⇤(XTT⇤ X) 1X
2
•
• •95% is approximately
confidence interval normal
about thethe value of of(✓ ()✓
b X T: X:
• 95%Yb is
b approximately
confidence normal
interval about value ⇤)
••b Y95%TbX is approximately
⇤ confidence normalabout the value of (✓⇤ )TT X:
interval b
(•✓ )(95% ⇤) TX
✓bwe plus or
T confidenceminus
plus or minus interval about Tthe value of (✓⇤ )T X:
Are
• 95% (✓ interested
confidence
)T X plus or minus in the
interval model
about ✓
q q X
the value of (✓ ) X:
b
or((in ✓ ) X predictions?
✓b )Tthe
plus or minus
X plus or minus 1.96 · b bXq T (XT T X) T 1 X1
1.96 · qX T(X TX) 1X
1.96 · bq XT (XT X) 1 X
– –confidence interval width 1.96
changes XT (XX
· b with T X)X) 1 X
Summary confidence interval width1.96changes
· b X (X
with X X
– – –
in in confidence
simple intervalthis
regression, width
giveschanges with X band
a confidence
– confidence
•–– Linear simple interval
regression,
regression width
this changes
gives a with
confidence X band (or prediction band)
confidence
inconfidence interval
simple regression, width changes
this gives a with
confidence X band
• •95%– in simple interval
regression, about
this the
gives avalue Y
confidence⇤ : band
•––95%
95%
confidence
formulation
in simple
confidence
interval
regression,
interval
about
this gives
about
the
a
the
value Y⇤: band
confidence
value Y⇤:
✓ˆ T
• –X✓ˆ T⇤Xplus
95% or or
minus
confidence
plus minusinterval about the value Y⇤ :
⇤
underlying assumptions
• 95% ˆ T confidence
✓ˆT X⇤ plus or minus interval aboutqq the value Y ⇤:
✓ X⇤ plus or minus q+ XT (X T T 1 X1
–✓ˆTformulas
X⇤ plus or minus 1.96 1.96· · 1q 1 + X⇤TT(XX) X) 1X⇤
1.96 · q 1 + X (X X) X T
– results: their interpretation 1.96 · and 1+ XT⇤⇤T (XTT X) 11X⇤⇤
usage
1.96 · 1 + X⇤ (X X) X⇤
• Still, many things can go wrong or be misinterpreted
• New issues when ✓ has high dimension
• Next session. . .
30
Are we interested in the model ✓ T X
orConfidence intervals and bands
in the predictions?
Summary
Confidence intervals and bands
Confidence
Everything intervals
there is to and
know bands
about
• Are
Summary we interested in the model ✓ T X linear regression formulas
and
Are
or in the
wethe underlying
interested
predictions? in assumptions
the model ✓ T X
Linear
• Are we regression in the model ✓ T X
interested
• orStill, many
in the things can go wrong or misinterpreted
predictions?
–or formulation
in the predictions?
New
• Summary issues when X has very high dimension
– underlying assumptions
Next
• •Summary
Linear time. ..
regression
– formulas
Summary
• – Linear regression
formulation
– results:
• – Linear their interpretation and usage
regression
formulationassumptions
underlying
• – Still,
– many things can go wrong or be misinterpreted
formulation
underlying assumptions
– formulas
–
• – New issues when
underlying ✓ has high dimension
assumptions
– formulas
results: their interpretation and usage
• – Next session. . .
formulastheir interpretation and usage
– results:
–
• Still, many things can go wrong or be misinterpreted
• – Newresults:
Still, many their
things interpretation andorusage
• issues when ✓can hasgo wrong
high dimensionbe misinterpreted
Still, many things
• New can go wrong or be misinterpreted
•
• Next issues
session.when. . ✓ has high dimension
New issues
• Next
• session.when. . ✓ has high dimension
• Next session. . .
31

L1 Slides

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L1 Slides

Uploaded by

Copyright:

Available Formats

Applied Data Science

• http://www.mit.edu/~jnt/home.html Privacy can be protected by

• http://www.mit.edu/~jnt/home.html Privacy can be protected by

Sample data vector scalar

empirical risk minimization (ERM) turns • out:Restrict to limited class

where X = (1, X (1), . . . , X (m)

46 · (TV) + 0.19 · (Radio) + 0.001 · (NewsP)

maximum likelihood if the Wi are normal

You might also like