Linear Regression Lecture Slides in 40 Characters

Linear Regression and the
Bias Variance Tradeoff
Guest Lecturer
Joseph E. Gonzalez
slides available here: h"p://&nyurl.com/reglecture
Simple Linear Regression
Y
X
Response
Variable Covariate
Linear Model: Y = mX + b
Slope Intercept (bias)
MoHvaHon
•  One of the most widely used techniques
•  Fundamental to many larger models
–  Generalized Linear Models
–  CollaboraHve filtering
•  Easy to interpret
•  Efficient to solve
MulHple Linear Regression
The Regression Model
•  For a single data point (x,y):
Independent Variable Response Variable
(Vector) (Scalar)
Observe:
(CondiHon) x y
x ∈ Rp y∈R
•  Joint Probability:
p(x, y) = p(x)p(y|x) DiscriminaHve
Model
The Linear Model
Vector of
Parameters Vector of
Covariates
T
Scalar
Response y =✓ x+✏ Real Value
Noise
+ b
Noise Model:
X
p
✏ ∼ N (0, σ 2 )
Linear Combina&on θi xi
of Covariates
i=1
What about bias/intercept term?
Define: xp+1 = 1
Then redefine p := p+1 for notaHonal simplicity
CondiHonal Likelihood p(y|x)
•  CondiHoned on x:
Constant
Normal DistribuHon
T
y = ✓ x + ✏ ∼ N (0, σ Mean
2
)
Variance
•  CondiHonal distribuHon of Y:
T 2
Y ∼ N (θ x, σ )
T 2
✓ ◆
1 (y − θ x)
p(y|x) = √ exp − 2
σ 2π 2σ
Parameters and Random Variables
Parameters
T 2
y ∼ N (θ x, σ )
•  CondiHonal distribuHon of y:
–  Bayesian: parameters as random variables
2
p(y|x, θ, σ )
–  FrequenHst: parameters as (unknown) constants
pθ,σ2 (y|x)
So far …
Y I’m
lonely
*
X2
X1
Independent and IdenHcally
Distributed (iid) Data
•  For n data points:
D = {(x1 , y1 ), . . . , (xn , yn )}
n
= {(xi , yi )}i=1
Plate Diagram
Independent Variable Response Variable
(Vector) (Scalar)
xi yi
p
xi ∈ R yi ∈ R
i ∈ {1, . . . , n}
Joint Probability
xi yi
n
•  For n data points independent and iden&cally
distributed (iid): n
Y
p(D) = p(xi , yi )
i=1
Yn
= p(xi )p(yi |xi )
i=1
RewriHng with Matrix NotaHon
n
D = {(x )}
•  Represent data as:
, y
i i i=1
Covariate (Design) Response
Matrix Vector
n x1 n y1
x2 y2
X= ∈ Rnp Y = .. ∈ R n
...
Assume X .
xn has rank p
(not degenerate)
yn
p 1
RewriHng with Matrix NotaHon
•  RewriHng the model using matrix operaHons:
Y = X✓ + ✏
Y = +
X θ ✏
p
n n n
1
1 p
EsHmaHng the Model
•  Given data how can we esHmate θ?
Y = X✓ + ✏
•  Construct maximum likelihood esHmator (MLE):
–  Derive the log‐likelihood
–  Find θMLE that maximizes log‐likelihood
•  AnalyHcally: Take derivaHve and set = 0
•  IteraHvely: (StochasHc) gradient descent
Joint Probability
xi yi
n
•  For n data points: n
Y
p(D) = p(xi , yi )
i=1
Yn “1”
= p(xi )p(yi |xi ) DiscriminaHve
Model
i=1
Defining the Likelihood
pθ (y|x) =
xi yi 1
√ exp −
✓
(y − θ x) T 2
◆
n σ 2π 2σ 2
Y
n
L(θ|D) = pθ (yi |xi )
i=1
n ✓ T 2
◆
Y 1 (yi − θ xi )
= √ exp − 2
i=1
σ 2π 2σ
n
!
1 1 X
= n n exp − 2 (yi − θT xi )2
σ (2π) 2 2σ i=1
Maximizing the Likelihood
•  Want to compute:
θ̂MLE = arg maxp L(θ|D)
θ∈R
•  To simplify the calculaHons we take the log:
1
θ̂MLE = arg maxp log L(θ|D) 1 2 3 4 5

θ∈R
-1
-2
which does not affect the maximizaHon because
log is a monotone funcHon.
n
!
1 1 X
L(θ|D) = n n exp − 2 (yi − θT xi )2
σ (2π) 2 2σ i=1
•  Take the log:
n 1 Xn
n
log L(θ|D) = − log(σ (2π) ) − 2
2 (yi − θT xi )2
2σ i=1
•  Removing constant terms with respect to θ:
X
n
log L(θ) = − (yi − θT xi )2
i=1
Monotone FuncHon
(Easy to maximize)
X
n
T 2
log L(θ) = − (yi − θ xi )
i=1
•  Want to compute:
θ̂MLE = arg maxp log L(θ|D)
θ∈R
•  Plugging in log‐likelihood:
X
n
T 2
θ̂MLE = arg maxp − (yi − θ xi )
θ∈R
i=1
X
n
T 2
θ̂MLE = arg maxp − (yi − θ xi )
θ∈R
i=1
•  Dropping the sign and flipping from maximizaHon
to minimizaHon:
X
n
θ̂MLE = arg minp (yi − θT xi )2
θ∈R
i=1
Minimize Sum (Error)2

•  Gaussian Noise Model  Squared Loss
–  Least Squares Regression
Pictorial InterpretaHon of
Squared Error
y
x
Maximizing the Likelihood
(Minimizing the Squared Error)
X
n
θ∈R
i=1
Convex FuncHon
− log L(θ)
Slope = 0
θ
θ̂MLE
•  Take the gradient and set it equal to zero
Minimizing the Squared Error
X
n
θ∈R
i=1
•  Taking the gradient
X
n
−rθ log L(θ) = rθ (yi − θT xi )2
i=1
Xn
Chain Rule  = −2 (yi − θT xi )xi
i=1
Xn X
n
= −2 yi x i + 2 (θT xi )xi
i=1 i=1
•  RewriHng the gradient in matrix form:
Xn X
n
−rθ log L(θ) = −2 yi x i + 2 (θT xi )xi
i=1 i=1

= −2X Y + 2X T Xθ
T
•  To make sure the log‐likelihood is convex
compute the second derivaHve (Hessian)
−r2 log L(θ) = 2X T X
•  If X is full rank then XTX is posiHve definite and
therefore θMLE is the minimum
–  Address the degenerate cases with regularizaHon
−rθ log L(θ) = −2X T y + 2X T Xθ = 0
•  Sehng gradient equal to 0 and solve for θMLE:
T T
(X X)θ̂MLE = X Y Normal
EquaHons
(Write on
T T
θ̂MLE = (X X) −1
X Y board)
n p ‐1 n 1
p =
Geometric InterpretaHon
•  View the MLE as finding a projecHon on col(X)
–  Define the esHmator:
Ŷ = Xθ
–  Observe that Ŷ is in col(X)
•  linear combinaHon of cols of X
–  Want to Ŷ closest to Y
•  Implies (Y‐Ŷ) normal to X
T T
X (Y − Ŷ ) = X (Y − Xθ) = 0
T T
⇒ X Xθ = X Y
ConnecHon to Pseudo‐Inverse
T T
θ̂MLE = (X X) −1
X Y
Moore‐Penrose X †
Psuedoinverse
•  GeneralizaHon of the inverse:
–  Consider the case when X is square and inverHble:
X † = (X T X)−1 X T = X −1 (X T )−1 X T = X −1
–  Which implies θMLE= X‐1 Y the soluHon
to X θ = Y when X is square and inverHble
CompuHng the MLE
T T
θ̂MLE = (X X) −1
X Y
•  Not typically solved by inverHng XTX
•  Solved using direct methods:
–  Cholesky factorizaHon: or use the
•  Up to a factor of 2 faster built‐in solver
–  QR factorizaHon: in your math library.
•  More numerically stable R: solve(Xt %*% X, Xt %*% y)
•  Solved using various iteraHve methods:
–  Krylov subspace methods
–  (StochasHc) Gradient Descent
hqp://www.seas.ucla.edu/~vandenbe/103/lectures/qr.pdf
Cholesky FactorizaHon
solve (X T X)θ̂MLE = X T Y
θ̂MLE
C d
•  Compute symm. matrix C = X T X O(np2 )
•  Compute vector d = X T Y O(np)
•  Cholesky FactorizaHon LLT = C O(p3 )
–  L is lower triangular
•  Forward subs. to solve: Lz = d O(p2 )
•  Backward subs. to solve: LT θ̂MLE = z O(p2 )
ConnecHons to graphical model inference:
hqp://ssg.mit.edu/~willsky/publ_pdfs/185_pub_MLR.pdf and
hqp://yaroslavvb.blogspot.com/2011/02/juncHon‐trees‐in‐numerical‐analysis.html with illustraHons
Solving Triangular System
A11 A12 A13 A14 x1 b1
A22 A23 A24 x2 b2

* =
A33 A34 x3 b3
A44 x4 b4

Solving Triangular System
A11x1 A12x2 A13x3 A14x4

x1=b1‐A12x2‐A13x3‐A14x4
b1
A11
A22x2 A23x3 A24x4 b2
x2=b2‐A23x3‐A24x4
A22
A33x3 A34x4 b3
x3=(b3‐A34x4)
A33
A44x4 b4
x4=b4 /A44
Distributed Direct SoluHon (Map‐Reduce)
T T
θ̂MLE = (X X) −1
X Y
•  DistribuHon computaHons of sums:
p X
n
p T T
C=X X= xi x i O(np2 )
i=1
1 X
n
p d = XT y = x i yi O(np)
i=1
•  Solve system C θMLE = d on master. O(p3 )
Gradient Descent:
What if p is large? (e.g., n/2)
•  The cost of O(np2) = O(n3) could by prohibiHve
•  SoluHon: IteraHve Methods
–  Gradient Descent:
For τ from 0 until convergence
θ(τ +1) = θ(τ ) − ρ(τ )r log L(θ(τ ) |D)
Learning rate

Gradient Descent Illustrated:
− log L(θ)
(0) Slope = 0
θ (1)
θ (2) (3) θ
θ
(3)
θ = θ̂MLE
Convex FuncHon
θ
Gradient Descent:
What if p is large? (e.g., n/2)
•  The cost of O(np2) = O(n3) could by prohibiHve
•  SoluHon: IteraHve Methods
–  Gradient Descent:
θ(τ +1) = θ(τ ) − ρ(τ )r log L(θ(τ ) |D)

1 X n
(τ )
= θ + ρ(τ ) (yi − θ(τ )T xi )xi O(np)
n i=1
•  Can we do beqer? EsHmate of the Gradient
StochasHc Gradient Descent
•  Construct noisy esHmate of the gradient:
1) pick a random i
2)
θ(τ +1) = θ(τ ) + ρ(τ )(yi − θ(τ )T xi )xi O(p)
•  SensiHve to choice of ρ(τ) typically (ρ(τ)=1/τ)
•  Also known as Least‐Mean‐Squares (LMS)
•  Applies to streaming data O(p) storage
Fihng Non‐linear Data
•  What if Y has a non‐linear response?
2.0
1.5
1.0
0.5
1 2 3 4 5 6
-0.5
-1.0
-1.5
•  Can we sHll use a linear model?
Transforming the Feature Space
•  Transform features xi
xi = (Xi,1 , Xi,2 , . . . , Xi,p )
•  By applying non‐linear transformaHon ϕ:
φ : Rp → R k
•  Example:
φ(x) = {1, x, x2 , . . . , xk }
–  others: splines, radial basis funcHons, …
–  Expert engineered features (modeling)
Under‐fihng
81.< 81., x<
2 2
1 1
1 2 3 4 5 6 1 2 3 4 5 6
-1 -1
-2 -2
=
91., x, x 2 , x 3 = 91., x, x 2 , x 3 , x 4 , x 5 =
2 2
1 1
1 2 3 4 5 6 1 2 3 4 5 6
-1 -1
-2 -2
Over‐fihng
Really Over‐fihng!
91., x, x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 , x 9 , x 10 , x 11 , x 12 , x 13 , x 14 =
2
1 2 3 4 5 6
-1
-2
•  Errors on training data are small
•  But errors on new points are likely to be large
What if I train on different data?
Low Variability:
91., x, x 2 , x 3 = 91., x, x 2 , x 3 = 91., x, x 2 , x 3 =
2 2 2
1 1 1
1 2 3 4 5 6 -1 1 2 3 4 5 6 1 2 3 4 5 6
-1 -1 -1
-2 -2 -2
High Variability
91., x, x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 , x 9 , x 10 , x 11 , x 12 , x 13 , x 14 = 91., x, x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 , x 9 , x 10 , x 11 , x 12 , x 13 , x 14 = 91., x, x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 , x 9 , x 10 , x 11 , x 12 , x 13 , x 14 =
2 2 2
1 1 1
1 2 3 4 5 6 -1 1 2 3 4 5 6 1 2 3 4 5 6
-1 -1 -1
-2 -2 -2
Bias‐Variance Tradeoff
•  So far we have minimized the error (loss) with
respect to training data
–  Low training error does not imply good expected
performance: over‐fiAng
•  We would like to reason about the expected
loss (Predic&on Risk) over:
–  Training Data: {(y1, x1), …, (yn, xn)}
–  Test point: (y*, x*)
•  We will decompose the expected loss into:
2
= Noise + Bias2 + Variance
⇥ ⇤
ED,(y∗ ,x∗ ) (y∗ − f (x∗ |D))
•  Define (unobserved) the true model (h):
Assume 0 mean noise
y∗ = h(x∗ ) + ✏∗ [bias goes in h(x )]
*
•  Completed the squares with: h(x∗ ) = h∗
2
⇥ ⇤
ED,(y∗ ,x∗ ) (y∗ − f (x∗ |D)) Expected Loss
2
⇥ ⇤
= ED,(y∗ ,x∗ ) (y∗ − h(x∗ ) + h(x∗ ) − f (x∗ |D))
a b
(a + b)2 = a2 + b2 + 2ab
2 2
⇥ ⇤ ⇥ ⇤
= E✏∗ (y∗ − h(x∗ )) + ED (h(x∗ ) − f (x∗ |D))
+ 2ED,(y∗ ,x∗ ) [y∗ h∗ − y∗ f∗ − h∗ h∗ + h∗ f∗ ]
y∗ = h(x∗ ) + ✏∗
2
⇥ ⇤
2
⇥ ⇤
= ED,(y∗ ,x∗ ) (y∗ − h(x∗ ) + h(x∗ ) − f (x∗ |D))
2 2
⇥ ⇤ ⇥ ⇤
= E✏∗ (y∗ − h(x∗ )) + ED (h(x∗ ) − f (x∗ |D))
+ 2ED,(y∗ ,x∗ ) [y∗ h∗ − y∗ f∗ − h∗ h∗ + h∗ f∗ ]
SubsHtute defn. y* = h* + e*
E [(h∗ + ✏∗ )h∗ − (h∗ + ✏∗ )f∗ − h∗ h∗ + h∗ f∗ ] =
h∗ h∗ + E [✏∗ ] h∗ − h∗ E [f∗ ] − E [✏∗ ] f∗ − h∗ h∗ + h∗ E [f∗ ]
y∗ = h(x∗ ) + ✏∗
2
⇥ ⇤
2
⇥ ⇤
= ED,(y∗ ,x∗ ) (y∗ − h(x∗ ) + h(x∗ ) − f (x∗ |D))
2 2
⇥ ⇤ ⇥ ⇤
= E✏∗ (y∗ − h(x∗ )) + ED (h(x∗ ) − f (x∗ |D))
Noise Term Model EsHmaHon Error
(out of our control) (we want to minimize this)
 Expand
•  Minimum error is governed by the noise.
•  Expanding on the model esHmaHon error:
2

⇥ ⇤
ED (h(x∗ ) − f (x∗ |D))
•  CompleHng the squares with E [f (x∗ |D)] = f¯∗
2
⇥ ⇤
ED (h(x∗ ) − f (x∗ |D))
2
⇥ ⇤
= E (h(x∗ ) − E [f (x∗ |D)] + E [f (x∗ |D)] − f (x∗ |D))
2 2
⇥ ⇤ ⇥ ⇤
= E (h(x∗ ) − E [f (x∗ |D)]) + E (f (x∗ |D) − E [f (x∗ |D)])
¯ ¯ ¯2
⇥ ⇤
+ 2E h∗ f∗ − h∗ f∗ − f∗ f∗ + f ∗
= h∗ f¯∗ − h∗ E [f∗ ] − f¯∗ E [f∗ ] + f¯∗2 =

h∗ f¯∗ − h∗ f¯∗ − f¯∗ f¯∗ + f¯∗2 = 0
2

⇥ ⇤
ED (h(x∗ ) − f (x∗ |D))
2
⇥ ⇤
ED (h(x∗ ) − f (x∗ |D))
2 2
⇥ ⇤ ⇥ ⇤
= E (h(x∗ ) − E [f (x∗ |D)]) + E (f (x∗ |D) − E [f (x∗ |D)])
(h(x∗ ) − E [f (x∗ |D)])2

2

⇥ ⇤
ED (h(x∗ ) − f (x∗ |D))
2
⇥ ⇤
ED (h(x∗ ) − f (x∗ |D))
2 2
⇥ ⇤
= (h(x∗ ) − E [f (x∗ |D)]) + E (f (x∗ |D) − E [f (x∗ |D)])
(Bias)2 Variance
•  Tradeoff between bias and variance:
–  Simple Models: High Bias, Low Variance
–  Complex Models: Low Bias, High Variance
Summary of Bias Variance Tradeoff
2
⇥ ⇤
ED,(y∗ ,x∗ ) (y∗ − f (x∗ |D)) = Expected Loss
2
⇥ ⇤
E✏∗ (y∗ − h(x∗ )) Noise
2
+ (h(x∗ ) − ED [f (x∗ |D)]) (Bias)2
2
⇥ ⇤
+ ED (f (x∗ |D) − ED [f (x∗ |D)]) Variance
•  Choice of models balances bias and variance.
–  Over‐fihng  Variance is too High
–  Under‐fihng  Bias is too High
Bias Variance Plot
Image from hqp://scoq.fortmann‐roe.com/docs/BiasVariance.html
T
Analyze bias of
f (x∗ |D) = x∗ θ̂MLE
T
•  Assume a true model is linear: h(x∗ ) = x∗ θ
bias = h(x∗ ) − ED [f (x∗ |D)]
h i SubsHtute MLE
= xT∗ ✓ − ED xT∗ ✓ˆMLE Plug in definiHon of Y
T
⇥ T T −1 T ⇤ Expand and cancel
= x∗ ✓ − ED x∗ (X X) X Y
T
⇥ T T −1 T ⇤
= x∗ ✓ − ED x∗ (X X) X (X✓ + ✏)
T
⇥ T T −1 T T T −1 T
⇤
= x∗ ✓ − ED x∗ (X X) X X✓ + x∗ (X X) X ✏
T
⇥ T T T −1 T
⇤
= x∗ ✓ − ED x∗ ✓ + x∗ (X X) X ✏
AssumpHon:
= xT∗ ✓ − xT∗ ✓ + xT∗ (X T X)−1 X T ED [✏] ED [✏] = 0
= xT∗ ✓ − xT∗ ✓ = 0
θ̂MLE is unbiased!
T
Analyze Variance of
T
•  Assume a true model is linear: h(x∗ ) = x∗ θ
2
⇥ ⇤
Var. = E (f (x∗ |D) − ED [f (x∗ |D)])
h i
= E (xT∗ ✓ˆMLE − xT∗ ✓)2 SubsHtute MLE + unbiased result
⇥ T T −1 T T 2
⇤ Plug in definiHon of Y
= E (x∗ (X X) X Y − x∗ ✓)
⇥ T T −1 T T 2
⇤
= E (x∗ (X X) X (X✓ + ✏) − x∗ ✓)
⇥ T T T −1 T T 2
⇤
= E (x∗ ✓ + x∗ (X X) X ✏ − x∗ ✓)
⇥ T T −1 T 2 ⇤
= E (x∗ (X X) X ✏)
Expand and cancel
•  Use property of scalar: a2 = a aT
T
Analyze Variance of
•  Use property of scalar: a2 = a aT
2
⇥ ⇤
Var. = E (f (x∗ |D) − ED [f (x∗ |D)])
⇥ T T −1 T 2 ⇤
= E (x∗ (X X) X ✏)
⇥ T T −1 T T T −1 T T
⇤
= E (x∗ (X X) X ✏)(x∗ (X X) X ✏)
⇥ T T −1 T T T T −1 T T ⇤
= E x∗ (X X) X ✏✏ (x∗ (X X) X )
T T −1 T
⇥ T ⇤ T T −1 T T
= x∗ (X X) X E ✏✏ (x∗ (X X) X )
= xT∗ (X T X)−1 X T σ✏2 I(xT∗ (X T X)−1 X T )T
= σ✏2 xT∗ (X T X)−1 X T X(xT∗ (X T X)−1 )T
= σ✏2 xT∗ (xT∗ (X T X)−1 )T
= σ✏2 xT∗ (X T X)−1 x∗
Consequence of Variance CalculaHon
2
⇥ ⇤
Var. = E (f (x∗ |D) − ED [f (x∗ |D)])
= σ✏2 xT∗ (X T X)−1 x∗
y y
x x
Higher Variance Lower Variance
Figure from hqp://people.stern.nyu.edu/wgreene/MathStat/GreeneChapter4.pdf
Summary
•  Least‐Square Regression is Unbiased:
h i
ED xT∗ θ̂MLE = xT∗ θ
•  Variance depends on:
2
= σ✏2 xT∗ (X T X)−1 x∗
⇥ ⇤
E (f (x∗ |D) − E [f (x∗ |D)])
2p
≈ σ✏
n
–  Number of data‐points n
–  Dimensionality p
–  Not on observaHons Y
Deriving the final idenHty
•  Assume xi and x* are N(0,1)
σ✏2 EX,x∗ xT∗ (X T X)−1 x∗
⇥ ⇤
EX,x∗ [Var.] =
2 T T
⇥ ⇤
= σ✏ EX,x∗ tr(x∗ x∗ (X X) ) −1
2 T T
⇥ ⇤
= σ✏ tr(EX,x∗ x∗ x∗ (X X) ) −1
T
2
⇥ ⇤ ⇥ T −1 ⇤
= σ✏ tr(Ex∗ x∗ x∗ EX (X X) )
σ✏2 ⇥ T
⇤
= tr(Ex∗ x∗ x∗ )
n
σ✏2
= p
n
Gauss‐Markov Theorem
•  The linear model:

f (x∗ ) = xT∗ θ̂MLE = xT∗ (X T X)−1 X T Y

has the minimum variance among all
unbiased linear esHmators
–  Note that this is linear in Y
•  BLUE: Best Linear Unbiased EsHmator
Summary
•  Introduced the Least‐Square regression model
–  Maximum Likelihood: Gaussian Noise
–  Loss FuncHon: Squared Error
–  Geometric InterpretaHon: Minimizing ProjecHon
•  Derived the normal equaHons:
–  Walked through process of construcHng MLE
–  Discussed efficient computaHon of the MLE
•  Introduced basis funcHons for non‐linearity
–  Demonstrated issues with over‐fihng
•  Derived the classic bias‐variance tradeoff
–  Applied to least‐squares model
AddiHonal Reading I found Helpful
•  hqp://www.stat.cmu.edu/~roeder/stat707/
lectures.pdf
•  hqp://people.stern.nyu.edu/wgreene/
MathStat/GreeneChapter4.pdf
•  hqp://www.seas.ucla.edu/~vandenbe/103/
lectures/qr.pdf
•  hqp://www.cs.berkeley.edu/~jduchi/projects/
matrix_prop.pdf

Linear Regression Lecture Slides in 40 Characters

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression Lecture Slides in 40 Characters

Uploaded by

Copyright:

Available Formats

Linear Regression and the

θ̂MLE = arg maxp log L(θ|D) 1 2 3 4 5

−r2 log L(θ) = 2X T X

A11 A12 A13 A14 x1 b1

A22 A23 A24 x2 b2

A44 x4 b4

A11x1 A12x2 A13x3 A14x4

θ(τ +1) = θ(τ ) − ρ(τ )r log L(θ(τ ) |D)

xi = (Xi,1 , Xi,2 , . . . , Xi,p )

= h∗ f¯∗ − h∗ E [f∗ ] − f¯∗ E [f∗ ] + f¯∗2 =

(h(x∗ ) − E [f (x∗ |D)])2

You might also like