Machine Learning Foundamental

1 Probability and Bayes rule
Marginalization rule
𝑃(𝑋) = ∑ 𝑃(𝑋, 𝑌 = 𝑦)
𝑦∈𝑉𝑎𝑙(𝑌)
𝑃(𝐵2 = 𝐺) = 𝑃(𝐵2 = 𝐺, 𝐵1 = 𝐺) + 𝑃(𝐵2 = 𝐺, 𝐵1 = 𝑀)

+ 𝑃(𝐵2 = 𝐺, 𝐵1 = 𝐴)
Chain Rule
𝑃(𝑋, 𝑌) = 𝑃(𝑋|𝑌)𝑃(𝑌) = 𝑃(𝑌|𝑋)𝑃(𝑋)
𝑃(𝐵2 = 𝐺, 𝐵1 = 𝐺) = 𝑃(𝐵2 = 𝐺|𝐵1 = 𝐺)𝑃(𝐵1 = 𝐺) = 1 ∗ 0.9 = 0.9
𝑃(𝐵2 = 𝐺, 𝐵1 = 𝑀) = 𝑃(𝐵2 = 𝐺|𝐵1 = 𝑀)𝑃(𝐵1 = 𝑀) = 0.1 ∗ 0.1 = 0.01
𝑃(𝐵2 = 𝐺, 𝐵1 = 𝐴) = 𝑃(𝐵2 = 𝐺|𝐵1 = 𝐴)𝑃(𝐵1 = 𝐴) = 0.2 ∗ 0 = 0
𝑃(𝐵2 = 𝐺) = 0.9 + 0.01 + 0 = 0.91
b. What is P(B1 = G|B2 = G) in this example?
Bayes Rule
𝑃(𝑌|𝑋)𝑃(𝑋)
𝑃(𝑋|𝑌) =
𝑃(𝑌)
𝑃(𝐵2 = 𝐺|𝐵1 = 𝐺)𝑃(𝐵1 = 𝐺) 1 ∗ 0.9

𝑃(𝐵1 = 𝐺|𝐵2 = 𝐺) = = = 0.989
𝑃(𝐵2 = 𝐺) 0.91
2 MLE and MAP (15 points)

Suppose that the sample is
𝐷 = {𝑋1 , 𝑋1 , … , 𝑋1 }
𝑛
𝜆𝑘𝑖 𝑒 −𝜆
𝑃(𝐷|𝜆) = ∏ , λ̂MLE = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃(𝐷|𝜆)
𝑘𝑖 !
𝑖=1
Log-likelihood of data:
𝑛 𝑛
𝜆𝑘𝑖 𝑒 −𝜆 −𝑛𝜆
𝜆𝑘𝑖
ln 𝑃(𝐷|𝜆) = ln ∏ = ln (𝑒 ∏ )
𝑘𝑖 ! 𝑘𝑖 !
𝑖=1 𝑖=1
𝑛 𝑛
= −𝑛𝜆 + (ln 𝜆) ∑ 𝑘𝑖 − ∑ ln 𝑘𝑖 !
𝑖=1 𝑖=1
Derivative with respect to λ and set to zero:

𝑛 𝑛
𝑑 ln 𝑃(𝐷|𝜆) 𝑑
= (−𝑛𝜆 + (ln 𝜆) ∑ 𝑘𝑖 − ∑ ln 𝑘𝑖 !)
𝑑𝜆 𝑑𝜆
𝑖=1 𝑖=1
𝑑 ln 𝑃(𝐷|𝜆) ∑𝑛𝑖=1 𝑘𝑖
= −𝑛 + =0
𝑑𝜆 𝜆
∑𝑛𝑖=1 𝑘𝑖
λ̂MLE =
𝑛
b.
𝑃(𝜆|𝐷) ∝ 𝑃(𝐷|𝜆)𝑃(𝜆)
𝑛
𝜆𝛼−1 𝑒 −𝜆/𝛽 𝜆𝑘𝑖 𝑒 −𝜆
𝑃(𝜆|𝐷) = ∏
Г(𝛼)𝛽 𝛼 𝑘𝑖 !
𝑖=1
Since functions that do not depend on λ and k will be eliminated in derivation step,
I don’t suppose them in taking log step.
𝑛
𝜆 1
𝛼−1 −𝛽 −𝑛𝜆 ∑𝑛
ln 𝑃(𝜆|𝐷) ∝ ln (𝜆 𝑒 𝑒 𝜆 𝑖=1 𝑘𝑖 ) = − (𝑛 + ) 𝜆 + (ln 𝜆) (𝛼 − 1 + ∑ 𝑘𝑖 )
𝛽
𝑖=1
Derivative with respect to λ and set to zero:

𝑑 ln 𝑃(𝐷|𝜆) 1 𝛼 − 1 + ∑𝑛𝑖=1 𝑘𝑖
= − (𝑛 + ) + =0
𝑑𝜆 𝛽 𝜆
2
𝛼 − 1 + ∑𝑛𝑖=1 𝑘𝑖
λ̂MAP =
1
(𝑛 + 𝛽 )
c.
𝛼 − 1 + ∑𝑛𝑖=1 𝑘𝑖
lim𝑛→∞ λ̂MAP = lim𝑛→∞
1
(𝑛 + 𝛽 )
𝑛 ∑𝑛𝑖=1 𝑘𝑖 𝛼−1 ∑𝑛𝑖=1 𝑘𝑖

= lim𝑛→∞ ( + )= = λ̂MLE
1 𝑛 1 𝑛
(𝑛 + 𝛽 ) (𝑛 + 𝛽 )
3 Ridge regression
𝐸ridge (𝑤) = (𝑋𝑤 − 𝑡)𝑇 (𝑋𝑤 − 𝑡) + 𝜆𝑤 𝑇 𝑤

(𝑋𝑤 − 𝑡)𝑇 (𝑋𝑤 − 𝑡) = (𝑤 𝑇 𝑋 𝑇 − 𝑡 𝑇 )(𝑋𝑤 − 𝑡)
𝜕𝐸ridge (𝑤) 𝜕
= (𝑤 𝑇 𝑋 𝑇 𝑋𝑤 − 𝑤 𝑇 𝑋 𝑇 𝑡 − 𝑡 𝑇 𝑋𝑤 − 𝑡 𝑇 𝑡 + 𝜆𝑤 𝑇 𝑤)
𝜕𝑤 𝜕𝑤
Based on the two facts:
𝜕 𝑇 𝜕 𝑇
(𝑥 𝑎) = (𝑎 𝑥) = 𝑎
𝜕𝑥 𝜕𝑥
𝜕 𝑇
(𝑥 𝐴𝑥) = (𝐴 + 𝐴𝑇 )𝑥
𝜕𝑥
𝜕
(𝑤 𝑇 𝑋 𝑇 𝑋𝑤) = (𝑋 𝑇 𝑋 + (𝑋 𝑇 𝑋)𝑇 )𝑤 = 2𝑋 𝑇 𝑋𝑤
𝜕𝑤
𝜕 𝜕 𝑇
(𝑤 𝑇 𝑋 𝑇 𝑡) = (𝑡 𝑋𝑤) = 𝑋 𝑇 𝑡
𝜕𝑤 𝜕𝑤
3
𝜕 𝜕
(𝜆𝑤 𝑇 𝑤) = (𝜆𝑤 𝑇 𝐼𝑤) = 2𝜆𝐼𝑤
𝜕𝑤 𝜕𝑤
𝜕𝐸ridge (𝑤)
= 2𝑋 𝑇 𝑋𝑤 − 𝑋 𝑇 𝑡 − 𝑋 𝑇 𝑡 + 2𝜆𝐼𝑤 = 0
𝜕𝑤
2(𝑋 𝑇 𝑋 + 𝜆𝐼)𝑤 = 2𝑋 𝑇 𝑡
̂ 𝐫𝐢𝐝𝐠𝐞 = (𝑋 𝑇 𝑋 + 𝜆𝐼)−1 𝑋 𝑇 𝑡
𝐖
4 Robust Linear Regression

𝑁
𝑃(𝑡|𝑋, 𝑤, 𝑏) = ∏ 𝑃(𝑡𝑖 |𝑥𝑖 , 𝑤, 𝑏)

𝑖=1
𝑁
1 |𝑡𝑖 − 𝑥𝑖𝑇 𝑤|
= ∏ exp (− )
2𝑏 𝑏
𝑖=1
1 𝑁 |𝑡 − 𝑋 𝑇 𝑤|
= ( ) exp (− )
2𝑏 𝑏
The ML estimate of w is:
𝑑 ln 𝑃(𝑡|𝑋, 𝑤, 𝑏) |𝑡 − 𝑋 𝑇 𝑤|
= =0
𝑑𝑤 𝑏
|𝑡 − 𝑋 𝑇 𝑤| = 0
The ML estimate of b is:
𝑑 ln 𝑃(𝑡|𝑋, 𝑤, 𝑏) 𝑁 |𝑡 − 𝑋 𝑇 𝑤|
=− + =0
𝑑𝑏 𝑏 𝑏2
∑𝑁 𝑇
𝑖=1|𝑡𝑖 − 𝑥𝑖 𝑤| |𝑡 − 𝑋 𝑇 𝑤|
𝑏= =
𝑁 𝑁
So, if ELaplace(w) be the absolute error, its minimization is equivalent the MLE of w
under Laplacian distribution.
𝐸Laplace (𝑤) = |𝑡 − 𝑋 𝑇 𝑤|
4
(b)
Fig.1 shows two distributions (Gaussian and Laplacian) with the same mean,
standard deviation and scale parameter. As shown, the density of Laplacian
distribution is higher at mean. Regarding the tail, Laplacian distribution has a higher
value (heavier tail) than the Gaussian distribution which has very little probability
in a point away from the mean. In terms of mathematical form, when |𝑦̂ − 𝑦| > 0 ,
the square error magnitude is more than absolute error which means that gaussian
distribution will be more sensitive in outlier data to Laplacian distribution.
Figure 1 Gaussian and Laplacian Distribution with σ=b=1/√2
5 Programming: Nonlinear Regression

Figure 2 represents the root mean square error for training and test(unseen) data with
respect to degree of polynomial from zero to 10 based on the polynomial basis
function from. There is a fluctuation in the error of training data due the high number
of features which are 7 and high degree of polynomial. When the degree of
polynomial increases, the model becomes more complex which can cause
overfitting. On the other hand, the coefficient parameters (vector of w) increase as
the polynomial degree increases. For example, the number of coefficient parameters
for degree equals to M could be 7M+1, whereas the number of training dataset is
just 100. It means that the model has more flexibility to fit the training data, therefore
having more parameters makes the model more sensitive to nose which leads to less
accurate fitting data. So, there are a fluctuation in the error of training dataset. In
mathematical pint of view, the basic assumption in linear regression of the Φ-matrix
in 𝑤𝑀𝐿 = (𝛷𝑇 𝛷)−1 𝛷 𝑇 𝑡 is being full rank. As the orders and features increases,
5
invertibility of 𝛷 𝑇 𝛷 decreases leads to inaccurate results in parameter estimation
and a significant increase in the parameters of coefficient vector (w).
(a) (b)
Figure 2 Root-mean-square error (RMSE) vs polynomial degree for a) training, b) test dataset
Figure 3 shows the norm of coefficient vector (w) with respect to degree of
polynomial. The norm of coefficient vector increases significantly in high order
polynomials.
Figure 3 Norm of coefficient parameters vs degree of polynomial
In addition, the number of training set is small compared to the number of estimated
parameters. Since, the larger the training data, the number of accidental estimation
and noise sensitivity decrease. Furthermore, to avoid these issues in small data size,
cross validation, regularization and random permutation are very helpful.
6
Therefore, I used two strategies to check the results. Frist, the random permutation
to the whole data set for each feature. Second, use the pseudo inverse of 𝛷𝑇 𝛷 to
estimate the coefficient. As shown in Figure 4, fluctuation in RMSE of training data
becomes smaller that indicated the random selection of data in the whole data set
decrease the estimation error of the coefficient vector (Python script for this
estimation is ppolynomial_regression_rendompermutation.py). The order of data for
the features has a big difference from training data (1 to 100 of data) and test data
(101to 392) that is shown in the second part of this question. Therefore, regression
with fixed permutation makes a larger error in test data than random permutation.
(a) (b)
Figure 4 RMSE vs polynomial degree using random permutation for a) training, b) test dataset
Figure 5 indicates the results of training and test error with respect degree of
polynomial using pseudo inverse and random permutation (Python script for this
estimation is ppolynomial_regression_pseudoinverse.py).
Figure 5 RMSE vs polynomial degree using pseudo inverse and random permutation
7
(b)
In this part the training is based on the fixed random permutation and the 3rd feature.
The results are shown in K=2,4 and 10. Figure 6 represents the training and test data
that are fitted by polynomials with degree 2 and 4. In fig. 6b the prediction of training
data has a better fit as the degree of polynomial increases. However, the predicted
values in both polynomials are not well matched with the scattered test data, because
the data are not random premutation for training and it seems the order of feature
data has a significant difference between the training part (1 to 100) and test part
(100 to 392). This data visualization helps to justify why there is a fluctuation in the
error of the training data and a big error in the error of the test data.
(a) (b)
Figure 6 Training and test data and learned polynomial for a) M=2 and b) M=4
In figure 7, an overfitting occurs due to the high degree of polynomial (M=10) which
try to fit all data regardless of data pattern.
8
Figure 7 Training and test data and learned polynomial for M=10
(c)
In this part also the training is based on the fixed random permutation and the 3rd
feature. The results are shown in M=0 and λ = 0, 0.01 and 0.1. Figure 8 shows the
average training and validation error. The validation and training error have a reverse
pattern with respect to the regularization parameter because of overfitting for small
λ and underfitting for large λ. When the model is complex and λ be small, the model
tends to overfit the training data, so the training error is small. However, validation
error is large because the model does not generalize well the validation data. On the
other hand, the model could be very simple when the λ is large and learning method
cannot fit the complex model very well. In this case the training and validation error
will both be large. So for the complex models, as regularization parameter increases
from zero, the training error increases since the model become simple, but the
validation error decreases because the model becomes less overfit. The optimum λ
is one that makes a balance between model overfitting and underfitting. Therefore,
λ=10 seems to be the optimum selection as validation error and training error are in
the middle of decreasing and increasing line with the largest slope. It can be observed
in fig. 11 that makes a good comparison between tow model fitting for training and
test data based on the λ=1 and 10, the model with λ=10 have bit better prediction
than λ=1. figure 9 and 10 represents training and test data and polynomial in
different λ = 0, 100 and 1, 10 respectively. Large and small magnitude of regulation
parameter leads a week prediction of model.
9
Figure 8 RMSE vs different log-scaled lambda of validation and training data
Figure 9 Training and test data and learned polynomial M=8 with λ=0 and λ=100
10

Machine Learning Foundamental

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Foundamental

Uploaded by

Copyright:

Available Formats

1 Probability and Bayes rule

𝑃(𝐵2 = 𝐺) = 𝑃(𝐵2 = 𝐺, 𝐵1 = 𝐺) + 𝑃(𝐵2 = 𝐺, 𝐵1 = 𝑀)

b. What is P(B1 = G|B2 = G) in this example?

𝑃(𝐵2 = 𝐺|𝐵1 = 𝐺)𝑃(𝐵1 = 𝐺) 1 ∗ 0.9

2 MLE and MAP (15 points)

Derivative with respect to λ and set to zero:

Derivative with respect to λ and set to zero:

𝑛 ∑𝑛𝑖=1 𝑘𝑖 𝛼−1 ∑𝑛𝑖=1 𝑘𝑖

𝐸ridge (𝑤) = (𝑋𝑤 − 𝑡)𝑇 (𝑋𝑤 − 𝑡) + 𝜆𝑤 𝑇 𝑤

4 Robust Linear Regression

𝑃(𝑡|𝑋, 𝑤, 𝑏) = ∏ 𝑃(𝑡𝑖 |𝑥𝑖 , 𝑤, 𝑏)

Figure 1 Gaussian and Laplacian Distribution with σ=b=1/√2

5 Programming: Nonlinear Regression

Figure 3 Norm of coefficient parameters vs degree of polynomial

You might also like