Lecture 6

Data Science for Molecular
Engineering
Lecture 6
ILOs
• Understand the formulation and assumptions of linear regression;
• Know how to find the optimal parameter for a linear regression
problem;
• Understand the concept and execution of gradient descent method;
Linear regression
• One of the most widely used techniques
• Fundamental to many larger models
• Generalized Linear Models
• Collaborative filtering
• Easy to interpret
• Efficient to solve
Linear regression examples
Simple linear regression Multiple linear regression
X
The linear model
Vector of
Parameters Vector of
Covariates/inputs
Scalar
Response
Model
Error
+b
Linear Combination
of Covariates
What about bias/intercept term?
Then redefine p := p+1 for notational simplicity

Example
What is y, x?
How do you define 𝛉?
Assumptions
• Linear regression assumes that…
• 1. The relationship between x and y is linear
• 2. y is distributed normally at each value of x
• 3. The variance of y at every value of x is the same (homogeneity of variances)
• 4. The observations are independent
Finding the best fit
Least mean squared error
𝑛 𝑛
𝑀𝑆𝐸=∑ 𝜖 =∑ (𝑦 𝑖 − 𝜃 𝑥 )
2
𝑖
𝑇 ❑ 2
𝑖
𝑖=1 𝑖=1
Loss, training, testing
• Loss
• A function to measure the model performance (e.g. MSE)
• Training
• Tuning model parameters to minimize the value of the loss function
𝑛 𝑛
𝐿𝑜𝑠𝑠= 𝑀𝑆𝐸=∑ 𝜖 =∑ ( 𝑦 𝑖 − 𝜃 𝑥 )
2
𝑖
𝑇 ❑ 2
𝑖
𝑖=1 𝑖 =1
• Testing
• Using the trained models to make predictions on new data and evaluating
model performance
Gradient Descent Illustrated:
Slope = 0
Convex Function
Gradient descent
𝑛 𝑛
𝐿𝑜𝑠𝑠= 𝑀𝑆𝐸=∑ 𝜖 =∑ ( 𝑦 𝑖 − 𝜃 𝑥 )
2
𝑖
𝑇 ❑ 2Sum over all data points -
𝑖 Batch gradient descent
𝑖=1 𝑖 =1
Stochastic and Minibatch Gradient Descent
𝑛 𝑛
𝐿𝑜𝑠𝑠= 𝑀𝑆𝐸=∑ 𝜖 2𝑖 =∑ ( 𝑦 𝑖 − 𝜃𝑇 𝑥❑
𝑖 ) 2
𝑖=1 𝑖 =1
𝑛1 𝑛1
2
𝐿𝑜𝑠𝑠= 𝜖 =( 𝑦 𝑖 − 𝜃
𝑖
𝐿𝑜𝑠𝑠=
𝑇 ❑
𝑥𝑖 )
2𝑀𝑆𝐸=
∑ 𝑖 ∑ 𝑖
𝜖 2
= ( 𝑦 − 𝜃 𝑇 ❑ 2
𝑥𝑖 )
𝑖=1 𝑖 =1
Minimizing the Squared Error
• Taking the gradient
Chain Rule 
• Rewriting the gradient in matrix form:
• To make sure the log-likelihood is convex compute the

second derivative (Hessian)
• If X is full rank then XTX is positive definite and

therefore θMLE is the minimum
• Address the degenerate cases with regularization
• Setting gradient equal to 0 and solve for θMLE:
n p -1 n 1
p =
Thinking about regression model from
probability point of view
• For a single data point (x,y):
Independent Variable Response Variable
(Vector) (Scalar)
Observe:
(Condition) x y
• Joint Probability:
Discriminative
Model
Conditional probability distribution of y
based on a single input x
Deterministic
Normal Distribution
Mean Variance
For multiple data points…
Assuming all data points are independent and identically distributed (iid)…
Rewriting with Matrix Notation
• Rewriting the model using matrix operations:
= +
p
n n n
1
1 p
Estimating the Model
• Given data how can we estimate θ?
• Construct maximum likelihood estimator (MLE):

• Derive/calculate θ so that the observed outcome has the
maximum likelihood
Defining the Likelihood
xi yi
Maximizing the Likelihood
• Want to compute:
• To simplify the calculations we take the log:
which does not affect the maximization because log is a

monotone function.
• Take the log:
• Removing constant terms with respect to θ:
Monotone Function
(Easy to maximize)
• Want to compute:
• Plugging in log-likelihood:
• Dropping the sign and flipping from maximization to
minimization:
Minimize Sum (Error)2
• Gaussian Noise Model  Squared Loss

• Least Squares Regression
Maximizing the Likelihood
(Minimizing the Squared Error)
Convex Function
Slope = 0
• Take the gradient and set it equal to zero

Simple example
n p -1 n 1
p =

Lecture 6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 6

Uploaded by

Copyright:

Available Formats

Data Science for Molecular

Simple linear regression Multiple linear regression

What about bias/intercept term?

Then redefine p := p+1 for notational simplicity

Least mean squared error

• Taking the gradient

• To make sure the log-likelihood is convex compute the

• If X is full rank then XTX is positive definite and

• Construct maximum likelihood estimator (MLE):

• To simplify the calculations we take the log:

which does not affect the maximization because log is a

• Removing constant terms with respect to θ:

Minimize Sum (Error)2

• Gaussian Noise Model  Squared Loss

• Take the gradient and set it equal to zero

You might also like