You are on page 1of 29

Data Science for Molecular

Engineering
Lecture 6
ILOs
• Understand the formulation and assumptions of linear regression;
• Know how to find the optimal parameter for a linear regression
problem;
• Understand the concept and execution of gradient descent method;
Linear regression
• One of the most widely used techniques
• Fundamental to many larger models
• Generalized Linear Models
• Collaborative filtering
• Easy to interpret
• Efficient to solve
Linear regression examples

Simple linear regression Multiple linear regression

X
The linear model
Vector of
Parameters Vector of
Covariates/inputs
Scalar
Response
Model
Error
+b
Linear Combination
of Covariates

What about bias/intercept term?

Then redefine p := p+1 for notational simplicity


Example

What is y, x?
How do you define 𝛉?
Assumptions
• Linear regression assumes that…
• 1. The relationship between x and y is linear
• 2. y is distributed normally at each value of x
• 3. The variance of y at every value of x is the same (homogeneity of variances)
• 4. The observations are independent
Finding the best fit

Least mean squared error

𝑛 𝑛
𝑀𝑆𝐸=∑ 𝜖 =∑ (𝑦 𝑖 − 𝜃 𝑥 )
2
𝑖
𝑇 ❑ 2
𝑖
𝑖=1 𝑖=1
Loss, training, testing
• Loss
• A function to measure the model performance (e.g. MSE)

• Training
• Tuning model parameters to minimize the value of the loss function
𝑛 𝑛
𝐿𝑜𝑠𝑠= 𝑀𝑆𝐸=∑ 𝜖 =∑ ( 𝑦 𝑖 − 𝜃 𝑥 )
2
𝑖
𝑇 ❑ 2
𝑖
𝑖=1 𝑖 =1
• Testing
• Using the trained models to make predictions on new data and evaluating
model performance
Gradient Descent Illustrated:

Slope = 0

Convex Function
Gradient descent

𝑛 𝑛
𝐿𝑜𝑠𝑠= 𝑀𝑆𝐸=∑ 𝜖 =∑ ( 𝑦 𝑖 − 𝜃 𝑥 )
2
𝑖
𝑇 ❑ 2Sum over all data points -
𝑖 Batch gradient descent
𝑖=1 𝑖 =1
Stochastic and Minibatch Gradient Descent
𝑛 𝑛
𝐿𝑜𝑠𝑠= 𝑀𝑆𝐸=∑ 𝜖 2𝑖 =∑ ( 𝑦 𝑖 − 𝜃𝑇 𝑥❑
𝑖 ) 2

𝑖=1 𝑖 =1

𝑛1 𝑛1
2
𝐿𝑜𝑠𝑠= 𝜖 =( 𝑦 𝑖 − 𝜃
𝑖
𝐿𝑜𝑠𝑠=
𝑇 ❑
𝑥𝑖 )
2𝑀𝑆𝐸=
∑ 𝑖 ∑ 𝑖
𝜖 2
= ( 𝑦 − 𝜃 𝑇 ❑ 2
𝑥𝑖 )
𝑖=1 𝑖 =1
Minimizing the Squared Error

• Taking the gradient

Chain Rule 
• Rewriting the gradient in matrix form:

• To make sure the log-likelihood is convex compute the


second derivative (Hessian)

• If X is full rank then XTX is positive definite and


therefore θMLE is the minimum
• Address the degenerate cases with regularization
• Setting gradient equal to 0 and solve for θMLE:

n p -1 n 1

p =
Thinking about regression model from
probability point of view
• For a single data point (x,y):
Independent Variable Response Variable
(Vector) (Scalar)

Observe:
(Condition) x y

• Joint Probability:
Discriminative
Model
Conditional probability distribution of y
based on a single input x
Deterministic
Normal Distribution

Mean Variance
For multiple data points…

Assuming all data points are independent and identically distributed (iid)…
Rewriting with Matrix Notation
• Rewriting the model using matrix operations:

= +
p
n n n
1

1 p
Estimating the Model
• Given data how can we estimate θ?

• Construct maximum likelihood estimator (MLE):


• Derive/calculate θ so that the observed outcome has the
maximum likelihood
Defining the Likelihood

xi yi
Maximizing the Likelihood
• Want to compute:

• To simplify the calculations we take the log:

which does not affect the maximization because log is a


monotone function.
• Take the log:

• Removing constant terms with respect to θ:

Monotone Function
(Easy to maximize)
• Want to compute:

• Plugging in log-likelihood:
• Dropping the sign and flipping from maximization to
minimization:

Minimize Sum (Error)2

• Gaussian Noise Model  Squared Loss


• Least Squares Regression
Maximizing the Likelihood
(Minimizing the Squared Error)

Convex Function

Slope = 0

• Take the gradient and set it equal to zero


Simple example
n p -1 n 1

p =

You might also like