Professional Documents
Culture Documents
Engineering
Lecture 6
ILOs
• Understand the formulation and assumptions of linear regression;
• Know how to find the optimal parameter for a linear regression
problem;
• Understand the concept and execution of gradient descent method;
Linear regression
• One of the most widely used techniques
• Fundamental to many larger models
• Generalized Linear Models
• Collaborative filtering
• Easy to interpret
• Efficient to solve
Linear regression examples
X
The linear model
Vector of
Parameters Vector of
Covariates/inputs
Scalar
Response
Model
Error
+b
Linear Combination
of Covariates
What is y, x?
How do you define 𝛉?
Assumptions
• Linear regression assumes that…
• 1. The relationship between x and y is linear
• 2. y is distributed normally at each value of x
• 3. The variance of y at every value of x is the same (homogeneity of variances)
• 4. The observations are independent
Finding the best fit
𝑛 𝑛
𝑀𝑆𝐸=∑ 𝜖 =∑ (𝑦 𝑖 − 𝜃 𝑥 )
2
𝑖
𝑇 ❑ 2
𝑖
𝑖=1 𝑖=1
Loss, training, testing
• Loss
• A function to measure the model performance (e.g. MSE)
• Training
• Tuning model parameters to minimize the value of the loss function
𝑛 𝑛
𝐿𝑜𝑠𝑠= 𝑀𝑆𝐸=∑ 𝜖 =∑ ( 𝑦 𝑖 − 𝜃 𝑥 )
2
𝑖
𝑇 ❑ 2
𝑖
𝑖=1 𝑖 =1
• Testing
• Using the trained models to make predictions on new data and evaluating
model performance
Gradient Descent Illustrated:
Slope = 0
Convex Function
Gradient descent
𝑛 𝑛
𝐿𝑜𝑠𝑠= 𝑀𝑆𝐸=∑ 𝜖 =∑ ( 𝑦 𝑖 − 𝜃 𝑥 )
2
𝑖
𝑇 ❑ 2Sum over all data points -
𝑖 Batch gradient descent
𝑖=1 𝑖 =1
Stochastic and Minibatch Gradient Descent
𝑛 𝑛
𝐿𝑜𝑠𝑠= 𝑀𝑆𝐸=∑ 𝜖 2𝑖 =∑ ( 𝑦 𝑖 − 𝜃𝑇 𝑥❑
𝑖 ) 2
𝑖=1 𝑖 =1
𝑛1 𝑛1
2
𝐿𝑜𝑠𝑠= 𝜖 =( 𝑦 𝑖 − 𝜃
𝑖
𝐿𝑜𝑠𝑠=
𝑇 ❑
𝑥𝑖 )
2𝑀𝑆𝐸=
∑ 𝑖 ∑ 𝑖
𝜖 2
= ( 𝑦 − 𝜃 𝑇 ❑ 2
𝑥𝑖 )
𝑖=1 𝑖 =1
Minimizing the Squared Error
Chain Rule
• Rewriting the gradient in matrix form:
n p -1 n 1
p =
Thinking about regression model from
probability point of view
• For a single data point (x,y):
Independent Variable Response Variable
(Vector) (Scalar)
Observe:
(Condition) x y
• Joint Probability:
Discriminative
Model
Conditional probability distribution of y
based on a single input x
Deterministic
Normal Distribution
Mean Variance
For multiple data points…
Assuming all data points are independent and identically distributed (iid)…
Rewriting with Matrix Notation
• Rewriting the model using matrix operations:
= +
p
n n n
1
1 p
Estimating the Model
• Given data how can we estimate θ?
xi yi
Maximizing the Likelihood
• Want to compute:
Monotone Function
(Easy to maximize)
• Want to compute:
• Plugging in log-likelihood:
• Dropping the sign and flipping from maximization to
minimization:
Convex Function
Slope = 0
p =