This document provides an overview of linear regression. It begins with an example of modeling life satisfaction based on GDP per capita. It then introduces simple and generalized linear regression models. It discusses using the normal equation and gradient descent to learn the parameters, noting gradient descent is better for large numbers of features. It also compares different gradient descent algorithms and their advantages, like stochastic gradient descent being fast for large training sets.
This document provides an overview of linear regression. It begins with an example of modeling life satisfaction based on GDP per capita. It then introduces simple and generalized linear regression models. It discusses using the normal equation and gradient descent to learn the parameters, noting gradient descent is better for large numbers of features. It also compares different gradient descent algorithms and their advantages, like stochastic gradient descent being fast for large training sets.
This document provides an overview of linear regression. It begins with an example of modeling life satisfaction based on GDP per capita. It then introduces simple and generalized linear regression models. It discusses using the normal equation and gradient descent to learn the parameters, noting gradient descent is better for large numbers of features. It also compares different gradient descent algorithms and their advantages, like stochastic gradient descent being fast for large training sets.
ELE5453, 30102425 Fall Semester 2022-2023 Dr. Abeer Al-Hyari Abeer.Hyari@bau.edu.jo Content • Life satisfaction example • A simple linear regression model • A more generalized model • Normal equation • Gradient Descent (GD) • Other GD algorithms What makes people happy? • Suppose you want to know if money makes people happy, so you download the Better Life Index data from the OECD’s website as well as stats about GDP per capita from the IMF’s website. Then you join the tables and sort by GDP per capita. Life satisfaction example
• There does seem to be a trend here! Although the
data is noisy (i.e., partly random). • It looks like life satisfaction goes up more or less linearly as the country’s GDP per capita increases. • So you decide to model life satisfaction as a linear function of GDP per capita. • This step is called model selection: you selected a linear model of life satisfaction with just one attribute, GDP per capita (Equation 1-1). Simple linear model Which model is the best? • Before you can use your model, you need to define the parameter values θ0 and θ1 . • How can you know which values will make your model perform best? • To answer this question, you need to specify a performance measure. • You can either define a utility function (or fitness function) that measures how good your model is, or you can define a cost function that measures how bad it is. Cost function • For linear regression problems, people typically use a cost function that measures the distance between the linear model’s predictions and the training examples • The objective is to minimize this distance. less is better A more general linear model
5 futures =6 parameter because we have a theta 0
Linear regression model Cost function Normal equation Normal equation Normal equation Normal equation experiment Normal equation complexity • Using the normal equation takes time roughly O(mn3 ), where n is the number of features, and m is the number of examples. • So, it scales well with large set of examples, but poorly with big number of features. • Now let’s look at other ways of learning θ that may be better for when there are a lot of features or too many training examples to fit in memory Gradient descent • General idea is to tweak parameters iteratively in order to minimize a cost function. • Analogy: suppose you are lost in the mountains in a dense fog. You can only feel the slope of the ground below your fee. A good strategy to get to the bottom quickly is to go downhill in the direction of the steepest slope. • The step size during the walking downhill is called learning rate Learning rate too small Learning rate too big Gradient descent pitfalls Gradient descent with feature scaling When all features have a similar scale, GD tends to converge quick. Batch gradient descent Batch GD: Learning rates Other GD algorithms • Batch GD: uses the whole training set to compute GD. • Stochastic GD: uses one random training example to compute GD. • Mini-batch GD: uses a small random set of training examples to compute GD. Comparing linear regression algorithms Algorithm Large m Large n Hyper-params Scaling required Scikit-learn Normal equation Fast Slow 0 No LinearRegression Batch GD Slow Fast 2 Yes N/A Stochastic GD Fast Fast >= 2 Yes SGDRegressor Mini-batch GD Fast Fast >= 2 Yes N/A