You are on page 1of 36

# Linear regression

## with gradient descent

Ingmar Schuster
Patrick Jhnichen
using slides by Andrew Ng

Institut fr Informatik

## This lecture covers

Linear Regression

Hypothesis formulation,
hypthesis space

## Optimizing Cost with Gradient

Descent
Using multiple input features
with Linear Regression

Feature Scaling

Nonlinear Regression

derivatives

## Linear regression w. gradient descent

Linear Regression

Institut fr Informatik

## Expected answer available for each example in data

Regression Problem

## Prediction of continuous output

Linear regression w. gradient descent

## m Number of training examples

x is input (predictor) variable
features in ML-speek
y is output (response) variable
Notation

Square meters

Price in 1000

73

174

146

367

38

69

124

257

...

...

## Linear regression w. gradient descent

Learning procedure

Hypothesis parameters

Training data

linear regression,
one input variable (univariate)

Learning Algorithm

Size
of flat

Estimated
price
hypothesis
(mapping between
input and output)

## How to choose parameters?

Linear regression w. gradient descent

Optimization objective

## Purpose of learning algorithm expressed in

optimization objective and cost function (often called J)

...

## smallest average distance to points in training data

(h(x) close to y for (x,y) in training data)
Cost function often named J
Number
Number of
of
training
training instances
instances

Squaring

## Penalty for large deviations stronger

Linear regression w. gradient descent

Optimizing Cost

## Gradient Descent Outline

Want to minimize

Keep changing
to reduce
until we end up at minimum

10

Stepwise
Stepwise
descent
descent
towards
towards
minimum
minimum

Derivatives
Derivatives
work
work only
only for
for
few
few parameters
parameters

## [plot by Andrew Ng]

partial
partial
derivative
derivative

beware: incremental
update incorrect!

steps
steps become
become smaller
smaller
without
without changing
changing
learning
learning rate
rate
12

convergence

## Overly large learning rate may

not lead to convergence or to
divergence
Often

## Linear regression w. gradient descent

13

Checking convergence

## Gradient descent works

correctly if
decreases
with every step
Possible convergence
criterion: converged if
decreases by less than
constant

14

Local Minima

## Gradient descent can get stuck at local minima

(e.g. J not squared error for regression with only one variable)

Random restart
with different
parameter(s)

15

## Linear regression w. gradient descent

16

Multiple features
Square Bedrooms Floors Age of building
meters
(years)
x1

x2

x3

x4

Price in
1000
y

200

45

460

131

40

232

142

30

315

756

36

178

Notation

## Linear regression w. gradient descent

17

Hypothesis representation

More compact

with definition

18

19

20

21

## With multiple variables, comparison of variance in data is lost

(scales can vary strongly)
Square meters 30 - 400

Bedrooms

1 - 10

Price

80 000
2 000 000

22

Feature Scaling

23

Feature scaling

scale

## (for single data point of feature j)

Z-score conversion

## Linear regression w. gradient descent

24

Z-Score conversion

Center data on 0

## Scale data so majority falls into range [-1, 1]

mean
mean // empirical
empirical
expected
expected value
value
(mu)
(mu)

empirical
empirical standard
standard
deviation
deviation (sigma)
(sigma)

25

## Linear regression w. gradient descent

26

Nonlinear Regression
(by cheap trickery)

27

28

29

30

## Linear regression w. gradient descent

31

Optimizing cost
using derivatives

32

solve
for all i

33

## Comparison Gradient Descent vs. Setting derivative = 0

Need to choose
Needs many iterations,
random restarts etc.
Works well for many features

Derivation

No need to choose

No iterations

34

## This lecture covers

Linear Regression

Hypothesis formulation,
hypthesis space

## Optimizing Cost with Gradient

Descent
Using multiple input features
with Linear Regression

Feature Scaling

Nonlinear Regression

derivatives

35

Pictures

## Some public domain plots from

en.wikipedia.org and
de.wikipedia.org

36