You are on page 1of 34

|section|Linear regression|

|video|https://www.youtube.com/embed/YGBZE6RA7aU|

Linear Models and Search


Part 1: Linear regression

Machine Learning
mlvu.github.io
Vrije Universiteit Amsterdam

machine learning: the basic recipe Here is the "basic recipe" for machine learning we saw in
the last lecture. In this lecture, we’ll have a look at linear
models, both for regression and classi ication. We’ll see
Abstract your problem to a standard task.
Classi cation, Regression, Clustering, Density estimation, Generative Modeling, Online how to de ine a linear model, how to formulate a loss
learning, Reinforcement Learning, Structured Output Learning
function and how to search for a model that minimises that
Choose your instances and their features.
For supervised learning, choose a target. loss function.
Choose your model class. Most of the lecture will be focused on search methods. The
Linear models
linear models themselves aren’t that strong, but because
Search for a good model.
Choose a loss function, choose a search method to minimise the loss. they’re pretty simple, we can use them to explain various
search methods that we can also apply to more complex
models as the course progresses. Speci ically, the method of
2
gradient descent, which we'll introduce here, will be the
search method used for almost all approaches we will
discuss.

regression We’ll start with regression. Here’s how we explained


regression in the last lecture.

data 2 0 ?
3 0 1.1
2 0 4.5
0 1 12.3
0 2 5.1
1 1 9.1
learner model
2 1 1.2
2 0 5.2
1 0 6.1
1 1 1.9
3 1 1.8
1.8
6 0 3.2
0 1 0.1
3
0 0 0.4
f
fi
f
f
This is is the example data we used to illustrate regression:
ipper body predicting the body mass of a penguin from its lipper
feature > length mass ‹ target length.
(dm) (kg)
data source: https://allisonhorst.github.io/
1.93 3.65
1.88 4.70
palmerpenguins/, https://github.com/mcnakhaee/
1.90 4.95 palmerpenguins (python package)
2.17 5.70
1.90 3.32 image source: https://allisonhorst.github.io/
1.92 5.70 palmerpenguins/
2.23 5.00
2.05 3.45
2.08 3.05
image source: https://allisonhorst.github.io/palmerpenguins/ 1.93 5.30
4
1.88 5.55

As we saw, the linear regression model is simply a linear


function that maps the feature(s) to the target value. In the
case of one feature, such a function looks like a line. The
only decision we have to make is which line its the data
ipper body
length mass best.
target

(dm) (kg)

1.93 3.65
1.88 4.70
1.90 4.95
2.17 5.70
1.90 3.32
1.92 5.70
2.23 5.00
2.05 3.45
feature space 2.08 3.05
5
1.93 5.30

regression: example data To simplify things we'll use this very simple data set in the
rest of this lecture. There is one input feature x, one output
value t (for target) and we have six instances.

x t We will assume that all our features are numeric. In a later


1 0.7 lecture we will see how best to convert categoric features to
2 2.3 numeric ones. We will develop linear regression for an
3 2.7
arbitrary number of features m, but we will keep visualizing
4 4.3
what we’re doing on this one-feature dataset.
5 4.7
6 6.3

6
fl
fl
f
f
notation Throughout the course, we will use the following notation:
lowercase non-bold for scalars, lowercase bold for vectors
and uppercase bold for matrices.
x, y, z <latexit sha1_base64="ZH2TeDPGY9vUH2k2JnFcW8OK+1E=">AAAJ9nicfVZNb9tGEGXSj1hq0zjpsReiQoGiEAxSikX5ECC25CSHJnYNy3ZrCsZytaIILT+wuxRJL/hTmluRa/9Iry36b7qkKJnkUt2LZ+e9Gc7OPK3XCrBDmab9++jxZ59/8eWTvVb7q6+ffvNs//mLK+qHBKIJ9LFPbixAEXY8NGEOw+gmIAi4FkbX1nKU4dcrRKjje5csCdDUBbbnzB0ImHDd7R/HXTXpqveqabZNYZtiY+Y7brqALay5epN21e3m1/Lmt1QQ7/Y72oGWL1U29MLoKMU6v3u+97s582HoIo9BDCi91bWATTkgzIEYpW0zpCgAcAlsdBuy+XDKHS8IGfJgqv4gsHmIVear2XHUmUMQZDgRBoDEERlUuAAEQCYO3a6mosgDLqLd2coJ6NqkK3ttMCA6NuVx3tH0aSWS2wQECwfGldI4cGnWB8lJE9eqOlGIEVm5VWdWpiiyxowRgQ7NmnAuOnMWZFOil/55gS+SYIE8mvKQ4LQcKABECJqLwNykiIUBz08jpLGkrxgJUTczc9+rMSDLCzTrijwVR7WcOfYBq7oscQzRHQ9F0Hdd4M24GaTcZChm3OwepHnvyuhFygvBWOpFBlfQDyX0Q5pWwdMSeCrAKjrZonN1Ug+9KoFX0levS+h1PdQKS2gooasSupIyW1EJjiQ4LqGxhCYlNJHQ+xJ6L/cZCFnc9qZ8PYt8qPwMOyv0liDkpbzTS+tnIWLet3o1JNMA7+hp3u4Zmot7ZQ24SUbn7y7f/5zy0bB3qA3SOsPCIdpQtP7gcKRJFHtdTcHRhsPeicTxCfDsbaLx6eBYlxMFIQnwlmQY/TdHcqYEYexH20yjk3GvXyOJhlRr0g1d0+qnj2y4IQwMY6zV64nwA+F4NO4b9c9EZItbeh/16s2L8ANh1h++lBNYW7yvDQZHEo4fCEYfGDOJsNziR/mq/6BEAXU1FENXO7oqqcduohetbAywmgLWkmnkL2X+WwKSHWy/KftGSY0RQVPERlaNEUlTxEZjm4hqSNTQplxMjR/IZSTR8W5+w8xype3I3kTHu/kNE8tluCN7Ex3v5jfMN9docyOXQXb/LaGYW/ZSAHhNGSPxhiDovbgXz8T/PcB88pO4DIntOkKH4q/Zzaz/I4J4QxRWuy0eNHr9+SIbV70D/fBA++Vl5/VJ8bTZU75Tvld+VHTFUF4r75RzZaJA5ZPyl/K38k8rbn1s/dH6tKY+flTEfKtUVuvP/wAG66Yb</latexit>

scalar (i.e. a single number)


x, y, z
<latexit sha1_base64="+2UJH6O+/ux1skcpZsKCO3YaI7M=">AAAKFnicfVZNb+M2ENVuv2K32822x16EGgWKwggke2M5hwV2Y6e7h+4mDeIkbWQYFE3LqqkPkJQlhdD/6KV/pbdir732R/TaXktJtiOJcnnJcN6b0XDmmaEVYIcyTfvr0eMPPvzo408OWu1PP3vy+dPDZ19cUz8kEE2gj31yawGKsOOhCXMYRrcBQcC1MLqxVqMMv1kjQh3fu2JJgKYusD1n4UDAhGt2COKumnTVe9U026awTbEx8x03XcCW1kK9TbvqbvNTefNzmhHjmWlB7uT27YwXG9OH/Jc0880OO9qRli9VNvSN0VE262L27OA3c+7D0EUegxhQeqdrAZtyQJgDMUrbZkhRAOAK2OguZIvhlDteEDLkwVT9RmCLEKvMV7PjqnOHIMhwIgwAiSMyqHAJCIBMNKVdTUWRB1xEu/O1E9DCpGu7MBgQHZ3yOO94+qQSyW0CgqUD40ppHLg065PkpIlrVZ0oxIis3aozK1MUWWPGiECHZk24EJ05D7Ip0iv/YoMvk2CJPJrykOC0HCgARAhaiMDcpIiFAc9PI6Szoi8YCVE3M3PfizEgq0s074o8FUe1nAX2Aau6LHEM0R0PRdB3XeDNuRmk3GQoZtzsHqV578roZco3grLUywyuoO9K6Ls0rYJnJfBMgFV0skMX6qQeel0Cr6Wv3pTQm3qoFZbQUELXJXQtZbaiEhxJcFxCYwlNSmgiofcl9F7uMxCyuOtNeTGLfKj8HDtr9Jog5KW800vrZyFi3nd6NSTTAO/oad7uOVqIe6cA3CSj8zdXb39I+WjYO9YGaZ1h4RBtKVp/cDzSJIpdVLPhaMNh71Ti+AR49i7R+GzwSpcTBSEJ8I5kGP3vT+RMCcLYj3aZRqfjXr9GEg2p1qQbuqbVTx/ZcEsYGMZYq9cT4QfCq9G4b9Q/E5Edbul91Ks3L8IPhHl/+FxOYO3wvjYYnEg4fiAYfWDMJcJqh5/kq/6DEgXU1bAZutrRVUk9dhN908rGAKspoJBMI38l818TkOxh+03Zt0pqjAiaIrayaoxImiK2GttGVEOihjblYmr8QC4jiY738xtmlittT/YmOt7Pb5hYLsM92ZvoeD+/Yb65RpsbuQqy+28FxdyylwLABWWMxBuCoLfiXjwX//cA88l34jIktusIHYq/Zjez/o8I4i1RWO22eNDo9eeLbFz3jvTjI+3H552Xp5unzYHylfK18q2iK4byUnmjXCgTBSrvlb+Vf5R/W7+2fm/90XpfUB8/2sR8qVRW68//AJG0szk=</latexit>

When we’re indexing individual elements of vectors and


x, y, z vector (a column of numbers)
x, y, z matrices, these are scalars, so they are non-bold.
X, Y, Z matrix (a rectangular grid of numbers)
X, Y, Z
xi scalar element of a vector x

Xij scalar element of a matrix X

multiple features As we saw in the last lecture, an instance in machine


learning is described by m features (with m ixed for a given
X = x1 , x2 , x3 , . . . dataset). We will represent this as a vector for each
<latexit sha1_base64="aFVEWXCKALxPxOKfdwYFl2BVYxM=">AAAKAnicfVbLbttGFGXSl6U2rdMssyEqFCgKweAjFuWFgcSSmyzq2DUs24ApCENyRBEaPjAzFMUMuOtv9Ae6K7Ltj2Tb/kiHFCWTHKpc0Bdzzrm8c+7VeKwIeYQqyqcnTz/7/IsvvzrodL/+5tm33x0+//6WhDG24cQOUYjvLUAg8gI4oR5F8D7CEPgWgnfWcpTjdyuIiRcGNzSN4NQHbuDNPRtQvjQ7vLiXT2VzPVP7+Vsr3jp/IyekRDbNrrm02U2Wk3hAZ2rWLyNtF+nZVjA77ClHSvHIYqCWQU8qn6vZ84M/TCe0Yx8G1EaAkAdVieiUAUw9G8Gsa8YERsBeAhc+xHQ+nDIviGIKAzuTf+TYPEYyDeV8a7LjYWhTlPIA2NjjGWR7ATCwKTegW09FYAB8SPrOyovIJiQrdxNQwN2bsnXhbvaspmQuBtHCs9e10hjwiQ/oQlgkqW/VF2GMIF759cW8TF5kg7mG2PZIbsIVd+YyyjtGbsKrEl+k0QIGJGMxRllVyAGIMZxzYRESSOOIFbvhY7IkpxTHsJ+HxdrpGODlNXT6PE9toV7OHIWA1pcsvg3uTgATO/R9EDjMjDJmUrimzOwfZYV3VfQ6Y8zMjbIs+TqHa+j7Cvo+y+rgeQU852AdnezQuTxpSm8r4K3w1bsKeteUWnEFjQV0VUFXQmYrqcCJAK8r6FpA0wqaCuiHCvpB9BnwsXjQpmzTi6Kp7BJ5K/gWQxhkrKdlzb1g3u8HtS7JZ4D11Kyw24FzfsZsAD/N6ezdzcWvGRsNtWNlkDUZForhlqLog+ORIlDcTTUlRxkOtTOBE2IQuLtE4/PBG1VMFMU4QjuSYei/nIiZUohQmOwyjc7Gmt4gcUPqNamGqijN3SeuvSUMDGOsNOtJ0CPhzWisG83PJHiHW6oOtaZ5CXokOPrwlZjA2uG6MhicCDh6JBg6MByBsNzhJ8XT/EHxAprTUDZd7qmyMD1uG720slVgtQk2I9PKX4r8txike9hhW/btJLUqojbFdqxaFWmbYjtjW0VdkrTYVAxT6weKMRLoaD+/pWfFpO3J3kZH+/ktHSvGcE/2Njraz2/pbzGj7UYuo/z845cPM8pvCgBtKGPI7xAYXvBz8ZL/3wM0xD/zwxC7vsfnkP81+3n0f0Sw3hJ51O3yC43avL6Iwa12pB4fKb+96r0+K682B9JL6QfpJ0mVDOm19E66kiaSLX2UPkn/SP92fu/82fmr83FDffqk1LyQak/n7/8A+a+qog==</latexit>

instance, with each element of the vector representing a


T = t1x,
, ty,
2 , tz
3, . . .
<latexit sha1_base64="+2UJH6O+/ux1skcpZsKCO3YaI7M=">AAAKFnicfVZNb+M2ENVuv2K32822x16EGgWKwggke2M5hwV2Y6e7h+4mDeIkbWQYFE3LqqkPkJQlhdD/6KV/pbdir732R/TaXktJtiOJcnnJcN6b0XDmmaEVYIcyTfvr0eMPPvzo408OWu1PP3vy+dPDZ19cUz8kEE2gj31yawGKsOOhCXMYRrcBQcC1MLqxVqMMv1kjQh3fu2JJgKYusD1n4UDAhGt2COKumnTVe9U026awTbEx8x03XcCW1kK9TbvqbvNTefNzmhHjmWlB7uT27YwXG9OH/Jc0880OO9qRli9VNvSN0VE262L27OA3c+7D0EUegxhQeqdrAZtyQJgDMUrbZkhRAOAK2OguZIvhlDteEDLkwVT9RmCLEKvMV7PjqnOHIMhwIgwAiSMyqHAJCIBMNKVdTUWRB1xEu/O1E9DCpGu7MBgQHZ3yOO94+qQSyW0CgqUD40ppHLg065PkpIlrVZ0oxIis3aozK1MUWWPGiECHZk24EJ05D7Ip0iv/YoMvk2CJPJrykOC0HCgARAhaiMDcpIiFAc9PI6Szoi8YCVE3M3PfizEgq0s074o8FUe1nAX2Aau6LHEM0R0PRdB3XeDNuRmk3GQoZtzsHqV578roZco3grLUywyuoO9K6Ls0rYJnJfBMgFV0skMX6qQeel0Cr6Wv3pTQm3qoFZbQUELXJXQtZbaiEhxJcFxCYwlNSmgiofcl9F7uMxCyuOtNeTGLfKj8HDtr9Jog5KW800vrZyFi3nd6NSTTAO/oad7uOVqIe6cA3CSj8zdXb39I+WjYO9YGaZ1h4RBtKVp/cDzSJIpdVLPhaMNh71Ti+AR49i7R+GzwSpcTBSEJ8I5kGP3vT+RMCcLYj3aZRqfjXr9GEg2p1qQbuqbVTx/ZcEsYGMZYq9cT4QfCq9G4b9Q/E5Edbul91Ks3L8IPhHl/+FxOYO3wvjYYnEg4fiAYfWDMJcJqh5/kq/6DEgXU1bAZutrRVUk9dhN908rGAKspoJBMI38l818TkOxh+03Zt0pqjAiaIrayaoxImiK2GttGVEOihjblYmr8QC4jiY738xtmlittT/YmOt7Pb5hYLsM92ZvoeD+/Yb65RpsbuQqy+28FxdyylwLABWWMxBuCoLfiXjwX//cA88l34jIktusIHYq/Zjez/o8I4i1RWO22eNDo9eeLbFz3jvTjI+3H552Xp5unzYHylfK18q2iK4byUnmjXCgTBSrvlb+Vf5R/W7+2fm/90XpfUB8/2sR8qVRW68//AJG0szk=</latexit>

feature.
<latexit sha1_base64="uVMH5vj/qqGMBCSKPY9Vkq53rRk=">AAAJ8XicfVZLb9tGEGbSl6U2rdMceyEqFCgKwSClWJSBGEgsucmhiV3DsgOYgrFcjihCywd2lxIZgj+khwJFrv0jvba3/psuqYdJLtU9UKP5vhnOznxarRUSl3FN+/fR408+/ezzLw5a7S+/evL1N4dPv71hQUQxTHBAAvreQgyI68OEu5zA+5AC8iwCt9ZilOO3S6DMDfxrnoQw9ZDjuzMXIy5c94cvzFg9VU0LHNdPQw9x6saZGt/rpimePfE0l3bAmVp890zw7Qfa/WFHO9KKpcqGvjE6ymZd3j89+N20Axx54HNMEGN3uhbyaYoodzGBrG1GDEKEF8iBu4jPhtPU9cOIg48z9QeBzSKi8kDNd6LaLgXMSSIMhKkrMqh4jijCXOy3XU3FwEcesK69dEO2NtnSWRsciWZN07hoZvakEpk6FIVzF8eV0lLkMdGDueRkiWdVnRARoEuv6szLFEXWmDFQ7LK8CZeiMxdhPiB2HVxu8HkSzsFnWRpRkpUDBQCUwkwEFiYDHoVpsRuhigU75TSCbm4WvtMxoosrsLsiT8VRLWdGAsSrLktsQ3THhxUOPA8JJZhhlpocYp6a3aOs6F0ZvcrS1MwbZVnqVQ5X0Hcl9F2WVcHzEnguwCo62aEzdVIPvSmBN9Jbb0vobT3UikpoJKHLErqUMlurEryS4LiExhKalNBEQj+U0A9yn5GQxV1vmq5nUQw1vSDuEl5TAD9LO72svhcq5n2nV0NyDaQdPSvabcNMHClrwEtyevrm+u0vWToa9o61QVZnWCSCLUXrD45HmkRx1tVsONpw2DuTOAFFvrNLND4fvNLlRGFEQ7IjGUb/5xM5UwKEBKtdptHZuNevkURDqjXphq5p9d2vHLwlDAxjrNXrWZEHwqvRuG/UX7OiO9zS+9CrN29FHgh2f/hcTmDt8L42GJxIOHkgGH1k2BJhscNPilX/QYkC6mrYDF3t6KqkHqeJvmllY4DVFLCWTCN/IfNfU5TsYQdN2bdKaowImyK2smqMSJoithrbRlRDVg1tKsTU+IJCRhKd7Oc3zKxQ2p7sTXSyn98wsUKGe7I30cl+fsN8C402N3IR5uffAou55TcFRNaUMYg7BIW34ly8EP97iAf0J3EYUsdzhQ7Fp9nNrf8jonhLFFa7LS40ev36Ihs3vSP9+Ej79Xnn5dnmanOgfKd8r/yo6IqhvFTeKJfKRMHKR+Uv5W/lnxZr/db6o/VxTX38aBPzTKms1p//AVOIpuY=</latexit>

1 x, y, z 0 This can be a little confusing, since we sometimes want to


x1 xi xj Instance i in the data <latexit sha1_base64="TZyUJCTJuolhH4MZrqTR5Im9XBE=">AAAJyXicfVZdb9s2FFXbfcTeuqXb3vYizBgwDEYg2Y3loCjQxs5aYGuTBXFSIDICir6WNVMfoyhLCqGn/Y4Be93+0f7NKPkjkiiPL7ngOefq8twbmlZAnJBp2r+PHj/56ONPPj1otT/7/OkXXx4+++o69COKYYJ94tMPFgqBOB5MmMMIfAgoINcicGMtRzl+swIaOr53xdIApi6yPWfuYMTE1t3hN2ZyZ1qYO5lqvhChj/lv2d1hRzvSiqXKgb4JOspmXdw9O/jTnPk4csFjmKAwvNW1gE05oszBBLK2GYUQILxENtxGbD6ccscLIgYeztTvBTaPiMp8Na9QnTkUMCOpCBCmjsig4gWiCDNxjnY1VQgeciHszlZOEK7DcGWvA4aECVOeFCZlTytKblMULBycVErjyA1dxBbSZpi6VnUTIgJ05VY38zJFkTVmAhQ7YW7ChXDmPMiND6/8iw2+SIMFeGHGI0qyslAAQCnMhbAIQ2BRwIvTiG4vw5eMRtDNw2Lv5RjR5SXMuiJPZaNazpz4iFW3LHEM4Y4HMfZdF3kzbgYZNxkkjJvdo6zwroxeZpybuVGWpV7mcAV9X0LfZ1kVPCuBZwKsopMdOlcndel1CbyWvnpTQm/qUisqoZGErkroSspsxSU4luCkhCYSmpbQVELvS+i97DMSY3Hbm/J1L4qm8nPirOANBfAy3ull9bNQ0e9bvSrJZ4B39KywewZzcVWsATfN6fzt1btfMj4a9o61QVZnWCSCLUXrD45HmkSx19VsONpw2DuVOD5Fnr1LND4bvNblREFEA7IjGUb/pxM5UwqE+PEu0+h03OvXSMKQak26oWta/fSxjbeEgWGMtXo9MXkgvB6N+0b9MzHd4Zbeh17dvJg8EGb94XM5gbXD+9pgcCLh5IFg9JExkwjLHX5SrPo/lCigPg2bpqsdXZWmx26ib6xsFFhNgvXINPKXMv8NRekett+UfTtJjYqgSbEdq0ZF2qTYzthWUZXEDTYVw9T4gWKMJDrZz2/oWTFpe7I30cl+fkPHijHck72JTvbzG/pbzGizkcsgv/+WWPQtfykgsqaMQbwhKLwT9+K5+N1DzKc/isuQ2q4j5lD8Nbt59H9ElGyJImq3xYNGrz9f5OC6d6QfH2m/Pu+8Ot08bQ6Ub5XvlB8UXTGUV8pb5UKZKFi5V/5S/lb+af3c+r2VtO7X1MePNpqvlcpq/fEfvgGXIQ==</latexit>

B x2 C X, Y, Z index the instance within the dataset and sometimes the


B C features of a given instance. Pay attention to whether the
x = B . C xi xxj i Feature j (of some instance)
<latexit sha1_base64="TZyUJCTJuolhH4MZrqTR5Im9XBE=">AAAJyXicfVZdb9s2FFXbfcTeuqXb3vYizBgwDEYg2Y3loCjQxs5aYGuTBXFSIDICir6WNVMfoyhLCqGn/Y4Be93+0f7NKPkjkiiPL7ngOefq8twbmlZAnJBp2r+PHj/56ONPPj1otT/7/OkXXx4+++o69COKYYJ94tMPFgqBOB5MmMMIfAgoINcicGMtRzl+swIaOr53xdIApi6yPWfuYMTE1t3hN2ZyZ1qYO5lqvhChj/lv2d1hRzvSiqXKgb4JOspmXdw9O/jTnPk4csFjmKAwvNW1gE05oszBBLK2GYUQILxENtxGbD6ccscLIgYeztTvBTaPiMp8Na9QnTkUMCOpCBCmjsig4gWiCDNxjnY1VQgeciHszlZOEK7DcGWvA4aECVOeFCZlTytKblMULBycVErjyA1dxBbSZpi6VnUTIgJ05VY38zJFkTVmAhQ7YW7ChXDmPMiND6/8iw2+SIMFeGHGI0qyslAAQCnMhbAIQ2BRwIvTiG4vw5eMRtDNw2Lv5RjR5SXMuiJPZaNazpz4iFW3LHEM4Y4HMfZdF3kzbgYZNxkkjJvdo6zwroxeZpybuVGWpV7mcAV9X0LfZ1kVPCuBZwKsopMdOlcndel1CbyWvnpTQm/qUisqoZGErkroSspsxSU4luCkhCYSmpbQVELvS+i97DMSY3Hbm/J1L4qm8nPirOANBfAy3ull9bNQ0e9bvSrJZ4B39KywewZzcVWsATfN6fzt1btfMj4a9o61QVZnWCSCLUXrD45HmkSx19VsONpw2DuVOD5Fnr1LND4bvNblREFEA7IjGUb/pxM5UwqE+PEu0+h03OvXSMKQak26oWta/fSxjbeEgWGMtXo9MXkgvB6N+0b9MzHd4Zbeh17dvJg8EGb94XM5gbXD+9pgcCLh5IFg9JExkwjLHX5SrPo/lCigPg2bpqsdXZWmx26ib6xsFFhNgvXINPKXMv8NRekett+UfTtJjYqgSbEdq0ZF2qTYzthWUZXEDTYVw9T4gWKMJDrZz2/oWTFpe7I30cl+fkPHijHck72JTvbzG/pbzGizkcsgv/+WWPQtfykgsqaMQbwhKLwT9+K5+N1DzKc/isuQ2q4j5lD8Nbt59H9ElGyJImq3xYNGrz9f5OC6d6QfH2m/Pu+8Ot08bQ6Ub5XvlB8UXTGUV8pb5UKZKFi5V/5S/lb+af3c+r2VtO7X1MePNpqvlcpq/fEfvgGXIQ==</latexit>

@ .. A letter we’re indexing is bold or non-bold: a bold letter x


Xij Feature j of instance i with a subscript i refers to the i-th instance in the data
xm (containinig all features). A non-bold letter x with an index i
8
refers to the i-th scalar feature of some instance x.

In the rare cases where we need to refer to both the index of


the instance, and the index of the feature within the
instance, we will usually use an uppercase X. This makes
sense if you imagine the data as a big matrix X, with the
instances as rows, and the features as columns.

We’ll occasionally deviate from this notation when doing so


makes things clearer, but we’ll point it out when that
happens.
f
de ning a model: one feature If we have one feature (as in this example) a standard linear
regression model has two parameters (the numbers that
determine which line we it through our data): w the
<latexit sha1_base64="6X9O4vxRp4SsTDPQp8U0v9nV9Ok=">AAAJ4HicfZbPb9s2FMfVdj9ib13T9biLMGNAtxqBZDeWcwjQxs7aw9pkQZwUiAyDomlZMPUDFGVJJXjfbeh1wP6OXbc/Y//NKFl2JFEeL3ng5/ueHt97oWkF2Amppv374OGjzz7/4suDVvurrx9/8+Tw6bc3oR8RiCbQxz75YIEQYcdDE+pQjD4EBAHXwujWWo0yfrtGJHR875qmAZq6wPachQMBFVuzwxeLGTMJZDHvqqYFmcX5czP5UT1VN7vJTFdfFGR22NGOtHypsqEXRkcp1uXs6cGf5tyHkYs8CjEIwztdC+iUAUIdiBFvm1GIAgBXwEZ3EV0Mp8zxgogiD3L1B8EWEVapr2aJq3OHIEhxKgwAiSMiqHAJCIBUHK9dDRUiD7go7M7XThBuzHBtbwwKRG2mLMlrxx9XPJlNQLB0YFJJjQE3dAFdSpth6lrVTRRhRNZudTNLUyRZUyaIQCfMinApKnMRZP0Ir/3Lgi/TYIm8kLOIYF52FAARghbCMTdDRKOA5acRQ7AKTymJUDcz873TMSCrKzTvijiVjWo6C+wDWt2yxDFEdTwUQ991gTdnZsCZSVFCmdk94nntyvSKM2ZmhbIs9SrDFfq+RN9zXoXnJXguYJVOdnShTuquNyV4I331tkRv665WVKKRRNclupYiW3EJxxJOSjSRaFqiqUQ/luhHuc5AjMVdb8o2vcibyi6ws0ZvCEIeZ50er5+FiH7f6VWXbAZYR+d5uedoIW6QDXDTTM7eXr/7hbPRsHesDXhdYeEIbSVaf3A80iSJvcmm0GjDYe9M0vgEePYu0Ph88FqXAwURCfBOZBj9n0/kSCnC2I93kUZn416/JhIFqeakG7qm1U8f23ArGBjGWKvnE+N7wevRuG/UPxOTHbf0PurVixfje8G8P3wpB7B2vK8NBicSx/cCow+MuSRY7fhJvur/UCKB+jQUTVc7uipNj90kL0rZ6GA1OWxGplG/kvVvCEj3qP2m6NtJavQImjy2Y9XokTZ5bGds61F1iRvKlA9T4wfyMZLkeL++oWf5pO2J3iTH+/UNHcvHcE/0Jjner2/obz6jzYVcBdn9t4Kib9lLAeCNZIzEG4Kgd+JevBC/e4D65CdxGRLbdcQcir9mN7P+TwiSrVBY7bZ40Oj154ts3PSO9OMj7deXnVdnxdPmQPlO+V55ruiKobxS3iqXykSByiflL+Vv5Z+W1fqt9Xvr00b68EHh80yprNYf/wGxi5+I</latexit>

1 feature x1: fw,b (x) = wx1 + b


weight and b, the bias. The the weight is also sometimes
called the slope and the bias is also sometimes called the
intercept.
fw,b (x) = wx + b b determines where the line crosses the vertical axis. That
+1

is, what value f takes when x = 0.


fw,b (x) = wx + b
w determines how much the line rises if we move one step
to the right (i.e. increase x by 1)
9
For the line drawn here, we have b=3 and w=0.5.

Note that this isn’t a very good it for the data. Our job is to
ind better numbers w and b.

two features If we have multiple features, each feature gets its own
weight (also known as a coef icient)
<latexit sha1_base64="6X9O4vxRp4SsTDPQp8U0v9nV9Ok=">AAAJ4HicfZbPb9s2FMfVdj9ib13T9biLMGNAtxqBZDeWcwjQxs7aw9pkQZwUiAyDomlZMPUDFGVJJXjfbeh1wP6OXbc/Y//NKFl2JFEeL3ng5/ueHt97oWkF2Amppv374OGjzz7/4suDVvurrx9/8+Tw6bc3oR8RiCbQxz75YIEQYcdDE+pQjD4EBAHXwujWWo0yfrtGJHR875qmAZq6wPachQMBFVuzwxeLGTMJZDHvqqYFmcX5czP5UT1VN7vJTFdfFGR22NGOtHypsqEXRkcp1uXs6cGf5tyHkYs8CjEIwztdC+iUAUIdiBFvm1GIAgBXwEZ3EV0Mp8zxgogiD3L1B8EWEVapr2aJq3OHIEhxKgwAiSMiqHAJCIBUHK9dDRUiD7go7M7XThBuzHBtbwwKRG2mLMlrxx9XPJlNQLB0YFJJjQE3dAFdSpth6lrVTRRhRNZudTNLUyRZUyaIQCfMinApKnMRZP0Ir/3Lgi/TYIm8kLOIYF52FAARghbCMTdDRKOA5acRQ7AKTymJUDcz873TMSCrKzTvijiVjWo6C+wDWt2yxDFEdTwUQ991gTdnZsCZSVFCmdk94nntyvSKM2ZmhbIs9SrDFfq+RN9zXoXnJXguYJVOdnShTuquNyV4I331tkRv665WVKKRRNclupYiW3EJxxJOSjSRaFqiqUQ/luhHuc5AjMVdb8o2vcibyi6ws0ZvCEIeZ50er5+FiH7f6VWXbAZYR+d5uedoIW6QDXDTTM7eXr/7hbPRsHesDXhdYeEIbSVaf3A80iSJvcmm0GjDYe9M0vgEePYu0Ph88FqXAwURCfBOZBj9n0/kSCnC2I93kUZn416/JhIFqeakG7qm1U8f23ArGBjGWKvnE+N7wevRuG/UPxOTHbf0PurVixfje8G8P3wpB7B2vK8NBicSx/cCow+MuSRY7fhJvur/UCKB+jQUTVc7uipNj90kL0rZ6GA1OWxGplG/kvVvCEj3qP2m6NtJavQImjy2Y9XokTZ5bGds61F1iRvKlA9T4wfyMZLkeL++oWf5pO2J3iTH+/UNHcvHcE/0Jjner2/obz6jzYVcBdn9t4Kib9lLAeCNZIzEG4Kgd+JevBC/e4D65CdxGRLbdcQcir9mN7P+TwiSrVBY7bZ40Oj154ts3PSO9OMj7deXnVdnxdPmQPlO+V55ruiKobxS3iqXykSByiflL+Vv5Z+W1fqt9Xvr00b68EHh80yprNYf/wGxi5+I</latexit>

1 feature x1: fw,b (x) = wx1 + b


<latexit sha1_base64="auhDr5siNnS3IbbtUocL2bUWcBw=">AAAGpXicdZTbbtNAEIbdQkMJFFq45MYiQhSIIjvQkptKhRZaIdqGKodKcWStN5PEik/aXSdOV/tIPA1XSPAurO2ksuOwN5ns989oDuuxAsemTNN+b2zeu79VerD9sPzo8c6Tp7t7zzrUDwmGNvYdn9xYiIJje9BmNnPgJiCAXMuBrjU5iXl3CoTavtdi8wD6Lhp59tDGiMkrc/dsaHKDYD4zdVFVU6seWxbmlhD7RvRGPVKXisjU1Xd3qsisx/8Spblb0WpactSioS+MirI4TXNv66sx8HHogsewgyjt6VrA+hwRZmMHRNkIKQQIT9AIeiEbNvrc9oKQgYeF+kqyYeiozFfjotSBTQAzZy4NhIktI6h4jAjCTJZezoei4CEXaHUwtQOamnQ6Sg2GZN/6PEr6KnZynnxEUDC2cZRLjSOXuoiNC5d07lr5SwgdIFM3fxmnKZNcUUZAsE3jJjRlZ66CeFa05TcXfDwPxuBRwUPiiKyjBEAIDKVjYlJgYcCTauQDmdAjRkKoxmZyd3SKyOQaBlUZJ3eRT2fo+IgJ2QwPZth3XeQNuBEIbjCIGDeqNZG0KkuvBedG3BfLUq9jnKOXGXopViO37+hQbUuag50M7BQCdzO0u+pqhRkaFug0Q6eFyNYsg2cFHGVoVKDzDJ0X6G2G3hZbieSge/U+T9udjIlfOfYUzgiAJ3ilLlZrIXKCPT3vEk+VV3SRtHsAQ7kvUuDOYzk/b118F/ykUT/QDsWqwnJCWEq094cHJ1pBMkqzWWi0RqP+uaDxCfJGd4FOvxx+0rXV6RNcSH2RoVrR1UKpo3XyRS5rHax1Dml9a/WTov6MoPl/1P666Muylx4rFU+C+AFMsPym4uWHnHRGpyDXIoEL+TCu5KeMmE/eytdARq4ta5O/RjW2ynLz6qt7tmh06jX9oKb9+FA5/rbYwdvKC+Wlsq/oykflWDlXmkpbwcpP5ZfyR/lbel26KLVKnVS6ubHwea7kTsn8BwX7YPk=</latexit>

2 features x1, x2: fw1 ,w2 ,b (x) = w1 x1 + w2 x2 + b

10

a 2D linear function Here’s what that looks like. The thick orange lines together
indicate a plane (which rises in the x2 direction, and
x2
declines in the x1 direction). The parameter b describes
f(x1, x2) = w1x1 + w2x2 + b
how high above the origin this plane lies (what the value of
w2 f is if both features are 0). The value w1 indicates how much
f increases if we take a step of 1 along the x1 axis, and the
value w2 indicates how much f increases if we take a step of
size 1 along the x2 axis.
b

w1

x1 11
f
fi
f
f
f
for n features For an arbitrary number of features, the pattern continues
as you’d expect. We summarize the w’s in a vector w with
<latexit sha1_base64="FsPpHsCnGaZ2b6KHnEcbvTw4+/A=">AAALHHicfVbdjuM0GE13+RkCCzvsFeImogItUI2SdqbtCI20O+2we8HuDKP5WWlSKid126jOjxynSdeyxJNwyS08BHeIWyRegafATtJMEqfkZj77nO/Y/r5Tj60AOSHR9X9aDx6+8+577+99oH740aOPP3m8/+lN6EfYhte2j3z8xgIhRI4Hr4lDEHwTYAhcC8FbazUS+O0a4tDxvSuyCeDEBQvPmTs2IHxqut/6bD6lpm9T04pZx7RsajH21Ey+1r460cR8PDVYMjW0b/NRl4+6WjHs8WFPjFZcAs18EjIxSnVMU81FhPhPV5qZFJjGQZPAhNDYIUtmfqepW2K6sgUXjkcDFxDsJGy7EZ6V6a2zlfJhPPWYCb1ZwddybeDNuDRfVlJUxZlEeqokomTqqRWR6eO2fqCnnyYHRh60lfy7mO7v/WvOfDtyoUdsBMLwztADMqEAE8dGkKlmFMIA2CuwgHcRmQ8n1PGCiEDPZtqXHJtHSCO+JvqkzRwMbYI2PAA2driCZi8BBjbh3VSrUiH0gAvDzmztBGEWhutFFhDArTChSWoV9qiSSRcYBEvHTipbo8ANeQmW0mS4ca3qJIwQxGu3Oim2yTdZYyYQ204oinDBK3MeCPuFV/5Fji83wRJ6IaMRRqycyAGIMZzzxDQMIYkCmp6Ge34VnhAcwY4I07mTMcCrSzjrcJ3KRHU7c+QDUp2y+DF4dTwY277rcudQM2A085HZOWBp7croJaPUFIWyLO1SwBX0dQl9zVgVPCuBZxysotcFOteu66k3JfBGWvW2hN7WU62ohEYSui6ha0mZ/zDv4ViCkxKaSOimhG4k9G0JfSvXGXBb3HUnNOtF2lR6jpw1fIEh9Bhtd1n9LJj3+86opggP0LbB0nLP4JxfmBngbgSdvrx69QOjo2H3SO+zOsNCEdxS9F7/aKRLlEW2m5yjD4fdU4njY+AtCqHxWf+5IQsFEQ5QQRoMet8fy0obiJAfF0qj03G3VyPxglT3ZAwMXa+fPl7YW0J/MBjr9f3E6J7wfDTuDerLxLjALaMHu/XixeieMOsND2UBq8B7er9/LOHonjDogcFMIqwK/Dj96rhf4LB/fJTVoGIXW3JLbgqtbWiSuxZN9LzUjQlWU0JmqUb+Sua/wGCzg+03qW+d1pgRNGVsbdeYsWnK2Hpwm1FNiRvKlJqtcYHUZhId7eY39Cx14g71JjrazW/oWGrTHepNdLSb39Df1MPN7Ib+po4uyl4zTyCuU/EcC8TDA6CsNWPInyQYvuLX7Dn/NwqIj7/hdyteuA63Lf9rdkT0f0SQbIkgfR0Z9beQHNx0D4z+weGPh+1np/k7aU/5XPlCeaoYykB5prxULpRrxW793Pq19Vvrd/UX9Q/1T/WvjPqglec8USqf+vd/clkURg==</latexit>

the same number of elements as x.


fw,b (x) = w1 x1 + w2 x2 + w3 x3 + . . . + b <latexit sha1_base64="FsPpHsCnGaZ2b6KHnEcbvTw4+/A=">AAALHHicfVbdjuM0GE13+RkCCzvsFeImogItUI2SdqbtCI20O+2we8HuDKP5WWlSKid126jOjxynSdeyxJNwyS08BHeIWyRegafATtJMEqfkZj77nO/Y/r5Tj60AOSHR9X9aDx6+8+577+99oH740aOPP3m8/+lN6EfYhte2j3z8xgIhRI4Hr4lDEHwTYAhcC8FbazUS+O0a4tDxvSuyCeDEBQvPmTs2IHxqut/6bD6lpm9T04pZx7RsajH21Ey+1r460cR8PDVYMjW0b/NRl4+6WjHs8WFPjFZcAs18EjIxSnVMU81FhPhPV5qZFJjGQZPAhNDYIUtmfqepW2K6sgUXjkcDFxDsJGy7EZ6V6a2zlfJhPPWYCb1ZwddybeDNuDRfVlJUxZlEeqokomTqqRWR6eO2fqCnnyYHRh60lfy7mO7v/WvOfDtyoUdsBMLwztADMqEAE8dGkKlmFMIA2CuwgHcRmQ8n1PGCiEDPZtqXHJtHSCO+JvqkzRwMbYI2PAA2driCZi8BBjbh3VSrUiH0gAvDzmztBGEWhutFFhDArTChSWoV9qiSSRcYBEvHTipbo8ANeQmW0mS4ca3qJIwQxGu3Oim2yTdZYyYQ204oinDBK3MeCPuFV/5Fji83wRJ6IaMRRqycyAGIMZzzxDQMIYkCmp6Ge34VnhAcwY4I07mTMcCrSzjrcJ3KRHU7c+QDUp2y+DF4dTwY277rcudQM2A085HZOWBp7croJaPUFIWyLO1SwBX0dQl9zVgVPCuBZxysotcFOteu66k3JfBGWvW2hN7WU62ohEYSui6ha0mZ/zDv4ViCkxKaSOimhG4k9G0JfSvXGXBb3HUnNOtF2lR6jpw1fIEh9Bhtd1n9LJj3+86opggP0LbB0nLP4JxfmBngbgSdvrx69QOjo2H3SO+zOsNCEdxS9F7/aKRLlEW2m5yjD4fdU4njY+AtCqHxWf+5IQsFEQ5QQRoMet8fy0obiJAfF0qj03G3VyPxglT3ZAwMXa+fPl7YW0J/MBjr9f3E6J7wfDTuDerLxLjALaMHu/XixeieMOsND2UBq8B7er9/LOHonjDogcFMIqwK/Dj96rhf4LB/fJTVoGIXW3JLbgqtbWiSuxZN9LzUjQlWU0JmqUb+Sua/wGCzg+03qW+d1pgRNGVsbdeYsWnK2Hpwm1FNiRvKlJqtcYHUZhId7eY39Cx14g71JjrazW/oWGrTHepNdLSb39Df1MPN7Ib+po4uyl4zTyCuU/EcC8TDA6CsNWPInyQYvuLX7Dn/NwqIj7/hdyteuA63Lf9rdkT0f0SQbIkgfR0Z9beQHNx0D4z+weGPh+1np/k7aU/5XPlCeaoYykB5prxULpRrxW793Pq19Vvrd/UX9Q/1T/WvjPqglec8USqf+vd/clkURg==</latexit>

fw,b (x) =
=wwT1 xx1++bw2 x2 + w3 x3 + . . . + b
<latexit sha1_base64="FsPpHsCnGaZ2b6KHnEcbvTw4+/A=">AAALHHicfVbdjuM0GE13+RkCCzvsFeImogItUI2SdqbtCI20O+2we8HuDKP5WWlSKid126jOjxynSdeyxJNwyS08BHeIWyRegafATtJMEqfkZj77nO/Y/r5Tj60AOSHR9X9aDx6+8+577+99oH740aOPP3m8/+lN6EfYhte2j3z8xgIhRI4Hr4lDEHwTYAhcC8FbazUS+O0a4tDxvSuyCeDEBQvPmTs2IHxqut/6bD6lpm9T04pZx7RsajH21Ey+1r460cR8PDVYMjW0b/NRl4+6WjHs8WFPjFZcAs18EjIxSnVMU81FhPhPV5qZFJjGQZPAhNDYIUtmfqepW2K6sgUXjkcDFxDsJGy7EZ6V6a2zlfJhPPWYCb1ZwddybeDNuDRfVlJUxZlEeqokomTqqRWR6eO2fqCnnyYHRh60lfy7mO7v/WvOfDtyoUdsBMLwztADMqEAE8dGkKlmFMIA2CuwgHcRmQ8n1PGCiEDPZtqXHJtHSCO+JvqkzRwMbYI2PAA2driCZi8BBjbh3VSrUiH0gAvDzmztBGEWhutFFhDArTChSWoV9qiSSRcYBEvHTipbo8ANeQmW0mS4ca3qJIwQxGu3Oim2yTdZYyYQ204oinDBK3MeCPuFV/5Fji83wRJ6IaMRRqycyAGIMZzzxDQMIYkCmp6Ge34VnhAcwY4I07mTMcCrSzjrcJ3KRHU7c+QDUp2y+DF4dTwY277rcudQM2A085HZOWBp7croJaPUFIWyLO1SwBX0dQl9zVgVPCuBZxysotcFOteu66k3JfBGWvW2hN7WU62ohEYSui6ha0mZ/zDv4ViCkxKaSOimhG4k9G0JfSvXGXBb3HUnNOtF2lR6jpw1fIEh9Bhtd1n9LJj3+86opggP0LbB0nLP4JxfmBngbgSdvrx69QOjo2H3SO+zOsNCEdxS9F7/aKRLlEW2m5yjD4fdU4njY+AtCqHxWf+5IQsFEQ5QQRoMet8fy0obiJAfF0qj03G3VyPxglT3ZAwMXa+fPl7YW0J/MBjr9f3E6J7wfDTuDerLxLjALaMHu/XixeieMOsND2UBq8B7er9/LOHonjDogcFMIqwK/Dj96rhf4LB/fJTVoGIXW3JLbgqtbWiSuxZN9LzUjQlWU0JmqUb+Sua/wGCzg+03qW+d1pgRNGVsbdeYsWnK2Hpwm1FNiRvKlJqtcYHUZhId7eY39Cx14g71JjrazW/oWGrTHepNdLSb39Df1MPN7Ib+po4uyl4zTyCuU/EcC8TDA6CsNWPInyQYvuLX7Dn/NwqIj7/hdyteuA63Lf9rdkT0f0SQbIkgfR0Z9beQHNx0D4z+weGPh+1np/k7aU/5XPlCeaoYykB5prxULpRrxW793Pq19Vvrd/UX9Q/1T/WvjPqglec8USqf+vd/clkURg==</latexit>
We call the w’s the weights, and b the bias. The weights and
fw,b (x) = w x 1 + w2 x2 + w03 x3 1+ ... + b
=0w1Twx11+ b x1 1
the bias are the parameters of the model. We need to choose
=w0T 1
x. +Cb 0 these to it the model to our data.
Bw B x..1 C
with w = 0 @ .. 11 A and x = 0@ .1 A
with w = @w.n A and x = @xx...1n A
Bw..1 C B C The operation of multiplying elements of w by the
with w = @ w... A and x = @ x... A
B C B C corresponding elements of x and summing them is the dot
n n
wn xn product of w and x.

12

dot product The dot product of two vectors is simply the sum of the
products of their elements. If we place the features into a
vector and the weights, then a linear function is simply their
wT x
<latexit sha1_base64="Zy9MQ+ahOqSl2Y7U+hTpk9n8on0=">AAAHVHicfVXbbtNAEHVb2pRAaQuPvFhESAhFkZPS20Ol0gZaIXqhatpKdajWm0liZW2vdtdp3JW/hFf4JsTHILHODTtr8ItHc84ZzZxZ7TqUuFxY1q+5+YVHi0uF5cfFJ09Xnq2urT+/4kHIMDRwQAJ24yAOxPWhIVxB4IYyQJ5D4NrpHSb4dR8YdwP/UkQUmh7q+G7bxUio1N3aqh1gaTv3sfn10rQH5t1ayapYw8/Ug+o4KBnj7/xufalitwIceuALTBDnt1WLiqZETLiYQFy0Qw4U4R7qwG0o2jtN6fo0FODj2HytsHZITBGYSXNmy2WABYlUgDBzVQUTdxFDWKgRitlSHHzkAS+3+i7lo5D3O6NAIDV/Uw6G/sQrGaXsMES7Lh5kWpPI4x4SXS3JI8/JJiEkwPpeNpm0qZqcYQ6AYZcnJpwrZ85o4jm/DM7HeDeiXfB5LENG4rRQAcAYtJVwGHIQIZXDadSie3xPsBDKSTjM7dUR611Aq6zqZBLZdtokQCKbctQYyh0f7nHgechvSZvG0hYwENIuV+Khd2n0IpbSToxyHPMigTPoaQo9jeMZbWOKts2GQjPgVQq80gpfp9DrWakTptBQQ/sptK9VVkf/L3yvwYMUOtDQKIVGGvqQQh90K5Ha/G2tKUd2D/cmz4jbhyMG4MeyVItnZ2FqpbfVrCRZsyxV46HdLWiri2AEeFFCl8eXJ59jebhT27S24lmGQ0KYUKyNrc1DS6N0Rt2MOdbOTu1A4wQM+Z1pofqHrfdVvRANGSVT0vb2xsddvVIEhAT300qHB/Xaxuw5YlgzYTyrWaqammmdPPp4qlyBkycYOZXL7+n8I4aif7CDvOoTA3MVNE8xcTNXEeUpJtZOFDND0OS09tRjQJOrG5ERpQ7qUmdwok7xmbqIkAjYW3V0WcdzlX3qb5eT6H9ENJgQVVQsqhemOvue6EGjVtmtWF/elfY/jZ+aZeOl8cp4Y1SNbWPfODbOjYaBjdD4Znw3fiz9XPpdWCgsjqjzc2PNCyPzFZ79AXFBpo4=</latexit>
or w·x
<latexit sha1_base64="/tMj0ybMVbm5pP7NPAG+YDuApcw=">AAAHV3icfVVdb9MwFPX4WEdhsI1HXiIqJISqKu3Y18OksQ2GEGxjWjekpZoc97aN6iSW7XTJrPwVXuEvwa/B6cdI6kBecnXPOVf3nmvZLqOekLb9a+He/QcPFytLj6qPnyw/fbayunYhwogTaJOQhvybiwVQL4C29CSFb4wD9l0Kl+7wIMMvR8CFFwbnMmHQ8XE/8HoewVKnrlfWnJAox71JLYd0Q2k5sXW9UrMb9vizzKA5DWpo+p1ery42nG5IIh8CSSgW4qppM9lRmEuPUEirTiSAYTLEfbiKZG+7o7yARRICklqvNNaLqCVDK+vP6nociKSJDjDhnq5gkQHmmEg9RbVYSkCAfRD17shjYhKKUX8SSKwt6Kh4bFG6XFCqPsds4JG40JrCvvCxHBhJkfhuMQkRBT7yi8msTd3kHDMGTjyRmXCqnTlhme3iPDyd4oOEDSAQqYo4TfNCDQDn0NPCcShARkyNp9G7HopdySOoZ+E4t3uI+fAMunVdp5AottOjIZbFlKvH0O4EcENC38dBVzksVY6EWCqn3kjH3uXRs1QpJzPKda2zDC6gxzn0OE3ntO07tGe1NVoAL3LghVH4ModezkvdKIdGBjrKoSOjsj79f+EbA45zaGygSQ5NDPQ2h96aVmK9+atWR03sHu9NnVBvBEccIEhVrZXOz8L1Sq+aRUm2ZlVrpmO7u9DTd8EE8JOMrj6ef/mcqoPt1oa9mc4zXBrBjGKvb24c2AalP+lmyrG3t1v7BifkOOjfFTp8v/muaRZiEWf0jrS1tf5hx6yUAKXhzV2lg/3D1vr8OeLEMGE6q1VrWoZp/TL6dKpSgVsmmDhVyh+a/COOk3+ww7LqMwNLFaxMMXOzVJGUKWbWzhRzQ7DstA71e8CyqxvTCeUQ9KXO4Ys+xSf6IsIy5G/00eV939P26b9Tz6L/EXE8I+qoWtUvTHP+PTGDdqux07C/vq3tfZo+NUvoBXqJXqMm2kJ76CM6RW1EUIy+ox/o5+LvCqroV3JCvbcw1TxHha+y+gcBJ6cB</latexit>

dot product (plus the b parameter).

The transpose (superscript T) notation arises from the fact


X that if we make one vector a row vector and one a column
wT x = wi xi vector, and matrix-multiply them, the result is the dot
i
α product (try it).
wT x = ||w|| ||x|| cos ↵
The dot product also has a geometric interpretation: the dot
<latexit sha1_base64="9IY7JFukLSLpK8vsjzvW6e4p4vs=">AAAHnnicfVXtbiM1FJ3dhc0SWOguP/ljiEAIRdEkZduuUKVlW+gK0W2pmraoEyKPc5NY8cxYtieZWdfPxnPwAPyFV8CTjzITD1ga5eqec67uPXbskDMqle//8eDho/fef9x48kHzw4+efvzJzrPnVzJJBYE+SVgibkIsgdEY+ooqBjdcAI5CBtfh7KjAr+cgJE3iS5VzGER4EtMxJVjZ1HDn1yAhOggXBv12iYIMfXWIAplGQ4oKYDGkJhvSIGi6tLu7Tc5G39kvu7tDAUkkCjDjUzzcafkdf7mQG3TXQctbr/Phs8edYJSQNIJYEYalvO36XA00FooSBqYZpBI4JjM8gdtUjQ8GmsY8VRATg7602DhlSCWoGBONqACiWG4DTAS1FRCZYoGJsmY0q6UkxDgC2R7NKZerUM4nq0Bh6+RAZ0unzdOKUk8E5lNKskprGkcywmrqJGUehdUkpAzEPKomizZtk1vMDAShsjDh3Dpzxovdk5fJ+Rqf5nwKsTQ6FcyUhRYAIWBshctQgkq5Xk5jj8xMHiqRQrsIl7nDYyxmFzBq2zqVRLWdMUuwqqZCO4Z1J4YFSaIIxyMdcKMDBZnSQbtjlt6V0QujdVAYFYboooAr6NsS+taYLW3/Hh2jvkUr4FUJvHIKX5fQ621pmJbQ1EHnJXTuVLb/hH/hhQNnJTRz0LyE5g76roS+c63EdudvewO9snu5b/qM0TmcCIDY6FbPbM8i7JbedquSYpt1q2uWdo9gbK+UFRDlBV2/uTz92eijg94Lf89sM0KWwobi7+69OPIdymTVzZrjHxz0XjucROB4cl/o+Ie977tuIZ4Kzu5J+/u7P750K+XAWLK4r3T0+ri3u32OBHFMWM+KWl3kmDapo6+nqhWEdYKVU7X8mcs/ETj/D3ZSV31jYK2C1yk2btYq8jrFxtqNYmsIXpzWmX0beHF1Y7aiHIO91AWc2lN8Zi8irBLxjT26YhJRa5/9DdpF9H9EnG2INmo27QvT3X5P3KDf67zs+L9823r10/qpeeJ95n3hfe11vX3vlffGO/f6HvF+9/70/vL+bnzeOGmcNs5W1IcP1ppPvcpq3PwDxRvCpA==</latexit>

<latexit sha1_base64="9IY7JFukLSLpK8vsjzvW6e4p4vs=">AAAHnnicfVXtbiM1FJ3dhc0SWOguP/ljiEAIRdEkZduuUKVlW+gK0W2pmraoEyKPc5NY8cxYtieZWdfPxnPwAPyFV8CTjzITD1ga5eqec67uPXbskDMqle//8eDho/fef9x48kHzw4+efvzJzrPnVzJJBYE+SVgibkIsgdEY+ooqBjdcAI5CBtfh7KjAr+cgJE3iS5VzGER4EtMxJVjZ1HDn1yAhOggXBv12iYIMfXWIAplGQ4oKYDGkJhvSIGi6tLu7Tc5G39kvu7tDAUkkCjDjUzzcafkdf7mQG3TXQctbr/Phs8edYJSQNIJYEYalvO36XA00FooSBqYZpBI4JjM8gdtUjQ8GmsY8VRATg7602DhlSCWoGBONqACiWG4DTAS1FRCZYoGJsmY0q6UkxDgC2R7NKZerUM4nq0Bh6+RAZ0unzdOKUk8E5lNKskprGkcywmrqJGUehdUkpAzEPKomizZtk1vMDAShsjDh3Dpzxovdk5fJ+Rqf5nwKsTQ6FcyUhRYAIWBshctQgkq5Xk5jj8xMHiqRQrsIl7nDYyxmFzBq2zqVRLWdMUuwqqZCO4Z1J4YFSaIIxyMdcKMDBZnSQbtjlt6V0QujdVAYFYboooAr6NsS+taYLW3/Hh2jvkUr4FUJvHIKX5fQ621pmJbQ1EHnJXTuVLb/hH/hhQNnJTRz0LyE5g76roS+c63EdudvewO9snu5b/qM0TmcCIDY6FbPbM8i7JbedquSYpt1q2uWdo9gbK+UFRDlBV2/uTz92eijg94Lf89sM0KWwobi7+69OPIdymTVzZrjHxz0XjucROB4cl/o+Ie977tuIZ4Kzu5J+/u7P750K+XAWLK4r3T0+ri3u32OBHFMWM+KWl3kmDapo6+nqhWEdYKVU7X8mcs/ETj/D3ZSV31jYK2C1yk2btYq8jrFxtqNYmsIXpzWmX0beHF1Y7aiHIO91AWc2lN8Zi8irBLxjT26YhJRa5/9DdpF9H9EnG2INmo27QvT3X5P3KDf67zs+L9823r10/qpeeJ95n3hfe11vX3vlffGO/f6HvF+9/70/vL+bnzeOGmcNs5W1IcP1ppPvcpq3PwDxRvCpA==</latexit>

product is equal to the lengths of the two vectors,


13
multiplied by the cosine of the angle between them. We
won't give you a proof, but we'll occasionally make use of
this form of the dot product, so make sure you remember
this.

The dot product will come back a lot in the rest of the
course. We don't have time to discuss it in depth, but if your
memory is hazy, we strongly recommend that you take a
minute to go back to your linear algebra book and look up
the various interpretations of what the dot product means.

example: predicting high blood pressure To build some intuition for the meaning of the weights w,
let’s look at an example. Imagine we are trying to predict
the risk of high blood pressure based on these three
features. We’ll assume that the features are expressed in
some number that measures these properties.

instances patients

features: job stress, healthy diet, age

14
f
dot product Here’s what the dot product expresses. For some features,
like job stress, we want to learn a positive weight (since
more job stress should contribute to higher risk of high
blood pressure). For others, we want to learn a negative
patient x weights w
how predictive stress is
weight (the healthier your diet, the lower your risk of high
wT x = x1 w1 + x2 w2 + x3 w3
how predictive diet is blood pressure). Finally, we can control the magnitude of
patient
how predictive age is x weights w
the weights to control their relative importance: if age and
patient x weights w wT x = x1 w1 + x2 w2 + x3 w3
job stress both contribute positively, but age is the bigger
wT x = x1 w1 + x2 w2 + x3 w3
age
has stressful job
has healthy diet

risk factor, we make both weights positive, but we make the


weight for age bigger.

15

But which model ts our data best? So, that's our model de ined in detail. But we still don't
know which model to choose for a given dataset. Given
some data, which values should we choose for the
two more ingredients:
• loss function parameters w and b?
• search method (next video)
In order to answer this question, we need two more
ingredients. First, we need a loss function, which tells us
how well a particular choice of model does (for the given
data) and second, we need a way to search the space of all
models for a particular model that results in a low loss (a
model for which the loss function returns a low value).

16

mean squared error loss Here is a common loss function for regression: the mean-
squared error (MSE) loss. We saw this brie ly already in
1X
<latexit sha1_base64="8TPMxhDbo/EfV1mR95LMJCyAzog=">AAAGtnicdZRbTxNBFMcXlIooCvroy8TGBEwlu1WwLyQoKMTIRUKBpFua2enZ7dLZS2ZmS8tkvprfw3df9TM42y1kt1vnZU/O739OzmV2nJj6XJjmr7n5Bw8XKo8WHy89ebr87PnK6otzHiWMQJNENGKXDuZA/RCawhcULmMGOHAoXDj93ZRfDIBxPwrPxCiGdoC90Hd9goV2dVYubQFDIWnEuerIyxo6U2s2IzJW62gb2S7DRFpKhgrZPAk618im4Io12yPS7WTCNXvYuV5H75DoXCub+V5PrF/VOytVc8McH1Q2rIlRNSbnpLO68NXuRiQJIBSEYs5blhmLtsRM+ISCWrITDjEmfexBKxFuoy39ME4EhEShN5q5CUUiQmmXqOszIIKOtIEJ83UGRHpYNyP0LJaKqTiEOABe6w78mGcmH3iZIbAeZFsOx4NWy4VI6TEc93wyLJQmccADLHolJx8FTtEJCQU2CIrOtExd5JRyCIz4PB3CiZ7McZwuj59FJxPeG8U9CLmSCaMqH6gBMAauDhybHEQSy3E3+sb0+bZgCdRSc+zb3sOsfwrdms5TcBTLcWmEhdLDCOGGREGAw660YyWzu2TXNtR4VHl6qqS007k4DjpNcYEe5eiRms7cvKcuampagOc5eF5KfJGjF9OhTpKjSYkOcnRQyuzc5PBNCQ9zdFiioxwdlehtjt6WR4n1olv1tszGPV6TPKb+APYZQKhkta6me2F6gy2rGJJuVVYtNR53F1z9gGQgGKVyeXB2+F3J3UZ909xS0wqHJnAnMd9vbe6aJYmXVTPRmI1G/XNJEzEceveJ9r5sfbLM6e0zUip9UiGqWqjUqjdLPqllZoAzKyDrb6a+X9bvMzz6jzqalf2u7buIqY77cXoB+kT/U+njh2m2oz3QzyKDQ30xjvWvjEXE3urbwLzA173pr11LrSX98lrT72zZOK9vWJsb5o8P1Z1vkzd40XhlvDbWDMv4aOwYB8aJ0TSI8dP4bfwx/lYalasKVLxMOj83iXlpFE4l/gfuPmlU</latexit>

lossX,T (p) = (fp (xj ) - tj )2 the previous lecture.


n
j

1X T Note that the loss function takes a model as its argument.


<latexit sha1_base64="OBcEsr5r7t0y5OK+IUaMVRMqDkw=">AAAGyXicdZRLb9NAEMcdHqGUVwtHLisipBZCZQdaeqlUaHkIKC1V0laq02i9GTtu1g9212nS1Z74hBz5JFxZxy7YcdiLR/v7z2ge63Fi6nNhmr9q167fuFm/tXB78c7de/cfLC0/POJRwgh0SEQjduJgDtQPoSN8QeEkZoADh8KxM9xJ+fEIGPejsC0mMXQD7IW+6xMs9FVvaWgLGAtJI85VT540UVut2IxI27lQTWQ7RDpqFW0h22WYSEvJUCGbJ0HvHNkUXLFie1qcO5y17bEGz3M/9AKJ3rmyme8NxOpZq7fUMNfM6UFVw8qNhpGfg97yzfd2PyJJAKEgFHN+apmx6ErMhE8oqEU74RBjMsQenCbC3exKP4wTASFR6KlmbkKRiFBaOOr7DIigE21gwnwdAZEB1lUJ3Z7FcigOIQ6AN/sjP+aZyUdeZgise9uV42nv1b2Sp/QYjgc+GZdSkzjgARaDyiWfBE75EhIKbBSUL9M0dZIzyjEw4vO0CQe6M/txOk/ejg5yPpjEAwi5kgmjquioATAGrnacmhxEEstpNfoRDfmWYAk0U3N6t7WL2fAQ+k0dp3RRTselERZKNyOECxIFAQ770o6VzJ6X3VxT01YV6aGS0k774jjoMMUl+rVAv6rZyJ2/1EUdTUvwqACPKoGPC/R41tVJCjSp0FGBjiqR9R/wD19U8LhAxxU6KdBJhV4W6GW1lVgP+rTVlVm7p2OS+9QfwQcGECrZaKnZWpie4KlVdkmnKhuWmra7D67eKRkIJqlcfmzvfVFyZ7O1bm6oWYVDE7iSmC831nfMisTLssk15uZm621FEzEcen8D7b7beGOZs9NnpJJ6niFqWKhSqjdPnucy18GZ55DVN1c/rOo/MDz5jzqaF/2q7CuPmYqHcfoAhnrNxunywzSb0S7otchgTz+Mff0rYxGxZ/o1MC/wdW36azdTa1FvXmt2z1aNo9aatb5mfnvV2P6U7+AF47HxxFgxLOO1sW18NA6MjkGMn8bvmlGr1T/Xv9fH9ctMeq2W+zwySqf+4w8mFm7n</latexit>

2
lossX,T (w, b) = w xj + b - tj
n
j The model maps the data to the output, the loss function
maps a model to a loss value. The data is a constant in the
loss function.

The main thing a regression loss should do is to compare


the model predictions to the actual values in our dataset,
and return a large value if they are all very different, and a
17
small value if they are all very close together. The difference
between the prediction and the actual value is called the
residual. We've drawn these here as green bars.

The MSE loss takes the residual for each instance in our
data, squares them, and returns the average. One reason for
the squaring step is to ensure that negative and positive
residuals don’t cancel out (giving us a small loss even
though we have big residuals). But that's not the only
reason.
f
f
fi
The squares also ensure that big errors affect the loss more
heavily than small errors. You can visualise this as shown
here: the mean squared error is the mean of the areas of the
green squares (it’s also called sum-of-squares loss).

When we search for a well- itting model, the search will try
to reduce the big squares much more than the small
squares.

If we think of the residuals as rubber bands, pulling on the


regression line to pull it closer to the points, the rubber
band on the bottom left pulls much harder than all the
18
other ones. Therefore, any search algorithm trying to
minimize this loss will be much more interested in moving
the left of the line down than in moving the right of the line
up.

It's not guaranteed that this is a good thing. Sometimes this


behavior is desirable and sometimes it isn't. For now, this is
just a simple loss function to get us started.

In later lectures, we will say more about when this kind of


loss is appropriate and when it isn't. We will also see that
this loss function follows from the assumption that our
data contains noise coming from a normal distribution.

visualization stolen from https://


machinelearning ashcards.com/

slight variations You may see slightly different versions of the MSE loss:
sometimes we take the average of the squares, sometimes
X just the sum. Sometimes we multiply by 1/2 to make the
(fp (xj ) - yi )2
j derivative simpler. In practice, the differences don’t mean
1X much because we’re not interested in the absolute value,
(fp (xj ) - yi )2
n
j just in how the loss changes from model to another.
1X
(fp (xj ) - yi )2
2
j
We will switch between these based on what is most useful
s
1X in a given context.
(fp (xj ) - yi )2
n
j

19
fl
f
|section|Searching for a good model|
|video|https://youtube.com/embed/q97nOAYfpHg|

Linear Models and Search


Part 2: Searching for a good model

Machine Learning
mlvu.github.io
Vrije Universiteit Amsterdam

model space Remember the two most important spaces of machine


learning: the feature space and the model space. The loss
function maps every point in the model space to a loss
value.

y b In a single-feature regression problem plotted like this, the


feature space is just the horizontal axis.

x w

feature space model space


aka instance space aka hypothesis space
21

loss surface As we saw in the previous lecture, we can plot the loss for
every point in our model space. This is called the loss
surface or sometimes the loss landscape. If you imagine a
2D model space, you can think of the loss surface as a
feature space landscape of rolling hills (or sometimes of jagged cliffs).

b
Here is what that actually looks like for the two parameters
of the one-feature linear regression. Note that this is
speci ic to the data we saw earlier. For a different dataset,
we get a different loss landscape.

model space To minimize the loss, we need to search this space to ind
the brightest point in this picture. Or, the lowest point in the
w 22

loss landscape. Remember that, normally, we may have


hundreds of parameters so it isn’t as easy as it looks. Any
method we come up with, needs to work in any number of
dimensions.

We’ve plotted the logarithm of the loss as a trick to make this


image visually easier to understand (it maps the values that
are easy to tell apart to the values we care about). The
logarithm is a monotonic function so log(loss(w, b)) has its
f
f
minimum at the same place as loss(w, b).

optimization The mathematical name for this sort of search is


optimization. That is, we are trying to ind the input (p, the
model parameters) for which a particular function (the
loss) is at its optimum (a maximum or minimum, in this
p̂ = arg min lossX,T (p) case a minimum). Failing that, we’d like to ind as low a
<latexit sha1_base64="BE2xQA4kIQ8Ddu6wfrWD/Sf00TM=">AAAJ9HicfVZNb9tGEGXSL0ttGqc99kJUKJAWgkFKsSijMJpYcpNDE7uGZRswBWFJrShCS3Kxu5TELPaftKci1/6RXttD/02XpCSTXKp7Gsx7b3Z2ZkSNg5FPmWH8++jxRx9/8ulnB43m5188+fLp4bOvbmgUExeO3AhF5M4BFCI/hCPmMwTvMIEgcBC8dRaDFL9dQkL9KLxmCYbjAHihP/NdwKRrcviTTVxuzwHTsdBPdRsQL/DDCU/dWAjd/tFmcM04iigVE87vRFvn10I8zwnfTw5bxpGRHV01zI3R0jbncvLs4Hd7GrlxAEPmIkDpvWlgNuaAMN9FUDTtmEIM3AXw4H3MZv0x90McMxi6Qv9OYrMY6SzS08foU59Al6FEGsAlvoygu3NAgMvkk5vlUBSGIIC0PV36mOYmXXq5wYCs15ivs3qKJyUl9wjAc99dl1LjIKABYHPFSZPAKTthjCBZBmVnmqZMssJcQ+L6NC3CpazMBU57RK+jyw0+T/AchlTwmCBRFEoAEgJnUpiZFLIY8+w1cjAW9JSRGLZTM/OdDgFZXMFpW8YpOcrpzFAEWNnlyGfI6oRw5UZBAMIpt7Hg+YjY7SOR1a6IXgnO7bRQjqNfpXAJfVdA3wlRBs8L4LkEy+hoh870UVV6UwBvlFtvC+htVerEBTRW0GUBXSqRnVUBXinwuoCuFTQpoImCvi+g79U6AzkW950xz3uRNZVfIH8JXxMIQ8FbHVF9C5H9vjfLknQGeMsUWbmncCa/KjkQJCmdv7l++4vgg37n2OiJKsNBMdxSjG7veGAoFC/PZsMx+v3OmcKJCAi9XaDhee+VqQbCMcFoR7Ks7s8naqQEIhStdpEGZ8NOt0KSBSnnZFqmYVRfv/LcLaFnWUOjms8KPRBeDYZdq3rNiuxwx+zCTrV4K/RAmHb7L9QAzg7vGr3eiYKjB4LVBdZUISx2+El2qj8omUB1GjZN11umrkyPV0fflLJW4NQJ8pGp5S9U/msCkj3sqC76dpJqFbhOsR2rWkVSp9jO2FZRlqxqypQNU+0F2RgpdLSfX9OzbNL2RK+jo/38mo5lY7gneh0d7efX9Deb0fpCLnD6/VvIXQWnmwJAOWUI5Q5B4Fv5XbyQ/3uAReQHvllihNwpPLudWv9HBOstUVrNplxozOr6oho3nSPz+Mj49UXr5dlmtTnQvtG+1Z5rpmZpL7U32qU20lztg/aX9rf2T2PZ+K3xR+NDTn38aKP5Wiudxp//AaWbqU8=</latexit>

p value as possible.

We’ll start by looking at some very simple approaches.

in our example: p = {w1 , w2 , . . . , wm , b}


<latexit sha1_base64="df2QPxq3iIGj130xpMnZ7R8mE0k=">AAAJ/HicfVbLbuM2FNVMX7HbaTPtshuhRoGiMALJntjOIsBM7DSz6EzSIE4GiIyAomhZMPUASVnWEOyfdNddMdv+SFcF2n8pJdmOJMrlJhf3nHNF3ntC046wR5lh/P3k6Ucff/LpZwet9udfPPvyq8PnX9/SMCYQTWGIQ/LOBhRhL0BT5jGM3kUEAd/G6M5ejjP8boUI9cLghqURmvnADby5BwGTqYfDC4tAHgn9VLcwmjPd4lkieTBFVy+iXhZhJ2R0m/GzjA25LWTCcxfMEvrDYcc4MvKlq4G5CTraZl09PD/4zXJCGPsoYBADSu9NI2IzDgjzIEaibcUURQAugYvuYzYfzbgXRDFDART69xKbx1hnoZ4dSnc8giDDqQwAJJ6soMMFIAAyefR2tRRFAfAR7TorL6JFSFduETAg+zbj67yv4llFyV0CooUH15WtceBTH7CFkqSpb1eTKMaIrPxqMtum3GSNuUYEejRrwpXszGWUzYrehFcbfJFGCxRQwWOCRVkoAUQImkthHlLE4ojnp5EGWdJTRmLUzcI8dzoBZHmNnK6sU0lUtzPHIWDVlC2PIbsToASGvg8Ch1uR4BZDa8at7pHIe1dGrwXnVtYo29avM7iCvi2hb4Wogucl8FyCVXS6Q+f6tC69LYG3ylfvSuhdXWrHJTRW0FUJXSmV7aQEJwq8LqFrBU1LaKqg70voe7XPQNrivjfjxSzyofJL7K3QBUEoELzTE/WzEDnve7MqyTzAO6bI2+2gubxdCsBPMzp/ffPmZ8HHo96xMRB1ho1jtKUY/cHx2FAobrGbDccYjXpnCickIHB3hSbng1emWiiKSYR3pOGw/9OJWilFGIfJrtL4bNLr10iyIdU9mUPTMOqnT1y4JQyGw4lR30+CHwmvxpP+sP6ZhOxw2+yjXr15CX4kOP3RC7WAvcP7xmBwouD4kTDsg6GjEJY7/CRf9X8ouYG6GzZD1zumrrjHbaJvWtkosJsEhWUa+UuVf0FAuocdNlXfOqlRETUptrZqVKRNiq3HtoqqJGloU26mxg/kNlLoeD+/YWa50/ZUb6Lj/fyGieU23FO9iY738xvmm3u0uZHLKLv/llDOLXspAFxQJki+IQh6I+/FS/m7B1hIfpSXIXF9T/pQ/rW6WfR/RLDeEmXUbssHjVl/vqjBbe/IPD4yfnnReXm2edocaN9q32k/aKY21F5qr7UrbapB7YP2l/aP9m/r19bvrT9aHwrq0ycbzTdaZbX+/A/K86rt</latexit>

23

machine learning vs. optimization We often frame machine learning as an optimization


problem, and we use many techniques from optimization,
but it’s important to recognize that there is a difference
optimization: nd the minimum, or the best possible
approximation between optimization and machine learning.

machine learning: nd the lowest loss that generalizes Optimization is concerned with inding the absolute
minimum (or maximum) of a function. The lower the better,
with no ifs or buts. In machine learning, if we have a very
Minimize the loss on the test data, seeing only the training
expressive model class (like the regression tree from the
data.
last lecture), the model that actually minimizes the loss on
the training data is the one that over its. In such cases,
we’re not looking to minize the loss on the training data,
24
since that would mean over itting, we’re looking to
minimize the loss on the test data. Of course, we don’t get to
see the test data, so we use the training data as a stand in,
and try to control against over itting as best we can.

In the case of underpowered models like the linear model,


this distinction isn’t too important, since they’re very
unlikely to over it. Here, the model that minimizes the loss
on the training data is likely the model that minimizes the
f
f
f
f
f
f
f
fi
fi
loss on the test data as well. For now, we'll just try some
simple optimization algorithms to ind the absolute
minimum of the loss, and worry about over itting later.

random search Let’s start with a very simple example: random search. We
simply make a small step to a nearby point. If the loss goes
up, we move back to our previous point. If it goes down we
start with a random point p in the model space
stay in the new point. Then we repeat the process.
loop:

pick a random point p’ close to p


We usually stop the loop when the loss gets to a pre-de ined
level, or we just run it for a ixed number of iterations, and
if loss(p’) < loss(p):
we see how well we've done.
p <- p’

25

analogy: hiker in a snowstorm A common analogy is a hiker in a snowstorm. Imagine you’re


hiking in the mountains, and you’re caught in a snowstorm.
You can’t see a thing, and you’d like to get down to your
hotel in the valley, or failing that, you’d like to get to as low
a point as possible.

You take a step in a random direction. If you're moving up,


you step back to where you came from, if you're moving
down, you repeat the process with a new random direction.
This is, in effect, what random search is doing. More
importantly, it’s how blind random search is to the larger
structure of the landscape. It can only see what's right in
26
front of it.

image source: https://www.wbur.org/hereandnow/


2016/12/19/rescue-algonquin-mountain
f
f
f
f
“close to” To implement the random search we need to de ine how to
pick a point “close to” another in model space.

One simple option is to choose the next point by sampling


uniformly among all points with some pre-chosen distance
p
p’ r from the current point.
p
Put more formally: we pick from the hypersphere (or circle, in
p’
2D) with radius r, centered on p.

27

random search Here is random search in action. The transparent red


offshoots are steps that turned out to be worse than the
current point (steps that went uphill). The algorithm starts
on the left, and slowly (with a bit of a detour) stumbles in
the direction of the low loss region.

As we can see, it doesn't exactly make a beeline for the


lowest point, but it gets there eventually.

28

Here is what it looks like in feature space. The irst model


(bottom-most line) is entirely wrong, and the search slowly
moves, step by step, towards a reasonable it on the data.

Every blue line in this image corresponds to a red dot in the


model space (inset).

29
f
f
f
convexity One of the reasons such a simple approach works well
enough here is that our problem is convex. A surface (like
our loss landscape) is convex if a line drawn between any
two points on the surface lies entirely above the surface.
One of the implications of convexity is that any point that
looks like a minimum locally (because all nearby points are
loss higher) it must be the global minimum: it’s lower than any
sur
face other point on the surface.

global minimum > This means that so long as we know we’re moving down (to
model space a point with lower loss), we can be sure we’re moving
towards the global minimum: the best of all possible
30
models.

local vs global minima Let’s look at what happens if the loss surface isn’t convex:
what if the loss surface has multiple local minima? These
global minimum are points that are lower than all nearby points, but if we
move far enough away from them, we can ind a point that
is even lower.

Here’s a loss surface with a more complex structure. The


two purple diamonds are the lowest point in their
respective neighborhoods, but the red disc is the lowest
point globally.

This loss surface isn't based on actual data. It's just some
local minima
31
function that illustrates the idea.

Here we see random search on our more complex loss


surface. As you can see, it heads quickly for one of the local
minima, and then gets stuck there. No matter how many
more iterations we give it, it will never escape.

Note that changing the step size will not help us here. Once
the search is stuck, it stays stuck.

32
f
simulated annealing There are a few tricks that can help us to escape local
minima. Here’s a popular one, called simulated annealing:
if the next point chosen isn’t better than the current one, we
pick a random point p in the model space
still pick it, but only with some small probability. In other
loop:
words, we allow the algorithm to occasionally travel uphill.
pick a random point p’ close to p
This means that whenever it gets stuck in a local minimum,
if loss(p’) < loss(p):
it still has some probability of escaping, and inding the
p <- p’
global minimum.
else:
The name "simulated annealing" is a bit of a historical
with probability q: p <- p’
accident, so don't read too much in to it. It comes from the
33
fact that this algorithm can be used to simulate the cooling of
a material like metal. The carefully controlled cooling of a
material to promote the growth of particular kinds of
crystals is called annealing. In physical terms this is like
looking for the minimum in an energy landscape, which is
mathematically similar to our loss landscape.

Here is a run of simulated annealing on our non-convex


problem. We see that it still hits the local minimum irst, but
after a while it manages to jump out, and to ind the global
minimum.

Of course, with this algorithm, there is always the


possibility that it will jump out of the global minimum again
and move to a worse minimum. That shouldn’t worry us,
however, so long as we remember the best model we’ve
observed over the entire run. Then we can just let
simulated annealing jump around the model space driven
partly by random noise, and partly by the loss surface.
34

All this talk about global minima may suggest that the local
minima are always terrible. Remember, however that if we
have a complex model, the global minimum will probably
over it. In such cases, we may actually be more interested in
inding a good local minimum.
Note: in many situations, the local minima are ne. We do
not always need an algorithm that is guaranteed to nd
the global minimum. In short, we want to think carefully about whether our
algorithm can escape bad local minima, but that doesn't
mean that local minima are always bad solutions.

35
f
f
f
f
f
variations on random search The ixed step size we used so far is just one way to sample
the next point. To allow the algorithm to occasionally make
smaller steps, you can sample p’ so that it is at most some
distance away from p, instead of exactly. Another approach
is to sample the distance from a Normal distribution. That
way, most points will be close to the original p, but every
point in the model space can theoretically be reached in one
step.
Fixed radius Random uniform Normal

36

Here is what random search looks when the steps are


sampled from a normal distribution. Note that the “failed”
steps all have different sizes.

37

discrete model spaces The space of linear models is continuous: between every
two models, there is always another model, no matter how
close they are together. *

The alternative is a discrete model space. For instance, the


space of all trees is discrete. If our model takes the form of a
tree (like the decision tree we saw in the last lecture), then
we don't always have another model "in between" any two
given models. In this case, some search algorithms no
longer work, but random search and simulated annealing
can still be used.

38
You just need to de ine which models are “close” to each
other. In this slide, we've decided that two trees are close if I
can turn one into the other by adding or removing a single
node.

Random search and simulated annealing can now be used


to search this space to ind the tree model that gives the
best performance.

In practice, we usually use a different method to search for


f
f
f
decision trees and regression trees, which will introduce this
algorithm in a later lecture. The point here is just that if you
are searching a discrete space, random search and simulated
annealing still work.

* Strictly speaking, this is not a complete de inition of a


continuous space, but it's the property that matters for us.

parallel search Another thing you can do is just to run random search a
couple of times independently (one after the other, or in
parallel). If you’re lucky one of these runs may start you off
close enough to the global minimum.

For simulated annealing, doing multiple runs makes less


sense. We can show that there’s not much difference
between 10 runs of 100 iterations and one run of 1000. The
only reason to do multiple runs of simulated annealing is
because it’s easier to parallelize over multiple cores or
machines.

39

population methods To make parallel search even more useful, we can introduce
some form of communication or synchronization between
the searches happening in parallel. If we see the parallel
evolutionary algorithms
searches as a population of agents that occasionally
• genetic algorithms “communicate” in some way, we can guide the search a lot
• evolutionary strategies more. Here are some examples of such population
methods. we won’t go into this too deeply. We will only
particle swarm optimization
take a (very) brief look at evolutionary algorithms.
ant colony optimization
Often, there are speci ic variants for discrete and for
continuous model spaces.

40
f
f
evolutionary algorithms Here is a basic outline of an evolutionary method (although
many other variations exist). We start with a population of
models, we remove the half with the worst loss, and pair up
Start with a population of k models.
the remainder to breed a new population.
loop:
In order to instantiate this, we need to de ine what it means
rank the population by loss
to “breed” a population of new models from an existing
remove the half with the worst loss population. A common approach is to select to random
“breed” a new population of k models
parents and to somehow average their models. This is easy
to do in a continuous model space: we can literally average
optional: add a little noise to each child.
the two parent models to create a child.

41
In a discrete model space, it’s more dif icult, and it depends
more on the speci ics of the model space. In such case,
designing the breeding process (sometimes called the
crossover operator) is usually the most dif icult part of
designing an effective evolutionary algorithm.

Here’s what a very basic evolutionary search looks like on


our non-convex loss surface. We start with a population of
50 models, and compute the loss for each. We kill the worst
50% (the red dots) and keep the best 50% (the green dots).

We then create a new population (the blue crosses), by


randomly pairing up parents from the green population,
and taking the point halfway between the two parents, with
a little noise added. Finally, we take the blue crosses as the
new population and repeat the process.

42

Here are ive iterations of the algorithm. Note that in the


intermediate stages, the population covers both the local
and the global minima.

43
f
f
f
f
f
population methods Population methods are very powerful, but computing the
loss for so many different models is often expensive. They
can also come with a lot of different parameters to control
Powerful
the search, each of which you will need to carefully tune.
Easy to parallelise

Slow/expensive for complex models

Di cult to tune

44

review

To escape local minima:

• add randomness, add multiple models

To converge faster:

• combine known good models (population methods)

45

black box optimization All these search methods are instances of black box
optimization.

random search, simulated annealing: Black box optimization refers to those methods that only
• very simple
require us to be able to compute the loss function. We don’t
need to know anything about the internals of the model.
• we only need to compute the loss function for each
model
These are usually very simple starting points. Often, there is
some knowledge about your model that you can add to
• can require many iterations
improve the search, but sometimes the black box approach
• also works for discrete model spaces (like tree models) is good enough. If nothing else, they serve as a good starting
point and point of comparison for the more sophisticated
approaches.
46

In the next video we’ll look at a way to improve the search


by opening up the black box for continuous models:
gradient descent.
ffi
|section|Gradient descent|
|video|https://youtube.com/embed/DlMKkPrKg5c|

Linear Models and Search


Part 3: Gradient descent

Machine Learning
mlvu.github.io
Vrije Universiteit Amsterdam

towards gradient descent: branching search As a stepping stone to what we’ll discuss in this video, let’s
take the random search from the previous video, and add a
little more inspection of the local neighborhood before
pick a random point p in the model space
taking a step. Instead of taking one random step, we’ll look
loop:
at k random steps and move in the direction of the one that
pick k random points {pi} close to p gives us the lowest loss.
p’ <- argminpi loss(pi)
In the hiker analogy, you can think of this algorithm as the
if loss(p’) < loss(p): case where the hiker taps his foot on the ground in a couple
p <- p’ of random directions, and then moves in the direction with
the strongest downward slope.

48

Here's what that looks like for a few values of k.

As you can see, the more samples we take, the more directly
k=2 k=5 k=15
we head for the region of low loss. The more closely we
inspect our local neighbourhood, to determine in which
direction the function decreases quickest, the faster we
converge.

The lesson here is that the better we know in which


direction the loss decreases, the faster our search
converges. In this case we pay a steep price: we have to
evaluate our function 15 times to work out a better
49
direction.
gradient descent However, if our model space is continuous, and if our loss
function is smooth, we don’t need to take multiple samples
to guess the direction of fastest descent: we can simply
derive it, using calculus. This is the basis of the gradient
descent algorithm.

image source: http://charlesfranzen.com/posts/multiple-


regression-in-python-gradient-descent/

50

gradient descent: outline

Using calculus, we can nd the


direction in which the loss drops most
quickly.
We also nd how quickly it drops.

This direction is the opposite of the


gradient.
An n-dimensional version of the derivative.

Gradient descent takes small steps in


this direction in order to nd the
minimum of a function.

51

calculus basics: slope Before we dig in to the gradient descent algorithm, let’s
review some basic principles from calculus. First up, slope.
f(x The slope of a linear function is simply how much it moves
)=
-1
x+ up if we move one step to the right. In the case of f(x) in this
3
picture, the slope is negative, because the line moves down.

0 In our 1D regression model, the parameter w was the slope.


=0.5x +
g(x) In this case, however we will investigate the slope of a linear
function approximation the loss landscape, not of the model
(be sure not to confuse these two).

52
fi
fi
fi
tangent line, derivative The tangent line of a function at particular point p is the
line that just touches the function at x. The derivative of
the function gives us the slope of the tangent line.

ace
g(x) = slope · x + c Traditionally we ind the minimum of a function by setting

surf
g(x) = f 0 (p)x + c the derivative equal to 0 and solving for x. This gives us the

loss
point where the tangent line has slope 0, and is therefore

(x)
)g
horizontal.

f(x
For complex models, it may not be possible to solve for x in
f(x)

this way. However, we can still use the gradient to search


g( x

g(x) = slope · x + c
for the minimum. Looking at the example in the slide, we
model space
)

g(x) = f 0 (p)x + c
f(x) g(x) note that the tangent line moves down (i.e. the slope is
53
negative). This tells us that we should move to the right to
follow the function downward. As we take small steps to the
right, the derivative stays negative, but gets smaller and
smaller as we close in on the minimum. This suggests that
the magnitude of the slope lets us know how big the steps
are that we should take, and the sign gives us the direction.

A useful analogy is to think of putting a marble at some


point on the curve and following it as it rolls downhill to
ind the lowest point.

What we need to do now, is to take the derivative of the


function describing the loss surface, and work out

• The direction in which the function decreases the


quickest. We call this the direction of steepest descent.

• How quickly the function decreases in that direction.

the gradient To apply this principle in machine learning, we’ll need to


generalise it for loss functions with multiple inputs (i.e. for
models with multiple parameters). We do this by
generalising the derivative to the gradient. The tangent line
then becomes a tangent (hyper)plane.

Once we have this hyperplane, we can use it to work out in


which direction the function grows and shrinks the
quickest. One way of thinking about this is that the tangent
hyperplane is a local approximation of the function.
Zoomed out like this, the hyperplane and the function look
nothing alike, but if we zoom in close enough on the point
image source: https://nl.mathworks.com/help/matlab/math/calculate-tangent-plane-to-surface.html?requestedDomain=true 54
where they touch, they behave almost exactly the same.

This is useful, because in a hyperplane it's very easy to see


in which direction it goes down the quickest. Much easier
than it is for a complicated beast like our loss function itself.
Since the hyperplane approximates the loss function, this is
also the direction in which the loss decreases the quickest.
At least, so long as we don't move away too far from the
neighborhood where the hyperplane is a good
f
f
approximation of the loss function.

image source: https://nl.mathworks.com/help/matlab/


math/calculate-tangent-plane-to-surface.html?
requestedDomain=true

vectors as directions (with magnitudes) To represent this direction we'll use a vector.

We’re used to thinking of vectors as points in the plane. But


✓ ◆ they can also be used to represent directions. In this case we
✓3◆ use the vector to represent the arrow from the origin to the
x= 3
x= 1 point. This gives us a direction (the direction in which the
p1
||x|| = px1 2 + x2 2 arrow points), and a magnitude (the length of the arrow).
||x|| = x1 2 + x2 2
So this direction of steepest descent that we’re looking for
(in model space) can be expressed as a vector.

55

a 2D linear function Remember, that this is how we express a linear function in


n dimensions: we assign each dimension a slope, and add a
x2
single bias (c).
f(x1, x2) = ax1 + bx2 + c
In this image, the two weights of a linear 2D function (a and
b) representing a hyperplane, are just one slope per
dimension. If we move one step in the direction of x1, we
move up by a, and if we move one step in the direction of
x2, we move up by b.

This is the same picture we saw earlier for the linear


function, but we're using it in a different way. Earlier, it
x1 56
represented a model with two features. Here, it serves as an
approximation for a loss landscape for a model with two
parameters.
gradient We are now ready to de ine the gradient. Any function from
n inputs to one output has n variables for which we can take
✓ ◆
@f @f the derivative. These are called partial derivatives: they
rf(x, y) = ,
@x @y work the same way as regular derivatives, except that you
when you take the derivative with respect to one variable x,
tangent hyperplane: you treat the other variables as constants.
g(x) = slope · x + c
g(x) = f 0 (p)x + c g(x, y) = fx0 (p)x + fy0 (p)y + c One thing that is sometimes a little confusing is that the
gradient of a function f(·) is another function ∇f(·), which
g(x) = rf(p)T x + c
◆✓ ◆ ✓ de ines the slopes of the tangent hyperplane for all points.
@f @f
rf(x, y) = rf(x,, y) = @f , @f At a speci ic point p, the gradient provides the slope of the
@x @y @x @y tangent hyperplane g. The function g is then a function of x
57
where this gradient is a constant..

If a particular function f(x) has gradient g for input x, then


f’(x) = gTx + b (for some b) is the tangent hyperplane at x.
This is a simple function, so we can easily work out what
the direction of steepest ascent/descent is on this
hyperplane.

The gradient is sometimes de ined as a row vector, and


sometimes as a column vector. In machine learning contexts,
the latter usually makes most sense.

the direction of steepest ascent So, now that we have a local linear approximation g to our
function f, which is the direction of steepest ascent on the
<latexit sha1_base64="Hx9t4qh4dmJwoy91jyhymIL/X44=">AAAKS3icfVbNbttGEKaSprHUpnHSYy9EhBZpIxikFEsyDAOJJSc5NLFrWHYAUzGWqxVFaPmD3aUkZs3X6K2P1Afoc/QW5NDlj2SSS3UvGs73fcPZmeFqTR/blGnaP7V797958O3DnXrju+8f/fB498nTS+oFBKIR9LBHPpqAImy7aMRshtFHnyDgmBhdmfNBjF8tEKG2516w0EdjB1iuPbUhYMJ1s/un9dywIDdW0a/qL0eq4QnbXEbqpws186vGXPwevjAOTWEbDcG6vV3zbm9V41A8ptT4CXqUGwD7M5Cy1Tx6pOqJ02BeKUhJdrPb1Pa0ZKmyoWdGU8nW2c2Tnb+MiQcDB7kMYkDpta75bMwBYTbEKGoYAUU+gHNgoeuATftjbrt+wJALI/VngU0DrIqs4hqpE5sgyHAoDACJLSKocAYIgExUslEMRZELHERbk4Xt09SkCys1GBBtGPNV0qboUUHJLQL8mQ1XhdQ4cKgD2Exy0tAxi04UYEQWTtEZpymSLDFXiECbxkU4E5U59ePW0wvvLMNnoT9DLo14QHCUFwoAEYKmQpiYFLHA58luxLzN6REjAWrFZuI7GgIyP0eTlohTcBTTmWIPsKLLFNsQ1XHREnqOA9wJN/yIGwytGDdae1FSuzx6HnFuxIUyTfU8hgvohxz6IYqK4EkOPBFgER1t0Kk6Kksvc+Cl9NarHHpVlppBDg0kdJFDF1Jk8Y3cwUsJXuXQlYSGOTSU0M859LNcZyDG4ro95mkvkqbyU2wv0FuCkBvxZjsq74WIfl/rRUk8A7ypR0m5J2gqDqsUcMKYzt9dvP894oN+e1/rRmWGiQO0pmid7v5AkyhWmk3G0fr99rHE8QhwrU2g4Un3tS4H8gPi4w2p1+u8OZAjhQhjb7mJNDgetjslkihIMSe9p2taefdLC64J3V5vqJXzWeI7wuvBsNMrv2ZJNripd1C7XLwlviNMOv2XcgBzg3e0bvdAwvEdodcBvYlEmG/wg2SVPyiRQHkasqarTV2VpseqomelrBSYVYJ0ZCr5c5n/loBwC9urir6epEqFX6VYj1WlIqxSrGdsrShKlhVlSoap8gXJGEl0vJ1f0bNk0rZEr6Lj7fyKjiVjuCV6FR1v51f0N5nR6kLO/fj8iy84fnxTADilDJG4QxD0XpyLp+J/DzCP/CYOQ2I5tphD8Wu0Yuv/iGC1JgqrEV9o9PL1RTYu23v6/p72x8vmq+PsarOj/KQ8U54rutJTXinvlDNlpEDla+1Z7UWtVf+7/m/9S/1rSr1XyzQ/KoXVePAfPO3Fug==</latexit>

approximation g?
g(x) = wT x + b
Since g is linear, many details don’t matter: we can set b to
= ||w|| ||x|| cos ↵
zero, since that just translates the hyperplane up or down.
||x|| = 1 It doesn’t matter how big a step we take in any direction, so
! ||w|| cos ↵ we’ll take a step of size 1. Finally, it doesn’t matter where
we start from, so we will just start from the origin. So the
α question becomes: for which input x of magnitude 1 (which
unit vector) does g(x) provide the biggest output?

58
To see the answer, we need to use the geometric de inition
of the dot product. Since we required that ||x||= 1, this
disappears from the equation, and we only need to
maximise the quantity ||w|| cos(α) (where only α depends
on our choice of x, and w is the gradient we computed).
cos(α) is maximal when α is zero: that is, when x and w are
pointing in the same direction.

In short: w, the gradient, is the direction of steepest


ascent. This means that -w is the direction of steepest
f
f
f
f
f
descent.

summary

The tangent hyperplane of a function f approximates the


function f locally.

The gradient, ∇f, gives the slope of this tangent hyperplane.

The vector expressing the slope of a hyperplane is also the


direction of steepest ascent on that hypeplane

The opposite vector is the direction of steepest descent.

Conclusion: to move to a lower point of f, we can compute


the gradient and take a small step in the opposite
direction.

59

magnitude of the gradient: speed of ascent Note that the gradient is a vector: it has a direction and a
magnitude (the length of the arrow). The magnitude tells us
x2
how quickly the linear function is rising. This is very useful
in search, since the more the function is changing, the
bigger the steps that we want to take. Once the function
stops changing as much, it's a good bet we are approaching
a minimum, so we'd like to slow down.

x1 60
gradient descent Here is the gradient descent algorithm. Starting from
some candidate p, we simply compute the gradient at p,
subtract it from the current choice, and iterate this process:
pick a random point p in the model space

loop: • We subtract, because the gradient points uphill. Since the


gradient is the direction of steepest ascent, the negative
p ← p - η∇loss(p)
gradient is the direction of steepest descent.

• Since the gradient is only a local approximation to our


loss function, the bigger our step, the more we go wrong
because the approximation is incorrect. Usually, we scale
we usually set η somewhere between 0.0001 and 0.1
down the step size indicated by the gradient by
61
multiplying it by a value η (eta), called the learning rate.
This value is chosen by trial and error, and remains
constant throughout the search (at least in the simplest
version of the algorithm).

Note again a potential point of confusion: we have two


linear functions here. One is the model, whose parameters
are indicated by w and b. The other is the tangent
hyperplane to the loss function, whose slope is indicated by
∇loss(p) here. These are different functions on different
spaces.

We can iterate for a ixed number of iterations, until the loss


gets low enough, or until the gradient gets close enough to
the zero vector, which implies we've reached a local
minimum.

gradient descent: outline

Using calculus, we can nd the


direction in which the loss drops most
quickly.
We also nd how quickly it drops.

This direction is the opposite of the


gradient.
It is an n-dimensional version of the derivative.

Gradient descent takes small steps in


this direction in order to nd the
minimum of a function.
Like a marble rolling down a hill

62
f
fi
fi
fi
Let’s go back to our example problem, and see how we can
apply gradient descent here.

Unlike random search, it’s not enough to just compute the


loss for a given model, we need the gradient of the loss.
We'll start by working this out.

63

Here is our loss function again, and the two partial


derivatives we need work out to ind the gradient.

To simplify the notation we’ll let xi refer to the only feature


1X
<latexit sha1_base64="v6HqdwYnLN2mKe+nx4SuIt4jTq0=">AAAKA3icfVZdb9s2FFU/tsXeuqXr416EGQPazQskO7WdhwBt7LR9WJMsiJMCkWdQNG0Lpj5AUpZVgo/9GfsFexv6uh/SvW4/ZJQsO5Iojy+54Dnn6vLeE5p2gB3KDOPTvfsPHn72+Rd7tfqXXz36+pv9x99eUz8kEA2hj33yzgYUYcdDQ+YwjN4FBAHXxujGXvQT/GaJCHV874rFARq5YOY5UwcCJrfG+2cWQyvGsU+peGr5kEeiqVs25LZ4ph/r1pQAyE3BPaFbNHTHjp6xVjL8KWPqP+ts7Dz7rVUf7zeMAyNduhqYWdDQsnUxfrz3uzXxYegij0EMKL01jYCNOCDMgRiJuhVSFAC4ADN0G7Jpb8QdLwgZ8qDQf5DYNMQ68/XkbPrEIQgyHMsAQOLIDDqcA3kCJjtQL6aiyAMuos3J0gnoOqTL2TpgQLZvxFdpe8WjgpLPCAjmDlwVSuPApS5gc2WTxq5d3EQhRmTpFjeTMmWRJeYKEejQpAkXsjPnQTIyeuVfZPg8DubIo4KHBIu8UAKIEDSVwjSkiIUBT08jfbKgx4yEqJmE6d7xAJDFJZo0ZZ7CRrGcKfYBK27Z8hiyOx6KoO+6wJtwKxB8bSmreSDS3uXRS8G5lTTKtvXLBC6gZzn0TIgieJoDTyVYRIdbdKoPy9LrHHitfPUmh96UpXaYQ0MFXebQpZLZjnJwpMCrHLpS0DiHxgr6Poe+V/sMpC1uWyO+nkU6VH6OnSV6TRDyBG+0RPksRM771ixKEg/whinSdk/QVF4ya8CNEzp/c/X2F8H7vdZzoyPKDBuHaEMx2p3nfUOhzNbVZByj12udKByfAG+2TTQ47bw01URBSAK8JXW77VdHaqYYYexH20z9k0GrXSLJhhRrMrumYZRPH83ghtDpdgdGuZ4I3xFe9gftbvkzEdnittlGrXLzInxHmLR7h2oCe4u3jU7nSMHxHaHbBt2JQlhs8aN0lf+hZAFlN2RD1xumrrhnVkXPWlkpsKsEa8tU8hcq/zUB8Q62X5V946RKRVCl2NiqUhFXKTYe2yiKkqiiTamZKj+Q2kih4938ipmlTtuRvYqOd/MrJpbacEf2Kjreza+Yb+rR6kYuguT+W0A5t+SlAPCaMkDyDUHQW3kvnsvfPcB88qO8DMnMdaQP5V+rmUT/RwSrDVFG9eRBY5afL2pw3TowOweHvx42XpxkT5s97Tvte+2pZmpd7YX2RrvQhhrUPmp/a/9o/9Y+1P6o/Vn7uKbev5dpnmiFVfvrP6V7rTo=</latexit>

of instance i.
loss(w, b) = (wxi + b - ti )2
n
i

<latexit sha1_base64="bxkKGAnqY2xeGalMH2MONzMdAfw=">AAAKVniclVZdb9s2FJW7dUm8dU23x70IMzakgxdIdmo5DwHa2Fn7sDZZECcFIiOgaFoWTH2ApCyrBH/KHvavtj8zjJI/IolygenF1/ecc3V5eUzTibBHmWH803jyxZdPv9rbP2h+/c2zb58fvvjuloYxgWgEQxySjw6gCHsBGjGPYfQxIgj4DkZ3znyQ4XcLRKgXBjcsjdDYB27gTT0ImEw9HP5lzyG3A+BgIHSboSXjOKRUHNkh5Ilo67YDuSNe6j+f6U0boynTj+wpAVI0jz4nECtCnsyy/0OTf5HdEM+dsZcPhy3j2MgfXQ3MddDS1s/Vw4v9P+1JCGMfBQxiQOm9aURszAFhHsRINO2YogjAOXDRfcym/TH3gihmKIBC/0li0xjrLNSzcekTjyDIcCoDAIknK+hwBuRamBxqs1yKogD4iLYnCy+iq5Au3FXA5ITRmC/zHRPPSkruEhDNPLgstcaBT33AZkqSpr5TTqIYI7Lwy8msTdlkhblEBHo0G8KVnMxllLmA3oRXa3yWRjMUUMFjgkVRKAFECJpKYR5SxOKI56uR1pvTM0Zi1M7CPHc2BGR+jSZtWaeUKLczxSFg5ZQjlyGnE6AEhr4Pggm3I+mL3DR2+1jksyui14JzOxuU4+jXGVxCPxTQD0KUwYsCeCHBMjraolN9VJXeFsBb5a13BfSuKnXiAhor6KKALpTKTlKAEwVeFtClgqYFNFXQTwX0kzpnIG1x3xnz1V7km8ovsbdAbwlCgeCtjqiuhcj9vjfLkswDvGWKfNwTNJXn1grw04zO3928/13wQb/zyuiJKsPBMdpQjG7v1cBQKO6qmzXH6Pc75wonJCBwt4WGF703plooikmEtyTL6v52qlZKEcZhsq00OB92uhWSHEi5J9MyDaO6+sSFG0LPsoZGtZ8EPxLeDIZdq/qahGxxx+yiTnV4CX4kTLr9E7WAs8W7Rq93quD4kWB1gTVRCPMtfpo/1R+UbKDqhvWm6y1TV9zj1tHXo6wVOHWClWVq+XOV/5aAdAc7rKu+cVKtIqpTbGxVq0jrFBuPbRRlSVIzptxMtS/IbaTQ8W5+zZ7lTttRvY6Od/Nrdiy34Y7qdXS8m1+zv7lH6wc5j7LzL7sNRdlNAeAVZYjkHYKg9/JcvJT/e4CF5Bd5GBLX96QP5afdzqLPEcFyQ5RRsykvNGb1+qIGt51js3d88sdJ6/X5+mqzr/2g/agdaaZmaa+1d9qVNtJgY6/xa6PXsA7+Pvi3+bS5t6I+aaw132ulp3n4H/GOynQ=</latexit>

✓ ◆
@loss(w, b) @loss(w, b)
rloss(w, b) = ,
@w @b

64

P Here are the derivations of the two partial derivatives:


@1 (wxi + b - ti )2
<latexit sha1_base64="2q8i8sFMD9DJzZopUlJNGKhwZN0=">AAALRnicvVbdbts2FFayv0xdt2a7HAYIM9a1mxdIdmo7FwHa2Fl7sTZZECcFIs+gaNoWTP2ApCyrBN9hL7D32SvsJXY39HaULDuSKO9yuvHB+b7z8fDwk0wnxC5lpvnX3v4HH3708ScHn+oPPnv4+RePDr+8oUFEIBrCAAfkrQMowq6PhsxlGL0NCQKeg9Gts+in+O0SEeoG/jVLQjTywMx3py4ETKbGh3u/21MCILcXoWEztGIcB5SKJ3YAeSyahu1A7oinYk3IksJ4fGoUqrLIEtwXhk0jb+waefVKhj/mCsZPBhu7T39rlZVsW9+KZRK5wlb+f5XazRD3+60nJSmppp98+Vahod2ryJRtjx81zCMzeww1sPKgoeXP5fjw4A97EsDIQz6DGFB6Z5khG3FAmAsxErodURQCuAAzdBexaW/EXT+MGPKhML6T2DTCBguM1B7GxCUIMpzIAEDiSgUDzoHcApMm0stSFPnAQ7Q5WbohXYd0OVsHDEgHjvgqc6h4WKrkMwLCuQtXpdY48KgH2FxJ0sRzykkUYUSWXjmZtimbrDBXiECXpkO4lJO5CFPX0+vgMsfnSThHPhU8IlgUCyWACEFTWZiFFLEo5Nlu5Ku2oKeMRKiZhlnudADI4gpNmlKnlCi3M8UBYOWUI7chp+OjGAaeB/wJt0NppOxVtJtHIptdEb0SnNvpoBzHuErhEvqmgL4RogyeF8BzCZbR4RadGsNq6U0BvFFWvS2gt9VSJyqgkYIuC+hSUXbiAhwr8KqArhQ0KaCJgr4roO/UOQNpi7vWiK/PIjtUfoHdJXpJEPIFb7REdS9EnvedVS5JPcAblsjGPUFT+Z1eA16S0vmr69e/CN7vtZ6ZHVFlODhCG4rZ7jzrmwpltu4m55i9XutM4QQE+LOt0OC888JShcKIhHhL6nbbP5+oSgnCOIi3Sv2zQatdIcmBlHuyupZpVncfz+CG0Ol2B2a1nxjfE170B+1udZmYbHHHaqNWdXgxvidM2r1jVcDZ4m2z0zlRcHxP6LZBd6IQFlv8JHuqL5RsoOqG/NCNhmUo7pnV0fNR1hY4dQVry9TyFyr/JQHJDnZQp75xUm1FWFexsVVtRVJXsfHYpqJcEteMKTNT7QKZjRQ63s2vObPMaTvU6+h4N7/mxDIb7lCvo+Pd/JrzzTxaP8hFmH7/FvJSE6Y3BYDXlAGSdwiCXsvv4oX83wMsID/IjyGZea70ofy1m2n0X0Sw2hBlpOvyQmNVry9qcNM6sjpHx78eN56f5VebA+1r7VvtiWZpXe259kq71IYa3Hu//83+4/3v9T/1v/V/9Pdr6v5eXvOVVnoeaP8C/FofsQ==</latexit>

<latexit sha1_base64="2q8i8sFMD9DJzZopUlJNGKhwZN0=">AAALRnicvVbdbts2FFayv0xdt2a7HAYIM9a1mxdIdmo7FwHa2Fl7sTZZECcFIs+gaNoWTP2ApCyrBN9hL7D32SvsJXY39HaULDuSKO9yuvHB+b7z8fDwk0wnxC5lpvnX3v4HH3708ScHn+oPPnv4+RePDr+8oUFEIBrCAAfkrQMowq6PhsxlGL0NCQKeg9Gts+in+O0SEeoG/jVLQjTywMx3py4ETKbGh3u/21MCILcXoWEztGIcB5SKJ3YAeSyahu1A7oinYk3IksJ4fGoUqrLIEtwXhk0jb+waefVKhj/mCsZPBhu7T39rlZVsW9+KZRK5wlb+f5XazRD3+60nJSmppp98+Vahod2ryJRtjx81zCMzeww1sPKgoeXP5fjw4A97EsDIQz6DGFB6Z5khG3FAmAsxErodURQCuAAzdBexaW/EXT+MGPKhML6T2DTCBguM1B7GxCUIMpzIAEDiSgUDzoHcApMm0stSFPnAQ7Q5WbohXYd0OVsHDEgHjvgqc6h4WKrkMwLCuQtXpdY48KgH2FxJ0sRzykkUYUSWXjmZtimbrDBXiECXpkO4lJO5CFPX0+vgMsfnSThHPhU8IlgUCyWACEFTWZiFFLEo5Nlu5Ku2oKeMRKiZhlnudADI4gpNmlKnlCi3M8UBYOWUI7chp+OjGAaeB/wJt0NppOxVtJtHIptdEb0SnNvpoBzHuErhEvqmgL4RogyeF8BzCZbR4RadGsNq6U0BvFFWvS2gt9VSJyqgkYIuC+hSUXbiAhwr8KqArhQ0KaCJgr4roO/UOQNpi7vWiK/PIjtUfoHdJXpJEPIFb7REdS9EnvedVS5JPcAblsjGPUFT+Z1eA16S0vmr69e/CN7vtZ6ZHVFlODhCG4rZ7jzrmwpltu4m55i9XutM4QQE+LOt0OC888JShcKIhHhL6nbbP5+oSgnCOIi3Sv2zQatdIcmBlHuyupZpVncfz+CG0Ol2B2a1nxjfE170B+1udZmYbHHHaqNWdXgxvidM2r1jVcDZ4m2z0zlRcHxP6LZBd6IQFlv8JHuqL5RsoOqG/NCNhmUo7pnV0fNR1hY4dQVry9TyFyr/JQHJDnZQp75xUm1FWFexsVVtRVJXsfHYpqJcEteMKTNT7QKZjRQ63s2vObPMaTvU6+h4N7/mxDIb7lCvo+Pd/JrzzTxaP8hFmH7/FvJSE6Y3BYDXlAGSdwiCXsvv4oX83wMsID/IjyGZea70ofy1m2n0X0Sw2hBlpOvyQmNVry9qcNM6sjpHx78eN56f5VebA+1r7VvtiWZpXe259kq71IYa3Hu//83+4/3v9T/1v/V/9Pdr6v5eXvOVVnoeaP8C/FofsQ==</latexit>

@loss(w, b)
= n i
@w @w • irst we use the sum rule, moving the derivative inside the
1 X @(wxi + b - ti )2
= sum symbol
n @w
i
1 X @(wxi + b - ti )2 @(wxi + b - yi ) • then we use the chain rule, to split the function into the
=
n
i
@(wxi + b - ti ) @w composition of computing the residual and squaring,
2X computing the derivative of each with respect to its
= (wxi + b - ti )xi
n argument.
i
P
@ n1 + b - ti )2 The second homework exercise, and the formula sheet both
<latexit sha1_base64="5QZ8IdW5jDYQACqjimOm4azCDkw=">AAAKZ3ichVbdbts2FJbb/aTauqYdMBTYjTZjQ9t5gWSnlnORoY2dJRdrkwVxUiDyDEqmbcHUD0jKskrwUfYSe5s9wt5ilCw7kihvvPExv+87OjrnM007RC6huv5348HDTz797PO9R+oXXz7+6sn+02c3JIiwA4dOgAL8wQYEIteHQ+pSBD+EGALPRvDWXvRT/HYJMXED/5omIRx5YOa7U9cBVGyN9/+yphg4zFqEmkXhijIUEMJfWIHDYt7SLNthNn/J14TsC9d+PNYKqiwyOPO5ZpHIG7tarl6J8KdcpP2s0bH78o92OZNlqSJZnqOd5vjfFJY13m/qB3q2NDkw8qCp5Oty/HTvT2sSOJEHfeogQMidoYd0xACmroMgV62IwBA4CzCDdxGd9kbM9cOIQt/h2g8Cm0ZIo4GW9k+buBg6FCUiAA52RQbNmQNRPxVdVsupCPSBB0lrsnRDsg7JcrYOKBAjGrFVNkL+uKRkMwzCueusSqUx4BEP0Lm0SRLPLm/CCEG89MqbaZmiyApzBbHjkrQJl6IzF2FqC3IdXOb4PAnn0CecRRjxolAAEGM4FcIsJJBGIcveRnhxQY4pjmArDbO94wHAiys4aYk8pY1yOVMUAFressVriO74MHYCzwP+hFmhcFFmVqt1wLPeFdErzpiVNsq2tasULqHvC+h7zsvgaQE8FWAZHW7RqTasSm8K4I301NsCeluV2lEBjSR0WUCXUmY7LsCxBK8K6EpCkwKaSOjHAvpR7jMQtrhrj9h6FtlQ2QVyl/AMQ+hz1mzz6rtgMe87oyxJPcCaBs/aPYFTcZCtAS9J6ez8+t1vnPV77dd6l1cZNorghqJ3uq/7ukSZravJOXqv1z6ROAEG/mybaHDafWvIicIIh2hLMs3Or0dypgQiFMTbTP2TQbtTIYmGlGsyTEPXq28fz5wNoWuaA71aT4zuCW/7g45ZfUyMt7htdGC72rwY3RMmnd6hnMDe4h292z2ScHRPMDvAnEiExRY/ylb1ByUKqLohH7rWNDTJPbM6et7KWoFdJ1hbppa/kPlnGCQ72EFd9o2TahVhnWJjq1pFUqfYeGyjKEvimjZlZqp9QGYjiY5282tmljltR/Y6OtrNr5lYZsMd2evoaDe/Zr6ZR+sbuQjT828hrjlhelMAaE0ZQHGHwPCdOBcvxP8eoAF+JQ5DPPNc4UPxabXS6L+IYLUhikhVxYXGqF5f5OCmfWB0Dw5/P2y+OcmvNnvKt8r3ygvFUEzljXKuXCpDxWk8b/zSOGucP/pHfaJ+oz5fUx80cs3XSmmp3/0LWy3Ngw==</latexit>

@loss(w, b) i (wxi
=
@b @b provide a list of the most common rules for derivatives.
2X
= (wxi + b - ti )
n On your irst pass through the slides, it's ok to take my word
i
65

for it that these are the derivatives and to skip the derivation.
However, there are a lot more derivations like these coming
up, so you should work through every step before moving on
to the next lecture, or you'll struggle in the later parts of the
course.
f
f
f
gradient descent for our example Here's what we've just worked out. Gradient descent, but
speci ic to this particular model. We start with some initial
guess, compute the gradient of the loss with the two
pick a random point (w, b) in the model space
functions we've just worked out, and we subtract that
loop:
vector (times some scalar η) from our current guess.
✓ ◆
<latexit sha1_base64="bJZ2zITtxlRh/W2jbiDeOZcYVkY=">AAAKpHiclVbLbttGFKXStHXZpnXaZTdEhBZpqxik5EjyIkBiyU0WSey6lh3AVIXh6IoiNHxgOBTFDLhrP6Mf1q/IL2RIPUxyqALhRldzzrm8c+/RaKyAOCHT9f8a9z67//kXXx58pX79zYNvvzt8+P116EcUwwj7xKfvLBQCcTwYMYcReBdQQK5F4MZaDDL8Zgk0dHzviiUBjF1ke87MwYiJpcnhB9MC2/F44CJGnVWqmT7msfgwNdPC3BIReNMCTGDGEKV+rH2i8on4zpCsmlGEeTvlnojDyJ042uN1ppUIf9vmeqKxifNLtiTyf5qoVMbksKkf6fmjyYGxCZrK5rmYPDz415z6OHLBY5igMLw19ICNOaLMwQRS1YxCCBBeIBtuIzbrj7njBREDD6faTwKbRURjvpY1X5s6FDAjiQgQpo7IoOE5EnthYkRqOVUIHnIhbE2XThCuw3BprwOGxHzHfJXPP31QUnKbomDu4FWpNI7cULRgLi2GiWuVFyEiQJdueTErUxRZYa6AYifMmnAhOnMeZJ4Kr/yLDT5Pgjl4YcojStKiUABAKcyEMA9DYFHA890IIy/CZ4xG0MrCfO3ZENHFJUxbIk9poVzOjPiIlZcssQ3RHQ9i7LsuEkYwg5SbDFaMm62jNO9dEb1MOTezRlmWdpnBJfRtAX2bpmXwrACeCbCMjnboTBtVpdcF8Fp6600BvalKraiARhK6LKBLKbMVF+BYglcFdCWhSQFNJPR9AX0v9xkJW9y2x3w9i3yo/Jw4S3hJAbyUN9tpdS9UzPvWKEsyD/CmkebtnsJMnIJrwE0yOn919eZ1ygf99lO9m1YZFolgS9E73acDXaLY62o2HL3fb59KHJ8iz94lGp51XxhyoiCiAdmRer3O7ydypgQI8eNdpsHpsN2pkERDyjUZPUPXq7uPbbwldHu9oV6tJyZ3hBeDYadXfU1Md7hldKBdbV5M7gjTTv9YTmDt8I7e7Z5IOLkj9DqoN5UIix1+kj/VH5QooOqGzdC1pqFJ7rHr6JtW1gqsOsHaMrX8hcx/SVGyh+3XZd86qVYR1Cm2tqpVJHWKrce2irIkrmlTbqbaF+Q2kuhkP79mZrnT9mSvo5P9/JqJ5Tbck72OTvbza+abe7S+kYsgO/8WWMwtuykgsqYMQdwhKLwR5+K5+N9DzKe/isOQ2q4jfCg+zVYW/R8RrbZEEamquNAY1euLHFy3j4zu0fEfx83np5urzYHyo/JIeawYSk95rrxSLpSRghujBm/83fhH/Vl9rf6pjtbUe42N5gel9Kh/fQQlbehd</latexit>

✓ ◆ ✓2 P ◆
w w (wxi + b - ti )xi
- ⌘ n2 Pi
b b n i (wxi + b - ti ) Hopefully, repeating this process a number of times in small
steps will directly follow the loss surface down to a (local)
s

ss
gues

gradient
minimum.
e
nt gu
next

curre

66

Here is the result on our dataset. Note how the iteration


converges directly to the minimum. Note also that we have
no rejections anymore. The algorithm is fully deterministic:
it computes the optimal step, and takes it. There is no trial
and error.

Note also that the gradient gives us a direction and a step


size. As we get closer to the minimum, the function lattens
out and the magnitude of the gradient decreases. The effect
is that as we approach the minimum the algorithm
automatically takes smaller and smaller steps, preventing
us from overshooting the optimum.
67

Here is what it looks like in feature space.

68
f
f
playground.tensor ow.org (bit.ly/2MnehJp) Here is a very helpful little browser app that we’ll return to
a few times during the course. It contains a few things that
that we haven't discussed yet, but if you remove all hidden
layers, and set the target to regression, you'll get a linear
classi ier of the kind that we've been discussing. Click the
following link to see a version with only the currently
relevant features: playground.tensor ow.com We will
enable different additional features as we discuss them in
the course.

The output for the data is indicated by the color of the


points, the output of the model is indicated by the colouring
of the plane.

Note that the page calls this model a neural network (which
we won’t discuss for a few more weeks). Linear models are
just a very simple neural network.

If our function is non-convex, gradient descent doesn’t help


us with the problem of local minima. As we see here, it
heads straight for the nearest minimum and stays there. To
make the algorithm more robust against this type of thing,
we need to add a little randomness back in, preferably
without destroying the behaviour of moving so cleanly to a
minimum once one is found.

We can also try multiple runs from different starts. Later we


will see stochastic gradient descent, which computes the
gradient only over subsets of the data (making the
algorithm more ef icient, and adding a little randomness at
70
the same time).

Here is a run with a more fortunate starting point.

The point of convergence seems a little off in these images.


The partial derivatives for this function are very complex (I
used Wolfram Alpha to ind them), so most likely, the
implementation has some numerical instability.

71
f
f
f
fl
fl
Here, we see the effect of the learning rate. If we set if too
high, the gradient descent jumps outof the irst minimum it
inds. A little lower and it stays in the neighborhood of the
irst minimum, but it sort of bounces from side to side, only
very slowly moving towards the actual minimum.

At 0.01, we ind a sweet spot where it inds the local


minimum pretty quickly. At 0.005 we see the same behavior,
but we need to wait much longer, because the step sizes are
so small.

The best value of the learning rate is different for each


72
dataset and each model. You'll usually have to ind it by trial
and error. We'll talk a little more about how this looks in
practice in the next lecture.

gradient descent

only works for continuous model spaces

… with smooth loss functions

… for which we can work out the gradient

does not escape local minima

very fast, low memory

very accurate

backbone of 99% of modern machine learning.


73

but actually… It’s worth saying that for linear regression, although it
makes a nice, simple illustration, none of this searching is
actually necessary. For linear regression, we can set the
derivatives equal to zero and solve explicitly for w and for b.
This would give us the optimal solution directly without
<latexit sha1_base64="lpnOhDhBinkS7r1HNkXnGplwc80=">AAAKKHiclVZdb9s2FFW6bou9dU23x70INTZ0gxFIdmo7DwHa2Fn7sDZZECcFIiOgaFoWTH2ApCyrBH/K/kB/Td+GvO6PbJRkO5IoD5hefH3POVfkvcc07RC7lBnG/d6jLx5/+dXX+43mN98++e7pwbPvr2kQEYjGMMAB+WADirDrozFzGUYfQoKAZ2N0Yy+GKX6zRIS6gX/FkhBNPOD47syFgMnU3QG1ZgRAbi1C3WJoxTgOKBUvrADyWLR1y4bcFr+InJAlhf7ziW7oltX8P9LsSy5t3h20jEMje3Q1MNdBS1s/F3fP9v+0pgGMPOQziAGlt6YRsgkHhLkQI9G0IopCABfAQbcRmw0m3PXDiCEfCv0nic0irLNATxugT12CIMOJDAAkrqygwzmQO2GyTc1yKYp84CHani7dkOYhXTp5wIDs8YSvshmIJyUldwgI5y5clZbGgUc9wOZKkiaeXU6iCCOy9MrJdJlykRXmChHo0rQJF7Iz52E6V3oVXKzxeRLOkU8FjwgWRaEEECFoJoVZSBGLQp7tRpppQU8YiVA7DbPcyQiQxSWatmWdUqK8nBkOACunbLkN2R0fxTDwPOBPuRVKV2SWsdqHIutdEb0UnFtpo2xbv0zhEvq+gL4XogyeFcAzCZbR8Rad6eOq9LoAXitvvSmgN1WpHRXQSEGXBXSpVLbjAhwr8KqArhQ0KaCJgn4soB/VPgNpi9vOhOezyIbKz7G7RG8IQr7grY6o7oXIed+aZUnqAd4yRdbuKZrJkygHvCSl87dX734XfDjovDR6osqwcYQ2FKPbezk0FIqTr2bNMQaDzqnCCQjwnW2h0VnvtakWCiMS4i2p3+/+dqxWShDGQbytNDwddboVkmxIeU1m3zSM6u5jB24IvX5/ZFTXE+MHwuvhqNuvviYmW9w2u6hTbV6MHwjT7uBILWBv8a7R6x0rOH4g9LugP1UIiy1+nD3VH5RcQNUN66HrLVNX3OPU0detrBXYdYLcMrX8hcp/Q0Cygx3UVd84qVYR1ik2tqpVJHWKjcc2irIkrmlTZqbaF2Q2Uuh4N79mZpnTdlSvo+Pd/JqJZTbcUb2Ojnfza+abebS+kYswPf8W8ooSpjcFgHPKCMk7BEHv5Ll4Lv/3AAvIr/IwJI7nSh/KT6udRv9FBKsNUUbN9EJjVq8vanDdOTR7h0d/HLVena6vNvvaj9pz7YVman3tlfZWu9DGGtTutX/29vcajU+Nz42/Gvc59dHeWvODVnoaf/8LgoS6/w==</latexit>

@loss(w, b)
=0 searching.
@w
@loss(w, b) However, this trick requires more advanced linear algebra to
=0 work out than we want to introduce here. You should learn
@b
about this in most linear algebra courses, where the problem
lution
tical so
n analy is called ordinary least squares, and is solved by computing
there’s a m o d el)
(for this the pseudo-inverse of the data matrix. We won't go down this
74
route in this course because it'll stop working very quickly
once we start looking at more complicated models.
f
f
f
f
f
f
75

|section|Gradient Descent and Classi ication|


|video|https://youtube.com/embed/LjhzJUy-w k|

Linear Models and Search


Part 4: Gradient descent and classi cation

Machine Learning
mlvu.github.io
Vrije Universiteit Amsterdam

classi cation Now, let’s look at how this works for classi ication.

The irst question we need to answer is how do we de ine a


linear classi ier: that is, a classi ier whose decision
boundary is always a line (or hyperplane) in feature space.

77
f
f
f
f
fi
f
f
f
To de ine a linear decision boundary, we take the same
functional form we used for the linear regression: some
weight vector w, and a bias b.
wTx + b> 0
The way we de ine the decision boundary is a little different
wTx + b= 0
x2 than the way we de ined the regression line. Here, we say
wTx + b < 0 that if wTx + b is larger than 0, we call x one class, if it is
smaller than 0, we call it the other (we’ll stick to binary
classi ication for now).

Note that we are drawing a line again, but in a different


space: in the regression example we draw a line in the
x1 78
combined feature and output space (a function from the
feature to the output). Here, we have two features, and we
are drawing a line in only the feature space.

1D linear classi er The actual hyperplane this function y = wTx + b de ines can
be thought of as lying above and below the feature space.

Here it is visualized for the case of one feature. We are


de ining a linear function from the feature to some output y.
y
Wherever this line lies above the feature space (i.e. is
positive), we classify things as the blue/disc class, and
0 x
wherever the line lies below the feature space (i.e. is
negative) we classify them as the red/diamond class.

wx + b > 0

79

Here it is in 2D: wTx + b describes a plane that intersects


the feature space. The line of intersection is our decision
x2
boundary.
f(x
)=
wT
x+
b

x1 80
f
f
f
f
f
f
fi
This also shows us another interpretation of w. Since it is
the direction of steepest ascent on this hyperplane, it is the
vector perpendicular to the decision boundary, pointing
to the class we assigned to the case where wTx + b is larger
w wTx + b = 0 than 0 (the blue class in this case).

x2 We never want to "ascend" this plane like we do with the


hyperplane approximating the loss landscape, but it's useful
for our geometric intuition to know where w points, relative
to our decision boundary. We will use this fact at different
points in the future.

x1 81

example data Here is a simple classi ication dataset, which we’ll use to
illustrate the principle.

x1 x2
1 2 true
1 1 false
2 3 true
3 3 true
3 1 false
3 2 false

82

what loss function do we use? This gives us a model space, but how do we decide the
quality of any particular model? What is our loss function
for classi ication?
nr. of misclassi ed examples (error)?
The thing we are usually trying to minimise is the error:
the number of misclassi ied examples. Sometimes we are
looking for something else, but in the simplest classi ication
problems, this is what we are ultimately interested in: a
classi ier that makes as few mistakes as possible. So let's
start there: can we use the error as a loss function?

83
f
f
f
f
f
fi
This is what our loss surface looks like for the error
function on our simple dataset. Note that it consists almost
entirely of lat regions. This is because changing a model a
tiny bit will usually not change the number of misclassi ied
examples. And if it does, the loss function will suddenly
jump a lot.

In these lat regions, random search would have to do a


random walk, stumbling around until it inds a ridge by
accident.

Gradient descent would fare even worse: the gradient is


84
zero everywhere in this picture, except exactly on the
ridges, where it is unde ined. Gradient descent would either
crash, or simply never move.

Note that our model now has three parameters w1, w2 and b,
so the loss surface is a function on a 3d space (a 4d
"surface"). In order to plot it in two dimensions, we have ixed
w2=1.

This is an important lesson about loss functions. They serve


two purposes:

Sometimes your loss function 1. To express what quantity we want to maximise in our
search for a good model.

should not be the same as 2. To provide a smooth loss surface, so that we can ind a
path from a bad model to a good one.

your evaluation function. For this reason, it’s common not to use the error as a loss
function, even when it’s the thing we’re actually interested
in minimizing. Instead, we’ll replace it by a loss function
that has its minimum at (roughly) the same model, but that
85
provides a smooth, differentiable loss surface.

After we have trained a model we can still evaluate it with


the function we're actually interested in (that is, we can still
count how many mistakes it makes). We'll discuss
evaluation in-depth in the next lecture.

classi cation losses In this course, we will investigate three common loss
functions for classi ication. The irst, least-squares loss, is
just an application of MSE loss to classi ication, we will
Least squares loss (this video)
discuss that in the remainder of the lecture. It's not usually
Log loss / Cross entropy (Lecture 5, Probability) that good, but it gives you an idea of what a classi ication
SVM loss (Lecture 6, Linear Models 2) loss might look like.

The others require a bit more background, so we’ll save


them for later.

86
f
f
f
f
f
fi
f
f
f
f
f
f
least-squares loss The least squares classi ier essentially turns the
classi ication problem into a regression problem: it assigns
loss for instances in pos class loss for instances in neg class points in one class the numeric value +1 and points in the
X X
other class the value -1, we then use a basic MSE loss that
<latexit sha1_base64="LD5OqszO0Xp5v/72GdACm/YXpsU=">AAAG+HicfVRbTxNBFN6qVEVR0EdfJjYalYq7VZAXEhRUYlSQUCBhSzM7Pd1OurfMzJaWyfwRn3wzvvpv/DfOXsDdbnVeevZ833dybj1O5FEuTPN37crVa3P16zduzt+6vXDn7uLSvUMexoxAm4ReyI4dzMGjAbQFFR4cRwyw73hw5Ay3EvxoBIzTMDgQkwg6PnYD2qcEC+3qLn6zBYyF9ELO1RM7JNJ2zlQT2Q6RjnqKNpDNY78rKbJpkHozfhRypRS6VJwe2OMuRcu5ED1H1tPTVvL9V26zC3UA7n/Vy6m6u9gwV8z0oaph5UbDyN9ed2nuvd0LSexDIIiHOT+xzEh0JGaCEg/UvB1ziDAZYhdOYtFf70gaRLGAgCj0SGP92EMiREmfUI8yIMKbaAMTRnUERAaYYSJ0N+fLoTgE2Afe7I1oxDOTj9zMEFiPoiPH6ajUQkkpXYajASXjUmoS+9zHYlBx8onvlJ0Qe8BGftmZpKmTnGKOgRHKkybs6c7sRsn4+UG4l+ODSTSAgCsZM08VhRoAxqCvhanJQcSRTKvROzfkG4LF0EzM1LexjdlwH3pNHafkKKfT90IslG5GAGck9H0c9KQdqXw/7OaKSltVRPeVlHbSF8dB+wlcQr8U0C9qOnL7Eu2jtkZL4GEBPKwEPiqgR9NSJy6gcQUdFdBRJbJe/L/wWQUeF9BxBZ0U0EkFPS+g59VWYj3ok1ZHZu1OxyR3PTqCDwwgULLRUtO1MD3BE6ssSaYqG5ZK292Dvj5BGeBPErrcOfj8Scmt9daquaamGY4XwwXFfLm2umVWKG6WTc4x19dbbyuckOHAvQy0/W7tjWVOT5+RSup5hqhhoUqp7ix6nstMgTNLkNU3kz+s8j8wPPkHO5wV/aLsC8VUxcMoWYChvq5Rcvywl81oG/RZZPBZL8au/itjEbJnehuY61Ndm/61m4k1ry+vNX1nq8Zha8VaW3n19VVj82N+g28YD4yHxhPDMl4bm8aOsWe0DVIzao9rL2pm/bz+vf6j/jOjXqnlmvtG6dV//QGd537h</latexit>

loss(w, b) = (wT xi + b - 1)2 + (wT xi + b + 1)2


i2pos i2neg we saw before the break to train a regression model to
predict these numeric values.
1
Performing gradient descent with this loss function will
result in a line that minimises the green residuals.
0 Hopefully the points are far enough apart that the decision
boundary (the single point where the orange line crosses
-1 the x axis) separates the two classes.
87

As you can see, we always get very big residuals whatever


we do. That is because the points simply do not lie on a
single line, so the linear model is not appropriate. Still, with
a little luck, the best itting line will be positive for the +1
class and negative for the -1 class. If so the classi ier will
make the right predictions, even if the model is way off as a
regression model for the numeric labels we introduced.

With this loss function, we note that our loss surface is


perfectly smooth. If we overlay the error loss, we see that
the minima of the two losses coincide pretty well (for this
dataset at least).

88

And gradient descent has no problem inding a solution.

Note, however that the optimum under this loss function may
not always perfectly separate the classes, even if they are
linearly separable. It does in our case, but this result is not
guaranteed.

89
f
f
f
f
f
Here is the result in feature space, with the inal decision
boundary in orange.

90

playground.tensor ow.org (bit.ly/2Me1fxU) The tensor low playground also allows us to play around
with linear classi iers. Note that only for one of the two
datasets, the linear decision boundary is appropriate.

Here is a link with the relevant features enabled.

This example actually uses a logarithmic loss, rather than a


least squares loss, but it should still be intructive to play
around with it. We’ll discuss the logarithmic loss in the irst
probability lecture.

91

mlcourse@peterbloem.nl
92
f
f
f
f
fl

You might also like