You are on page 1of 37

Search

SVM Tutorial

SVM - Understanding the math - Part 1 - The margin

Introduction

This is the first article from a series of articles I will be writing about the math behind SVM. There is a lot to talk about and a lot of
mathematical backgrounds is often necessary. However, I will try to keep a slow pace and to give in-depth explanations, so that
everything is crystal clear, even for beginners.

If you are new and wish to know a little bit more about SVMs before diving into the math, you can read the article: an overview of
Support Vector Machine.

What is the goal of the Support Vector Machine (SVM)?

The goal of a support vector machine is to find the optimal separating hyperplane which maximizes the margin of the training
data.

The first thing we can see from this definition, is that a SVM needs training data. Which means it is a supervised learning
algorithm.

It is also important to know that SVM is a classification algorithm. Which means we will use it to predict if something belongs to a
particular class.

For instance, we can have the training data below:


Figure 1

We have plotted the size and weight of several people, and there is also a way to distinguish between men and women.

With such data, using a SVM will allow us to answer the following question:

Given a particular data point (weight and size), is the person a man or a woman ?

For instance: if someone measures 175 cm and weights 80 kg, is it a man of a woman?

What is a separating hyperplane?

Just by looking at the plot, we can see that it is possible to separate the data. For instance, we could trace a line and then all
the data points representing men will be above the line, and all the data points representing women will be below the line.

Such a line is called a separating hyperplane and is depicted below:

If it is just a line, why do we call it an hyperplane ?

Even though we use a very simple example with data points laying in R2 the support vector machine can work with any number
of dimensions !

An hyperplane is a generalization of a plane.

in one dimension, an hyperplane is called a point


in two dimensions, it is a line
in three dimensions, it is a plane
in more dimensions you can call it an hyperplane

The point L is a separating hyperplane in one dimension


What is the optimal separating hyperplane?

The fact that you can find a separating hyperplane, does not mean it is the best one ! In the example below there is several
separating hyperplanes. Each of them is valid as it successfully separates our data set with men on one side and women on the
other side.

There can be a lot of separating hyperplanes

Suppose we select the green hyperplane and use it to classify on real life data.
This hyperplane does not generalize well

This time, it makes some mistakes as it wrongly classify three women. Intuitively, we can see that if we select an hyperplane
which is close to the data points of one class, then it might not generalize well.

So we will try to select an hyperplane as far as possible from data points from each category:

This one looks better. When we use it with real life data, we can see it still make perfect classification.

The black hyperplane classifies more accurately than the green one
That's why the objective of a SVM is to find the optimal separating hyperplane:

because it correctly classifies the training data


and because it is the one which will generalize better with unseen data

What is the margin and how does it help choosing the optimal hyperplane?

The margin of our optimal hyperplane

Given a particular hyperplane, we can compute the distance between the hyperplane and the closest data point. Once we have
this value, if we double it we will get what is called the margin.

Basically the margin is a no man's land. There will never be any data point inside the margin. (Note: this can cause some
problems when data is noisy, and this is why soft margin classifier will be introduced later)

For another hyperplane, the margin will look like this :


As you can see, Margin B is smaller than Margin A.

We can make the following observations:

If an hyperplane is very close to a data point, its margin will be small.


The further an hyperplane is from a data point, the larger its margin will be.

This means that the optimal hyperplane will be the one with the biggest margin.

That is why the objective of the SVM is to find the optimal separating hyperplane which maximizes the margin of the
training data.

This concludes this introductory post about the math behind SVM. There was not a lot of formula, but in the next article we will
put on some numbers and try to get the mathematical view of this using geometry and vectors.

If you want to learn more read it now :


SVM - Understanding the math - Part 2 : Calculate the margin

Alexandre KOWALCZYK
I am passionate about machine learning and Support Vector Machine. I like to explain things simply to share
my knowledge with people from around the world. If you wish you can add me to linkedin, I like to connect
with my readers.

November 2, 2014
 105 Replies

« Previous Next »
SVM Tutorial

SVM - Understanding the math - Part 2

This is Part 2 of my series of tutorial about the math behind Support Vector Machines.
If you did not read the previous article, you might want to start the serie at the beginning by reading this article: an
overview of Support Vector Machine.

In the rst part, we saw what is the aim of the SVM. Its goal is to nd the hyperplane which maximizes the margin.

But how do we calculate this margin?

SVM = Support VECTOR Machine

In Support Vector Machine, there is the word vector.


That means it is important to understand vector well and how to use them.

Here a short sum-up of what we will see today:

What is a vector?
its norm
its direction
How to add and subtract vectors ?
What is the dot product ?
How to project a vector onto another ?

Once we have all these tools in our toolbox, we will then see:
What is the equation of the hyperplane?
How to compute the margin?

What is a vector?

If we de ne a point A(3, 4) in R 2 we can plot it like this.

Figure 1: a point

De nition: Any point x = (x1 , x2 ), x ≠ 0 , in R 2  speci es a vector in the plane, namely the vector starting at the
origin and ending at x.

This de nition means that there exists a vector between the origin and A.

Figure 2 - a vector


If we say that the point at the origin is the point O(0, 0) then the vector above is the vector OA. We could also give it
an arbitrary name such as  u.

Note: You can notice that we write vector either with an arrow on top of them, or in bold, in the rest of this text I will

use the arrow when there is two letters like OA and the bold notation otherwise.

Ok so now we know that there is a vector, but we still don't know what IS a vector.
De nition: A vector is an object that has both a magnitude and a direction.

We will now look at these two concepts.

1) The magnitude

The magnitude or length of a vector x is written ∥x∥  and is called its norm.

For our vector OA,   ∥OA∥ is the length of the segment OA

Figure 3

From Figure 3 we can easily calculate the distance OA using Pythagoras' theorem:

2 2 2
OA = OB + AB

2 2 2
OA = 3 +4

2
OA = 25

−−
OA = √25

∥OA∥ = OA = 5

2) The direction

The direction is the second component of a vector.

De nition : The direction of a vector u(u1 , u2 ) is the vector  w(


u1 u2
, )
∥u∥ ∥u∥

Where does the coordinates of  w  come from ?

Understanding the de nition


To nd the direction of a vector, we need to use its angles.

Figure 4 - direction of a vector

Figure 4 displays the vector u(u1 , u2 ) with u1 = 3 and u2 = 4

We could say that :

Naive de nition 1: The direction of the vector u is de ned by the angle θ with respect to the horizontal axis, and with the angle
α with respect to the vertical axis.

This is tedious. Instead of that we will use the cosine of the angles.

In a right triangle, the cosine of an angle β is de ned by :

adjacent
cos(β) =
hypotenuse

In Figure 4 we can see that we can form two right triangles, and in both case the adjacent side will be on one of the axis.
Which means that the de nition of the cosine implicitly contains the axis related to an angle. We can rephrase our naïve
de nition to :

Naive de nition 2: The direction of the vector u is de ned by the cosine of the angle θ and the cosine of the angle α.

Now if we look at their values :

u1
cos(θ) =
∥u∥

u2
cos(α) =
∥u∥

Hence the original de nition of the vector w . That's why its coordinates are also called direction cosine.
Computing the direction vector

We will now compute the direction of the vector u  from Figure 4.:

u1 3
cos(θ) = = = 0.6
∥u∥ 5

and

u2 4
cos(α) = = = 0.8
∥u∥ 5

The direction of u(3, 4) is the vector w(0.6, 0.8)

If we draw this vector we get Figure 5:

Figure 5: the direction of u

We can see that w as indeed the same look as u except it is smaller. Something interesting about direction vectors like 
w is that their norm is equal to 1. That's why we often call them unit vectors.

The sum of two vectors


Figure 6: two vectors u and v

Given two vectors u(u1 , u2 ) and v(v1 , v2 ) then :

u + v = (u1 + v1 , u2 + v2 )

Which means that adding two vectors gives us a third vector whose coordinate are the sum of the coordinates of the
original vectors.

You can convince yourself with the example below:

Figure 7: the sum of two vectors

The di erence between two vectors

The di erence works the same way :

u − v = (u1 − v1 , u2 − v2 )
Figure 8: the di erence of two vectors

Since the subtraction is not commutative, we can also consider the other case:

v − u = (v1 − u1 , v2 − u2 )

Figure 9: the di erence v-u

The last two pictures describe the "true" vectors generated by the di erence of u and v.

However, since a vector has a magnitude and a direction, we often consider that parallel translate of a given vector
(vectors with the same magnitude and direction but with a di erent origin) are the same vector, just drawn in a di erent
place in space.

So don't be surprised if you meet the following :


Figure 10: another way to view the di erence v-u

and

Figure 11: another way to view the di erence u-v

If you do the math, it looks wrong, because the end of the vector u − v is not in the right point, but it is a convenient
way of thinking about vectors which you'll encounter often.

The dot product

One very important notion to understand SVM is the dot product.

De nition: Geometrically, it is the product of the Euclidian magnitudes of the two vectors and the cosine of the angle
between them

Which means if we have two vectors x and y and there is an angle θ  (theta) between them, their dot product is :

x ⋅ y = ∥x∥∥y∥cos(θ)

Why ?

To understand let's look at the problem geometrically.


Figure 12

In the de nition, they talk about cos(θ), let's see what it is.

By de nition we know that in a right-angled triangle:

adjacent
cos(θ) =
hypotenuse

In our example, we don't have a right-angled triangle.

However if we take a di erent look Figure 12 we can nd two right-angled triangles formed by each vector with the
horizontal axis.

Figure 13

and
Figure 14

So now we can view our original schema like this:

Figure 15

We can see that

θ = β−α

So computing cos(θ) is like computing cos(β − α)

There is a special formula called the di erence identity for cosine which says that:

cos(β − α) = cos(β)cos(α) + sin(β)sin(α)

(if you want you can read  the demonstration here)

Let's use this formula!

adjacent x1
cos(β) = =
hypotenuse ∥x∥

opposite x2
sin(β) = =
hypotenuse ∥x∥
adjacent y1
cos(α) = =
hypotenuse ∥y∥

opposite y2
sin(α) = =
hypotenuse ∥y∥

So if we replace each term

cos(θ) = cos(β − α) = cos(β)cos(α) + sin(β)sin(α)

x1 y1 x2 y2
cos(θ) = +
∥x∥ ∥y∥ ∥x∥ ∥y∥

x1 y 1 + x2 y 2
cos(θ) =  
∥x∥∥y∥

If we multiply both sides by ∥x∥∥y∥ we get:

∥x∥∥y∥cos(θ) = x1 y 1 + x2 y 2

Which is the same as :

∥x∥∥y∥cos(θ) = x ⋅ y

We just found the geometric de nition of the dot product ! 

Eventually from the two last equations we can see that :

x ⋅ y = x1 y 1 + x2 y 2 = ∑(xi y i )

i=1

This is the algebraic de nition of the dot product !

 A few words on notation

The dot product is called like that because we write a dot between the two vectors.
Talking about the dot product x ⋅ y is the same as talking about

the inner product  ⟨x, y⟩  (in linear algebra)


scalar product because we take the product of two vectors and it returns a scalar (a real number)

The orthogonal projection of a vector

Given two vectors x and y, we would like to  nd the orthogonal projection of x onto y.
Figure 16

To do this we project the vector x onto y

Figure 17

This give us the vector z

Figure 18 : z is the projection of x onto y

By de nition :
∥z∥
cos(θ) =
∥x∥

∥z∥ = ∥x∥cos(θ)

We saw in the section about the dot product that

x⋅y
cos(θ) =
∥x∥∥y∥

So we replace cos(θ) in our equation:

x⋅y
∥z∥ = ∥x∥
∥x∥∥y∥

x⋅y
∥z∥ =
∥y∥

If we de ne the vector u as the direction of y then

y
u =
∥y∥

and

∥z∥ = u ⋅ x

We now have a simple way to compute the norm of the vector z.


Since this vector is in the same direction as y it has the direction  u

z
u =
∥z∥

z = ∥z∥u

And we can say :

The vector z = (u ⋅ x)u is the orthogonal projection of x onto y.

Why are we interested by the orthogonal projection ? Well in our example, it allows us to compute the distance
between x and the line which goes through y.
Figure 19

We see that this distance is ∥x − z∥

−−−−−−−−−−−−−−−
2 2 −−
∥x − z∥ = √ (3 − 4) + (5 − 1) = √17

The SVM hyperplane

Understanding the equation of the hyperplane

You probably learnt that an equation of a line is : y = ax + b . However when reading about hyperplane, you will often
nd that the equation of an hyperplane is de ned by :

T
w x = 0

How does these two forms relate ?


In the hyperplane equation you can see that the name of the variables are in bold. Which means that they are vectors !
 Moreover, wT x is how we compute the inner product of two vectors, and if you recall, the inner product is just another
name for the dot product !

Note that

y = ax + b

is the same thing as

y − ax − b = 0

−b 1
⎛ ⎞ ⎛ ⎞
Given two vectors  w ⎜ −a ⎟ and x ⎜ x ⎟
⎝ ⎠ ⎝ ⎠
1 y

T
w x = −b × (1) + (−a) × x + 1 × y

T
w x = y − ax − b
The two equations are just di erent ways of expressing the same thing.

It is interesting to note that w0 is −b, which means that this value determines the intersection of the line with the
vertical axis.

Why do we use the hyperplane equation wT x instead of  y = ax + b ?

For two reasons:

it is easier to work in more than two dimensions with this notation,
the vector w will always be normal to the hyperplane(Note: I received a lot of questions about the last remark. w
will always be normal because we use this vector to de ne the hyperplane, so by de nition it will be normal. As
you can see this page, when we de ne a hyperplane, we suppose that we have a vector that is orthogonal to the
hyperplane)

And this last property will come in handy to compute the distance from a point to the hyperplane.

Compute the distance from a point to the hyperplane

In Figure 20 we have an hyperplane, which separates two group of data.

Figure 20

To simplify this example, we have set w0 = 0 .

As you can see on the Figure 20, the equation of the hyperplane is :

x2 = −2x1

which is equivalent to
T
w x = 0

2 x1
with w ( )  and x ( )
1 x2

Note that the vector w is shown on the Figure 20. (w is not a data point)

We would like to compute the distance between the point A(3, 4) and the hyperplane.

This is the distance between A and its projection onto the hyperplane

Figure 21

We can view the point A as a vector from the origin to A.


If we project it onto the normal vector w
Figure 22 : projection of a onto w

We get the vector p

Figure 23: p is the projection of a onto w

Our goal is to nd the distance between the point A(3, 4) and the hyperplane.
We can see in Figure 23 that this distance is the same thing as ∥p∥ .
Let's compute this value.

We start with two vectors, w = (2, 1) which is normal to the hyperplane, and a = (3, 4) which is the vector between
the origin and A.
− −−−−− –
2 2
∥w∥ = √ 2 + 1 = √5

Let the vector u be the direction of w

2 1
u = ( , )
– –
√5 √5

p is the orthogonal projection of a onto w so :

p = (u ⋅ a)u

2 1
p = (3 × +4× )u
– –
√5 √5

6 4
p = ( + )u
– –
√5 √5

10
p = u

√5

10 2 10 1
p = ( × , × )
– – – –
√5 √5 √5 √5

20 10
p = ( , )
5 5

p = (4, 2)

− −−−−− –
2 2
∥p∥ = √ 4 + 2 = 2√5

Compute the margin of the hyperplane

Now that we have the distance ∥p∥ between A and the hyperplane, the margin is de ned by :


margin = 2∥p∥ = 4√5

We did it ! We computed the margin of the hyperplane !

Conclusion

This ends the Part 2 of this tutorial about the math behind SVM.
There was a lot more of math, but I hope you have been able to follow the article without problem.

What's next?

Now that we know how to compute the margin, we might want to know how to select the best hyperplane, this is
described in Part 3 of the tutorial : How to nd the optimal hyperplane ?
Search

SVM Tutorial

SVM - Understanding the math - the optimal hyperplane

This is the Part 3 of my series of tutorials about the math behind Support Vector Machine.

If you did not read the previous articles, you might want to start the serie at the beginning by reading this article: an overview of
Support Vector Machine.

What is this article about?

The main focus of this article is to show you the reasoning allowing us to select the optimal hyperplane.

Here is a quick summary of what we will see:

How can we find the optimal hyperplane ?


How do we calculate the distance between two hyperplanes ?
What is the SVM optimization problem ?

How to find the optimal hyperplane ?

At the end of Part 2 we computed the distance ∥p∥ between a point A and a hyperplane. We then computed the margin which
was equal to 2∥p∥ .

However, even if it did quite a good job at separating the data it was not the optimal hyperplane.

Figure 1: The margin we calculated in Part 2 is shown as M1


As we saw in Part 1, the optimal hyperplane is the one which maximizes the margin of the training data.

In Figure 1, we can see that the margin M1 , delimited by the two blue lines, is not the biggest margin separating perfectly the
data. The biggest margin is the margin M2 shown in Figure 2 below.

Figure 2: The optimal hyperplane is slightly on the left of the one we used in Part 2.

You can also see the optimal hyperplane on Figure 2. It is slightly on the left of our initial hyperplane. How did I find it ? I simply
traced a line crossing M2 in its middle.

Right now you should have the feeling that hyperplanes and margins are closely related. And you would be right!

If I have an hyperplane I can compute its margin with respect to some data point. If I have a margin delimited by two
hyperplanes (the dark blue lines in Figure 2), I can find a third hyperplane passing right in the middle of
the margin.

Finding the biggest margin, is the same thing as finding the optimal hyperplane.

How can we find the biggest margin ?

It is rather simple:

1. You have a dataset


2. select two hyperplanes which separate the data with no points between them
3. maximize their distance (the margin)

The region bounded by the two hyperplanes will be the biggest possible margin.

If it is so simple why does everybody have so much pain understanding SVM ?


It is because as always the simplicity requires some abstraction and mathematical terminology to be well understood.

So we will now go through this recipe step by step:

Step 1: You have a dataset D and you want to classify it

Most of the time your data will be composed of n vectors xi .

Each xi will also be associated with a value y i indicating if the element belongs to the class (+1) or not (-1).
Note that y i can only have two possible values -1 or +1.

Moreover, most of the time, for instance when you do text classification, your vector xi ends up having a lot of dimensions. We
can say that xi is a p -dimensional vector if it has p dimensions.

So your dataset D is the set of n couples of element (xi , y i )

The more formal definition of an initial dataset in set theory is :

D = {(x , y
i i)
p
∣ xi ∈ R , y i ∈ {−1, 1}}
n
i=1

Step 2: You need to select two hyperplanes separating the data with no points between them

Finding two hyperplanes separating some data is easy when you have a pencil and a paper. But with some p -dimensional data
it becomes more difficult because you can't draw it.

Moreover, even if your data is only 2-dimensional it might not be possible to find a separating hyperplane !

You can only do that if your data is linearly separable

Figure 3: Data on the left can be separated by an hyperplane, while data on the right can't

So let's assume that our dataset D IS linearly separable. We now want to find two hyperplanes with no points between them,
but we don't have a way to visualize them.

What do we know about hyperplanes that could help us ?

Taking another look at the hyperplane equation


We saw previously, that the equation of a hyperplane can be written

T
w x = 0

However, in the Wikipedia article about Support Vector Machine it is said that :

Any hyperplane can be written as the set of points x satisfying w ⋅ x + b = 0 .

First, we recognize another notation for the dot product, the article uses w ⋅ x instead of w
T
x .

You might wonder... Where does the +b comes from ? Is our previous definition incorrect ?

Not quite. Once again it is a question of notation. In our definition the vectors w and x have three dimensions, while in the
Wikipedia definition they have two dimensions:

Given two 3-dimensional vectors w(b, −a, 1) and x(1, x, y)

w ⋅ x = b × (1) + (−a) × x + 1 × y

w ⋅ x = y − ax + b (1)

Given two 2-dimensional vectors w′ (−a, 1) and x′ (x, y)


′ ′
w ⋅x = (−a) × x + 1 × y

′ ′
w ⋅x = y − ax (2)

Now if we add b on both side of the equation (2) we got :

′ ′
w ⋅ x + b = y − ax + b

′ ′
w ⋅x +b = w⋅x (3)

For the rest of this article we will use 2-dimensional vectors (as in equation (2)).

Given a hyperplane H0 separating the dataset and satisfying:

w ⋅ x + b = 0 

We can select two others hyperplanes H1 and H2 which also separate the data and have the following equations :

w ⋅ x + b = δ 

and

w ⋅ x + b = −δ 

so that H0 is equidistant from H1 and H2 .

However, here the variable δ is not necessary. So we can set δ = 1 to simplify the problem.

w ⋅ x + b = 1 

and

w ⋅ x + b = −1 

Now we want to be sure that they have no points between them.

We won't select any hyperplane, we will only select those who meet the two following constraints:

For each vector xi either :

w ⋅ xi + b ≥ 1 for  xi having the class 1 (4)

or

w ⋅ xi + b ≤ −1 for  xi having the class − 1 (5)

Understanding the constraints


On the following figures, all red points have the class 1 and all blue points have the class −1.

So let's look at Figure 4 below and consider the point A . It is red so it has the class 1 and we need to verify it does not violate
the constraint w ⋅ xi + b ≥ 1 

When xi = A we see that the point is on the hyperplane so w ⋅ xi + b = 1  and the constraint is respected. The same applies
for B .

When xi = C we see that the point is above the hyperplane so w ⋅ xi + b > 1  and the constraint is respected. The same
applies for D , E, F and G.

With an analogous reasoning you should find that the second constraint is respected for the class −1.
Figure 4: Two hyperplanes satisfying the constraints

On Figure 5, we see another couple of hyperplanes respecting the constraints:

Figure 5: Two hyperplanes also satisfying the constraints

And now we will examine cases where the constraints are not respected:
Figure 6: The right hyperplane does not satisfy the first constraint

Figure 7: The left hyperplane does not satisfy the second constraint
Figure 8: Both constraint are not satisfied

What does it means when a constraint is not respected ? It means that we cannot select these two hyperplanes. You can see
that every time the constraints are not satisfied (Figure 6, 7 and 8) there are points between the two hyperplanes.

By defining these constraints, we found a way to reach our initial goal of selecting two hyperplanes without points between
them. And it works not only in our examples but also in p -dimensions !

Combining both constraints


In mathematics, people like things to be expressed concisely.

Equations (4) and (5) can be combined into a single constraint:

We start with equation (5)

for  xi having the class − 1

w ⋅ xi + b ≤ −1

And multiply both sides by y i (which is always -1 in this equation)

y i (w ⋅ xi + b) ≥ y i (−1)

Which means equation (5) can also be written:

y i (w ⋅ xi + b) ≥ 1 for  xi having the class − 1 (6)

In equation (4), as y i = 1 it doesn't change the sign of the inequation.

y i (w ⋅ xi + b) ≥ 1 for  xi having the class 1 (7)

We combine equations (6) and (7) :

y i (w ⋅ xi + b) ≥ 1 for all 1 ≤ i ≤ n (8)

We now have a unique constraint (equation 8) instead of two (equations 4 and 5) , but they are mathematically equivalent. So
their effect is the same (there will be no points between the two hyperplanes).

Step 3: Maximize the distance between the two hyperplanes

This is probably be the hardest part of the problem. But don't worry, I will explain everything along the way.

a) What is the distance between our two hyperplanes ?


Before trying to maximize the distance between the two hyperplane, we will first ask ourselves: how do we compute it ?
Let:

H be the hyperplane having the equation w ⋅ x + b = −1


0

H be the hyperplane having the equation w ⋅ x + b = 1


1

x be a point in the hyperplane H .


0 0

We will call m the perpendicular distance from x0 to the hyperplane H 1 . By definition, m is what we are used to call the
margin.

As x0 is in H 0 , m is the distance between hyperplanes H 0 and H 1 .

We will now try to find the value of m .

Figure 9: m is the distance between the two hyperplanes

You might be tempted to think that if we add m to x0 we will get another point, and this point will be on the other hyperplane !

But it does not work, because m is a scalar, and x0 is a vector and adding a scalar with a vector is not possible. However, we
know that adding two vectors is possible, so if we transform m into a vector we will be able to do an addition.

We can find the set of all points which are at a distance m from x0 . It can be represented as a circle :
Figure 10: All points on the circle are at the distance m from x0

Looking at the picture, the necessity of a vector become clear. With just the length m we don't have one crucial information : the
direction. (recall from Part 2 that a vector has a magnitude and a direction).

We can't add a scalar to a vector, but we know if we multiply a scalar with a vector we will get another vector.

From our initial statement, we want this vector:

1. to have a magnitude of m
2. to be perpendicular to the hyperplane H 1

Fortunately, we already know a vector perpendicular to H 1 , that is w (because H 1 = w⋅x+b = 1 )


Figure 11: w is perpendicular to H1

w
Let's define u = the unit vector of w . As it is a unit vector ∥u∥ = 1 and it has the same direction as w so it is also
∥w∥

perpendicular to the hyperplane.


Figure 12: u is also is perpendicular to H1

If we multiply u by m we get the vector k = mu and :

1. ∥k∥ = m
2. k is perpendicular to H 1 (because it has the same direction as u)

From these properties we can see that k is the vector we were looking for.
Figure 13: k is a vector of length m perpendicular to H1

w
k = mu = m (9)
∥w∥

We did it ! We transformed our scalar m into a vector k which we can use to perform an addition with the vector x0 .

If we start from the point x0 and add k we find that the point z0 = x0 + k is in the hyperplane H 1 as shown on Figure 14.
Figure 14: z0 is a point on H1

The fact that z0 is in H 1 means that

w ⋅ z0 + b = 1 (10)

We can replace z0 by x0 +k because that is how we constructed it.

w ⋅ (x0 + k) + b = 1 (11)

We can now replace k using equation (9)

w
w ⋅ (x0 + m )+b = 1 (12)
∥w∥

We now expand equation (12)


w⋅w
w ⋅ x0 + m +b = 1 (13)
∥w∥

The dot product of a vector with itself is the square of its norm so :

2
∥w∥
w ⋅ x0 + m +b = 1 (14)
∥w∥

w ⋅ x0 + m∥w∥ + b = 1 (15)

w ⋅ x0 + b = 1 − m∥w∥ (16)

As x0 is in H 0 then w ⋅ x0 + b = −1

−1 = 1 − m∥w∥ (17)

m∥w∥ = 2 (18)

2
m = (19)
∥w∥

This is it ! We found a way to compute m .