Professional Documents
Culture Documents
Bruno Gonçalves
www.data4sci.com/newsletter
https://github.com/DataForScience/DeepLearning
Question https://github.com/DataForScience/DeepLearning
•Where are you located?
• Europe
• Asia
• Africa
• US
• Canada
• Latin America
•Oceania
@bgoncalves www.data4sci.com
Question https://github.com/DataForScience/DeepLearning
• What’s your job title?
• Data Scientist
• Statistician
• Data Engineer
• Researcher
• Business Analyst
• Software Engineer
• Other
@bgoncalves www.data4sci.com
Question https://github.com/DataForScience/DeepLearning
• How experienced are you in Python?
@bgoncalves www.data4sci.com
References https://github.com/DataForScience/DeepLearning
https://github.com/DataForScience/DeepLearning
@bgoncalves www.data4sci.com
@bgoncalves www.data4sci.com
Machine Learning
@bgoncalves www.data4sci.com
3 Types of Machine Learning https://github.com/bmtgoncalves/Neural-Networks/
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
@bgoncalves www.data4sci.com
3 Types of Machine Learning https://github.com/bmtgoncalves/Neural-Networks/
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
@bgoncalves www.data4sci.com
Optimization Problem https://github.com/bmtgoncalves/Neural-Networks/
@bgoncalves www.data4sci.com
Supervised Learning
@bgoncalves www.data4sci.com
Supervised Learning - Regression
• Dataset formatted as an MxN matrix of M samples and N features
• Linear Regression
Feature N
Feature 1
Feature 2
Feature 3
…
value
• Neural Networks Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample M
@bgoncalves www.data4sci.com
Supervised Learning - Regression
• Dataset formatted as an MxN matrix of M samples and N features
• Linear Regression
Feature N
Feature 1
Feature 2
Feature 3
…
value
• Neural Networks Sample 1
Sample 2
• Two fundamental types of problems: Sample 3
Sample 4
Sample M
@bgoncalves www.data4sci.com
Linear Regression
⃗
x ) x1
13 f( w1
y ≈ x 0+
Each point is w0 Add x0 ≡ 1
xi ⃗ = (x0, x1, ⋯, xn)
T
represented = to account
y
by a vector for intercept
9.75
6.5
y
3.25
0
0 5 10 15 20
@bgoncalves x1 www.data4sci.com
Linear Regression
• We are assuming that our functional dependence is of the form:
f ( x )⃗ = w0 + w1x1 + ⋯ + wn xn ≡ X w ⃗
Feature N
Feature 1
Feature 2
Feature 3
…
value
and it imposes a Constraint on the solutions that can be found.
Sample 1
Sample 2
• We quantify our far our hypothesis is from the correct value using an Sample 3
Sample 4
Error Function:
Sample 5
1
∑[ ]
2 Sample 6
Jw (X, y )⃗ = h w ( x (i)
) − y (i) . X y
2m i
or, vectorially:
1
Jw (X, y )⃗ = [X w ⃗ − y ]⃗
2
2m Sample M
@bgoncalves www.data4sci.com
Geometric Interpretation
13
1
Jw (X, y )⃗ = [X w ⃗ − y ]⃗
2
2m
3.25
0
0 5 10 15 20
@bgoncalves www.data4sci.com
Gradient Descent
• Goal: Find the minimum of Jw (X, y )⃗ by varying the components of w ⃗
δ
− Jw (X, y )⃗
δ w⃗
Jw (X, y )⃗
δ
− Jw (X, y )⃗
δ w⃗
Jw (X, y ⃗)
• Algorithm:
Constraint
Learning Error
Algorithm Function
@bgoncalves www.data4sci.com
Geometric
` Interpretation
2D 3D nD
13
9.75
6.5
3.25
0
0 5 10 15 20
y = w0 + w1x1 y = w0 + w1x1 + w2 x2 y = X w⃗
Add x0 ≡ 1
Finds the hyperplane that
to account
splits the points in two for intercept
such that the errors on
each side balance out
@bgoncalves www.data4sci.com
Code - Linear Regression
https://github.com/DataForScience/DeepLearning
Linear Regression
@bgoncalves
Supervised Learning - Classification
• Dataset formatted as an NxM matrix of N samples and M features
Feature M
Feature 1
Feature 2
Feature 3
…
• Logistic Regression
label
Sample 1
• Neural Networks Sample 2
Sample 3
• Two fundamental types of problems: Sample 4
Sample 5
Sample 6
• Regression (continuous output value) .
Sample N
@bgoncalves www.data4sci.com
Logistic Regression (Classification)
• Not actually regression, but rather Classification
z ≡ X w⃗
1
ϕ (z) =
z encapsulates all 1 + e −z
the parameters and
input values
@bgoncalves www.data4sci.com
Geometric Interpretation
1
ϕ (z) ≥
2
@bgoncalves www.data4sci.com
Logistic Regression
• Error Function - Cross Entropy
1 T
Jw (X, y )⃗ = − [y log (hw (X)) + (1 − y) log (1 − hw (X))]
T
m
measures the “distance” between two probability distributions
1
hw (X) =
1 + e −X w ⃗
• Effectively treating the labels as probabilities (an instance with
label=1 has Probability 1 of belonging to the class).
@bgoncalves www.data4sci.com
Learning Procedure
Constraint
Learning Error
Algorithm Function
@bgoncalves www.data4sci.com
Learning Procedure
Constraint
Learning Error
Algorithm Function
@bgoncalves www.data4sci.com
Iris dataset
@bgoncalves www.data4sci.com
Iris dataset
@bgoncalves www.data4sci.com
Code - Logistic Regression
https://github.com/DataForScience/DeepLearning
Logistic Regression
@bgoncalves www.data4sci.com
Logistic Regression
@bgoncalves www.data4sci.com
Practical Considerations
• So far we have looked at very idealized cases. Reality is never this
simple!
• Data normalization
• Overfitting
• Hyperparameters
• etc…
@bgoncalves www.data4sci.com
0
Linear boundaries 1
• Both Linear Regression and Logistic Regression rely on hyperplanes
to separate the data points. Unfortunately, this is not always
possible:
AND OR XOR
@bgoncalves www.data4sci.com
Data Normalization https://github.com/bmtgoncalves/Neural-Networks/
• In the rest of the discussion we will assume that the data has been normalized in some
suitable way
@bgoncalves www.data4sci.com
Data Normalization https://github.com/bmtgoncalves/Neural-Networks/
• In the rest of the discussion we will assume that the data has been normalized in some
suitable way
@bgoncalves www.data4sci.com
Supervised Learning - Overfitting
Feature M
Feature 1
Feature 2
Feature 3
…
value
Sample 1
• “Learning the noise” Sample 2
Sample 3
Training
• “Memorization” instead of “generalization” Sample 4
Sample 5
Sample 6
• How can we prevent it? .
Testing
• Train model using only the Training dataset and evaluate results
in the previously unseen Testing dataset.
Sample N
• Single split
@bgoncalves www.data4sci.com
Bias-Variance Tradeoff
@bgoncalves www.data4sci.com
Bias-Variance Tradeoff
High Bias Low Bias
Low Variance High Variance
Training
Error
Testing
Variance
Bias
Model Complexity
@bgoncalves www.data4sci.com
Comparison
• Linear Regression
• Logistic Regression
z = X w⃗ z = X w⃗
Map features to a
continuous variable
1 1 T
⃗
Jw (X, y ) = ⃗
[hw (X) − y ] Jw (X, y ) = − [y log (hw (X)) + (1 − y) log (1 − hw (X))]
⃗
2 T
2m m
δ 1 T δ 1 T
Jw (X, y )⃗ = X ⋅ (hw (X ) − y )⃗ Jw (X, y )⃗ = X ⋅ (hw (X ) − y )⃗
δwj m δwj m
@bgoncalves www.data4sci.com
Similar Structure
Bias
1
w
x1 0j
w
1j
x2 w2j
zj
w 3j wT x (z)
x3
j
N
w
xN
Activation
Inputs Weights function
@bgoncalves www.data4sci.com
What about Neurons?
Biological Neuron
@bgoncalves www.data4sci.com
How the Brain “Works” (Cartoon version)
@bgoncalves www.data4sci.com
How the Brain “Works” (Cartoon version)
• Each neuron receives input from other neurons
• 1011 neurons, each with with 104 weights
• Weights can be positive or negative
• Weights adapt during the learning process
• “neurons that fire together wire together” (Hebb)
• Different areas perform different functions using same structure
(Modularity)
@bgoncalves www.data4sci.com
How the Brain “Works” (Cartoon version)
@bgoncalves www.data4sci.com
@bgoncalves www.data4sci.com
Perceptron
Bias
1
w
x1 0j
w
1j
x2 w2j
zj aj
w 3j wT x (z)
x3
j
N
w
xN
Activation
Output
Inputs Weights function
@bgoncalves www.data4sci.com
Activation Function http://github.com/bmtgoncalves/Neural-Networks
• Non-Linear function
• Differentiable
• non-decreasing
@bgoncalves www.data4sci.com
Activation Function - Linear
• Non-Linear function
• Differentiable
• non-decreasing
ϕ (z) = z
• Compute new sets of features
• The simplest
@bgoncalves www.data4sci.com
Activation Function - Linear
• Non-Linear function
Linear Regression
• Differentiable
• non-decreasing
ϕ (z) = z
• Compute new sets of features
• The simplest
@bgoncalves www.data4sci.com
Activation Function - Sigmoid
• Non-Linear function
• Differentiable
• non-decreasing 1
ϕ (z) =
1 + e −z
• Compute new sets of features
@bgoncalves www.data4sci.com
Activation Function - Sigmoid
• Non-Linear function
Logistic Regression
• Differentiable
• non-decreasing 1
ϕ (z) =
1 + e −z
• Compute new sets of features
@bgoncalves www.data4sci.com
Activation Function - ReLu
• Non-Linear function
• Differentiable
• non-decreasing
ϕ (z) = z, z > 0
• Compute new sets of features
@bgoncalves www.data4sci.com
Activation Function - ReLu
• Non-Linear function
Stepwise Regression
• Differentiable
• non-decreasing
ϕ (z) = z, z > 0
• Compute new sets of features
@bgoncalves www.data4sci.com
Stepwise Regression https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline
f ̂ (x) =
∑
ci Bi (x)
• The basis functions Bi (x) can be: i
• Constant
• Products of hinges
@bgoncalves www.data4sci.com
Stepwise Regression https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline
f ̂ (x) =
∑
ci Bi (x)
i
y (x) = 1.013
+ 1.198 max (0, x 0.485)
1.803 max (0, 0.485 x)
1.321 max (0, x 0.283)
1.609 max (0, x 0.640)
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
+ 1.591 max (0, x 0.907)
@bgoncalves www.data4sci.com
Perceptron - Forward Propagation
• The output of a perceptron is determined by a sequence of steps:
T w
aj = ! x x1 0j
w
1j
x2 w2j
T
aj
w 3j w x
x3
j
N
w
@bgoncalves xN www.data4sci.com
Perceptron - Forward Propagation http://github.com/bmtgoncalves/Neural-Networks
return a
@bgoncalves
Perceptron - Training
• Training Procedure:
• If correct, do nothing
@bgoncalves xN www.data4sci.com
Code - Forward Propagation
https://github.com/DataForScience/DeepLearning
Forward Propagation
• The output of a perceptron is determined by a sequence of steps:
• To create a multi-layer perceptron, you can simply use the output of one layer as the input to
the next one.
1
1
w
w a1 0k
x1 0j w
1k
w
1j a2 w2k
x2 w2j ak
w 3k wT a
w 3j wT x aj
x3
k
j wN
wN
aN
xN
• But how can we propagate back the errors and update the weights?
@bgoncalves www.data4sci.com
Stepwise Regression https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline
f ̂ (x) =
∑
ci Bi (x)
i
y (x) = 1.013
+ 1.198 max (0, x 0.485)
1.803 max (0, 0.485 x)
1.321 max (0, x 0.283)
1.609 max (0, x 0.640)
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
+ 1.591 max (0, x 0.907)
1
1
x ReLu
1
x ReLu Linear
x ReLu
@bgoncalves www.data4sci.com
Loss Functions
• For learning to occur, we must quantify how far off we are from the
desired output. There are two common ways of doing this:
@bgoncalves www.data4sci.com
Regularization
• Helps keep weights relatively small by adding a penalization to the cost function.
• Lasso helps with feature selection by driving less important weights to zero
@bgoncalves www.data4sci.com
Backward Propagation of Errors (BackProp)
• BackProp operates in two phases:
• The error at the output is a weighted average difference between predicted output and the
observed one.
@bgoncalves www.data4sci.com
BackProp
• Let δ (l) be the error at each of the total L layers:
• Then:
δ (L) = hw (X) − y
• And for every other layer, in reverse order:
• And finally:
Δ(l)
ij
= Δ (l)
ij
+ a (l) (l+1)
j i
δ
∂ 1 (l)
⃗
(l) w ( )
(l)
J X, y = Δ + λw
∂wij m ij ij
@bgoncalves www.data4sci.com
A practical example - MNIST http://github.com/bmtgoncalves/Neural-Networks
http://yann.lecun.com/exdb/mnist/
Feature M
Feature 1
Feature 2
Feature 3
…
Label
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample N
@bgoncalves www.data4sci.com
A practical example - MNIST http://github.com/bmtgoncalves/Neural-Networks
http://yann.lecun.com/exdb/mnist/
Feature M
Feature 1
Feature 2
Feature 3
…
Label
Sample 1
Sample 2
Sample 3
s: Sample 4
er
arg max
Sample 5
Sample 6
er X ⇥1 ⇥2 .
er X y
Sample N
@bgoncalves www.data4sci.com
A practical example - MNIST http://github.com/bmtgoncalves/Neural-Networks
arg max
# Multiply by the weights
X ⇥1 ⇥2 z = np.dot(X_, Theta.T)
return a
Vectors 784 50 10 1
Matrices 50 ⇥ 785 10 ⇥ 51 Forward Propagation
Matrices 50 ⇥ 784 10 ⇥ 50 Backward Propagation
return np.argmax(h2, 1)
@bgoncalves www.data4sci.com
Code - Simple Network
https://github.com/DataForScience/DeepLearning
Bias-Variance Tradeoff
Variance
@bgoncalves www.data4sci.com
Learning Rate
δ
wij = wij − α Jw (X, y )⃗
δwij
δ
↵ defines size of step in direction of Jw (X, y )⃗
δwij
Epoch
@bgoncalves www.data4sci.com
Tips
• online learning - update weights after each case
- might be useful to update model as new data is
obtained
- subject to fluctuations
@bgoncalves www.data4sci.com
Generalization
• Neural Networks are extremely modular in their design with
• Fortunately, we can write code that is also modular and can easily
handle arbitrary numbers of layers
• Let’s describe the structure of our network as a list of weight
matrices and activation functions
• We also need to keep track of the gradients of the activation
functions so let us define a simple class:
class Activation(object):
def f(z):
pass
def df(z):
pass
class Linear(Activation):
def f(z):
return z
def df(z):
return np.ones(z.shape)
class Sigmoid(Activation):
def f(z):
return 1./(1+np.exp(-z))
def df(z):
@bgoncalves h = Sigmoid.f(z) www.data4sci.com
return h*(1-h)
Generalization
• Now we can describe our simple MNIST model with:
Thetas = []
Thetas.append(init_weights(input_layer_size, hidden_layer_size))
Thetas.append(init_weights(hidden_layer_size, num_labels))
model = []
model.append(Thetas[0])
model.append(Sigmoid)
model.append(Thetas[1])
model.append(Sigmoid)
• Where Sigmoid is an object that contains both the sigmoid function and its gradient as was
defined in the previous slide.
@bgoncalves
Generalization - Forward propagation
return a
@bgoncalves forward.py
def backprop(model, X, y):
M = X.shape[0]
Thetas = model[0::2]
activations = model[1::2]
layers = len(Thetas)
K = Thetas[-1].shape[0]
J = 0
Deltas = []
for i in range(layers):
Deltas.append(np.zeros(Thetas[i].shape))
deltas = [0, 0, 0, 0]
for i in range(M):
As = []
Zs = [0]
Hs = [X[i]]
# Forward propagation, saving intermediate results
As.append(np.concatenate(([1], Hs[0]))) # Input layer
y0 = one_hot(K, y[i])
# Cross entropy
J -= np.dot(y0.T, np.log(Hs[2]))+np.dot((1-y0).T, np.log(1-Hs[2]))
J /= M
grads = []
grads.append(Deltas[0]/M)
grads.append(Deltas[1]/M)
@bgoncalves
return [J, grads]
Code - Modular Network
https://github.com/DataForScience/DeepLearning
Neural Network Architectures
@bgoncalves www.data4sci.com
word2vec Mikolov 2013
1
wj
wj
⇥2 ⇥1 word embeddings ⇥2
⇥2 context embeddings
wj
wj
⇥1 ⇥1
wj one hot vector
⇥2 ⇥2
activation function
wj+1
wj+1
@bgoncalves www.data4sci.com
“You shall know a word by the company it keeps”
(J. R. Firth)
Analogies
• The embedding of each word is a function of the context it appears
in:
(red) = f (context (red))
@bgoncalves www.data4sci.com
Feed Forward Networks
ht Output
xt Input
ht = f (xt)
@bgoncalves www.data4sci.com
Feed Forward Networks
ht Output
Information
Flow
xt Input
ht = f (xt)
@bgoncalves www.data4sci.com
Information
Recurrent Neural Network (RNN) Flow
ht Output
ht Output
Previous ht−1
Output
xt Input
ht = f (xt,)ht−1)
@bgoncalves www.data4sci.com
Recurrent Neural Network (RNN)
• Each output depends (implicitly) on all previous outputs.
ht−1 ht ht+1
ht−2 ht−1 ht ht+1
xt−1 xt xt+1
@bgoncalves www.data4sci.com
Long-Short Term Memory (LSTM)
• What if we want to keep explicit information about previous states
(memory)?
• How much information is kept, can be controlled through gates.
ht−1 ht ht+1
ct−2 ct−1 ct ct+1
ht−2 ht−1 ht ht+1
xt−1 xt xt+1
@bgoncalves www.data4sci.com
Convolutional Neural Networks
@bgoncalves www.data4sci.com
@bgoncalves
Interpretability
@bgoncalves www.data4sci.com
Interpretability
@bgoncalves www.data4sci.com
“Deep” learning
@bgoncalves www.data4sci.com
Webinars www.data4sci.com/newsletter
@bgoncalves www.data4sci.com