Yousef Udacity Deep Learning Part1 Introdution + Part 2 NN

Part 01_Introduction to Deep Learning
You'll also learn about model evaluation and validation, an important technique for training
and assessing neural networks. We also have guest instructor Andrew Trask, author
of Grokking Deep Learning, developing a neural network for processing text and predicting
sentiment.
You'll also use convolutional networks to build an autoencoder, a network architecture used
for image compression and denoising. Then, you'll use a pretrained neural network (VGGnet),
to classify images of flowers the network has never seen before, a technique known
as transfer learning.
Then, you'll learn about word embeddings and implement the Word2Vec model, a network
that can learn about semantic relationships between words. These are used to increase the
efficiency of networks when you're processing text.
Applying Deep Learning
 Back to Home
 01. Introduction
 02. Style Transfer
 03. DeepTraffic
 04. Flappy Bird
 05. Books to Read
02. Style Transfer >> revision to lesson to apply from Github
Code directly
Style Transfer
As an example of the kind of things you'll be building with deep learning models, here is a
really fun project, fast style transfer. Style transfer allows you to take famous paintings, and
recreate your own images in their styles! The network learns the underlying techniques of
those paintings and figures out how to apply them on its own. This model was trained on the
styles of famous paintings and is able to transfer those styles to other images and even videos!
I used it to style my cat Chihiro in the style of Hokusai's The Great Wave Off Kanagawa.
DeepTraffic
Another great application of deep learning is in simulating traffic and making driving
decisions. You can find the DeepTraffic simulator here. The network here is attempting
to learn a driving strategy such that the car is moving as fast as possible
using reinforcement learning. The network is rewarded when the car chooses actions
that result in it moving fast. It's this feedback that allows the network to find a
strategy of actions for optimal speed.
To learn more about setting the parameters and training the network, read
the overview here.
Discuss how you built your network and your results with your fellow students in
Study Groups.
04. Flappy Bird
Flappy Bird
In this example, you'll get to see a deep learning agent playing Flappy Bird! You have the
option to train the agent yourself, but for now let's just start with the pre-trained network given
by the author. Note that the following agent is able to play without being told any information
about the structure of the game or its rules. It automatically discovers the rules of the game by
finding out how it did on each iteration.
We will be following this repository by Yenchen Lin.
Instructions
1. Install miniconda or anaconda if you have not already. You can follow our tutorial for help.
2. Create an environment for flappybird
o Mac/Linux: conda create --name=flappybird python=2.7
o Windows: conda create --name=flappybird python=3.5
3. Enter your conda environment
o Mac/Linux: source activate flappybird
o Windows: activate flappybird
4. conda install -c menpo opencv3
5. pip install pygame
6. pip install tensorflow
7. git clone https://github.com/yenchenlin/DeepLearningFlappyBird.git
8. cd DeepLearningFlappyBird
9. python deep_q_network.py
If all went correctly, you should be seeing a deep learning based agent play Flappy Bird! The
repository contains instructions for training your own agent if you're interested!
Books to read
We believe that you learn best when you are exposed to multiple perspectives on the
same idea. As such, we recommend checking out a few of the books below to get an
added perspective on Deep Learning.
 Grokking Deep Learning by Andrew Trask. Use our exclusive discount

code traskud17 for 40% off. This provides a very gentle introduction to Deep
Learning and covers the intuition more than the theory.
 Neural Networks And Deep Learning by Michael Nielsen. This book is more
rigorous than Grokking Deep Learning and includes a lot of fun, interactive
visualizations to play with.
 The Deep Learning Textbook from Ian Goodfellow, Yoshua Bengio, and Aaron
Courville. This online book contains a lot of material and is the most rigorous of
the three books suggested.
INSTRUCTOR NOTE:
Anaconda is a distribution of packages built for data science. It comes with conda, a
package and environment manager. You'll be using conda to create environments for
isolating your projects that use different versions of Python and/or different packages.
You'll also use it to install, uninstall, and update packages in your environments. Using
Anaconda has made my life working with data much more pleasant.
04. Installing Anaconda
Installation instructions
Installing Anaconda
Anaconda is available for Windows, Mac OS X, and Linux. You can find the installers
and installation instructions at https://www.anaconda.com/download/.
If you already have Python installed on your computer, this won't break anything.
Instead, the default Python used by your scripts and programs will be the one that
comes with Anaconda.
Choose the Python 3.6 version, you can install Python 2 versions later. Also, choose
the 64-bit installer if you have a 64-bit operating system, otherwise go with the 32-bit
installer. Go ahead and choose the appropriate version, then install it. Continue on
afterwards!
After installation, you’re automatically in the default conda environment with all
packages installed which you can see below. You can check out your own install by
entering conda list into your terminal.
Play
00:00
-00:01
Settings
Enter fullscreen
Play
On Windows
A bunch of applications are installed along with Anaconda:
 Anaconda Navigator, a GUI for managing your environments and packages

 Anaconda Prompt, a terminal where you can use the command line interface to
manage your environments and packages
 Spyder, an IDE geared toward scientific development
To avoid errors later, it's best to update all the packages in the default environment.
Open the Anaconda Prompt application. In the prompt, run the following
commands:
conda upgrade conda

conda upgrade --all
and answer yes when asked if you want to install the packages. The packages that
come with the initial install tend to be out of date, so updating them now will prevent
future errors from out of date software.
Note: In the previous step, running conda upgrade conda should not be necessary
because --all includes the conda package itself, but some users have encountered
errors without it.
In the rest of this lesson, I'll be asking you to use commands in your terminal. I highly
suggest you start working with Anaconda this way, then later use the GUI if you'd like.
Troubleshooting
If you are seeing the following "conda command not found" and are using ZShell, you
have to do the following:
 Add export PATH="/Users/username/anaconda/bin:$PATH" to your .zsh_config file.

05. Managing packages
Managing Packages
Once you have Anaconda installed, managing packages is fairly straightforward. To
install a package, type conda install package_name in your terminal. For example, to
install numpy, type conda install numpy .
Play
00:02
-00:08
Settings
Enter fullscreen
Play
You can install multiple packages at the same time. Something like conda install
numpy scipy pandas will install all those packages simultaneously. It's also possible to
specify which version of a package you want by adding the version number such
as conda install numpy=1.10 .
Conda also automatically installs dependencies for you. For example scipy depends
on numpy , it uses and requires numpy . If you install just scipy ( conda install scipy ),
Conda will also install numpy if it isn't already installed.
Most of the commands are pretty intuitive. To uninstall, use conda remove
package_name . To update a package conda update package_name . If you want to
update all packages in an environment, which is often useful, use conda update
--all . And finally, to list installed packages, it's conda list which you've seen
before.
If you don't know the exact name of the package you're looking for, you can try
searching with conda search *search_term* . For example, I know I want to
install Beautiful Soup, but I'm not sure of the exact package name. So, I try conda
search *beautifulsoup* . Note that your shell might expand the wildcard * before
running the conda command. To fix this, wrap the search string in single or double
quotes like conda search '*beautifulsoup*' .
It returns a list of the Beautiful Soup packages available with the appropriate package
name, beautifulsoup4 .
Which of these commands would you use to install the

packages numpy and pandas with conda? (More than one might be correct - select all
that apply.)

conda install numpy

conda install pandas

conda install numpy pandas
SOLUTION:
 `conda install pandas`

 `conda install numpy pandas`
udacimak v1.1.3
07. More environment actions
Saving and loading environments
A really useful feature is sharing environments so others can install all the packages
used in your code, with the correct versions. You can save the packages to a YAML file
with conda env export > environment.yaml . The first part conda env export writes
out all the packages in the environment, including the Python version.
Exported environment printed to the terminal
Above you can see the name of the environment and all the dependencies (along with
versions) are listed. The second part of the export command, >
environment.yaml writes the exported text to a YAML file environment.yaml . This file
can now be shared and others will be able to create the same environment you used
for the project.
To create an environment from an environment file use conda env create -f

environment.yaml . This will create a new environment with the same name listed
in environment.yaml .
Listing environments
If you forget what your environments are named (happens to me sometimes),
use conda env list to list out all the environments you've created. You should see a
list of environments, there will be an asterisk next to the environment you're currently
in. The default environment, the environment used when you aren't in one, is
called root .
Removing environments
If there are environments you don't use anymore, conda env remove -n env_name will
remove the specified environment (here, named env_name ).
https://docs.getpelican.com/en/stable/
Pelican 4.2.0
Pelican is a static site generator, written in Python. Highlights include:
 Write your content directly with your editor of choice

in reStructuredText or Markdown formats
 Includes a simple CLI tool to (re)generate your site
 Easy to interface with distributed version control systems and web hooks
 Completely static output is easy to host anywhere
Ready to get started? Check out the Quickstart guide.

08. Best practices
Best practices
Using environments
One thing that’s helped me tremendously is having separate environments for Python 2 and
Python 3. I used conda create -n py2 python=2 and conda create -n py3 python=3 to
create two separate environments, py2 and py3 . Now I have a general use environment for
each Python version. In each of those environments, I've installed most of the standard data
science packages (numpy, scipy, pandas, etc.). Remember that when you set up an
environment initially, you'll only start with the standard packages and whatever packages you
specify in your conda create statement.
I’ve also found it useful to create environments for each project I’m working on. It works great
for non-data related projects too like web apps with Flask. For example, I have an environment
for my personal blog using Pelican.
Sharing environments
When sharing your code on GitHub, it's good practice to make an environment file and include
it in the repository. This will make it easier for people to install all the dependencies for your
code. I also usually include a pip requirements.txt file using pip freeze (learn more
here) for people not using conda.
More to learn
To learn more about conda and how it fits in the Python ecosystem, check out this article by
Jake Vanderplas: Conda myths and misconceptions. And here's the conda documentation you
can reference later.
09. On Python versions at Udacity
Python versions at Udacity
Most Nanodegree programs at Udacity will be (or are already) using Python 3 almost
exclusively.
Why we're using Python 3
 Jupyter is switching to Python 3 only

 Python 2.7 is being retired
 Python 3 has been out for almost 10 years, and there are very few dependencies (and none in
this program) that are incompatible.
At this point, there are enough new features in Python 3 that it doesn't make much sense to
stick with Python 2 unless you're working with old code. All new Python code should be
written for version 3. Read more here.
The main breakage between Python 2 and 3

For the most part, Python 2 code will work with Python 3. Of course, most new features
introduced with Python 3 versions won't be backwards compatible. The place where your
Python 2 code will fail most often is the print statement.
For most of Python's history including Python 2, printing was done like so:
print "Hello", "world!"

> Hello world!
This was changed in Python 3 to a function.
print("Hello", "world!")
> Hello world!
The print function was back-ported to Python 2 in version 2.6 through
the __future__ module:
# In Python 2.6+
from __future__ import print_function
print("Hello", "world!")
> Hello world!
The print statement doesn't work in Python 3. If you want to print something and have it
work in both Python versions, you'll need to import print_function in your Python 2 code.
Jupyter Notebooks
 Back to Home
 01. Instructor
 02. What are Jupyter notebooks?
 03. Installing Jupyter Notebook
 04. Launching the notebook server
 05. Notebook interface
 06. Code cells
 07. Markdown cells
 08. Keyboard shortcuts
 09. Magic keywords
 10. Converting notebooks
 11. Creating a slideshow
 12. Finishing up
02. What are Jupyter notebooks?
Jupyter
Play
01:21
-00:14
Mute
Disable captions
Settings
Enter fullscreen
Play
What are Jupyter notebooks?
Welcome to this lesson on using Jupyter notebooks. The notebook is a web application that
allows you to combine explanatory text, math equations, code, and visualizations all in one
easily sharable document. For example, here's one of my favorite notebooks shared recently,
the analysis of gravitational waves from two colliding blackholes detected by the LIGO
experiment. You could download the data, run the code in the notebook, and repeat the
analysis, in effect detecting the gravitational waves yourself!
Notebooks have quickly become an essential tool when working with data. You'll find them
being used for data cleaning and exploration, visualization, machine learning, and big data
analysis. Here's an example notebook I made for my personal blog that shows off many of the
features of notebooks. Typically you'd be doing this work in a terminal, either the normal
Python shell or with IPython. Your visualizations would be in separate windows, any
documentation would be in separate documents, along with various scripts for functions and
classes. However, with notebooks, all of these are in one place and easily read together.
Notebooks are also rendered automatically on GitHub. It’s a great feature that lets you easily
share your work. There is also http://nbviewer.jupyter.org/ that renders the notebooks from
your GitHub repo or from notebooks stored elsewhere.
Literate programming
Notebooks are a form of literate programming proposed by Donald Knuth in 1984. With
literate programming, the documentation is written as a narrative alongside the code instead of
sitting off by its own. In Donald Knuth's words,
Instead of imagining that our main task is to instruct a computer what to do, let us concentrate
rather on explaining to human beings what we want a computer to do.
After all, code is written for humans, not for computers. Notebooks provide exactly this
capability. You are able to write documentation as narrative text, along with code. This is not
only useful for the people reading your notebooks, but for your future self coming back to the
analysis.
Just a small aside: recently, this idea of literate programming has been extended to a whole
programming language, Eve.
How notebooks work

Jupyter notebooks grew out of the IPython project started by Fernando Perez. IPython is an
interactive shell, similar to the normal Python shell but with great features like syntax
highlighting and code completion. Originally, notebooks worked by sending messages from
the web app (the notebook you see in the browser) to an IPython kernel (an IPython
application running in the background). The kernel executed the code, then sent it back to the
notebook. The current architecture is similar, drawn out below.
From Jupyter documentation
The central point is the notebook server. You connect to the server through your browser and
the notebook is rendered as a web app. Code you write in the web app is sent through the
server to the kernel. The kernel runs the code and sends it back to the server, then any output is
rendered back in the browser. When you save the notebook, it is written to the server as a
JSON file with a .ipynb file extension.
The great part of this architecture is that the kernel doesn't need to run Python. Since the
notebook and the kernel are separate, code in any language can be sent between them. For
example, two of the earlier non-Python kernels were for the R and Julia languages. With an R
kernel, code written in R will be sent to the R kernel where it is executed, exactly the same as
Python code running on a Python kernel. IPython notebooks were renamed because notebooks
became language agnostic. The new name Jupyter comes from the combination
of Julia, Python, and R. If you're interested, here's a list of available kernels.
Another benefit is that the server can be run anywhere and accessed via the internet. Typically
you'll be running the server on your own machine where all your data and notebook files are
stored. But, you could also set up a server on a remote machine or cloud instance like
Amazon's EC2. Then, you can access the notebooks in your browser from anywhere in the
world.
03. Installing Jupyter Notebook
Installing Jupyter Notebook
By far the easiest way to install Jupyter is with Anaconda. Jupyter notebooks automatically
come with the distribution. You'll be able to use notebooks from the default environment.
To install Jupyter notebooks in a conda environment, use conda install jupyter

notebook .
Jupyter notebooks are also available through pip with pip install jupyter notebook .
04. Launching the notebook server
Launching the notebook server
To start a notebook server, enter jupyter notebook in your terminal or console. This will
start the server in the directory you ran the command in. That means any notebook files will be
saved in that directory. Typically you'd want to start the server in the directory where your
notebooks live. However, you can navigate through your file system to where the notebooks
are.
When you run the command (try it yourself!), the server home should open in your browser.
By default, the notebook server runs at http://localhost:8888 . If you aren't familiar with
this, localhost means your computer and 8888 is the port the server is communicating on.
As long as the server is still running, you can always come back to it by going to
http://localhost:8888 in your browser.
If you start another server, it'll try to use port 8888 , but since it is occupied, the new server
will run on port 8889 . Then, you'd connect to it at http://localhost:8889 . Every
additional notebook server will increment the port number like this.
If you tried starting your own server, it should look something like this:
05. Notebook interface
Notebook interface
When you create a new notebook, you should see something like this:
06. Code cells
Code cells
Most of your work in notebooks will be done in code cells. This is where you write your code
and it gets executed. In code cells you can write any code, assigning variables, defining
functions and classes, importing packages, and more. Any code executed in one cell is
available in all other cells.
To give you some practice, I created a notebook you can work through. Download the
notebook Working With Code Cells below then run it from your own notebook server. (In
your terminal, change to the directory with the notebook file, then enter jupyter notebook )
Your browser might try to open the notebook file without downloading it. If that happens,
right click on the link then choose "Save Link As…"
Need to understand
07. Markdown cells
Markdown cells
As mentioned before, cells can also be used for text written in Markdown. Markdown is a
formatting syntax that allows you to include links, style text as bold or italicized, and format
code. As with code cells, you press Shift + Enter or Control + Enter to run the Markdown
cell, where it will render the Markdown to formatted text. Including text allows you to write a
narrative alongside your code, as well as documenting your code and the thoughts that went
into it.
You can find the documentation here, but I'll provide a short primer.
Headers
You can write headers using the pound/hash/octothorpe symbol # placed before the text.
One # renders as an h1 header, two # s is an h2, and so on. Looks like this:
# Header 1
## Header 2
### Header 3
renders as
Header 1
Header 2
Header 3
Links
Linking in Markdown is done by enclosing text in square brackets and the URL in
parentheses, like this [Udacity's home page](https://www.udacity.com) for a link
to Udacity's home page.
Emphasis
You can add emphasis through bold or italics with asterisks or underscores ( * or _ ). For
italics, wrap the text in one asterisk or underscore, _gelato_ or *gelato* renders as gelato.
Bold text uses two symbols, **aardvark** or __aardvark__ looks like aardvark.
Either asterisks or underscores are fine as long as you use the same symbol on both sides of
the text.
Code
There are two different ways to display code, inline with text and as a code block separated
from the text. To format inline code, wrap the text in backticks. For
example, `string.punctuation` renders as string.punctuation .
To create a code block, start a new line and wrap the text in three backticks
```
import requests
response = requests.get('https://www.udacity.com')
```
or indent each line of the code block with four spaces.
import requests
response = requests.get('https://www.udacity.com')
Math expressions
You can create math expressions in Markdown cells using LaTeX symbols. Notebooks use
MathJax to render the LaTeX symbols as math symbols. To start math mode, wrap the LaTeX
in dollar signs $y = mx + b$ for inline math. For a math block, use double dollar signs,
$$
y = \frac{a}{b+c}
$$
This is a really useful feature, so if you don't have experience with LaTeX please read this
primer on using it to create math expressions.
Play
00:02
-00:00
Settings
Enter fullscreen
Play
Wrapping up
Here's a cheatsheet you can use as a reference for writing Markdown. My advice is to make
use of the Markdown cells. Your notebooks will be much more readable compared to a bunch
of code blocks.
08. Keyboard shortcuts
Keyboard shortcuts
Notebooks come with a bunch of keyboard shortcuts that let you use your keyboard to interact
with the cells, instead of using the mouse and toolbars. They take a bit of time to get used to,
but when you're proficient with the shortcuts you'll be much faster at working in notebooks.
To learn more about the shortcuts and get practice using them, download the
notebook Keyboard Shortcuts below. Again, your browser might try to open it, but you want
to save it to your computer. Right click on the link, then choose "Save Link As…"
09. Magic keywords
Magic keywords
Magic keywords are special commands you can run in cells that let you control the notebook
itself or perform system calls such as changing directories. For example, you can set up
matplotlib to work interactively in the notebook with %matplotlib .
Magic commands are preceded with one or two percent signs ( % or %% ) for line magics and
cell magics, respectively. Line magics apply only to the line the magic command is written on,
while cell magics apply to the whole cell.
NOTE: These magic keywords are specific to the normal Python kernel. If you are using other
kernels, these most likely won't work.
Timing code
At some point, you'll probably spend some effort optimizing code to run faster. Timing how
quickly your code runs is essential for this optimization. You can use the timeit magic
command to time how long it takes for a function to run, like so:
10. Converting notebooks
Converting notebooks
Notebooks are just big JSON files with the extension .ipynb .
Notebook file opened in a text editor shows JSON data
Since notebooks are JSON, it is simple to convert them to other formats. Jupyter comes with a
utility called nbconvert for converting to HTML, Markdown, slideshows, etc.
For example, to convert a notebook to an HTML file, in your terminal use
jupyter nbconvert --to html notebook.ipynb

Converting to HTML is useful for sharing your notebooks with others who aren't using
notebooks. Markdown is great for including a notebook in blogs and other text editors that
accept Markdown formatting.
As always, learn more about nbconvert from the documentation.
11. Creating a slideshow
Creating a slideshow
Create slideshows from notebooks is one of my favorite features. You can see an example of a
slideshow here introducing pandas for working with data.
The slides are created in notebooks like normal, but you'll need to designate which cells are
slides and the type of slide the cell will be. In the menu bar, click View > Cell Toolbar >
Slideshow to bring up the slide cell menu on each cell.
Turning on Slideshow toolbars for cells
This will show a menu dropdown on each cell that lets you choose how the cell shows up in
the slideshow.
Choose slide type
Slides are full slides that you move through left to right. Sub-slides show up in the slideshow
by pressing up or down. Fragments are hidden at first, then appear with a button press. You
can skip cells in the slideshow with Skip and Notes leaves the cell as speaker notes.
Running the slideshow

To create the slideshow from the notebook file, you'll need to use nbconvert :
jupyter nbconvert notebook.ipynb --to slides

This just converts the notebook to the necessary files for the slideshow, but you need to serve
it with an HTTP server to actually see the presentation.
To convert it and immediately see it, use
jupyter nbconvert notebook.ipynb --to slides --post serve

This will open up the slideshow in your browser so you can present it.
12. Finishing up
Congratulations!
You've made it to the end of this short course on tools in the Python data science workflow.
Making good use of Anaconda and Jupyter Notebooks will increase your productivity and
general well-being. There is a lot to learn to get the most out of these, Markdown and LaTeX
for instance, but after a bit you'll be wondering why data analysis is done any other way.
Again, congratulations and good luck!
Part1 >> Lesson 5:

Matrix Math and NumPy Refresher
 Back to Home
 02. Data Dimensions
 03. Data in NumPy
 04. Element-wise Matrix Operations
 05. Element-wise Operations in NumPy
 06. Matrix Multiplication: Part 1
 07. Matrix Multiplication: Part 2
 08. NumPy Matrix Multiplication
 09. Matrix Transposes
 10. Transposes in NumPy
 11. NumPy Quiz
Part 02_Neural Networks
Introduction to Neural Networks
 Back to Home
 01. Instructor
 03. Classification Problems 1
 04. Classification Problems 2
 05. Linear Boundaries
 06. Higher Dimensions
 07. Perceptrons
 08. Why "Neural Networks"?
 09. Perceptrons as Logical Operators
 10. Perceptron Trick
 11. Perceptron Algorithm
 12. Non-Linear Regions
 13. Error Functions
 14. Log-loss Error Function
 15. Discrete vs Continuous
 16. Softmax
 17. One-Hot Encoding
 18. Maximum Likelihood
 19. Maximizing Probabilities
 20. Cross-Entropy 1
 21. Cross-Entropy 2
 22. Multi-Class Cross Entropy
 23. Logistic Regression
 24. Gradient Descent
 25. Logistic Regression Algorithm
 26. Pre-Lab: Gradient Descent
 27. Notebook: Gradient Descent
 28. Perceptron vs Gradient Descent
 29. Continuous Perceptrons
 30. Non-linear Data
 31. Non-Linear Models
 32. Neural Network Architecture
 33. Feedforward
 34. Backpropagation
 35. Pre-Lab: Analyzing Student Data
 36. Notebook: Analyzing Student Data
 37. Outro
00:00:00.000 --> 00:00:04.040
So you may be wondering why are these objects called neural networks.
00:00:04.040 --> 00:00:06.059

Well, the reason why they're called neural networks is
00:00:06.059 --> 00:00:08.759

because perceptions kind of look like neurons in the brain.
00:00:08.759 --> 00:00:11.525

In the left we have a perception with four inputs.
00:00:11.525 --> 00:00:12.809

The number is one, zero,
00:00:12.808 --> 00:00:14.689

four, and minus two.
00:00:14.689 --> 00:00:15.990

And what the perception does,
00:00:15.990 --> 00:00:20.504

it calculates some equations on the input and decides to return a one or a
zero.
00:00:20.504 --> 00:00:24.839

In a similar way neurons in the brain take inputs coming from the dendrites.
00:00:24.838 --> 00:00:27.058

These inputs are nervous impulses.
00:00:27.059 --> 00:00:30.464
So what the neuron does is it does something with the nervous impulses
00:00:30.463 --> 00:00:35.054

and then it decides if it outputs a nervous impulse or not through the axon.
00:00:35.054 --> 00:00:38.070

The way we'll create neural networks later in this lesson
00:00:38.070 --> 00:00:40.649

is by concatenating these perceptions so we'll be mimicking
00:00:40.649 --> 00:00:43.200

the way the brain connects neurons by taking the output from
00:00:43.200 --> 00:00:46.130

one and turning it into the input for another one.
09. Perceptrons as Logical Operators
Perceptrons as Logical Operators
In this lesson, we'll see one of the many great applications of perceptrons. As logical
operators! You'll have the chance to create the perceptrons for the most common of these,
the AND, OR, and NOT operators. And then, we'll see what to do about the
elusive XOR operator. Let's dive in!
AND Perceptron
AND And OR Perceptrons
What are the weights and bias for the AND perceptron?
Set the weights ( weight1 , weight2 ) and bias bias to the correct values that calculate
AND operation as shown above.
Start Quiz:
import pandas as pd
# TODO: Set weight1, weight2, and bias

weight1 = 0.0
weight2 = 0.0
bias = 0.0
# DON'T CHANGE ANYTHING BELOW

# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, False, False, True]
outputs = []
# Generate and check output

for test_input, correct_output in zip(test_inputs, correct_outputs):
linear_combination = weight1 * test_input[0] + weight2 *
test_input[1] + bias
output = int(linear_combination >= 0)
is_correct_string = 'Yes' if output == correct_output else 'No'
outputs.append([test_input[0], test_input[1], linear_combination,
output, is_correct_string])
# Print output
num_wrong = len([output[4] for output in outputs if output[4] ==
'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', ' Input 2',
' Linear Combination', ' Activation Output', ' Is Correct'])
if not num_wrong:
print('Nice! You got it all correct.\n')
else:
print('You got {} wrong. Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))
The OR perceptron is very similar to an AND perceptron. In the image below, the OR
perceptron has the same line as the AND perceptron, except the line is shifted down.
What can you do to the weights and/or bias to achieve this? Use the following AND
perceptron to create an OR Perceptron.
OR Perceptron Quiz
What are two ways to go from an AND perceptron to an OR perceptron?

Increase the weights

Decrease the weights

Increase a single weight

Decrease a single weight

Increase the magnitude of the bias

Decrease the magnitude of the bias
NOT Perceptron
Unlike the other perceptrons we looked at, the NOT operation only cares about one
input. The operation returns a 0 if the input is 1 and a 1 if it's a 0 . The other inputs
to the perceptron are ignored.
In this quiz, you'll set the weights ( weight1 , weight2 ) and bias bias to the values
that calculate the NOT operation on the second input and ignores the first input.
Start Quiz:
quiz.py
import pandas as pd
# TODO: Set weight1, weight2, and bias

weight1 = 0.0
weight2 = 0.0
bias = 0.0
# DON'T CHANGE ANYTHING BELOW

# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [True, False, True, False]
outputs = []
# Generate and check output

for test_input, correct_output in zip(test_inputs, correct_outputs):
linear_combination = weight1 * test_input[0] + weight2 *
test_input[1] + bias
output = int(linear_combination >= 0)
is_correct_string = 'Yes' if output == correct_output else 'No'
outputs.append([test_input[0], test_input[1], linear_combination,
output, is_correct_string])
# Print output
num_wrong = len([output[4] for output in outputs if output[4] ==
'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', ' Input 2',
' Linear Combination', ' Activation Output', ' Is Correct'])
if not num_wrong:
print('Nice! You got it all correct.\n')
else:
print('You got {} wrong. Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))
XOR Perceptron
Play
00:10
-00:35
Mute
Disable captions
Settings
Enter fullscreen
Play
XOR Perceptron
Quiz: Build an XOR Multi-Layer Perceptron
Now, let's build a multi-layer perceptron from the AND, NOT, and OR perceptrons to
create XOR logic!
The neural network below contains 3 perceptrons, A, B, and C. The last one (AND) has
been given for you. The input to the neural network is from the first node. The output
comes out of the last node.
The multi-layer perceptron below calculates XOR. Each perceptron is a logic operation
of AND, OR, and NOT. However, the perceptrons A, B, and C don't indicate their
operation. In the following quiz, set the correct operations for the perceptrons to
calculate XOR.
QUIZ QUESTION::
Set the operations for the perceptrons in the XOR neural network.
ANSWER CHOICES:
A

B

C
Perceptron Operators
NOT
AND
OR
SOLUTION:
udacimak v1.1.3
10. Perceptron Trick
Perceptron Trick
In the last section you used your logic and your mathematical knowledge to create perceptrons
for some of the most common logical operators. In real life, though, we can't be building these
perceptrons ourselves. The idea is that we give them the result, and they build themselves. For
this, here's a pretty neat trick that will help us.
Perceptron Algorithm
Play
00:00
-00:07
Mute
Disable captions
Settings
Enter fullscreen
Play
Does the misclassified point want the line to be closer or farther?

Closer

Farther
SOLUTION:Closer
DL 10 S Perceptron Algorithm
Play
00:00
-00:01
Mute
Disable captions
Settings
Enter fullscreen
Play
Time for some math!
Now that we've learned that the points that are misclassified, want the line to move closer to
them, let's do some math. The following video shows a mathematical trick that modifies the
equation of the line, so that it comes closer to a particular point.
11. Perceptron Algorithm
Perceptron Algorithm
And now, with the perceptron trick in our hands, we can fully develop the perceptron
algorithm! The following video will show you the pseudocode, and in the quiz below, you'll
have the chance to code it in Python.
Perceptron Agorithm Pseudocode

00:00:00.000 --> 00:00:04.995
Now, we finally have all the tools for describing the perceptron algorithm.
00:00:04.995 --> 00:00:06.410

We start with the random equation,
00:00:06.410 --> 00:00:07.895

which will determine some line,
00:00:07.894 --> 00:00:11.494

and two regions, the positive and the negative region.
00:00:11.494 --> 00:00:14.804

Now, we'll move this line around to get a better and better fit.
00:00:14.804 --> 00:00:17.265

So, we ask all the points how they're doing.
00:00:17.265 --> 00:00:20.964

The four correctly classified points say, "I'm good."
00:00:20.964 --> 00:00:25.890

And the two incorrectly classified points say, "Come closer."
00:00:25.890 --> 00:00:28.088

So, let's listen to the point in the right,
00:00:28.088 --> 00:00:31.484

and apply the trick to make the line closer to this point.
00:00:31.484 --> 00:00:34.704

So, here it is. Now, this point is good.
00:00:34.704 --> 00:00:36.869

Now, let's listen to the point in the left.
00:00:36.869 --> 00:00:38.349

The points says, "Come closer."
00:00:38.350 --> 00:00:39.770

We apply the trick,
00:00:39.770 --> 00:00:41.685

and now the line goes closer to it,
00:00:41.685 --> 00:00:45.094

and it actually goes over it classifying correctly.
00:00:45.094 --> 00:00:48.484

Now, every point is correctly classified and happy.
00:00:48.484 --> 00:00:52.670

So, let's actually write the pseudocode for this perceptron algorithm.
00:00:52.670 --> 00:00:53.780

We start with random weights,
00:00:53.780 --> 00:00:55.640

w1 up to wn and b.
00:00:55.640 --> 00:00:57.774

This gives us the question wx plus b,
00:00:57.774 --> 00:01:02.004

the line, and the positive and negative areas.
00:01:02.005 --> 00:01:05.822

Now, for every misclassified point with coordinates x1 up to xn,
00:01:05.822 --> 00:01:07.740

we do the following.
00:01:07.739 --> 00:01:09.184

If the prediction was zero,
00:01:09.185 --> 00:01:12.879

which means the point is a positive point in the negative area,
00:01:12.879 --> 00:01:16.490

then we'll update the weights as follows: for i equals 1 to n,
00:01:16.489 --> 00:01:21.049

we change wi, to wi plus alpha times xi,
00:01:21.049 --> 00:01:23.664

where alpha is the learning rate.
00:01:23.665 --> 00:01:26.060

In this case, we're using 0.1.
00:01:26.060 --> 00:01:28.659

Sometimes, we use 0.01 etc.
00:01:28.659 --> 00:01:33.840

It depends. Then we also change the bi as unit to b plus alpha.
00:01:33.840 --> 00:01:38.024

That moves the line closer to the misclassified point.
00:01:38.024 --> 00:01:39.700

Now, if the prediction was one,
00:01:39.700 --> 00:01:42.415

which means a point is a negative point in the positive area,
00:01:42.415 --> 00:01:44.650

then we'll update the weights in a similar way,
00:01:44.650 --> 00:01:46.950

except we subtract instead of adding.
00:01:46.950 --> 00:01:50.545

This means for i equals 1, change wi,
00:01:50.545 --> 00:01:53.299

to wi minus alpha xi,
00:01:53.299 --> 00:01:57.995

and change the bi as unit b to b minus alpha.
00:01:57.995 --> 00:02:01.770

And now, the line moves closer to our misclassified point.
00:02:01.769 --> 00:02:05.024

And now, we just repeat this step until we get no errors,
00:02:05.025 --> 00:02:07.425

or until we have a number of error that is small.
00:02:07.424 --> 00:02:08.564

Or simply we can just say,
00:02:08.564 --> 00:02:11.520

do the step a thousand times and stop.
00:02:11.520 --> 00:02:14.000

We'll see what are our options later in the class.
Coding the Perceptron Algorithm
Time to code! In this quiz, you'll have the chance to implement the perceptron
algorithm to separate the following data (given in the file data.csv).
Recall that the perceptron step works as follows. For a point with coordinates
(p,q)(p,q), label yy, and prediction given by the equation \hat{y} = step(w_1x_1 +
w_2x_2 + b)y^=step(w1x1+w2x2+b):
 If the point is correctly classified, do nothing.

 If the point is classified positive, but it has a negative label, subtract
\alpha p, \alpha q,αp,αq, and
\alphaα
from
w_1, w_2,w1,w2,
and
bb
respectively.
 If the point is classified negative, but it has a positive label, add
\alpha p, \alpha q,αp,αq,
and
\alphaα
to
w_1, w_2,w1,w2,
and
bb
respectively.
Then click on test run to graph the solution that the perceptron algorithm gives you.
It'll actually draw a set of dotted lines, that show how the algorithm approaches to the
best solution, given by the black solid line.
Feel free to play with the parameters of the algorithm (number of epochs, learning
rate, and even the randomizing of the initial parameters) to see how your initial
conditions can affect the solution!
Start Quiz:
perceptron.py data.csvsolution.py
import numpy as np
# Setting the random seed, feel free to change it and see different
solutions.
np.random.seed(42)
def stepFunction(t):
if t >= 0:
return 1
return 0
def prediction(X, W, b):

return stepFunction((np.matmul(X,W)+b)[0])
# TODO: Fill in the code below to implement the perceptron trick.

# The function should receive as inputs the data X, the labels y,
# the weights W (as an array), and the bias b,
# update the weights and bias W, b, according to the perceptron
algorithm,
# and return W and b.
def perceptronStep(X, y, W, b, learn_rate = 0.01):
# Fill in code
return W, b
# This function runs the perceptron algorithm repeatedly on the

dataset,
# and returns a few of the boundary lines obtained in the iterations,
# for plotting purposes.
# Feel free to play with the learning rate and the num_epochs,
# and see your results plotted below.
def trainPerceptronAlgorithm(X, y, learn_rate = 0.01, num_epochs =
25):
x_min, x_max = min(X.T[0]), max(X.T[0])
y_min, y_max = min(X.T[1]), max(X.T[1])
W = np.array(np.random.rand(2,1))
b = np.random.rand(1)[0] + x_max
# These are the solution lines that get plotted below.
boundary_lines = []
for i in range(num_epochs):
# In each epoch, we apply the perceptron step.
W, b = perceptronStep(X, y, W, b, learn_rate)
boundary_lines.append((-W[0]/W[1], -b/W[1]))
return boundary_lines
Start Quiz:
perceptron.pydata.csv solution.py
def perceptronStep(X, y, W, b, learn_rate = 0.01):
for i in range(len(X)):
y_hat = prediction(X[i],W,b)
if y[i]-y_hat == 1:
W[0] += X[i][0]*learn_rate
W[1] += X[i][1]*learn_rate
b += learn_rate
elif y[i]-y_hat == -1:
W[0] -= X[i][0]*learn_rate
W[1] -= X[i][1]*learn_rate
b -= learn_rate
return W, b
14. Log-loss Error Function
Error Functions
Play
01:17
05:50
Disable captions
Settings
Enter fullscreen
Play
We pick back up on log-loss error with the gradient descent concept.
Which of the following conditions should be met in order to apply gradient descent?
(Check all that apply.)

The error function should be discrete

The error function should contain only positive values

The error function should be differentiable

The error function should be normalized

The error function should be continuous
SOLUTION:
 The error function should be differentiable

 The error function should be continuous
udacimak v1.1.3
15. Discrete vs Continuous
Discrete vs Continuous Predictions
In the last few videos, we learned that continuous error functions are better than discrete error
functions, when it comes to optimizing. For this, we need to switch from discrete to
continuous predictions. The next two videos will guide us in doing that.
Need to understand
16. Softmax
Multi-Class Classification and Softmax
The Softmax Function

In the next video, we'll learn about the softmax function, which is the equivalent of the
sigmoid activation function, but when the problem has 3 or more classes.
Softmax Quiz
What function turns every number into a positive number?

sin

cos

log

exp
SOLUTION:exp
Quiz: Coding Softmax
And now, your time to shine! Let's code the formula for the Softmax function in
Python.
Start Quiz:
softmax.py solution.py
import numpy as np
# Write a function that takes as input a list of numbers, and returns

# the list of values given by the softmax function.
def softmax(L):
pass
import numpy as np
def softmax(L):
expL = np.exp(L)
sumExpL = sum(expL)
result = []
for i in expL:
result.append(i*1.0/sumExpL)
return result
# Note: The function np.divide can also be used here, as follows:

# def softmax(L):
# expL = np.exp(L)
# return np.divide (expL, expL.sum())
18. Maximum Likelihood
Maximum Likelihood
Probability will be one of our best friends as we go through Deep Learning. In this lesson,
we'll see how we can use probability to evaluate (and improve!) our models.
00:00:00.000 --> 00:00:02.859

So we're still in our quest for an algorithm that will help
00:00:02.859 --> 00:00:05.525

us pick the best model that separates our data.
00:00:05.525 --> 00:00:09.984

Well, since we're dealing with probabilities then let's use them in our
favor.
00:00:09.984 --> 00:00:12.894

Let's say I'm a student and I have two models.
00:00:12.894 --> 00:00:16.300

One that tells me that my probability of getting accepted is
00:00:16.300 --> 00:00:20.859
80% and one that tells me the probability is 55%.
00:00:20.859 --> 00:00:22.855

Which model looks more accurate?
00:00:22.855 --> 00:00:24.879

Well, if I got accepted then I'd say
00:00:24.879 --> 00:00:28.210

the better model is probably the one that says 80%.
00:00:28.210 --> 00:00:29.679

What if I didn't get accepted?
00:00:29.678 --> 00:00:33.983

Then the more accurate model is more likely the one that says 55 percent.
00:00:33.984 --> 00:00:37.469

But I'm just one person. What if it was me and a friend?
00:00:37.469 --> 00:00:40.789

Well, the best model would more likely be the one that
00:00:40.789 --> 00:00:44.506

gives the higher probabilities to the events that happened to us,
00:00:44.506 --> 00:00:47.085

whether it's acceptance or rejection.
00:00:47.085 --> 00:00:49.149

This sounds pretty intuitive.
00:00:49.149 --> 00:00:52.310

The method is called maximum likelihood.
00:00:52.310 --> 00:00:58.353

What we do is we pick the model that gives the existing labels the highest
probability.
00:00:58.353 --> 00:01:00.259

Thus, by maximizing the probability,
00:01:00.259 --> 00:01:02.000

we can pick the best possible model.
19. Maximizing Probabilities
Maximizing Probabilities
In this lesson and quiz, we will learn how to maximize a probability, using some math.
Nothing more than high school math, so get ready for a trip down memory lane!
20. Cross-Entropy 1
Cross Entropy 1
INSTRUCTOR NOTE:
Correction: At 2:18, the top right point should be labelled -log(0.7) instead of -
log(0.2) .
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:02.250

Correct. The answer is logarithm,
00:00:02.250 --> 00:00:06.389

because logarithm has this very nice identity that says that the logarithm of
00:00:06.389 --> 00:00:11.929
the product A times B is the sum of the logarithms of A and B.
00:00:11.929 --> 00:00:13.294

So this is what we do.
00:00:13.294 --> 00:00:17.559

We take our products and we take the logarithms,
00:00:17.559 --> 00:00:21.854

so now we get a sum of the logarithms of the factors.
00:00:21.855 --> 00:00:28.219

So the ln(0.6*0.2*0.1*0.7) is equal to
00:00:28.219 --> 00:00:35.700

ln(0.6) + ln(0.2) + ln(0.1) + ln(0.7) etc. Now from now until the end of class,
00:00:35.700 --> 00:00:40.040

we'll be taking the natural logarithm which is base e instead of 10.
00:00:40.039 --> 00:00:41.759

Nothing different happens with base 10.
00:00:41.759 --> 00:00:44.945

Everything works the same as everything gets scaled by the same factor.
00:00:44.945 --> 00:00:46.770

So it's just more for convention.
00:00:46.770 --> 00:00:51.330

We can calculate those values and get minus 0.51, minus 1.61,
00:00:51.329 --> 00:00:58.164

minus 0.23 etc. Notice that they are all negative numbers and that actually makes sense.
00:00:58.164 --> 00:01:01.560

This is because the logarithm of a number between 0 and 1 is always
00:01:01.560 --> 00:01:05.594

a negative number since the logarithm of one is zero.
00:01:05.594 --> 00:01:07.789

So it actually makes sense to think of the negative of
00:01:07.790 --> 00:01:11.260

the logarithm of the probabilities and we'll get positive numbers.
00:01:11.260 --> 00:01:15.740
So that's what we'll do. We'll take the negative of the logarithm of the probabilities.
00:01:15.739 --> 00:01:18.905

That sums up negatives of logarithms of the probabilities,
00:01:18.905 --> 00:01:23.180

we'll call the cross entropy which is a very important concept in the class.
00:01:23.180 --> 00:01:25.385

If we calculate the cross entropies,
00:01:25.385 --> 00:01:30.255

we see that the bad model on left has a cross entropy 4.8 which is high.
00:01:30.254 --> 00:01:35.229

Whereas the good model on the right has a cross entropy of 1.2 which is low.
00:01:35.230 --> 00:01:37.454

This actually happens all the time.
00:01:37.454 --> 00:01:38.810

A good model will give us
00:01:38.810 --> 00:01:43.185

a low cross entropy and a bad model will give us a high cross entropy.
00:01:43.185 --> 00:01:44.629

The reason for this is simply that
00:01:44.629 --> 00:01:47.390

a good model gives us a high probability and the negative
00:01:47.390 --> 00:01:52.599

of the logarithm of a large number is a small number and vice versa.
00:01:52.599 --> 00:01:55.250

This method is actually much more powerful than we think.
00:01:55.250 --> 00:01:59.180

If we calculate the probabilities and pair the points with the corresponding logarithms,
00:01:59.180 --> 00:02:01.470

we actually get an error for each point.
00:02:01.469 --> 00:02:06.539

So again, here we have probabilities for both models and the products of them.
00:02:06.540 --> 00:02:09.944
Now, we take the negative of the logarithms which gives us sum of
00:02:09.944 --> 00:02:15.319

logarithms and if we pair each logarithm with the point where it came from,
00:02:15.319 --> 00:02:17.859

we actually get a value for each point.
00:02:17.860 --> 00:02:19.565

And if we calculate the values,
00:02:19.564 --> 00:02:22.185

we get this. Check it out.
00:02:22.185 --> 00:02:24.319

If we look carefully at the values we can see that
00:02:24.319 --> 00:02:26.430

the points that are mis-classified has like
00:02:26.430 --> 00:02:31.295

values like 2.3 for this point or 1.6 one for this point,
00:02:31.294 --> 00:02:36.544

whereas the points that are correctly classified have small values.
00:02:36.544 --> 00:02:38.719

And the reason for this is again is that
00:02:38.719 --> 00:02:42.604

a correctly classified point will have a probability that as close to 1,
00:02:42.604 --> 00:02:44.989

which when we take the negative of the logarithm,
00:02:44.990 --> 00:02:46.915

we'll get a small value.
00:02:46.914 --> 00:02:51.215

Thus we can think of the negatives of these logarithms as errors at each point.
00:02:51.215 --> 00:02:53.539

Points that are correctly classified will have
00:02:53.539 --> 00:02:57.594

small errors and points that are mis-classified will have large errors.
00:02:57.594 --> 00:03:02.530
And now we've concluded that our cross entropy will tell us if a model is good or bad.
00:03:02.530 --> 00:03:06.800

So now our goal has changed from maximizing a probability to minimizing
00:03:06.800 --> 00:03:12.580

a cross entropy in order to get from the model in left to the model in the right.
00:03:12.580 --> 00:03:14.655

And that error function that we're looking for,
00:03:14.655 --> 00:03:17.000

that was precisely the cross entropy.
21. Cross-Entropy 2
Cross-Entropy
So we're getting somewhere, there's definitely a connection between probabilities and error
functions, and it's called Cross-Entropy. This concept is tremendously popular in many fields,
including Machine Learning. Let's dive more into the formula, and actually code it!
Formula For Cross 1

00:02:09.050 --> 00:02:12.080
And notice that the events with high probability have
00:02:12.080 --> 00:02:16.345

low cross-entropy and the events with low probability have high cross-entropy.
WEBVTT
Kind: captions
Language: en
00:00:02.440 --> 00:00:07.140

Let's look a bit closer into Cross-Entropy by switching to a different example.
00:00:07.140 --> 00:00:08.935

Let's say we have three doors.
00:00:08.935 --> 00:00:11.330

And no this is not the Monty Hall problem.
00:00:11.330 --> 00:00:13.775

We have the green door, the red door,
00:00:13.775 --> 00:00:18.720

and the blue door, and behind each door we could have a gift or not have a gift.
00:00:18.720 --> 00:00:23.150

And the probabilities of there being a gift behind each door is 0.8 for the first one,
00:00:23.150 --> 00:00:24.935

0.7 for the second one,
00:00:24.935 --> 00:00:26.900

0.1 for the third one.
00:00:26.900 --> 00:00:29.805

So for example behind the green door
00:00:29.805 --> 00:00:33.155

there is an 80 percent probability of there being a gift,
00:00:33.155 --> 00:00:36.780

and a 20 percent probability of there not being a gift.
00:00:36.780 --> 00:00:39.610

So we can put the information in this table where
00:00:39.610 --> 00:00:42.970

the probabilities of there being a gift are given in the top row,
00:00:42.970 --> 00:00:46.630

and the probabilities of there not being a gift are given in the bottom row.
00:00:46.630 --> 00:00:49.180

So let's say we want to make a bet on the outcomes.
00:00:49.180 --> 00:00:53.375

So we want to try to figure out what is the most likely scenario here.
00:00:53.375 --> 00:00:56.880

And for that we'll assume they're independent events.
00:00:56.880 --> 00:00:59.870

In this case, the most likely scenario is just
00:00:59.870 --> 00:01:03.440

obtained by picking the largest probability in each column.
00:01:03.440 --> 00:01:06.875

So for the first door is more likely to have a gift than not have a gift.
00:01:06.875 --> 00:01:09.230

So we'll say there's a gift behind the first door.
00:01:09.230 --> 00:01:12.680

For the second door, it's also more likely that there's a gift.
00:01:12.680 --> 00:01:14.995

So we'll say there's a gift behind the second door.
00:01:14.995 --> 00:01:18.060

And for the third door it's much more likely that there's no gift,
00:01:18.060 --> 00:01:21.015

so we'll say there's no gift behind the third door.
00:01:21.015 --> 00:01:22.700

And as the events are independent,
00:01:22.700 --> 00:01:24.810

the probability for this whole arrangement is
00:01:24.810 --> 00:01:27.995

the product of the three probabilities which is 0.8,
00:01:27.995 --> 00:01:31.096

times 0.7, times 0.9,
00:01:31.096 --> 00:01:33.446

which ends up being 0.504,
00:01:33.446 --> 00:01:36.665

which is roughly 50 percent.
00:01:36.665 --> 00:01:39.680

So let's look at all the possible scenarios in the table.
00:01:39.680 --> 00:01:43.085

Here's a table with all the possible scenarios for each door
00:01:43.085 --> 00:01:46.940

and there are eight scenarios since each door gives us two possibilities each,
00:01:46.940 --> 00:01:48.815

and there are three doors.
00:01:48.815 --> 00:01:51.545

So we do as before to obtain the probability of
00:01:51.545 --> 00:01:57.245

each arrangement by multiplying the three independent probabilities to get these numbers.
00:01:57.245 --> 00:01:59.590

You can check that these numbers add to one.
00:01:59.590 --> 00:02:02.570

And from last video we learned that the negative
00:02:02.570 --> 00:02:05.955

of the logarithm of the probabilities across entropy.
00:02:05.955 --> 00:02:09.050

So let's go ahead and calculate the cross-entropy.
00:02:09.050 --> 00:02:12.080

And notice that the events with high probability have
00:02:12.080 --> 00:02:16.345

low cross-entropy and the events with low probability have high cross-entropy.
00:02:16.345 --> 00:02:19.130

For example, the second row which has probability of
00:02:19.130 --> 00:02:24.440

0.504 gives a small cross-entropy of 0.69,
00:02:24.440 --> 00:02:28.675

and the second to last row which is very very unlikely has a probability of
00:02:28.675 --> 00:02:34.441

0.006 gives a cross entropy a 5.12.
00:02:34.441 --> 00:02:37.573

So let's actually calculate a formula for the cross-entropy.
00:02:37.573 --> 00:02:39.215

Here we have our three doors,
00:02:39.215 --> 00:02:44.180

and our sample scenario said that there is a gift behind the first and second doors,
00:02:44.180 --> 00:02:46.445

and no gift behind the third door.
00:02:46.445 --> 00:02:49.370

Recall that the probabilities of these events happening
00:02:49.370 --> 00:02:52.189

are 0.8 for a gift behind the first door,
00:02:52.189 --> 00:02:54.665

0.7 for a gift behind the second door,
00:02:54.665 --> 00:02:57.915

and 0.9 for no gift behind the third door.
00:02:57.915 --> 00:02:59.510

So when we calculate the cross-entropy,
00:02:59.510 --> 00:03:03.622

we get the negative of the logarithm of the product,
00:03:03.622 --> 00:03:08.015

which is a sum of the negatives of the logarithms of the factors,
00:03:08.015 --> 00:03:14.070

which is negative logarithm of 0.8 minus logarithm of 0.7 minus logarithm 0.9.
00:03:14.070 --> 00:03:17.225

And in order to drive the formula we'll have some variables.
00:03:17.225 --> 00:03:20.885

So let's call P1 the probability that there's a gift behind the first door,
00:03:20.885 --> 00:03:24.110

P2 the probability there's a gift behind the second door,
00:03:24.110 --> 00:03:27.940

and P3 the probability there's a gift behind the third door.
00:03:27.940 --> 00:03:30.580

So this 0.8 here is P1,
00:03:30.580 --> 00:03:32.580

this 0.7 here is P2,
00:03:32.580 --> 00:03:35.370

and this 0.9 here is one minus P3.
00:03:35.370 --> 00:03:36.990

So it's a probability of there not being
00:03:36.990 --> 00:03:41.460

a gift is one minus the probability of there being a gift.
00:03:41.460 --> 00:03:43.785

Let's have another variable called Yi,
00:03:43.785 --> 00:03:46.980

which will be one of there's a present behind the ith door,
00:03:46.980 --> 00:03:49.750

and zero there's no present.
00:03:49.750 --> 00:03:53.470

So Yi is technically a number of presents behind the ith door.
00:03:53.470 --> 00:03:55.442

In this case Y1 equals one,
00:03:55.442 --> 00:04:00.210

Y2 equals one, and Y3 equals zero.
00:04:00.210 --> 00:04:02.550

So we can put all this together and derive a formula
00:04:02.550 --> 00:04:05.355

for the cross-entropy and it's this sum.
00:04:05.355 --> 00:04:08.305

Now let's look at the formula inside the summation.
00:04:08.305 --> 00:04:12.155

Noted that if there is a present behind the ith door,
00:04:12.155 --> 00:04:14.300

then Yi equals one.
00:04:14.300 --> 00:04:17.180

So the first term is logarithm of the Pi.
00:04:17.180 --> 00:04:19.795

And the second term is zero.
00:04:19.795 --> 00:04:24.285

Likewise, if there is no present behind the ith door,
00:04:24.285 --> 00:04:26.355

then Yi is zero.
00:04:26.355 --> 00:04:28.355

So this first term is zero.
00:04:28.355 --> 00:04:32.655

And this term is precisely logarithm of one minus Pi.
00:04:32.655 --> 00:04:35.785

Therefore, this formula really encompasses the sums of the
00:04:35.785 --> 00:04:39.935

negative of logarithms which is precisely the cross-entropy.
00:04:39.935 --> 00:04:45.640

So the cross-entropy really tells us when two vectors are similar or different.
00:04:45.640 --> 00:04:52.170

For example, if you calculate the cross entropy of the pair one one zero,
00:04:52.170 --> 00:04:53.485

and 0.8, 0.7, 0.1, we get 0.69.
00:04:53.485 --> 00:05:00.500

And that is low because one one zero is a similar vector to 0.8, 0.7, 0.1.
00:05:00.500 --> 00:05:05.510

Which means that the arrangement of gifts given by the first set of
00:05:05.510 --> 00:05:08.270

numbers is likely to happen based
00:05:08.270 --> 00:05:11.715

on the probabilities given by the second set of numbers.
00:05:11.715 --> 00:05:17.107

But on the other hand if we calculate the cross-entropy of the pairs zero zero one,
00:05:17.107 --> 00:05:19.559

and 0.8, 0.7, 0.1,
00:05:19.559 --> 00:05:23.210

that is 5.12 which is very high.
00:05:23.210 --> 00:05:27.380

This is because the arrangement of gifts being given by the first set of numbers is
00:05:27.380 --> 00:05:32.030

very unlikely to happen from the probabilities given by the second set of numbers.
Start Quiz:
cross_entropy.py solution.py
import numpy as np
# Write a function that takes as input two lists Y, P,

# and returns the float corresponding to their cross-entropy.
def cross_entropy(Y, P):
pass
import numpy as np
def cross_entropy(Y, P):

Y = np.float_(Y)
P = np.float_(P)
return -np.sum(Y * np.log(P) + (1 - Y) * np.log(1 - P))
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:05.220

Now that was when we had two classes namely receiving a gift or not receiving a gift.
00:00:05.220 --> 00:00:09.285

What happens if we have more classes? Let's take a look.
00:00:09.285 --> 00:00:10.790

So we have a similar problem.
00:00:10.790 --> 00:00:12.220

We still have three doors.
00:00:12.220 --> 00:00:14.940

And this problem is still not the Monty Hall problem.
00:00:14.940 --> 00:00:16.860

Behind each door there can be an animal,
00:00:16.860 --> 00:00:19.140

and the animal can be of three types.
00:00:19.140 --> 00:00:21.555

It can be a duck, it can be a beaver,
00:00:21.555 --> 00:00:23.575

or it can be a walrus.
00:00:23.575 --> 00:00:26.095

So let's look at this table of probabilities.
00:00:26.095 --> 00:00:28.485

According to the first column on the table,
00:00:28.485 --> 00:00:30.135

behind the first door,
00:00:30.135 --> 00:00:33.380

the probability of finding a duck is 0.7,
00:00:33.380 --> 00:00:35.800

the probability of finding a beaver is 0.2,
00:00:35.800 --> 00:00:39.080

and the probability of finding a walrus is 0.1.
00:00:39.080 --> 00:00:42.150

Notice that the numbers in each column need to add to
00:00:42.150 --> 00:00:45.745

one because there is some animal behind door one.
00:00:45.745 --> 00:00:50.030

The numbers in the rows do not need to add to one as you can see.
00:00:50.030 --> 00:00:53.825

It could easly be that we have a duck behind every door and that's okay.
00:00:53.825 --> 00:00:55.590

So let's look at a sample scenario.
00:00:55.590 --> 00:00:57.225

Let's say we have our three doors,
00:00:57.225 --> 00:00:59.775

and behind the first door, there's a duck,
00:00:59.775 --> 00:01:02.040

behind the second door there's a walrus,
00:01:02.040 --> 00:01:04.805

and behind the third door there's also a walrus.
00:01:04.805 --> 00:01:07.895

Recall that the probabilities are again by the table.
00:01:07.895 --> 00:01:11.555

So a duck behind the first door is 0.7 likely,
00:01:11.555 --> 00:01:14.870

a walrus behind the second door is 0.3 likely,
00:01:14.870 --> 00:01:18.925

and a walrus behind the third door is 0.4 likely.
00:01:18.925 --> 00:01:21.930

So the probability of obtaining this three animals is the product of
00:01:21.930 --> 00:01:25.470

the probabilities of the three events since they are independent events,
00:01:25.470 --> 00:01:27.900

which in this case it's 0.084.
00:01:27.900 --> 00:01:30.285

And as we learn,
00:01:30.285 --> 00:01:33.000

that cross entropy here is given by
00:01:33.000 --> 00:01:37.065

the sums of the negatives of the logarithms of the probabilities.
00:01:37.065 --> 00:01:40.720

So the first one is negative logarithm of 0.7.
00:01:40.720 --> 00:01:43.710

The second one is negative logarithm of 0.3.
00:01:43.710 --> 00:01:46.740

And the third one is negative logarithm of 0.4.
00:01:46.740 --> 00:01:52.255

The Cross entropy's and the sum of these three which is actually 2.48.
00:01:52.255 --> 00:01:55.490

But we want a formula, so let's put some variables here.
00:01:55.490 --> 00:02:00.187

So P11 is the probability of finding a duck behind door one.
00:02:00.187 --> 00:02:04.535

P12 is the probability of finding a duck behind door two etc.
00:02:04.535 --> 00:02:09.260

And let's have the indicator variables Y1j D1 if there's
00:02:09.260 --> 00:02:14.790

a duck behind door J. Y2j B1 if there's a beaver behind door J,
00:02:14.790 --> 00:02:19.285

and Y3j B1 if there's a walrus behind door J.
00:02:19.285 --> 00:02:21.935

And these variables are zero otherwise.
00:02:21.935 --> 00:02:24.210

And so, the formula for the cross entropy is
00:02:24.210 --> 00:02:27.445

simply the negative of the summation from i_ equals_ one to n,
00:02:27.445 --> 00:02:35.630

up to summation from y_ equals_ j to m of Yij_ times_ the logarithm of Pij.
00:02:35.630 --> 00:02:39.150

In this case, m is a number of classes.
00:02:39.150 --> 00:02:42.330

This formula works because Yij being zero one,
00:02:42.330 --> 00:02:45.135

makes sure that we're only adding the logarithms
00:02:45.135 --> 00:02:48.555

of the probabilities of the events that actually have occurred.
00:02:48.555 --> 00:02:53.760

And voila, this is the formula for the cross entropy in more classes.
00:02:53.760 --> 00:02:55.080

Now I'm going to leave this equestion.
00:02:55.080 --> 00:03:00.085

Given that we have a formula for cross entropy for two classes and one for m classes.
00:03:00.085 --> 00:03:04.240

These formulas look different but are they the same for m_ equals_ two?
00:03:04.240 --> 00:03:05.565

Obviously the answer is yes,
00:03:05.565 --> 00:03:07.950

but it's a cool exercise to actually write them down and
00:03:07.950 --> 00:03:11.000

convince yourself that they are actually the same.
23. Logistic Regression
Logistic Regression
Now, we're finally ready for one of the most popular and useful algorithms in Machine
Learning, and the building block of all that constitutes Deep Learning. The Logistic
Regression Algorithm. And it basically goes like this:
 Take your data

 Pick a random model
 Calculate the error
 Minimize the error, and obtain a better model
 Enjoy!
Calculating the Error Function
Let's dive into the details. The next video will show you how to calculate an error function.
Important
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:03.632

So this is a good time for a quick recap of the last couple of lessons.
00:00:03.632 --> 00:00:05.640

Here we have two models.
00:00:05.639 --> 00:00:08.934

The bad model on the left and the good model on the right.
00:00:08.935 --> 00:00:13.440

And for each one of those we calculate the cross entropy which is the sum of
00:00:13.439 --> 00:00:19.259

the negatives of the logarithms off the probabilities of the points being their colors.
00:00:19.260 --> 00:00:22.170

And we conclude that the one on the right is better
00:00:22.170 --> 00:00:25.860

because a cross entropy is much smaller.
00:00:25.859 --> 00:00:29.269
So let's actually calculate the formula for the error function.
00:00:29.269 --> 00:00:31.559

Let's split into two cases.
00:00:31.559 --> 00:00:34.269

The first case being when y=1.
00:00:34.270 --> 00:00:36.130

So when the point is blue to begin with,
00:00:36.130 --> 00:00:42.480

the model tells us that the probability of being blue is the prediction y_hat.
00:00:42.479 --> 00:00:47.849

So for these two points the probabilities are 0.6 and 0.2.
00:00:47.850 --> 00:00:50.910

As we can see the point in the blue area has
00:00:50.909 --> 00:00:55.000

more probability of being blue than the point in the red area.
00:00:55.000 --> 00:01:00.500

And our error is simply the negative logarithm of this probability.
00:01:00.500 --> 00:01:04.010

So it's precisely minus logarithm of y_hat.
00:01:04.010 --> 00:01:09.665

In the figure it's minus logarithm of 0.6. and minus logarithm of 0.2.
00:01:09.665 --> 00:01:13.745

Now if y=0, so when the point is red,
00:01:13.745 --> 00:01:17.585

then we need to calculate the probability of the point being red.
00:01:17.584 --> 00:01:22.339

The probability of the point being red is one minus the probability of the point being
00:01:22.340 --> 00:01:27.750

blue which is precisely 1 minus the prediction y_hat.
00:01:27.750 --> 00:01:30.890

So the error is precisely the negative logarithm of
00:01:30.890 --> 00:01:35.870
this probability which is negative logarithm of 1 - y_hat.
00:01:35.870 --> 00:01:42.040

In this case we get negative logarithm 0.1 and negative logarithm 0.7.
00:01:42.040 --> 00:01:46.605

So we conclude that the error is a negative logarithm of y_hat if the point is blue.
00:01:46.605 --> 00:01:50.635

And negative logarithm of one - y_hat the point is red.
00:01:50.635 --> 00:01:53.625

We can summarize these two formulas into this one.
00:01:53.625 --> 00:02:02.159

Error = - (1-y)(ln( 1- y_hat)) - y ln(y_hat).
00:02:02.159 --> 00:02:03.759

Why does this formula work?
00:02:03.760 --> 00:02:05.730

Well because if the point is blue,
00:02:05.730 --> 00:02:10.664

then y=1 which means 1-y=0 which makes the first term
00:02:10.664 --> 00:02:16.495

0 and the second term is simply logarithm of y_hat.
00:02:16.495 --> 00:02:20.219

Similarly, if the point is red then y=0.
00:02:20.219 --> 00:02:27.680

So the second term of the formula is 0 and the first one is logarithm of 1- y_hat.
00:02:27.680 --> 00:02:31.145

Now the formula for the error function is simply the sum over
00:02:31.145 --> 00:02:35.510

all the error functions of points which is precisely the summation here.
00:02:35.509 --> 00:02:38.564

That's going to be this 4.8 we have over here.
00:02:38.564 --> 00:02:41.469

Now by convention we'll actually consider the average,
00:02:41.469 --> 00:02:45.330
not the sum which is where we are dividing by n over here.
00:02:45.330 --> 00:02:49.050

This will turn the 4.8 into a 1.2.
00:02:49.050 --> 00:02:53.330

From now on we'll use this formula as our error function.
00:02:53.330 --> 00:02:58.860

And now since y_hat is given by the sigmoid of the linear function wx + b,
00:02:58.860 --> 00:03:01.890

then the total formula for the error is actually in terms
00:03:01.889 --> 00:03:05.094

of w and b which are the weights of the model.
00:03:05.094 --> 00:03:08.219

And it's simply the summation we see here.
00:03:08.219 --> 00:03:14.449

In this case y_i is just the label of the point x_superscript_i.
00:03:14.449 --> 00:03:17.364

So now that we've calculated it our goal is to minimize it.
00:03:17.365 --> 00:03:18.975

And that's what we'll do next.
00:03:18.974 --> 00:03:20.293

And just a small aside,
00:03:20.294 --> 00:03:23.210

what we did is for binary classification problems.
00:03:23.210 --> 00:03:25.670

If we have a multiclass classification problem then
00:03:25.669 --> 00:03:28.490

the error is now given by the multiclass entropy.
00:03:28.491 --> 00:03:33.380

This formula is given here where for every data point we take the product
00:03:33.379 --> 00:03:39.139

of the label times the logarithm of the prediction and then we average all these values.
00:03:39.139 --> 00:03:41.539
And again it's a nice exercise to convince yourself that
00:03:41.539 --> 00:03:45.000

the two are the same when there are just two classes.
24. Gradient Descent
Gradient Descent
In this lesson, we'll learn the principles and the math behind the gradient descent algorithm
So, a small gradient means we'll change our coordinates by a little bit, and a large
gradient means we'll change our coordinates by a lot.
If this sounds anything like the perceptron algorithm, this is no coincidence! We'll see
it in a bit.
00:00:00.000 --> 00:00:02.580
And now we finally have the tools to write
00:00:02.580 --> 00:00:05.120

the pseudocode for the grading descent algorithm,
00:00:05.120 --> 00:00:06.830

and it goes like this.
00:00:06.830 --> 00:00:15.170

Step one, start with random weights w_one up to w_n and b which will give us a line,
00:00:15.170 --> 00:00:19.270

and not just a line, but the whole probability function given by sigmoid of w x plus b.
00:00:19.270 --> 00:00:22.820

Now for every point we'll calculate the error,
00:00:22.820 --> 00:00:25.150

and as we can see the error is high for
00:00:25.150 --> 00:00:29.230

misclassified points and small for correctly classified points.
00:00:29.230 --> 00:00:32.545
Now for every point with coordinates x_one up to x_n,
00:00:32.545 --> 00:00:36.845

we update w_i by adding the learning rate
00:00:36.845 --> 00:00:42.950

alpha times the partial derivative of the error function with respect to w_i.
00:00:42.950 --> 00:00:45.120

We also update b by adding alpha times
00:00:45.120 --> 00:00:48.440

the partial derivative of the error function with respect to be.
00:00:48.440 --> 00:00:49.920

This gives us new weights,
00:00:49.920 --> 00:00:52.610

w_i_prime and then new bias b_prime.
00:00:52.610 --> 00:00:55.330

Now we've already calculated these partial derivatives and we
00:00:55.330 --> 00:00:58.605

know that they are y_hat minus y times
00:00:58.605 --> 00:01:01.295

x_i for the derivative with respect to w_i
00:01:01.295 --> 00:01:05.215

and y_hat minus y for the derivative with respect to b.
00:01:05.215 --> 00:01:08.840

So that's how we'll update the weights.
00:01:08.840 --> 00:01:13.350

Now repeat this process until the error is small,
00:01:13.350 --> 00:01:15.765

or we can repeat it a fixed number of times.
00:01:15.765 --> 00:01:18.840

The number of times is called the epochs and we'll learn them later.
00:01:18.840 --> 00:01:20.100

Now this looks familiar,
00:01:20.100 --> 00:01:21.935
have we seen something like that before?
00:01:21.935 --> 00:01:24.300

Well, we look at the points and what each point is doing is
00:01:24.300 --> 00:01:26.640

it's adding a multiple of itself into the weights of
00:01:26.640 --> 00:01:31.640

the line in order to get the line to move closer towards it if it's misclassified.
00:01:31.640 --> 00:01:34.435

That's pretty much what the Perceptron algorithm is doing.
00:01:34.435 --> 00:01:36.000

So in the next video, we'll look at
00:01:36.000 --> 00:01:39.000

the similarities because it's a bit suspicious how similar they are.
26. Pre-Lab: Gradient Descent
Implementing Gradient Descent
In the following lab, you'll be able to implement the gradient descent algorithm on the
following sample dataset with two classes.
Workspace
To open this notebook, you have two options:
 Go to the next page in the classroom (recommended)

 Clone the repo from Github and open the notebook GradientDescent.ipynb in
the gradient-descent folder. You can either download the repository with git clone
https://github.com/udacity/deep-learning.git , or download it as an archive file
from this link.
Instructions
In this notebook, you'll be implementing the functions that build the gradient descent
algorithm, namely:
 sigmoid : The sigmoid activation function.

 output_formula : The formula for the prediction.
 error_formula : The formula for the error at a point.
 update_weights : The function that updates the parameters with one gradient descent
step.
When you implement them, run the train function and this will graph the several of
the lines that are drawn in successive gradient descent steps. It will also graph the
error function, and you can see it decreasing as the number of epochs grows.
This is a self-assessed lab. If you need any help or want to check your answers, feel
free to check out the solutions notebook in the same folder, or by clicking here.
27. Notebook: Gradient Descent
Workspace
This section contains either a workspace (it can be a Jupyter Notebook workspace or an online
code editor work space, etc.) and it cannot be automatically downloaded to be generated here.
Please access the classroom with your account and manually download the workspace to your
local machine. Note that for some courses, Udacity upload the workspace files
onto https://github.com/udacity, so you may be able to download them there.
Workspace Information:
 Default file path:

 Workspace type: jupyter
 Opened files (when workspace is loaded): n/a
28. Perceptron vs Gradient Descent

Gradient Descent Vs Perceptron Algorithm
00:00:00.000 --> 00:00:03.990
So let's compare the Perceptron algorithm and the Gradient Descent algorithm.
00:00:03.990 --> 00:00:05.845

In the Gradient Descent algorithm,
00:00:05.845 --> 00:00:09.535

we take the weights and change them from Wi to
00:00:09.535 --> 00:00:13.915

Wi_ plus_ alpha_ times_ Y hat_ minus_ Y_ times_ Xi.
00:00:13.915 --> 00:00:15.325

In the Perceptron algorithm,
00:00:15.325 --> 00:00:17.253

not every point changes weights,
00:00:17.253 --> 00:00:18.960

only the misclassified ones.
00:00:18.960 --> 00:00:21.385

Here, if X is misclassified,
00:00:21.385 --> 00:00:27.525

we'll change the weights by adding Xi to Wi if the point label is positive,
00:00:27.525 --> 00:00:29.785
and subtracting if negative.
00:00:29.785 --> 00:00:32.327

Now the question is, are these two things the same?
00:00:32.327 --> 00:00:34.920

Well, let's remember that in that Perceptron algorithm,
00:00:34.920 --> 00:00:37.350

the labels are one and zero.
00:00:37.350 --> 00:00:40.320

And the predictions Y-hat are also one and zero.
00:00:40.320 --> 00:00:43.060

So, if the point is correct, classified,
00:00:43.060 --> 00:00:48.440

then Y_ minus_ Y-hat is zero because Y is equal to Y-hat.
00:00:48.440 --> 00:00:50.205

Now, if the point is labeled blue,
00:00:50.205 --> 00:00:52.095

then Y_ equals_ one.
00:00:52.095 --> 00:00:53.220

And if it's misclassified,
00:00:53.220 --> 00:00:55.950

then the prediction must be Y-hat_ equals_ zero.
00:00:55.950 --> 00:00:59.265

So Y-hat_ minus_ Y is minus one.
00:00:59.265 --> 00:01:01.050

Similarly, with the points labeled red,
00:01:01.050 --> 00:01:04.105

then Y_ equals_ zero and Y-hat_ equals_ one.
00:01:04.105 --> 00:01:06.180

So, Y-hat_ minus_ Y_ equals_ one.
00:01:06.180 --> 00:01:08.300

This may not be super clear right away.
00:01:08.300 --> 00:01:10.035
But if you stare at the screen for long enough,
00:01:10.035 --> 00:01:13.620

you'll realize that the right and the left are exactly the same thing.
00:01:13.620 --> 00:01:15.175

The only difference is that in the left,
00:01:15.175 --> 00:01:17.776

Y-hat can take any number between zero and one,
00:01:17.776 --> 00:01:19.650

whereas in the right,
00:01:19.650 --> 00:01:23.305

Y-hat can take only the values zero or one.
00:01:23.305 --> 00:01:25.175

It's pretty fascinating, isn't it?
00:01:25.175 --> 00:01:28.055

But let's study Gradient Descent even more carefully.
00:01:28.055 --> 00:01:31.680

Both in the Perceptron algorithm and the Gradient Descent algorithm,
00:01:31.680 --> 00:01:36.570

a point that is misclassified tells a line to come closer because eventually,
00:01:36.570 --> 00:01:40.770

it wants the line to surpass it so it can be in the correct side.
00:01:40.770 --> 00:01:43.734

Now, what happens if the point is correctly classified?
00:01:43.734 --> 00:01:47.315

Well, the Perceptron algorithm says do absolutely nothing.
00:01:47.315 --> 00:01:49.575

In the Gradient Descent algorithm,
00:01:49.575 --> 00:01:51.195

you are changing the weights.
00:01:51.195 --> 00:01:52.830

But what is it doing?
00:01:52.830 --> 00:01:54.480
Well, if we look carefully,
00:01:54.480 --> 00:01:56.640

what the point is telling the line,
00:01:56.640 --> 00:01:58.875

is to go farther away.
00:01:58.875 --> 00:02:01.120

And this makes sense, right?
00:02:01.120 --> 00:02:03.180

Because if you're correctly classified,
00:02:03.180 --> 00:02:05.895

say, if you're a blue point in the blue region,
00:02:05.895 --> 00:02:08.385

you'd like to be even more into the blue region,
00:02:08.385 --> 00:02:10.740

so your prediction is even closer to one,
00:02:10.740 --> 00:02:13.060

and your error is even smaller.
00:02:13.060 --> 00:02:16.320

Similarly, for a red point in the red region.
00:02:16.320 --> 00:02:19.590

So it makes sense that the point tells the line to go farther away.
00:02:19.590 --> 00:02:22.925

And that's precisely what the Gradient Descent algorithm does.
00:02:22.925 --> 00:02:26.540

The misclassified points asks the line to come closer and
00:02:26.540 --> 00:02:30.315

the correctly classified points asks the line to go farther away.
00:02:30.315 --> 00:02:33.240

The line listens to all the points and takes steps in
00:02:33.240 --> 00:02:37.000

such a way that it eventually arrives to a pretty good solution.
In the video at 0:12 mark, the instructor said y hat minus y . It should be y minus y
hat instead as stated on the slide.
29. Continuous Perceptrons

Continuous Perceptrons
00:00:00.000 --> 00:00:04.139
So, this is just a small recap video that will get us ready for what's coming.
00:00:04.139 --> 00:00:06.809
Recall that if we have our data in the form of these points over
00:00:06.809 --> 00:00:10.710
here and the linear model like this one, for example,
00:00:10.710 --> 00:00:14.865
with equation 2x1 + 7x2 - 4 = 0,
00:00:14.865 --> 00:00:19.495
this will give rise to a probability function that looks like this.
00:00:19.495 --> 00:00:23.760
Where the points on the blue or positive region have more chance of being
00:00:23.760 --> 00:00:29.570
blue and the points in the red or negative region have more chance of being red.
00:00:29.570 --> 00:00:32.609
And this will give rise to this perception where we label
00:00:32.609 --> 00:00:36.314
the edges by the weights and the node by the bias.

00:00:36.314 --> 00:00:37.664
So, what the perception does,
00:00:37.664 --> 00:00:39.655
it takes to point (x1, x2),
00:00:39.655 --> 00:00:44.689
plots it in the graph and then it returns a probability that the point is blue.
00:00:44.689 --> 00:00:47.378
In this case, it returns a 0.9
00:00:47.378 --> 00:00:51.200
and this mimics the neurons in the brain because they receive nervous impulses,
00:00:51.200 --> 00:00:54.000
do something inside and return a nervous impulse.
30. Non-linear Data

Non-Linear Data
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:04.003

Now we've been dealing a lot with data sets that can be separated by a line,
00:00:04.003 --> 00:00:05.543
like this one over here.
00:00:05.543 --> 00:00:08.865

But as you can imagine the real world is much more complex than that.
00:00:08.865 --> 00:00:12.150

This is where neural networks can show their full potential.
00:00:12.150 --> 00:00:14.970

In the next few videos we'll see how to deal with
00:00:14.970 --> 00:00:17.350

more complicated data sets that require
00:00:17.350 --> 00:00:20.109

highly non-linear boundaries such as this one over here.
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:02.459

So, let's go back to this example of where we saw
00:00:02.459 --> 00:00:04.769

some data that is not linearly separable.
00:00:04.769 --> 00:00:09.740

So a line can not divide these red and blue points and we looked at some solutions,
00:00:09.740 --> 00:00:14.185

and if you remember, the one we considered more seriously was this curve over here.
00:00:14.185 --> 00:00:18.664

So what I'll teach you now is to find this curve and it's very similar than before.
00:00:18.664 --> 00:00:20.519
We'll still use grading dissent.
00:00:20.518 --> 00:00:23.009

In a nutshell, what we're going to do is for
00:00:23.010 --> 00:00:25.769

these data which is not separable with a line,
00:00:25.768 --> 00:00:30.599

we're going to create a probability function where the points in the blue region are more
00:00:30.600 --> 00:00:36.240

likely to be blue and the points in the red region are more likely to be red.
00:00:36.240 --> 00:00:39.798

And this curve here that separates them is
00:00:39.798 --> 00:00:44.329

a set of points which are equally likely to be blue or red.
00:00:44.329 --> 00:00:47.789

Everything will be the same as before except this equation
00:00:47.789 --> 00:00:52.000

won't be linear and that's where neural networks come into play.
32. Neural Network Architecture
Neural Network Architecture
Ok, so we're ready to put these building blocks together, and build great Neural Networks! (Or
Multi-Layer Perceptrons, however you prefer to call them.)
This first two videos will show us how to combine two perceptrons into a third, more
complicated one.
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:03.464

Now I'm going to show you how to create these nonlinear models.
00:00:03.464 --> 00:00:06.058

What we're going to do is a very simple trick.
00:00:06.059 --> 00:00:12.060

We're going to combine two linear models into a nonlinear model as follows.
00:00:12.060 --> 00:00:13.769

Visually it looks like this.
00:00:13.769 --> 00:00:17.518

The two models over imposed creating the model on the right.
00:00:17.518 --> 00:00:20.084

It's almost like we're doing arithmetic on models.
00:00:20.085 --> 00:00:24.160

It's like saying "This line plus this line equals that curve."
00:00:24.160 --> 00:00:26.824

Let me show you how to do this mathematically.
00:00:26.824 --> 00:00:30.750
So a linear model as we know is a whole probability space.
00:00:30.750 --> 00:00:36.478

This means that for every point it gives us the probability of the point being blue.
00:00:36.478 --> 00:00:39.179

So, for example, this point over here is in
00:00:39.179 --> 00:00:43.890

the blue region so its probability of being blue is 0.7.
00:00:43.890 --> 00:00:47.250

The same point given by the second probability space is
00:00:47.250 --> 00:00:52.170

also in the blue region so it's probability of being blue is 0.8.
00:00:52.170 --> 00:00:53.353

Now the question is,
00:00:53.353 --> 00:00:55.890

how do we combine these two?
00:00:55.890 --> 00:01:00.225

Well, the simplest way to combine two numbers is to add them, right?
00:01:00.225 --> 00:01:05.409

So 0.8 plus 0.7 is 1.5.
00:01:05.409 --> 00:01:09.890

But now, this doesn't look like a probability anymore since it's bigger than one.
00:01:09.890 --> 00:01:15.915

And probabilities need to be between 0 and 1. So what can we do?
00:01:15.915 --> 00:01:20.980

How do we turn this number that is larger than 1 into something between 0 and 1?
00:01:20.980 --> 00:01:24.079

Well, we've been in this situation before and we have a pretty good tool that
00:01:24.078 --> 00:01:27.744

turns every number into something between 0 and 1.
00:01:27.745 --> 00:01:30.234

That's just a sigmoid function.
00:01:30.233 --> 00:01:32.780
So that's what we're going to do.
00:01:32.780 --> 00:01:36.858

We applied the sigmoid function to 1.5 to get the value
00:01:36.858 --> 00:01:40.188

0.82 and that's the probability of
00:01:40.188 --> 00:01:44.568

this point being blue in the resulting probability space.
00:01:44.569 --> 00:01:47.299

So now we've managed to create a probability function for
00:01:47.299 --> 00:01:51.243

every single point in the plane and that's how we combined two models.
00:01:51.243 --> 00:01:54.093

We calculate the probability for one of them,
00:01:54.093 --> 00:01:56.140

the probability for the other,
00:01:56.140 --> 00:01:59.334

then add them and then we apply the sigmoid function.
00:01:59.334 --> 00:02:01.340

Now, what if we wanted to weight this sum?
00:02:01.340 --> 00:02:04.370

What, if say, we wanted the model in the top to have
00:02:04.370 --> 00:02:07.849

more of a saying the resulting probability than the second?
00:02:07.849 --> 00:02:11.569

So something like this where the resulting model looks a lot more like the one in
00:02:11.568 --> 00:02:15.698

the top then like the one in the bottom. Well, we can add weights.
00:02:15.699 --> 00:02:22.355

For example, we can say "I want seven times the first model plus the second one."
00:02:22.354 --> 00:02:24.240

Actually, I can add the weights since I want.
00:02:24.241 --> 00:02:29.574
For example, I can say "Seven times the first one plus five times the second one."
00:02:29.574 --> 00:02:34.335

And when I do get the combine the model is I take the first probability,
00:02:34.335 --> 00:02:36.789

multiply it by seven,
00:02:36.788 --> 00:02:43.293

then take the second one and multiply it by five and I can even add a bias if I want.
00:02:43.294 --> 00:02:45.526

Say, the bias is minus 6,
00:02:45.526 --> 00:02:48.020

then we add it to the whole equation.
00:02:48.020 --> 00:02:52.735

So we'll have seven times this plus five times this minus six,
00:02:52.735 --> 00:02:54.914

which gives us 2.9.
00:02:54.913 --> 00:03:00.679

We then apply the sigmoid function and that gives us 0.95.
00:03:00.680 --> 00:03:02.680

So it's almost like we had before, isn't it?
00:03:02.680 --> 00:03:06.085

Before we had a line that is a linear combination
00:03:06.085 --> 00:03:10.240

of the input values times the weight plus a bias.
00:03:10.240 --> 00:03:13.300

Now we have that this model is a linear combination of
00:03:13.300 --> 00:03:17.650

the two previous model times the weights plus some bias.
00:03:17.650 --> 00:03:18.905

So it's almost the same thing.
00:03:18.905 --> 00:03:21.599

It's almost like this curved model in the right.
00:03:21.599 --> 00:03:25.818
It's a linear combination of the two linear models before
00:03:25.818 --> 00:03:30.573

or we can even think of it as the line between the two models.
00:03:30.574 --> 00:03:32.069

This is no coincidence.
00:03:32.068 --> 00:03:35.435

This is at the heart of how neural networks get built.
00:03:35.435 --> 00:03:38.628

Of course, we can imagine that we can keep doing this always obtaining
00:03:38.628 --> 00:03:43.228

more new complex models out of linear combinations of the existing ones.
00:03:43.229 --> 00:03:47.000

And this is what we're going to do to build our neural networks.
29 Neural Network Architecture 2

Very important
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:02.770

So in the previous session we learn that we can
00:00:02.770 --> 00:00:05.679

add to linear models to obtain a third model.
00:00:05.679 --> 00:00:07.734

As a matter of fact, we did even more.
00:00:07.735 --> 00:00:10.505
We can take a linear combination of two models.
00:00:10.505 --> 00:00:13.750

So, the first model times a constant plus the second model times a
00:00:13.750 --> 00:00:17.785

constant plus a bias and that gives us a non-linear model.
00:00:17.785 --> 00:00:22.240

That looks a lot like perceptrons where we can take a value times a constant plus
00:00:22.239 --> 00:00:26.784

another value times a constant plus a bias and get a new value.
00:00:26.785 --> 00:00:28.204

And that's no coincidence.
00:00:28.204 --> 00:00:31.649

That's actually the building block of Neural Networks.
00:00:31.649 --> 00:00:33.210

So, let's look at an example.
00:00:33.210 --> 00:00:40.304

Let's say, we have this linear model where the linear equation is 5x1 minus 2x2 plus 8.
00:00:40.304 --> 00:00:42.689

That's represented by this perceptron.
00:00:42.689 --> 00:00:46.169

And we have another linear model with equations 7x1 minus
00:00:46.170 --> 00:00:52.045

3x2 minus 1 which is represented by this perceptron over here.
00:00:52.045 --> 00:00:55.929

Let's draw them nicely in here and let's use another perceptron
00:00:55.929 --> 00:01:00.070

to combine these two models using the Linear Equation,
00:01:00.070 --> 00:01:06.420

seven times the first model plus five times the second model minus six.
00:01:06.420 --> 00:01:11.170

And now the magic happens when we join these together and we get a Neural Network.
00:01:11.170 --> 00:01:16.480
We clean it up a bit and we obtain this. All the weights are there.
00:01:16.480 --> 00:01:18.670

The weights on the left,
00:01:18.670 --> 00:01:22.445

tell us what equations the linear models have.
00:01:22.444 --> 00:01:25.024

And the weights on the right,
00:01:25.025 --> 00:01:27.160

tell us what the linear combination is of
00:01:27.159 --> 00:01:31.629

the two models to obtain the curve non-linear model in the right.
00:01:31.629 --> 00:01:35.319

So, whenever you see a Neural Network like the one on the left,
00:01:35.319 --> 00:01:40.204

think of what could be the nonlinear boundary defined by the Neural Network.
00:01:40.204 --> 00:01:45.444

Now, note that this was drawn using the notation that puts a bias inside the node.
00:01:45.444 --> 00:01:50.394

This can also be drawn using the notation that keeps the bias as a separate node.
00:01:50.394 --> 00:01:52.939

Here, what we do is, in every layer we have
00:01:52.939 --> 00:01:56.870

a bias unit coming from a node with a one on it.
00:01:56.870 --> 00:01:59.870

So for example, the minus eight on the top node
00:01:59.870 --> 00:02:04.160

becomes an edge labelled minus eight coming from the bias node.
00:02:04.159 --> 00:02:06.119

We can see that this Neural Network uses
00:02:06.120 --> 00:02:09.000

a Sigmoid Activation Function and the Perceptrons.
Multiple layers
Now, not all neural networks look like the one above. They can be way more
complicated! In particular, we can do the following things:
 Add more nodes to the input, hidden, and output layers.

 Add more layers.
We'll see the effects of these changes in the next video.

WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:04.495

Neural networks have a certain special architecture with layers.
00:00:04.495 --> 00:00:07.320

The first layer is called the input layer,
00:00:07.320 --> 00:00:08.934

which contains the inputs,
00:00:08.933 --> 00:00:11.931

in this case, x1 and x2.
00:00:11.932 --> 00:00:14.460

The next layer is called the hidden layer,
00:00:14.460 --> 00:00:18.855

which is a set of linear models created with this first input layer.
00:00:18.855 --> 00:00:21.940

And then the final layer is called the output layer,
00:00:21.940 --> 00:00:26.614

where the linear models get combined to obtain a nonlinear model.
00:00:26.614 --> 00:00:28.644
You can have different architectures.
00:00:28.643 --> 00:00:31.764

For example, here's one with a larger hidden layer.
00:00:31.765 --> 00:00:33.689

Now we're combining three linear models to
00:00:33.689 --> 00:00:36.600

obtain the triangular boundary in the output layer.
00:00:36.600 --> 00:00:39.649

Now what happens if the input layer has more nodes?
00:00:39.649 --> 00:00:43.460

For example, this neural network has three nodes in its input layer.
00:00:43.460 --> 00:00:46.435

Well, that just means we're not living in two-dimensional space anymore.
00:00:46.435 --> 00:00:48.755

We're living in three-dimensional space,
00:00:48.755 --> 00:00:50.045

and now our hidden layer,
00:00:50.045 --> 00:00:51.689

the one with the linear models,
00:00:51.689 --> 00:00:54.795

just gives us a bunch of planes in three space,
00:00:54.795 --> 00:00:59.820

and the output layer bounds a nonlinear region in three space.
00:00:59.820 --> 00:01:03.030

In general, if we have n nodes in our input layer,
00:01:03.030 --> 00:01:06.780

then we're thinking of data living in n-dimensional space.
00:01:06.780 --> 00:01:08.983

Now what if our output layer has more nodes?
00:01:08.983 --> 00:01:10.890

Then we just have more outputs.
00:01:10.890 --> 00:01:14.209
In that case, we just have a multiclass classification model.
00:01:14.209 --> 00:01:18.329

So if our model is telling us if an image is a cat or dog or a bird,
00:01:18.328 --> 00:01:20.309

then we simply have each node in
00:01:20.310 --> 00:01:25.140

the output layer output a score for each one of the classes: one for the cat,
00:01:25.140 --> 00:01:27.930

one for the dog, and one for the bird.
00:01:27.930 --> 00:01:31.189

And finally, and here's where things get pretty cool,
00:01:31.188 --> 00:01:33.274

what if we have more layers?
00:01:33.275 --> 00:01:36.090

Then we have what's called a deep neural network.
00:01:36.090 --> 00:01:39.435

Now what happens here is our linear models combine to create
00:01:39.435 --> 00:01:45.364

nonlinear models and then these combine to create even more nonlinear models.
00:01:45.364 --> 00:01:48.150

In general, we can do this many times and obtain
00:01:48.150 --> 00:01:51.329

highly complex models with lots of hidden layers.
00:01:51.328 --> 00:01:54.434

This is where the magic of neural networks happens.
00:01:54.435 --> 00:01:56.406

Many of the models in real life,
00:01:56.406 --> 00:01:59.054

for self-driving cars or for game-playing agents,
00:01:59.055 --> 00:02:01.049

have many, many hidden layers.
00:02:01.049 --> 00:02:02.879
That neural network will just split
00:02:02.879 --> 00:02:07.091

the n-dimensional space with a highly nonlinear boundary,
00:02:07.090 --> 00:02:08.370

such as maybe the one on the right.
Multi-Class Classification
And here we elaborate a bit more into what can be done if our neural network needs
to model data with more than one output.
Multiclass Classification
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:02.640
We briefly mentioned multi-class classification
00:00:02.640 --> 00:00:05.439

in the last video but let me be more specific.
00:00:05.440 --> 00:00:07.469

It seems that neural networks work really well when
00:00:07.469 --> 00:00:10.278

the problem consist on classifying two classes.
00:00:10.278 --> 00:00:13.800

For example, if the model predicts a probability of receiving
00:00:13.800 --> 00:00:18.625

a gift or not then the answer just comes as the output of the neural network.
00:00:18.625 --> 00:00:20.588

But what happens if we have more classes?
00:00:20.588 --> 00:00:23.643

Say, we want the model to tell us if an image is a duck,
00:00:23.643 --> 00:00:26.849

a beaver, or a walrus.
00:00:26.850 --> 00:00:30.695

Well, one thing we can do is create a neural network to predict if the image is a duck,
00:00:30.695 --> 00:00:33.990

then another neural network to predict if the image is a beaver,
00:00:33.990 --> 00:00:37.408

and a third neural network to predict if the image is a walrus.
00:00:37.408 --> 00:00:42.548

Then we can just use SoftMax or pick the answer that gives us the highest probability.
00:00:42.548 --> 00:00:45.073

But this seems like overkill, right?
00:00:45.073 --> 00:00:48.280

The first layers of the neural network should be enough to tell us things about
00:00:48.280 --> 00:00:52.545

the image and maybe just the last layer should tell us which animal it is.
00:00:52.545 --> 00:00:56.448
As a matter of fact, as you'll see in the CNN section,
00:00:56.448 --> 00:00:58.594

this is exactly the case.
00:00:58.594 --> 00:01:02.719

So what we need here is to add more nodes in the output layer and each one of
00:01:02.719 --> 00:01:07.730

the nodes will give us the probability that the image is each of the animals.
00:01:07.730 --> 00:01:11.569

Now, we take the scores and apply the SoftMax function that was previously
00:01:11.569 --> 00:01:15.989

defined to obtain well-defined probabilities.
00:01:15.989 --> 00:01:20.000

This is how we get neural networks to do multi-class classification.
33. Feedforward
Feedforward
Feedforward is the process neural networks use to turn the input into an output. Let's study it
more carefully, before we dive into how to train the networks.
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:03.000

So now that we have defined what neural networks are,
00:00:03.000 --> 00:00:04.810

we need to learn how to train them.
00:00:04.810 --> 00:00:07.290

Training them really means what parameters should they
00:00:07.290 --> 00:00:10.495

have on the edges in order to model our data well.
00:00:10.495 --> 00:00:12.180

So in order to learn how to train them,
00:00:12.180 --> 00:00:16.800

we need to look carefully at how they process the input to obtain an output.
00:00:16.800 --> 00:00:19.820

So let's look at our simplest neural network, a perceptron.
00:00:19.820 --> 00:00:23.400
This perceptron receives a data point of the form x1,
00:00:23.400 --> 00:00:27.195

x2 where the label is Y=1.
00:00:27.195 --> 00:00:29.385

This means that the point is blue.
00:00:29.385 --> 00:00:34.680

Now the perceptron is defined by a linear equation say w1, x1 plus w2,
00:00:34.680 --> 00:00:41.595

x2 plus B, where w1 and w2 are the weights in the edges and B is the bias in the note.
00:00:41.595 --> 00:00:43.555

Here, w1 is bigger than w2,
00:00:43.555 --> 00:00:46.200

so we'll denote that by drawing the edge labelled w1
00:00:46.200 --> 00:00:49.820

much thicker than the edge labelled w2.
00:00:49.820 --> 00:00:53.280

Now, what the perceptron does is it plots the point x1,
00:00:53.280 --> 00:00:57.240

x2 and it outputs the probability that the point is blue.
00:00:57.240 --> 00:01:01.173

Here is the point is in the red area and then the output is a small number,
00:01:01.173 --> 00:01:03.795

since the point is not very likely to be blue.
00:01:03.795 --> 00:01:07.045

This process is known as feedforward.
00:01:07.045 --> 00:01:11.070

We can see that this is a bad model because the point is actually blue.
00:01:11.070 --> 00:01:12.570

Given that the third coordinate,
00:01:12.570 --> 00:01:14.820

the Y is one.
00:01:14.820 --> 00:01:17.010
Now if we have a more complicated neural network,
00:01:17.010 --> 00:01:18.570

then the process is the same.
00:01:18.570 --> 00:01:22.050

Here, we have thick edges corresponding to large weights and
00:01:22.050 --> 00:01:26.280

thin edges corresponding to small weights and the neural network plots
00:01:26.280 --> 00:01:29.070

the point in the top graph and also in
00:01:29.070 --> 00:01:35.025

the bottom graph and the outputs coming out will be a small number from the top model.
00:01:35.025 --> 00:01:38.580

The point lies in the red area which means it has a small probability of being
00:01:38.580 --> 00:01:43.140

blue and a large number from the second model,
00:01:43.140 --> 00:01:44.895

since the point lies in the blue area which means
00:01:44.895 --> 00:01:47.280

it has a large probability of being blue.
00:01:47.280 --> 00:01:51.650

Now, as the two models get combined into this nonlinear model and
00:01:51.650 --> 00:01:53.180

the output layer just plots
00:01:53.180 --> 00:01:57.485

the point and it tells the probability that the point is blue.
00:01:57.485 --> 00:02:00.620

As you can see, this is a bad model because it
00:02:00.620 --> 00:02:03.750

puts the point in the red area and the point is blue.
00:02:03.750 --> 00:02:08.280

Again, this process called feedforward and we'll look at it more carefully.
00:02:08.280 --> 00:02:13.070
Here, we have our neural network and the other notations so the bias is in the outside.
00:02:13.070 --> 00:02:15.260

Now we have a matrix of weights.
00:02:15.260 --> 00:02:21.285

The matrix w superscript one denoting the first layer and the entries are the weights w1,
00:02:21.285 --> 00:02:23.310

1 up to w3, 2.
00:02:23.310 --> 00:02:26.175

Notice that the biases have now been written as w3,
00:02:26.175 --> 00:02:30.110

1 and w3, 2 this is just for convenience.
00:02:30.110 --> 00:02:31.520

Now in the next layer,
00:02:31.520 --> 00:02:36.115

we also have a matrix this one is w superscript two for the second layer.
00:02:36.115 --> 00:02:38.840

This layer contains the weights that tell us how to combine
00:02:38.840 --> 00:02:43.700

the linear models in the first layer to obtain the nonlinear model in the second layer.
00:02:43.700 --> 00:02:45.060

Now what happens is some math.
00:02:45.060 --> 00:02:47.135

We have the input in the form x1, x2,
00:02:47.135 --> 00:02:51.000

1 where the one comes from the bias unit.
00:02:51.000 --> 00:02:55.660

Now we multiply it by the matrix w1 to get these outputs.
00:02:55.660 --> 00:03:01.250

Then, we apply the sigmoid function to turn the outputs into values between zero and one.
00:03:01.250 --> 00:03:04.130

Then the vector format these values gets a one attatched for
00:03:04.130 --> 00:03:08.280
the bias unit and multiplied by the second matrix.
00:03:08.280 --> 00:03:12.110

This returns an output that now gets thrown into a sigmoid function to
00:03:12.110 --> 00:03:16.290

obtain the final output which is y-hat.
00:03:16.290 --> 00:03:21.155

Y-hat is the prediction or the probability that the point is labeled blue.
00:03:21.155 --> 00:03:23.275

So this is what neural networks do.
00:03:23.275 --> 00:03:25.760

They take the input vector and then apply
00:03:25.760 --> 00:03:29.170

a sequence of linear models and sigmoid functions.
00:03:29.170 --> 00:03:32.825

These maps when combined become a highly non-linear map.
00:03:32.825 --> 00:03:37.310

And the final formula is simply y-hat equals sigmoid of
00:03:37.310 --> 00:03:42.995

w2 combined with sigmoid of w1 applied to x.
00:03:42.995 --> 00:03:48.025

Just for redundance, we do this again on a multi-layer perceptron or neural network.
00:03:48.025 --> 00:03:51.105

To calculate our prediction y-hat,
00:03:51.105 --> 00:03:53.380

we start with the unit vector x,
00:03:53.380 --> 00:03:55.560

then we apply the first matrix and
00:03:55.560 --> 00:04:00.405

a sigmoid function to get the values in the second layer.
00:04:00.405 --> 00:04:05.360

Then, we apply the second matrix and another sigmoid function to get the values on
00:04:05.360 --> 00:04:13.315
the third layer and so on and so forth until we get our final prediction, y-hat.
00:04:13.315 --> 00:04:16.430

And this is the feedforward process that the neural networks
00:04:16.430 --> 00:04:20.000

use to obtain the prediction from the input vector.
Error Function
Just as before, neural networks will produce an error function, which at the end, is
what we'll be minimizing. The following video shows the error function for a neural
network.
DL 42 Neural Network Error Function (1)

0:00:00.000 --> 00:00:02.520
So, our goal is to train our neural network.
00:00:02.520 --> 00:00:03.715

In order to do this,
00:00:03.715 --> 00:00:05.950
we have to define the error function.
00:00:05.950 --> 00:00:10.375

So, let's look again at what the error function was for perceptrons.
00:00:10.375 --> 00:00:12.135

So, here's our perceptron.
00:00:12.135 --> 00:00:15.000

In the left, we have our input vector with
00:00:15.000 --> 00:00:18.900

entries x_1 up to x_n, and one for the bias unit.
00:00:18.900 --> 00:00:23.945

And the edges with weights W_1 up to W_n,
00:00:23.945 --> 00:00:26.360

and b for the bias unit.
00:00:26.360 --> 00:00:30.275

Finally, we can see that this perceptor uses a sigmoid function.
00:00:30.275 --> 00:00:37.008

And the prediction is defined as y-hat equals sigmoid of Wx plus b.
00:00:37.008 --> 00:00:39.750

And as we saw, this function gives us a measure of
00:00:39.750 --> 00:00:44.175

the error of how badly each point is being classified.
00:00:44.175 --> 00:00:48.565

Roughly, this is a very small number if the point is correctly classified,
00:00:48.565 --> 00:00:50.640

and a measure of how far the point is from
00:00:50.640 --> 00:00:53.415

the line and the point is incorrectly classified.
00:00:53.415 --> 00:00:57.840

So, what are we going to do to define the error function in a multilayer perceptron?
00:00:57.840 --> 00:01:00.000

Well, as we saw, our prediction is simply
00:01:00.000 --> 00:01:03.740
a combination of matrix multiplications and sigmoid functions.
00:01:03.740 --> 00:01:07.370

But the error function can be the exact same thing, right?
00:01:07.370 --> 00:01:08.817

It can be the exact same formula,
00:01:08.817 --> 00:01:12.000

except now, y-hat is just a bit more complicated.
00:01:12.000 --> 00:01:17.490

And still, this function will tell us how badly a point gets misclassified.
00:01:17.490 --> 00:01:20.000

Except now, it's looking at a more complicated boundary.
34. Backpropagation
Backpropagation
Now, we're ready to get our hands into training a neural network. For this, we'll use the
method known as backpropagation. In a nutshell, backpropagation will consist of:
 Doing a feedforward operation.

 Comparing the output of the model with the desired output.
 Calculating the error.
 Running the feedforward operation backwards (backpropagation) to spread the error to each
of the weights.
 Use this to update the weights, and get a better model.
 Continue this until we have a model that is good.
Sounds more complicated than what it actually is. Let's take a look in the next few videos. The
first video will show us a conceptual interpretation of what backpropagation is.
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:04.309

So now we're finally ready to get our hands into training a neural network.
00:00:04.309 --> 00:00:06.449

So let's quickly recall feedforward.
00:00:06.450 --> 00:00:10.469

We have our perceptron with a point coming in labeled positive.
00:00:10.470 --> 00:00:15.050

And our equation w1x1 + w2x2 + b,
00:00:15.050 --> 00:00:19.804

where w1 and w2 are the weights and b is the bias.
00:00:19.804 --> 00:00:21.009

Now, what the perceptron does is,
00:00:21.010 --> 00:00:25.405

it plots a point and returns a probability that the point is blue.
00:00:25.405 --> 00:00:29.345

Which in this case is small since the point is in the red area.
00:00:29.344 --> 00:00:32.070

Thus, this is a bad perceptron since it
00:00:32.070 --> 00:00:36.164

predicts that the point is red when the point is really blue.
00:00:36.164 --> 00:00:39.609

And now let's recall what we did in the gradient descent algorithm.
00:00:39.609 --> 00:00:42.284

We did this thing called Backpropagation.
00:00:42.284 --> 00:00:44.879

We went in the opposite direction.
00:00:44.880 --> 00:00:48.885

We asked the point, "What do you want the model to do for you?"
00:00:48.884 --> 00:00:50.829

And the point says, "Well,
00:00:50.829 --> 00:00:55.204

I am misclassified so I want this boundary to come closer to me."
00:00:55.204 --> 00:00:59.894

And we saw that the line got closer to it by updating the weights.
00:00:59.895 --> 00:01:01.625

Namely, in this case,
00:01:01.625 --> 00:01:07.239

let's say that it tells the weight w1 to go lower and the weight w2 to go higher.
00:01:07.239 --> 00:01:08.694

And this is just an illustration,
00:01:08.694 --> 00:01:10.379

it's not meant to be exact.
00:01:10.379 --> 00:01:12.045

So we obtain new weights,
00:01:12.045 --> 00:01:19.490

w1' and w2' which define a new line which is now closer to the point.
00:01:19.489 --> 00:01:22.170

So what we're doing is like descending from
00:01:22.170 --> 00:01:23.780

Mt. Errorest, right?
00:01:23.780 --> 00:01:29.864

The height is going to be the error function E(W) and we calculate the gradient
00:01:29.864 --> 00:01:32.679

of the error function which is exactly
00:01:32.680 --> 00:01:35.857

like asking the point what does is it want the model to do.
00:01:35.856 --> 00:01:40.340

And as we take the step down the direction of the negative of the gradient,
00:01:40.340 --> 00:01:43.969

we decrease the error to come down the mountain.
00:01:43.969 --> 00:01:45.304

This gives us a new error,
00:01:45.305 --> 00:01:49.932

E(W') and a new model W' with a smaller error,
00:01:49.932 --> 00:01:53.480

which means we get a new line closer to the point.
00:01:53.480 --> 00:01:58.130

We continue doing this process in order to minimize the error.
00:01:58.129 --> 00:01:59.890

So that was for a single perceptron.
00:01:59.890 --> 00:02:02.760

Now, what do we do for multi-layer perceptrons?
00:02:02.760 --> 00:02:06.745

Well, we still do the same process of reducing the error by descending from the mountain,
00:02:06.745 --> 00:02:11.055

except now, since the error function is more complicated then it's not
00:02:11.055 --> 00:02:12.775

Mt. Errorest, now it's
00:02:12.775 --> 00:02:15.789

Mt. Kilimanjerror. But same thing,
00:02:15.788 --> 00:02:19.554

we calculate the error function and its gradient.
00:02:19.555 --> 00:02:25.290

We then walk in the direction of the negative of the gradient in order to find
00:02:25.289 --> 00:02:28.644

a new model W' with a smaller error
00:02:28.645 --> 00:02:32.719

E(W') which will give us a better prediction.
00:02:32.719 --> 00:02:36.895

And we continue doing this process in order to minimize the error.
00:02:36.895 --> 00:02:40.149

So let's look again at what feedforward does in a multi-layer perceptron.
00:02:40.149 --> 00:02:45.990

The point comes in with coordinates (x1, x2) and label y = 1.
00:02:45.990 --> 00:02:50.570

It gets plotted in the linear models corresponding to the hidden layer.
00:02:50.569 --> 00:02:54.019

And then, as this layer gets combined the point gets
00:02:54.020 --> 00:02:58.280

plotted in the resulting non-linear model in the output layer.
00:02:58.280 --> 00:03:01.400

And the probability that the point is blue is obtained by
00:03:01.400 --> 00:03:05.060

the position of this point in the final model.
00:03:05.060 --> 00:03:07.189

Now, pay close attention because this is
00:03:07.189 --> 00:03:11.094

the key for training neural networks, it's Backpropagation.
00:03:11.094 --> 00:03:13.849

We'll do as before, we'll check the error.
00:03:13.849 --> 00:03:16.159

So this model is not good because it predicts that
00:03:16.159 --> 00:03:19.365

the point will be red when in reality the point is blue.
00:03:19.365 --> 00:03:21.320

So we'll ask the point,
00:03:21.319 --> 00:03:26.579

"What do you want this model to do in order for you to be better classified?"
00:03:26.580 --> 00:03:31.615

And the point says, "I kind of want this blue region to come closer to me."
00:03:31.615 --> 00:03:35.195

Now, what does it mean for the region to come closer to it?
00:03:35.194 --> 00:03:39.049

Well, let's look at the two linear models in the hidden layer.
00:03:39.050 --> 00:03:42.735

Which one of these two models is doing better?
00:03:42.735 --> 00:03:45.740

Well, it seems like the top one is badly misclassifying
00:03:45.740 --> 00:03:50.230

the point whereas the bottom one is classifying it correctly.
00:03:50.229 --> 00:03:55.454

So we kind of want to listen to the bottom one more and to the top one less.
00:03:55.455 --> 00:03:58.880

So what we want to do is to reduce the weight coming from
00:03:58.879 --> 00:04:02.519

the top model and increase the weight coming from the bottom model.
00:04:02.520 --> 00:04:05.909

So now our final model will look a lot
00:04:05.909 --> 00:04:10.034

more like the bottom model than like the top model.
00:04:10.034 --> 00:04:12.014

But we can do even more.
00:04:12.014 --> 00:04:15.464

We can actually go to the linear models and ask the point,
00:04:15.465 --> 00:04:20.250

"What can these models do to classify you better?"
00:04:20.250 --> 00:04:22.139

And the point will say, "Well,
00:04:22.139 --> 00:04:24.832

the top model is misclassifying me,
00:04:24.833 --> 00:04:28.635

so I kind of want this line to move closer to me.
00:04:28.634 --> 00:04:33.084

And the second model is correctly classifying me,
00:04:33.084 --> 00:04:37.370

so I want this line to move farther away from me."
00:04:37.370 --> 00:04:41.670

And so this change in the model will actually update the weights.
00:04:41.670 --> 00:04:46.000

Let's say, it'll increase these two and decrease these two.
00:04:46.000 --> 00:04:50.735

So now after we update all the weights we have better predictions at
00:04:50.735 --> 00:04:53.569

all the models in the hidden layer and
00:04:53.569 --> 00:04:57.589

also a better prediction at the model in the output layer.
00:04:57.589 --> 00:05:02.125

Notice that in this video we intentionally left the bias unit away for clarity.
00:05:02.125 --> 00:05:06.649

In reality, when you update the weights we're also updating the bias unit.
00:05:06.649 --> 00:05:08.659

If you're the kind of person who likes formality,
00:05:08.660 --> 00:05:12.070

don't worry, we'll calculate these gradients in detail soon.
Backpropagation Math
And the next few videos will go deeper into the math. Feel free to tune out, since this
part gets handled by Keras pretty well. If you'd like to go start training networks right
away, go to the next section. But if you enjoy calculating lots of derivatives, let's dive
in!
In the video below at 1:24, the edges should be directed to the sigmoid function and
not the bias at that last layer; the edges of the last layer point to the bias currently
which is incorrect.
important
Chain Rule
We'll need to recall the chain rule to help us calculate derivatives.
Chain Rule
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:02.919
So before we start calculating derivatives,
00:00:02.919 --> 00:00:04.980

let's do a refresher on the chain rule which
00:00:04.980 --> 00:00:08.080

is the main technique we'll use to calculate them.
00:00:08.080 --> 00:00:12.390

The chain rule says, if you have a variable x on a function f that you
00:00:12.390 --> 00:00:17.824

apply to x to get f of x, which we're gonna call A,
00:00:17.824 --> 00:00:19.809

and then another function g,
00:00:19.809 --> 00:00:23.604

which you apply to f of x to get g of f of x,
00:00:23.605 --> 00:00:26.760

which we're gonna call B, the chain rule says,
00:00:26.760 --> 00:00:32.920

if you want to find the partial derivative of B with respect to x,
00:00:32.920 --> 00:00:36.685

that's just a partial derivative of B with respect to
00:00:36.685 --> 00:00:41.704

A times the partial derivative of A with respect to x.
00:00:41.704 --> 00:00:43.184

So it literally says,
00:00:43.185 --> 00:00:47.605

when composing functions, that derivatives just multiply,
00:00:47.604 --> 00:00:50.344

and that's gonna be super useful for us because
00:00:50.344 --> 00:00:55.185

feed forwarding is literally composing a bunch of functions,
00:00:55.185 --> 00:01:00.554

and back propagation is literally taking the derivative at each piece,
00:01:00.554 --> 00:01:03.344

and since taking the derivative of a composition
00:01:03.344 --> 00:01:06.756

is the same as multiplying the partial derivatives,
00:01:06.756 --> 00:01:10.290

then all we're gonna do is multiply a bunch of
00:01:10.290 --> 00:01:14.130
partial derivatives to get what we want. Pretty simple, right?
important
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:04.305

So, let us go back to our neural network with our weights and our input.
00:00:04.305 --> 00:00:08.730

And recall that the weights with superscript 1 belong to the first layer,
00:00:08.730 --> 00:00:12.775

and the weights with superscript 2 belong to the second layer.
00:00:12.775 --> 00:00:15.390

Also, recall that the bias is not called b anymore.
00:00:15.390 --> 00:00:16.960

Now, it is called W31,
00:00:16.960 --> 00:00:19.665

W32 etc. for convenience,
00:00:19.665 --> 00:00:22.830

so that we can have everything in matrix notation.
00:00:22.830 --> 00:00:25.280

And now what happens with the input?
00:00:25.280 --> 00:00:28.218

So, let us do the feedforward process.
00:00:28.218 --> 00:00:29.910

In the first layer,
00:00:29.910 --> 00:00:34.620
we take the input and multiply it by the weights and that gives us h1,
00:00:34.620 --> 00:00:37.860

which is a linear function of the input and the weights.
00:00:37.860 --> 00:00:39.600

Same thing with h2,
00:00:39.600 --> 00:00:42.085

given by this formula over here.
00:00:42.085 --> 00:00:43.350

Now, in the second layer,
00:00:43.350 --> 00:00:46.485

we would take this h1 and h2 and the new bias,
00:00:46.485 --> 00:00:48.645

apply the sigmoid function,
00:00:48.645 --> 00:00:52.980

and then apply a linear function to them by multiplying them by
00:00:52.980 --> 00:00:57.345

the weights and adding them to get a value of h. And finally,
00:00:57.345 --> 00:00:58.530

in the third layer,
00:00:58.530 --> 00:01:02.100

we just take a sigmoid function of h to get
00:01:02.100 --> 00:01:07.540

our prediction or probability between 0 and 1, which is ŷ.
00:01:07.540 --> 00:01:11.070

And we can read this in more condensed notation by saying that
00:01:11.070 --> 00:01:15.244

the matrix corresponding to the first layer is W superscript 1,
00:01:15.244 --> 00:01:19.930

the matrix corresponding to the second layer is W superscript 2,
00:01:19.930 --> 00:01:22.570

and then the prediction we had is just going to be
00:01:22.570 --> 00:01:26.260

the sigmoid of W superscript 2 combined with
00:01:26.260 --> 00:01:33.540

the sigmoid of W superscript 1 applied to the input x and that is
feedforward.
00:01:33.540 --> 00:01:35.890

Now, we are going to develop backpropagation,
00:01:35.890 --> 00:01:39.010

which is precisely the reverse of feedforward.
00:01:39.010 --> 00:01:40.930

So, we are going to calculate the derivative of
00:01:40.930 --> 00:01:44.890

this error function with respect to each of the weights in
00:01:44.890 --> 00:01:52.611

the labels by using the chain rule.
00:01:52.611 --> 00:01:59.110

So, let us recall that our error function is this formula over here,
00:01:59.110 --> 00:02:02.760

which is a function of the prediction ŷ.
00:02:02.760 --> 00:02:07.015

But, since the prediction is a function of all the weights wij,
00:02:07.015 --> 00:02:13.540

then the error function can be seen as the function on all the wij.
00:02:13.540 --> 00:02:16.960

Therefore, the gradient is simply the vector formed by
00:02:16.960 --> 00:02:23.500

all the partial derivatives of the error function E with respect to each of
the weights.
00:02:23.500 --> 00:02:25.196

So, let us calculate one of these derivatives.
00:02:25.196 --> 00:02:31.210

Let us calculate derivative of E with respect to W11 superscript 1.
00:02:31.210 --> 00:02:35.140

So, since the prediction is simply a composition of functions and by the
chain rule,
00:02:35.140 --> 00:02:37.750

we know that the derivative with respect to this
00:02:37.750 --> 00:02:41.650

is the product of all the partial derivatives.
00:02:41.650 --> 00:02:44.710

In this case, the derivative E with respect
00:02:44.710 --> 00:02:48.617

to W11 is the derivative of either respect to ŷ times
00:02:48.617 --> 00:02:52.480

the derivative ŷ with respect to h
00:02:52.480 --> 00:02:57.650
times the derivative h with respect to h1 times the derivative h1 with
respect to W11.
00:02:57.650 --> 00:03:01.345

This may seem complicated,
00:03:01.345 --> 00:03:03.790

but the fact that we can calculate a derivative of
00:03:03.790 --> 00:03:06.370

such a complicated composition function by just
00:03:06.370 --> 00:03:10.235

multiplying 4 partial derivatives is remarkable.
00:03:10.235 --> 00:03:12.360

Now, we have already calculated the first one,
00:03:12.360 --> 00:03:14.767

the derivative of E with respect to ŷ.
00:03:14.767 --> 00:03:16.430

And if you remember, we got ŷ minus y.
00:03:16.430 --> 00:03:20.095

So, let us calculate the other ones.
00:03:20.095 --> 00:03:25.193

Let us zoom in a bit and look at just one piece of our multi-layer
perceptron.
00:03:25.193 --> 00:03:28.665

The inputs are some values h1 and h2,
00:03:28.665 --> 00:03:30.955

which are values coming in from before.
00:03:30.955 --> 00:03:34.905

And once we apply the sigmoid and a linear function
00:03:34.905 --> 00:03:39.045

on h1 and h2 and 1 corresponding to the biased unit,
00:03:39.045 --> 00:03:41.550

we get a result h. So,
00:03:41.550 --> 00:03:44.670

now what is the derivative of h with respect to h1?
00:03:44.670 --> 00:03:51.130

Well, h is a sum of three things and only one of them contains h1.
00:03:51.130 --> 00:03:55.940

So, the second and the third summon just give us a derivative of 0.
00:03:55.940 --> 00:04:03.205
The first summon gives us W11 superscript 2 because that is a constant,
00:04:03.205 --> 00:04:08.715

and that times the derivative of the sigmoid function with respect to h1.
00:04:08.715 --> 00:04:12.615

This is something that we calculated below in the instructor comments,
00:04:12.615 --> 00:04:15.960

which is that the sigmoid function has a beautiful derivative,
00:04:15.960 --> 00:04:19.200

namely the derivative of sigmoid of h is
00:04:19.200 --> 00:04:24.660

precisely sigmoid of h times 1 minus sigmoid of h. Again,
00:04:24.660 --> 00:04:27.600

you can see this development underneath in the instructor comments.
00:04:27.600 --> 00:04:31.275

You also have the chance to code this in the quiz because at the end of the
day,
00:04:31.275 --> 00:04:35.635

we just code these formulas and then use them forever, and that is it.
00:04:35.635 --> 00:04:37.020

That is how you train a neural network.
Calculation of the derivative of the sigmoid function
Recall that the sigmoid function has a beautiful derivative, which we can see in the
following calculation. This will make our backpropagation step much cleaner.
35. Pre-Lab: Analyzing Student Data
Lab: Analyzing Student Data
Now, we're ready to put neural networks in practice. We'll analyze a dataset of student
admissions at UCLA.
 Go to the next page in the classroom (recommended).

 Clone the repo from Github and open the notebook StudentAdmissions.ipynb in
the student_admissions folder. You can either download the repository with git clone
from this link.
Instructions
In this notebook, you'll be implementing some of the steps in the training of the neural
network, namely:
 One-hot encoding the data

 Scaling the data
 Writing the backpropagation step
This is a self-assessed lab. If you need any help or want to check your answers, feel free to
check out the solutions notebook in the same folder, or by clicking here.
36. Notebook: Analyzing Student Data

Workspace

Great job!
You now know how neural networks work and how they get trained. In the next
lesson, Mat will guide you through implementing this training process in NumPy. See
you soon!
Implementing Gradient Descent
 Back to Home
 01. Mean Squared Error Function

 02. Gradient Descent
 03. Gradient Descent: The Math
 04. Gradient Descent: The Code
 05. Implementing Gradient Descent
 06. Multilayer Perceptrons
 07. Backpropagation
 08. Implementing Backpropagation
 09. Further Reading
Log-Loss vs Mean Squared Error
In the previous section, Luis taught you about the log-loss function. There are many
other error functions used for neural networks. Let me teach you another one, called
the mean squared error. As the name says, this one is the mean of the squares of the
differences between the predictions and the labels. In the following section I'll go over
it in detail, then we'll get to implement backpropagation with it on the same student
admissions dataset.
And as a bonus, we'll be implementing this in a very effective way using matrix
multiplication with NumPy!
02. Gradient Descent
Gradient Descent with Squared Errors
We want to find the weights for our neural networks. Let's start by thinking about the goal.
The network needs to make predictions as close as possible to the real values. To measure this,
we use a metric of how wrong the predictions are, the error. A common metric is the sum of
the squared errors (SSE):
E = \frac{1}{2}\sum_{\mu} \sum_j \left[ y^{\mu}_j - \hat{y} ^{\mu}_j

\right]^2E=21∑μ∑j[yjμ−y^jμ]2
where \hat yy^ is the prediction and yy is the true value, and you take the sum over all output
units jj and another sum over all data points \muμ. This might seem like a really complicated
equation at first, but it's fairly simple once you understand the symbols and can say what's
going on in words.
First, the inside sum over jj. This variable jj represents the output units of the network. So this
inside sum is saying for each output unit, find the difference between the true value yy and the
predicted value from the network \hat yy^, then square the difference, then sum up all those
squares.
Then the other sum over \muμ is a sum over all the data points. So, for each data point you
calculate the inner sum of the squared differences for each output unit. Then you sum up those
squared differences for each data point. That gives you the overall error for all the output
predictions for all the data points.
The SSE is a good choice for a few reasons. The square ensures the error is always positive
and larger errors are penalized more than smaller errors. Also, it makes the math nice, always
a plus.
Remember that the output of a neural network, the prediction, depends on the weights
\hat{y}^{\mu}_j = f \left( \sum_i{ w_{ij} x^{\mu}_i }\right)y^jμ=f(∑iwijxiμ)

and accordingly the error depends on the weights
E = \frac{1}{2}\sum_{\mu} \sum_j \left[ y^{\mu}_j - f \left( \sum_i{ w_{ij}

x^{\mu}_i }\right) \right]^2E=21∑μ∑j[yjμ−f(∑iwijxiμ)]2
We want the network's prediction error to be as small as possible and the weights are the
knobs we can use to make that happen. Our goal is to find weights w_{ij}wij that minimize
the squared error EE. To do this with a neural network, typically you'd use gradient descent.
Enter Gradient Descent
As Luis said, with gradient descent, we take multiple small steps towards our goal. In
this case, we want to change the weights in steps that reduce the error. Continuing
the analogy, the error is our mountain and we want to get to the bottom. Since the
fastest way down a mountain is in the steepest direction, the steps taken should be in
the direction that minimizes the error the most. We can find this direction by
calculating the gradient of the squared error.
Gradient is another term for rate of change or slope. If you need to brush up on this
concept, check out Khan Academy's great lectures on the topic.
The gradient is just a derivative generalized to functions with more than one variable.
We can use calculus to find the gradient at any point in our error function, which
depends on the input weights. You'll see how the gradient descent step is derived on
the next page.
Below I've plotted an example of the error of a neural network with two inputs, and
accordingly, two weights. You can read this like a topographical map where points on
a contour line have the same error and darker contour lines correspond to larger
errors.
At each step, you calculate the error and the gradient, then use those to determine
how much to change each weight. Repeating this process will eventually find weights
that are close to the minimum of the error function, the black dot in the middle.
Caveats
Since the weights will just go wherever the gradient takes them, they can end up
where the error is low, but not the lowest. These spots are called local minima. If the
weights are initialized with the wrong values, gradient descent could lead the weights
into a local minimum, illustrated below.
There are methods to avoid this, such as using momentum.

03. Gradient Descent: The Math
Gradient Descent-Math
INSTRUCTOR NOTE:
Notes
Check out Khan Academy's Multivariable calculus lessons if you are unfamiliar with the
subject.
# Defining the sigmoid function for activations

def sigmoid(x):
return 1/(1+np.exp(-x))
# Derivative of the sigmoid function

def sigmoid_prime(x):
return sigmoid(x) * (1 - sigmoid(x))
# Input data
x = np.array([0.1, 0.3])
# Target
y = 0.2
# Input to output weights
weights = np.array([-0.8, 0.5])
# The learning rate, eta in the weight step equation

learnrate = 0.5
# the linear combination performed by the node (h in f(h) and f'(h))

h = x[0]*weights[0] + x[1]*weights[1]
# or h = np.dot(x, weights)
# The neural network output (y-hat)

nn_output = sigmoid(h)
# output error (y - y-hat)

error = y - nn_output
# output gradient (f'(h))

output_grad = sigmoid_prime(h)
# error term (lowercase delta)

error_term = error * output_grad
# Gradient descent step

del_w = [ learnrate * error_term * x[0],
learnrate * error_term * x[1]]
# or del_w = learnrate * error_term * x
Start Quiz:
gradient.py solution.py
import numpy as np
def sigmoid(x):
"""
Calculate sigmoid
"""
"""
"""
learnrate = 0.5
x = np.array([1, 2, 3, 4])
y = np.array(0.5)
# Initial weights
w = np.array([0.5, -0.5, 0.3, 0.1])
### Calculate one gradient descent step for each weight

### Note: Some steps have been consilated, so there are
### fewer variable names than in the above sample code
# TODO: Calculate the node's linear combination of inputs and weights

h = None
# TODO: Calculate output of neural network

nn_output = None
# TODO: Calculate error of neural network

error = None
# TODO: Calculate the error term

# Remember, this requires the output gradient, which we haven't
# specifically added a variable for.
error_term = None
# TODO: Calculate change in weights

del_w = None
print('Neural Network output:')

print(nn_output)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)
Start Quiz:
gradient.py solution.py
import numpy as np
def sigmoid(x):
"""
Calculate sigmoid
"""
"""
"""
learnrate = 0.5
x = np.array([1, 2. 3, 4])
y = np.array(0.5)
# Initial weights
w = np.array([0.5, -0.5, 0.3, 0.1])
### Calculate one gradient descent step for each weight

### Note: Some steps have been consilated, so there are
### fewer variable names than in the above sample code
# TODO: Calculate the node's linear combination of inputs and weights

h = np.dot(x, w)
# TODO: Calculate output of neural network

nn_output = sigmoid(h)
# TODO: Calculate error of neural network

error = y - nn_output

# Remember, this requires the output gradient, which we haven't
# specifically added a variable for.
error_term = error * sigmoid_prime(h)
# Note: The sigmoid_prime function calculates sigmoid(h) twice,
# but you've already calculated it once. You can make this
# code more efficient by calculating the derivative directly
# rather than calling sigmoid_prime, like this:
# error_term = error * nn_output * (1 - nn_output)
# TODO: Calculate change in weights
del_w = learnrate * error_term * x
print('Neural Network output:')

print(nn_output)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)
The goal here is to predict if a student will be admitted to a graduate program based
on these features. For this, we'll use a network with one output layer with one unit.
We'll use a sigmoid function for the output unit activation.
Data cleanup
You might think there will be three input units, but we actually need to transform the
data first. The rank feature is categorical, the numbers don't encode any sort of
relative values. Rank 2 is not twice as much as rank 1, rank 3 is not 1.5 more than rank
2. Instead, we need to use dummy variables to encode rank , splitting the data into
four new columns encoded with ones or zeros. Rows with rank 1 have one in the rank
1 dummy column, and zeros in all other columns. Rows with rank 2 have one in the
rank 2 dummy column, and zeros in all other columns. And so on.
We'll also need to standardize the GRE and GPA data, which means to scale the values
such that they have zero mean and a standard deviation of 1. This is necessary
because the sigmoid function squashes really small and really large inputs. The
gradient of really small and large inputs is zero, which means that the gradient
descent step will go to zero too. Since the GRE and GPA values are fairly large, we
have to be really careful about how we initialize the weights or the gradient descent
steps will die off and the network won't train. Instead, if we standardize the data, we
can initialize the weights easily and everyone is happy.
This is just a brief run-through, you'll learn more about preparing data later. If you're
interested in how I did this, check out the data_prep.py file in the programming
exercise below.
Now that the data is ready, we see that there are six input features: gre , gpa , and the
four rank dummy variables.
Mean Square Error

We're going to make a small change to how we calculate the error here. Instead of the
SSE, we're going to use the mean of the square errors (MSE). Now that we're using a
lot of data, summing up all the weight steps can lead to really large updates that
make the gradient descent diverge. To compensate for this, you'd need to use a quite
small learning rate. Instead, we can just divide by the number of records in our
data, mm to take the average. This way, no matter how much data we use, our
learning rates will typically be in the range of 0.01 to 0.001. Then, we can use the MSE
(shown below) to calculate the gradient and the result is the same as before, just
averaged instead of summed.
Programming exercise
Below, you'll implement gradient descent and train the network on the admissions
data. Your goal here is to train the network until you reach a minimum in the mean
square error (MSE) on the training set. You need to implement:
 The network output: output .

 The output error: error .
 The error term: error_term .
 Update the weight step: del_w += .
 Update the weights: weights += .
After you've written these parts, run the training by pressing "Test Run". The MSE will
print out, as well as the accuracy on a test set, the fraction of correctly predicted
admissions.
Feel free to play with the hyperparameters and see how it changes the MSE.
Start Quiz:
gradient.py data_prep.pybinary.csvsolution.py
import numpy as np
from data_prep import features, targets, features_test, targets_test
def sigmoid(x):
"""
Calculate sigmoid
"""
return 1 / (1 + np.exp(-x))
# TODO: We haven't provided the sigmoid_prime function like we did in

# the previous lesson to encourage you to come up with a more
# efficient solution. If you need a hint, check out the
comments
# in solution.py from the previous lecture.
# Use to same seed to make debugging easier

np.random.seed(42)
n_records, n_features = features.shape

last_loss = None
# Initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)
# Neural Network hyperparameters

epochs = 1000
learnrate = 0.5
for e in range(epochs):
del_w = np.zeros(weights.shape)
for x, y in zip(features.values, targets):
# Loop through all records, x is the input, y is the target
# Note: We haven't included the h variable from the previous

# lesson. You can add it if you want, or you can
calculate
# the h together with the output
# TODO: Calculate the output

output = None
# TODO: Calculate the error

error = None

error_term = None
# TODO: Calculate the change in weights for this sample

# and add it to the total weight change
del_w += 0
# TODO: Update weights using the learning rate and the average
change in weights
weights += 0
# Printing out the mean square error on the training set

if e % (epochs / 10) == 0:
out = sigmoid(np.dot(features, weights))
loss = np.mean((out - targets) ** 2)
if last_loss and last_loss < loss:
print("Train loss: ", loss, " WARNING - Loss
Increasing")
else:
print("Train loss: ", loss)
last_loss = loss
# Calculate accuracy on test data

tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))
Start Quiz:
gradient.py data_prep.py binary.csvsolution.py
import numpy as np
import pandas as pd
admissions = pd.read_csv('binary.csv')
# Make dummy variables for rank

data = pd.concat([admissions, pd.get_dummies(admissions['rank'],
prefix='rank')], axis=1)
data = data.drop('rank', axis=1)
# Standarize features
for field in ['gre', 'gpa']:
mean, std = data[field].mean(), data[field].std()
data.loc[:,field] = (data[field]-mean)/std
# Split off random 10% of the data for testing

np.random.seed(42)
sample = np.random.choice(data.index, size=int(len(data)*0.9),
replace=False)
data, test_data = data.ix[sample], data.drop(sample)
# Split into features and targets

features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1),
test_data['admit']
Start Quiz:
gradient.pydata_prep.pybinary.csv solution.py
import numpy as np
def sigmoid(x):
"""
Calculate sigmoid
"""
# TODO: We haven't provided the sigmoid_prime function like we did in

# the previous lesson to encourage you to come up with a more
# efficient solution. If you need a hint, check out the
comments
# in solution.py from the previous lecture.
# Use to same seed to make debugging easier

np.random.seed(42)

last_loss = None
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)
# Neural Network hyperparameters

epochs = 1000
learnrate = 0.5
del_w = np.zeros(weights.shape)
# Loop through all records, x is the input, y is the target
# Activation of the output unit

# Notice we multiply the inputs and the weights here
# rather than storing h as a separate variable
output = sigmoid(np.dot(x, weights))
# The error, the target minus the network output

error = y - output
# The error term

# Notice we calulate f'(h) here instead of defining a
separate
# sigmoid_prime function. This just makes it faster because
we
# can re-use the result of the sigmoid function stored in
# the output variable
error_term = error * output * (1 - output)
# The gradient descent step, the error times the gradient

times the inputs
del_w += error_term * x
# Update the weights here. The learning rate times the

# change in weights, divided by the number of records to average
weights += learnrate * del_w / n_records

if e % (epochs / 10) == 0:
out = sigmoid(np.dot(features, weights))
Increasing")
else:
last_loss = loss

tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
06. Multilayer Perceptrons
Implementing the hidden layer
Prerequisites
Below, we are going to walk through the math of neural networks in a multilayer
perceptron. With multiple perceptrons, we are going to move to using vectors and
matrices. To brush up, be sure to view the following:
1. Khan Academy's introduction to vectors.

2. Khan Academy's introduction to matrices.
Derivation
Before, we were dealing with only one output node which made the code
straightforward. However now that we have multiple input units and multiple hidden
units, the weights between them will require two indices: w_{ij}wij where ii denotes
input units and jj are the hidden units.
For example, the following image shows our network, with its input units labeled x_1,
x_2,x1,x2, and x_3x3, and its hidden nodes labeled h_1h1 and h_2h2:
The lines indicating the weights leading to h_1h1 have been colored differently from
those leading to h_2h2 just to make it easier to read.
Now to index the weights, we take the input unit number for the _ii and the hidden
unit number for the _jj. That gives us
w_{11}w11
for the weight leading from x_1x1 to h_1h1, and
w_{12}w12
for the weight leading from x_1x1 to h_2h2.
The following image includes all of the weights between the input layer and the
hidden layer, labeled with their appropriate w_{ij}wij indices:
Start Quiz:
multilayer.py solution.py
import numpy as np
def sigmoid(x):
"""
Calculate sigmoid
"""
# Network size
N_input = 4
N_hidden = 3
N_output = 2
np.random.seed(42)
# Make some fake data
X = np.random.randn(4)
weights_input_to_hidden = np.random.normal(0, scale=0.1,

size=(N_input, N_hidden))
weights_hidden_to_output = np.random.normal(0, scale=0.1,
size=(N_hidden, N_output))
# TODO: Make a forward pass through the network
hidden_layer_in = np.dot(X, weights_input_to_hidden)

hidden_layer_out = sigmoid(hidden_layer_in)
print('Hidden-layer Output:')
print(hidden_layer_out)
output_layer_in = np.dot(hidden_layer_out, weights_hidden_to_output)

output_layer_out = sigmoid(output_layer_in)
print('Output-layer Output:')
print(output_layer_out)
07. Backpropagation
Backpropagation
Backpropagation
Now we've come to the problem of how to make a multilayer neural network learn.
Before, we saw how to update weights with gradient descent. The backpropagation
algorithm is just an extension of that, using the chain rule to find the error with the
respect to the weights connecting the input layer to the hidden layer (for a two layer
network).
To update the weights to hidden layers using gradient descent, you need to know
how much error each of the hidden units contributed to the final output. Since the
output of a layer is determined by the weights between layers, the error resulting
from units is scaled by the weights going forward through the network. Since we
know the error at the output, we can use the weights to work backwards to hidden
layers.
For example, in the output layer, you have errors \deltaô_kδko attributed to each

output unit kk. Then, the error attributed to hidden unit jj is the output errors, scaled
by the weights between the output and hidden layers (and the gradient):
Then, the gradient descent step is the same as before, just with the new errors:
where w_{ij}wij are the weights between the inputs and hidden layer and x_ixi are
input unit values. This form holds for however many layers there are. The weight steps
are equal to the step size times the output error of the layer times the values of the
inputs to that layer
Here, you get the output error, \delta_{output}δoutput, by propagating the errors

backwards from higher layers. And the input values, V_{in}Vin are the inputs to the
layer, the hidden layer activations to the output unit for example.
Working through an example
Let's walk through the steps of calculating the weight updates for a simple two layer
network. Suppose there are two input values, one hidden unit, and one output unit,
with sigmoid activations on the hidden and output units. The following image depicts
this network. (Note: the input values are shown as nodes at the bottom of the image,
while the network's output value is shown as \hat yy^ at the top. The inputs
themselves do not count as a layer, which is why this is considered a two layer
network.)
Very Important
It turns out this is exactly how we want to calculate the weight update step. As before,
if you have your inputs as a 2D array with one row, you can also
do hidden_error*inputs.T , but that won't work if inputs is a 1D array.
Backpropagation exercise
Below, you'll implement the code to calculate one backpropagation update step for
two sets of weights. I wrote the forward pass - your goal is to code the backward pass.
Things to do
 Calculate the network's output error.

 Calculate the output layer's error term.
 Use backpropagation to calculate the hidden layer's error term.
 Calculate the change in weights (the delta weights) that result from propagating the
errors back through the network.
Start Quiz:
backprop.py solution.py
import numpy as np
def sigmoid(x):
"""
Calculate sigmoid
"""
x = np.array([0.5, 0.1, -0.2])

target = 0.6
learnrate = 0.5
weights_input_hidden = np.array([[0.5, -0.6],

[0.1, -0.2],
[0.1, 0.7]])
weights_hidden_output = np.array([0.1, -0.3])
## Forward pass
hidden_layer_input = np.dot(x, weights_input_hidden)
hidden_layer_output = sigmoid(hidden_layer_input)
output_layer_in = np.dot(hidden_layer_output, weights_hidden_output)

output = sigmoid(output_layer_in)
## Backwards pass
## TODO: Calculate output error
error = None
# TODO: Calculate error term for output layer

output_error_term = None
# TODO: Calculate error term for hidden layer

hidden_error_term = None
# TODO: Calculate change in weights for hidden layer to output layer

delta_w_h_o = None
# TODO: Calculate change in weights for input layer to hidden layer

delta_w_i_h = None
print('Change in weights for hidden layer to output layer:')

print(delta_w_h_o)
print('Change in weights for input layer to hidden layer:')
print(delta_w_i_h)
Start Quiz:
backprop.py solution.py
import numpy as np
def sigmoid(x):
"""
Calculate sigmoid
"""
x = np.array([0.5, 0.1, -0.2])

target = 0.6
learnrate = 0.5
weights_input_hidden = np.array([[0.5, -0.6],

[0.1, -0.2],
[0.1, 0.7]])
weights_hidden_output = np.array([0.1, -0.3])
## Forward pass
hidden_layer_input = np.dot(x, weights_input_hidden)
hidden_layer_output = sigmoid(hidden_layer_input)
output_layer_in = np.dot(hidden_layer_output, weights_hidden_output)

output = sigmoid(output_layer_in)
## Backwards pass
## TODO: Calculate output error
error = target - output
# TODO: Calculate error term for output layer

output_error_term = error * output * (1 - output)
# TODO: Calculate error term for hidden layer

hidden_error_term = np.dot(output_error_term, weights_hidden_output)
* \
hidden_layer_output * (1 - hidden_layer_output)
# TODO: Calculate change in weights for hidden layer to output layer

delta_w_h_o = learnrate * output_error_term * hidden_layer_output
# TODO: Calculate change in weights for input layer to hidden layer

delta_w_i_h = learnrate * hidden_error_term * x[:, None]
print('Change in weights for hidden layer to output layer:')

print(delta_w_h_o)
print('Change in weights for input layer to hidden layer:')
print(delta_w_i_h)
08. Implementing Backpropagation
Implementing backpropagation
Now we've seen that the error term for the output layer is
\delta_k = (y_k - \hat y_k) f'(a_k)δk=(yk−y^k)f′(ak)

and the error term for the hidden layer is
For now we'll only consider a simple network with one hidden layer and one output unit.
Here's the general algorithm for updating the weights with backpropagation:
 Set the weight steps for each layer to zero
o The input to hidden weights \Delta w_{ij}

= 0Δwij=0
o The hidden to output weights \Delta W_j = 0ΔWj=0
 For each record in the training data:
o Make a forward pass through the network, calculating the output \hat yy^
o Calculate the error gradient in the output unit, \deltaô = (y - \hat y)
f'(z)δo=(y−y^)f′(z) where z = \sum_j W_j a_jz=∑jWjaj, the
input to the output unit.
o Propagate the errors to the hidden layer \delta^h_j = \deltaô W_j
f'(h_j)δjh=δoWjf′(hj)
o Update the weight steps:
 \Delta W_j = \Delta W_j + \deltaô a_jΔWj=ΔWj+δoaj

 \Delta w_{ij} = \Delta w_{ij} + \delta^h_j a_iΔwij=Δwij+δjhai
 Update the weights, where \etaη is the learning rate and mm is the number of records:
o W_j = W_j + \eta \Delta W_j / mWj=Wj+ηΔWj/m
o w_{ij} = w_{ij} + \eta \Delta w_{ij} / mwij=wij+ηΔwij/m
 Repeat for ee epochs.
Backpropagation exercise
Now you're going to implement the backprop algorithm for a network trained on the graduate
school admission data. You should have everything you need from the previous exercises to
complete this one.
Your goals here:
 Implement the forward pass.

 Implement the backpropagation algorithm.
 Update the weights.
Start Quiz:
backprop.py data_prep.pybinary.csvsolution.py
import numpy as np
np.random.seed(21)
def sigmoid(x):
"""
Calculate sigmoid
"""
# Hyperparameters
n_hidden = 2 # number of hidden units
epochs = 900
learnrate = 0.005

last_loss = None
weights_input_hidden = np.random.normal(scale=1 / n_features ** .5,
size=(n_features, n_hidden))
weights_hidden_output = np.random.normal(scale=1 / n_features ** .5,
size=n_hidden)
del_w_input_hidden = np.zeros(weights_input_hidden.shape)
del_w_hidden_output = np.zeros(weights_hidden_output.shape)
## Forward pass ##
hidden_input = None
hidden_output = None
output = None
## Backward pass ##
# TODO: Calculate the network's prediction error
error = None
# TODO: Calculate error term for the output unit

output_error_term = None
## propagate errors to hidden layer
# TODO: Calculate the hidden layer's contribution to the

error
hidden_error = None
# TODO: Calculate the error term for the hidden layer

hidden_error_term = None
# TODO: Update the change in weights

del_w_hidden_output += 0
del_w_input_hidden += 0
# TODO: Update weights (don't forget to division by n_records or

number of samples)
weights_input_hidden += 0
weights_hidden_output += 0

if e % (epochs / 10) == 0:
hidden_output = sigmoid(np.dot(x, weights_input_hidden))
out = sigmoid(np.dot(hidden_output,
weights_hidden_output))

Increasing")
else:
last_loss = loss

hidden = sigmoid(np.dot(features_test, weights_input_hidden))
out = sigmoid(np.dot(hidden, weights_hidden_output))
predictions = out > 0.5
INSTRUCTOR NOTE:
Note: This code takes a while to execute, so Udacity's servers sometimes return with an error
saying it took too long. If that happens, it usually works if you try again.
Start Quiz:
backprop.py data_prep.py binary.csvsolution.py
import numpy as np
import pandas as pd
admissions = pd.read_csv('binary.csv')
# Make dummy variables for rank

data = pd.concat([admissions, pd.get_dummies(admissions['rank'],
prefix='rank')], axis=1)
data = data.drop('rank', axis=1)
# Standarize features
for field in ['gre', 'gpa']:
mean, std = data[field].mean(), data[field].std()
data.loc[:,field] = (data[field]-mean)/std
# Split off random 10% of the data for testing

np.random.seed(21)
sample = np.random.choice(data.index, size=int(len(data)*0.9),
replace=False)
data, test_data = data.ix[sample], data.drop(sample)
# Split into features and targets

features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1),
test_data['admit']
Start Quiz:
backprop.pydata_prep.pybinary.csv solution.py
import numpy as np
np.random.seed(21)
def sigmoid(x):
"""
Calculate sigmoid
"""
# Hyperparameters
n_hidden = 2 # number of hidden units
epochs = 900
learnrate = 0.005

last_loss = None
weights_input_hidden = np.random.normal(scale=1 / n_features ** .5,
size=(n_features, n_hidden))
weights_hidden_output = np.random.normal(scale=1 / n_features ** .5,
size=n_hidden)
del_w_input_hidden = np.zeros(weights_input_hidden.shape)
del_w_hidden_output = np.zeros(weights_hidden_output.shape)
## Forward pass ##
hidden_input = np.dot(x, weights_input_hidden)
hidden_output = sigmoid(hidden_input)
output = sigmoid(np.dot(hidden_output,
## Backward pass ##
# TODO: Calculate the network's prediction error
error = y - output
# TODO: Calculate error term for the output unit

output_error_term = error * output * (1 - output)
## propagate errors to hidden layer
# TODO: Calculate the hidden layer's contribution to the

error
hidden_error = np.dot(output_error_term,
weights_hidden_output)
# TODO: Calculate the error term for the hidden layer

hidden_error_term = hidden_error * hidden_output * (1 -
hidden_output)
# TODO: Update the change in weights

del_w_hidden_output += output_error_term * hidden_output
del_w_input_hidden += hidden_error_term * x[:, None]
# TODO: Update weights

weights_input_hidden += learnrate * del_w_input_hidden /
n_records
weights_hidden_output += learnrate * del_w_hidden_output /
n_records

if e % (epochs / 10) == 0:
hidden_output = sigmoid(np.dot(x, weights_input_hidden))
out = sigmoid(np.dot(hidden_output,

Increasing")
else:
last_loss = loss

hidden = sigmoid(np.dot(features_test, weights_input_hidden))
out = sigmoid(np.dot(hidden, weights_hidden_output))
predictions = out > 0.5
09. Further Reading
Further reading
Backpropagation is fundamental to deep learning. TensorFlow and other libraries will
perform the backprop for you, but you should really really understand the algorithm.
We'll be going over backprop again, but here are some extra resources for you:
Very important
 From Andrej Karpathy: Yes, you should understand backprop
 Also from Andrej Karpathy, a lecture from Stanford's CS231n course

Need to understand
To reach global minimum
Very important
Part 02-Module 01-Lesson 04_GPU Workspaces Demo
01. Introduction to GPU Workspaces
## Introduction
Udacity Workspaces with GPU support are available for some projects as an alternative to
manually configuring your own remote server with GPU support. These workspaces provide a
Jupyter notebook server directly in your browser. This lesson will briefly introduce the
Workspaces interface.
Important Notes:
 Workspaces sessions are connections from your browser to a remote server. Each student has
a limited number of GPU hours allocated on the servers (the allocation is significantly more
than completing the projects is expected to take). There is currently no limit on the number of
Workspace hours when GPU mode is disabled.
 Workspace data stored in the user's home folder is preserved between sessions (and can be
reset as needed, e.g., to get project updates).
 Only 3 gigabytes of data can be stored in the home folder.
 Workspace sessions are preserved if your connection drops or your browser window is
closed, simply return to the classroom and re-open the workspace page; however, workspace
sessions are automatically terminated after a period of inactivity. This will prevent you from
leaving a session connection open and burning through your time allocation. (See the section
on active connections below.)
 The kernel state is preserved as long as the notebook session remains open, but it
is not preserved if the session is closed. If you exit the notebook for more than half an hour
and the session is closed, you will need to re-run any previously-run cells before continuing.
## Overview
The default workspaces interface
When the workspace opens, you'll see the normal Jupyter file browser. From this
interface you can open a notebook file, start a remote terminal session, enable the
GPU, submit your project, or reset the workspace data, and more. Clicking the three
bars in the top left corner above the Jupyter logo will toggle hiding the classroom
lessons sidebar.
NOTE: You can always return to the file browser page from anywhere else in the
workspace by clicking the Jupyter logo in the top left corner.
## Opening a notebook
View of the project notebook
Clicking the name of a notebook (*.ipynb) file in the file list will open a standard
Jupyter notebook view of the project. The notebook session will remain open as long
as you are active, and will be automatically terminated after 30 minutes of inactivity.
You can exit a notebook by clicking on the Jupyter logo in the top left corner.
NOTE: Notebooks continue to run in the background unless they are stopped. IF
GPU MODE IS ACTIVE, IT WILL REMAIN ACTIVE AFTER CLOSING OR STOPPING A
NOTEBOOK. YOU CAN ONLY STOP GPU MODE WITH THE GPU TOGGLE BUTTON.
(See next section.)
## Enabling GPU Mode
The GPU Toggle Button
GPU Workspaces can also be run without time restrictions when the GPU mode is
disabled. The "Enable"/"Disable" button (circled in red in the image) can be used to
toggle GPU mode. NOTE: Toggling GPU support may switch the physical server
your session connects to, which can cause data loss UNLESS YOU CLICK THE
SAVE BUTTON BEFORE TOGGLING GPU SUPPORT.
ALWAYS SAVE YOUR CHANGES BEFORE TOGGLING GPU SUPPORT.
## Keeping Your Session Active

Workspaces automatically disconnect after 30 minutes of user inactivity—which
means that workspaces can disconnect during long-running tasks (like training neural
networks). We have provided a utility that can keep your workspace sessions active for
these tasks. However, keep the following guidelines in mind:
 Do not try to permanently hold the workspace session active when you do not have a
process running (e.g., do not try to hold the session open in the background)—the
limits are in place to preserve your GPU time allocation; there is no guarantee that
you'll receive additional time if you exceed the limit.
 Make sure that you save the results of the long running task to disk as soon as the
task ends (e.g., checkpoint your model parameters for deep learning networks);
otherwise the workspace will disconnect 30 minutes after the active process ends, and
the results will be lost.
The workspace_utils.py module (available here) includes an iterator wrapper

called keep_awake and a context manager called active_session that can be used to
maintain an active session during long-running processes. The two functions are
equivalent, so use whichever fits better in your code. NOTE: The file may be
incorrectly downloaded as workspace-utils.py (note the dash instead of an
underscore in the filename). Make sure to correct the filename before uploading to
your workspace; Python cannot import from file names including hyphens.
Example using keep_awake :
from workspace_utils import keep_awake
for i in keep_awake(range(5)): #anything that happens inside this loop will keep
the workspace active
# do iteration with lots of work here
Example using active_session :
from workspace_utils import active_session
with active_session():
# do long-running work here
## Submitting a Project
The Submit Project Button
Some workspaces are able to directly submit projects on your behalf (i.e., you
do not need to manually submit the project in the classroom). To submit your project,
simply click the "Submit Project" button (circled in red in the above image).
If you do not see the "Submit Project" button, then project submission is not enabled
for that workspace. You will need to manually download your project files and submit
them in the classroom.
NOTE: YOU MUST ENSURE THAT YOUR SUBMISSION INCLUDES ALL REQUIRED
FILES BEFORE SUBMITTING -- INCLUDING ANY FILE CONVERSIONS (e.g., from
ipynb to HTML)
## Opening a Terminal
The "New" menu button
Jupyter workspaces support several views, including the file browser and notebook
view already covered, as well as shell terminals. To open a terminal shell, click the
"New" menu button at the top right of the file browser view and select "Terminal".
## Terminals
Jupyter terminal shell interface
Terminals provide a full Bash shell that you can use to install or update software
packages, fetch updates from github repositories, or run any other terminal
commands. As with the notebook view, you can return to the file browser view by
clicking on the Jupyter logo at the top left corner of the window.
NOTE: Your data & changes are persistent across workspace sessions. Any
changes you make will need to be repeated if you later reset your workspace
data.
## Resetting Data
The Menu Button
The "Menu" button in the bottom left corner provides support for resetting your
Workspaces. The "Refresh Workspace" button will refresh your session, which has no
effect on the changes you've made in the workspace.
The "Reset Data" button discards all changes and restores a clean copy of the
workspace. Clicking the button will open a dialog that requires you to type "Reset
data" in a confirmation dialog. ALL OF YOUR DATA WILL BE LOST.
Resetting should only be required if Udacity makes changes to the project and you
can't get them via git pull , or if you destroy the contents of the workspace. If you
do need to reset your data, you are strongly encouraged to download a copy of your
work from the file interface before clicking Reset Data.
02. Workspace Playground

Try it out!
There is an empty workspace in the next module that you can use to explore the workspaces
interface. The GPU time allocation in this notebook is shared with all others throughout the
term, but you can use this playground to experiment with the interface.
THE PLAYGROUND MAY NOT SUPPORT ALL PROJECTS. FOLLOW THE

INSTRUCTIONS FOR EACH PROJECT TO COMPLETE AND SUBMIT THEM. In
other words, if you're working on a project that doesn't have an associated workspace, then
there is no expectation for this playground to support that project.
Project: Predicting Bike Sharing Data
 Back to Home
 01. Introduction to the Project

 02. Project Workspace
 Project Description - Your first neural network
 Project Rubric - Your first neural network
02. Project Workspace
Workspace

Your first neural network
Your First Neural Network
Introduction
In this project, you'll get to build a neural network from scratch to carry out a prediction
problem on a real dataset! By building a neural network from the ground up, you'll have a
much better understanding of gradient descent, backpropagation, and other concepts that are
important to know before we move to higher level tools such as Tensorflow. You'll also get to
see how to apply these networks to solve real prediction problems!
The data comes from the UCI Machine Learning Database.
Instructions
1. Download the project materials from our GitHub repository. You can get download the
repository with git clone https://github.com/udacity/deep-learning.git . Our
files in the GitHub repo are the most up to date, so it's the best place to get the project files.
2. cd into the first-neural-network directory.
3. Download anaconda or miniconda based on the instructions in the Anaconda lesson.
4. Create a new conda environment:
conda create --name dlnd python=3
5. Enter your new environment:

o Mac/Linux: >> source activate dlnd
o Windows: >> activate dlnd
6. Ensure you have numpy , matplotlib , pandas , and jupyter notebook installed by
doing the following:
conda install numpy matplotlib pandas jupyter notebook
7. Run the following to open up the notebook server:
jupyter notebook
8. In your browser, open Your_first_neural_network.ipynb

9. Follow the instructions in the notebook; they will lead you through the project. You'll
ultimately be editing the my_answers.py python file, whose components are imported into
the notebook at various places.
10. Ensure you've passed the unit tests in the notebook and have taken a look at the
rubric before you submit the project!
If you need help running the notebook file, check out the Jupyter notebook lesson.
Submission
Before submitting your solution to a reviewer, you are required to submit your project to
Udacity's Project Assistant, which will provide some initial feedback. It will give you
feedback within a minute or two on whether your project will meet all specifications.
It is possible to submit projects which do not pass all tests; you can expect to get feedback
from your Udacity reviewer on these within 3-4 days.
The setup for the project assistant is simple. If you have not installed the client tool from a
different Nanodegree program already, then you may do so with the command pip install
udacity-pa .
To submit your code to the project assistant, run udacity submit from within the top-level
directory of the project. You will be prompted for a username and password. If you login using
google or facebook, visit this link for alternate login instructions.
This process will create a zipfile in your top-level directory named first_neural_network-
result-.zip , where there will be a number between result- and .zip . This is the file that
you should submit to the Udacity reviews system.
Upload that file into the system and hit Submit Project below!
If you run into any issues using the project assistant, please check this page to troubleshoot;
feel free to post your problem in Knowledge if it isn't covered by one of the displayed cases!
What to do afterwards
If you're waiting for new content or to get the review back, here's a great video from Frank
Chen about the history of deep learning. It's a 45 minute video, sort of a short documentary,
starting in the 1950s and bringing us to the current boom in deep learning and artificial
intelligence.
Your first neural network
Code Functionality
Criteria
All code works appropriately and passes all unit tests All the code in the notebook runs in Python
Sigmoid activation function The sigmoid activation function is implemen
Forward Pass
Criteria
Forward Pass - Training The forward pass is correctly implemented for the network's training.
Forward Pass - Run The run method correctly produces the desired regression output for the n
Backward Pass
Criteria Meet Specification
Batch Weight Change The network correctly implements the backward pass for each batch, correctly up
Updating the weights Updates to both the input-to-hidden and hidden-to-output weights are implemente
Hyperparameters
Criteria Meet Specification
Number of epochs The number of epochs is chosen such the network is trained well enough to accurately m
Number of hidden The number of hidden units is chosen such that the network is able to accurately predict
units overfitting.
Learning rate The learning rate is chosen such that the network successfully converges, but is still time
Output Nodes The number of output nodes is properly selected to solve the desired problem.
Final Results The training loss is below 0.09 and the validation loss is below 0.18.
Part 02-Module 01-Lesson 06_Sentiment Analysis
03. Materials
Materials
As you follow along this lesson, it's extremely important that you open the Jupyter notebook
and attempt the exercises. Much of the value in this experience will come from seeing how
your solution is different from Andrew's and playing around with the code in your own way.
Make this lesson count!
Workspace
The best way to open the notebook is to click here, which will open it in a new window. We
recommend you to work on the notebook in that window, and watch the videos in this one.
You can also get to the notebook by clicking the "Next" button in the classroom.
If you want to download the notebooks yourself, you can clone them from our GitHub
repository. You can either download the repository with git clone
from this link.
This lesson uses the following files:
 Sentiment_Classification_Projects.ipynb - a notebook you will use to following

along and work on lesson mini projects.
 Sentiment_Classification_Solutions.ipynb - a notebook that includes Andrew’s
solutions to the lesson projects, which you can use for reference
 A notebook for the solution for each mini project.
 reviews.txt - a collection of 25 thousand movie reviews
 labels.txt - positive/negative sentiment labels for the associated reviews in reviews.txt
Note: the notebooks for these lessons have been updated since the videos were recorded. In
most cases that just means your notebook will contain more hints and explanatory text than
what you see in the videos, but there may be some minor differences in the code as well. With
these changes, you still will be able to follow along with the lessons, and should have an easier
time understanding the project material.
Solutions
If you need help, feel free to look at the solutions in the same folder.
04. The Notebooks
Workspace

06. Mini Project 1
Instructions
In this project, you'll test your theory of what features of a review correlate with the label!
Here are your specific steps:
Mini Project 1
Task List:

Work in the Project 1 section of Sentiment_Classification_Projects.ipynb .

Follow the notebook’s instructions to test the correlation between review features and labels.
Task Feedback:
Nice work! In the next video, Andrew will explain his solution.
Important about project
09. Mini Project 2
Instructions
In the following mini project, you'll convert the inputs and outputs of the dataset into numbers.
Namely, you will convert each review string into a vector, and each label into a 0 or 1 .
You’ll need to make a few additions to the notebook, but the main work will be implementing
two functions, whose signatures are shown below:
def update_input_layer(review):
""" Modify the global layer_0 to represent the vector form of review.
The element at a given index of layer_0 should represent \
how many times the given word occurs in the review.
Args:
review(string) - the string of the review
Returns:
None
"""
global layer_0
# clear out previous state, reset the layer to be all 0s
layer_0 *= 0
## Your code here
pass
def get_target_for_label(label):
"""Convert a label to `0` or `1`.
Args:
label(string) - Either "POSITIVE" or "NEGATIVE".
Returns:
`0` or `1`.
"""
pass
Mini Project 2
Task List:

Work in the Project 2 section of Sentiment_Classification_Projects.ipynb .

Follow the notebook’s instructions to convert your inputs and outputs to numbers.

Create a global vocabulary set vocab

Initialize a global layer_0 that is a vector of the size of the text vocabulary. All values should
be initialized to 0 .

Implement update_input_layer .

Implement get_target_label .
Task Feedback:
Nice work! In the next video, Andrew will share his solution.
Keras
 Back to Home
 01. Intro
 02. Keras
 03. Pre-Lab: Student Admissions in Keras
 04. Lab: Student Admissions in Keras
 05. Optimizers in Keras
 06. Mini Project Intro
 07. Pre-Lab: IMDB Data in Keras
 08. Lab: IMDB Data in Keras
Keras
Hi again! Now, we know all there is about training and optimizing neural networks,
and we've actually trained a few of them in NumPy, But this is not what we normally
do in real life. There are many packages which will make our life much easier. The two
main ones that we'll learn in this course are Keras and TensorFlow. In this lesson, we'll
learn to use Keras.
The way we'll learn is by writing lots of code and building lots of models. We'll start by
building a simple neural network that will solve the XOR problem. Then, we'll build a
bigger neural network that will analyze the student data that we have analyzed in a
previous section.
And finally, we'll have a lab in which you'll be able to build a neural network yourself,
which will process text, and make predictions on the sentiment of movie reviews in
IMDB.
02. Keras
Neural Networks in Keras
Luckily, every time we need to use a neural network, we won't need to code the activation
function, gradient descent, etc. There are lots of packages for this, which we recommend you
to check out, including the following:
 Keras
 TensorFlow
 Caffe
 Theano
 Scikit-learn
 And many others!
In this course, we will learn Keras. Keras makes coding deep neural networks simpler. To
demonstrate just how easy it is, you're going to build a simple fully-connected network in a
few dozen lines of code.
We’ll be connecting the concepts that you’ve learned in the previous lessons to the methods
that Keras provides.
The general idea for this example is that you'll first load the data, then define the network, and
then finally train the network.
Building a Neural Network in Keras

Here are some core concepts you need to know for working with Keras.
Sequential Model
from keras.models import Sequential
#Create the Sequential model

model = Sequential()
The keras.models.Sequential class is a wrapper for the neural network model that treats the
network as a sequence of layers. It implements the Keras model interface with common
methods like compile() , fit() , and evaluate() that are used to train and run the model.
We'll cover these functions soon, but first let's start looking at the layers of the model.
Layers
The Keras Layer class provides a common interface for a variety of standard neural network
layers. There are fully connected layers, max pool layers, activation layers, and more. You can
add a layer to a model using the model's add() method. For example, a simple model with a
single hidden layer might look like this:
import numpy as np
from keras.layers.core import Dense, Activation
# X has shape (num_rows, num_cols), where the training data are stored
# as row vectors
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=np.float32)
# y must have an output vector for each input vector

y = np.array([[0], [0], [0], [1]], dtype=np.float32)
# Create the Sequential model

# 1st Layer - Add an input layer of 32 nodes with the same input shape as
# the training samples in X
model.add(Dense(32, input_dim=X.shape[1]))
# Add a softmax activation layer

model.add(Activation('softmax'))
# 2nd Layer - Add a fully connected output layer

model.add(Dense(1))
# Add a sigmoid activation layer

model.add(Activation('sigmoid'))
Keras requires the input shape to be specified in the first layer, but it will automatically infer
the shape of all other layers. This means you only have to explicitly set the input dimensions
for the first layer.
The first (hidden) layer from above, model.add(Dense(32, input_dim=X.shape[1])) ,

creates 32 nodes which each expect to receive 2-element vectors as inputs. Each layer takes
the outputs from the previous layer as inputs and pipes through to the next layer. This chain of
passing output to the next layer continues until the last layer, which is the output of the model.
We can see that the output has dimension 1.
The activation "layers" in Keras are equivalent to specifying an activation function in the
Dense layers (e.g., model.add(Dense(128)); model.add(Activation('softmax')) is
computationally equivalent to model.add(Dense(128, activation="softmax"))) ), but it is
common to explicitly separate the activation layers because it allows direct access to the
outputs of each layer before the activation is applied (which is useful in some model
architectures).
Once we have our model built, we need to compile it before it can be run. Compiling the Keras
model calls the backend (tensorflow, theano, etc.) and binds the optimizer, loss function, and
other parameters required before the model can be run on any input data. We'll specify the loss
function to be categorical_crossentropy which can be used when there are only two
classes, and specify adam as the optimizer (which is a reasonable default when speed is a
priority). And finally, we can specify what metrics we want to evaluate the model with. Here
we'll use accuracy.
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics =

["accuracy"])
We can see the resulting model architecture with the following command:
model.summary()
The model is trained with the fit() method, through the following command that specifies
the number of training epochs and the message level (how much information we want
displayed on the screen during training).
model.fit(X, y, nb_epoch=1000, verbose=0)

Note: In Keras 1, nb_epoch sets the number of epochs, but in Keras 2 this changes to the
keyword epochs .
Finally, we can use the following command to evaluate the model:
model.evaluate()
Pretty simple, right? Let's put it into practice.
Quiz
Let's start with the simplest example. In this quiz you will build a simple multi-layer
feedforward neural network to solve the XOR problem.
1. Set the first layer to a Dense() layer with an output width of 8 nodes and
the input_dim set to the size of the training samples (in this case 2).
2. Add a tanh activation function.
3. Set the output layer width to 1, since the output has only two classes. (We can use 0 for one
class and 1 for the other)
4. Use a sigmoid activation function after the output layer.
5. Run the model for 50 epochs.
This should give you an accuracy of 50%. That's ok, but certainly not great. Out of 4 input
points, we're correctly classifying only 2 of them. Let's try to change some parameters around
to improve. For example, you can increase the number of epochs. You'll pass this quiz if you
get 75% accuracy. Can you reach 100%?
To get started, review the Keras documentation about models and layers.
The Keras example of a Multi-Layer Perceptron network is similar to what you need to do
here. Use that as a guide, but keep in mind that there will be a number of differences.
tart Quiz:
network.py network_solution.py
import numpy as np
from keras.utils import np_utils
import tensorflow as tf
# Using TensorFlow 1.0.0; use tf.python_io in later versions
tf.python.control_flow_ops = tf
# Set random seed

np.random.seed(42)
# Our data
X = np.array([[0,0],[0,1],[1,0],[1,1]]).astype('float32')
y = np.array([[0],[1],[1],[0]]).astype('float32')
# Initial Setup for Keras

from keras.layers.core import Dense, Activation
# One-hot encoding the output
y = np_utils.to_categorical(y)
# Building the model

xor = Sequential()
# Add required layers

# xor.add()
# Specify loss as "binary_crossentropy", optimizer as "adam",

# and add the accuracy metric
# xor.compile()
# Uncomment this line to print the model architecture

# xor.summary()
# Fitting the model

history = xor.fit(X, y, nb_epoch=50, verbose=0)
# Scoring the model

score = xor.evaluate(X, y)
print("\nAccuracy: ", score[-1])
# Checking the predictions

print("\nPredictions:")
print(xor.predict_proba(X))
Start Quiz:
network.py network_solution.py
import numpy as np
from keras.utils import np_utils
tf.python.control_flow_ops = tf
# Set random seed

np.random.seed(42)
# Our data
X = np.array([[0,0],[0,1],[1,0],[1,1]]).astype('float32')
y = np.array([[0],[1],[1],[0]]).astype('float32')
# Initial Setup for Keras

from keras.layers.core import Dense, Activation, Flatten
# One-hot encoding the output

y = np_utils.to_categorical(y)
# Building the model

xor = Sequential()
xor.add(Dense(32, input_dim=2))
xor.add(Activation("tanh"))
xor.add(Dense(2))
xor.add(Activation("sigmoid"))
xor.compile(loss="categorical_crossentropy", optimizer="adam",
metrics = ['accuracy'])
# Uncomment this line to print the model architecture

# xor.summary()
# Fitting the model

history = xor.fit(X, y, nb_epoch=1000, verbose=0)
# Scoring the model

score = xor.evaluate(X, y)
print("\nAccuracy: ", score[-1])
# Checking the predictions

print("\nPredictions:")
print(xor.predict_proba(X))
03. Pre-Lab: Student Admissions in Keras
Mini Project: Student Admissions in Keras
So, now we're ready to use Keras with real data. We'll now build a neural network
which analyzes the dataset of student admissions at UCLA that we've previously
studied.
As you follow along with this lesson, you are encouraged to work in the referenced
Jupyter notebooks at the end of the page. We will present a solution to you, but
please try creating your own deep learning models! Much of the value in this
experience will come from playing around with the code in your own way.
Workspace

 Clone the repo from Github and open the
notebook StudentAdmissionsKeras.ipynb in the student_admissions_keras folder.
You can either download the repository with git clone
from this link.
Instructions
This is more of a follow-along lab. We'll show you the steps to build the network.
However, at the end of the lab you'll be given the opportunity to improve the model,
and try to improve on its performance. Here are the main steps in this lab.
Studying the data
The dataset has the following columns:
 Student GPA (grades)

 Score on the GRE (test)
 Class rank (1-4)
First, let's start by looking at the data. For that, we'll use the read_csv function in
pandas.
import pandas as pd
data = pd.read_csv('student_data.csv')
print(data)
Here we can see that the first column is the label y , which corresponds to
acceptance/rejection. Namely, a label of 1 means the student got accepted, and a
label of 0 means the student got rejected.
When we plot the data, we get the following graphs, which show that unfortunately,
the data is not as nicely separable as we'd hope:
So one thing we can do is make one graph for each of the 4 ranks. In that case, we get
this:
Pre-processing the data
Ok, there's a bit more hope here. It seems like the better grades and test scores the
student has, the more likely they are to be accepted. And the rank has something to
do with it. So what we'll do is, we'll one-hot encode the rank, and our 6 input variables
will be:
 Test (GPA)
 Grades (GRE)
 Rank 1
 Rank 2
 Rank 3
 Rank 4.
The last 4 inputs will be binary variables that have a value of 1 if the student has that
rank, or 0 otherwise.
So, first things first, let's notice that the test scores have a range of 800, while the
grades have a range of 4. This is a huge discrepancy, and it will affect our training.
Normally, the best thing to do is to normalize the scores so they are between 0 and 1.
We can do this as follows:
data["gre"] = data["gre"]/800
data["gpa"] = data["gpa"]/4.0
Now, we split our data input into X, and the labels y , and one-hot encode the output,
so it appears as two classes (accepted and not accepted).
X = np.array(data)[:,1:]
y = keras.utils.to_categorical(np.array(data["admit"]))
Building the model architecture
And finally, we define the model architecture. We can use different architectures, but
here's an example:
model.add(Dense(128, input_dim=6))
model.add(Dense(32))
model.add(Dense(2))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',
metrics=['accuracy'])
model.summary()
The error function is given by categorical_crossentropy , which is the one we've
been using, but there are other options. There are several optimizers which you can
choose from in order to improve your training. Here we use adam, but there are
others that are useful, such as rmsprop. These use a variety of techniques that we'll
outline in upcoming pages in this lesson.
The model summary will tell us the following:

Training the model
Now, we train the model, with 1000 epochs. Don't worry about the batch_size, we'll
learn about it soon.
model.fit(X_train, y_train, epochs=1000, batch_size=100,

verbose=0)
Evaluating the model
And finally, we can evaluate our model.
score = model.evaluate(X_train, y_train)

Results may vary, but you should get somewhere over 70% accuracy.
And there you go, you've trained your first neural network to analyze a dataset. Now,
in the following pages, you'll learn many techniques to improve the training process.
04. Lab: Student Admissions in Keras
Workspace

05. Optimizers in Keras

Keras Optimizers
There are many optimizers in Keras, that we encourage you to explore further, in this link, or
in this excellent blog post. These optimizers use a combination of the tricks above, plus a few
others. Some of the most common are:
SGD
This is Stochastic Gradient Descent. It uses the following parameters:
 Learning rate.
 Momentum (This takes the weighted average of the previous steps, in order to get a bit of
momentum and go over bumps, as a way to not get stuck in local minima).
 Nesterov Momentum (This slows down the gradient when it's close to the solution).
Adam
Adam (Adaptive Moment Estimation) uses a more complicated exponential decay that consists
of not just considering the average (first moment), but also the variance (second moment) of
the previous steps.
RMSProp
RMSProp (RMS stands for Root Mean Squared Error) decreases the learning rate by dividing
it by an exponentially decaying average of squared gradients.
07. Pre-Lab: IMDB Data in Keras
Mini Project: Using Keras to analyze IMDB Movie Data
Now, you're ready to shine! In this project, we will analyze a dataset from IMDB and use it to
predict the sentiment analysis of a review.
Workspace

 Clone the repo from Github and open the notebook IMDB_in_Keras.ipynb in
the imdb_keras folder. You can either download the repository with git clone
from this link.
Instructions
In this lab, we will preprocess the data for you, and you'll be in charge of building and training
the model in Keras.
The dataset
This lab uses a dataset of 25,000 IMDB reviews. Each review comes with a label. A label of 0
is given to a negative review, and a label of 1 is given to a positive review. The goal of this lab
is to create a model that will predict the sentiment of a review, based on the words in the
review. You can see more information about this dataset in the Keras website.
Now, the input already comes preprocessed for us for convenience. Each review is encoded as
a sequence of indexes, corresponding to the words in the review. The words are ordered by
frequency, so the integer 1 corresponds to the most frequent word ("the"), the integer 2 to the
second most frequent word, etc. By convention, the integer 0 corresponds to unknown words.
Then, the sentence is turned into a vector by simply concatenating these integers. For instance,
if the sentence is "To be or not to be." and the indices of the words are as follows:
 "to": 5
 "be": 8
 "or": 21
 "not": 3
Then the sentence gets encoded as the vector [5,8,21,3,5,8] .
Loading the data
The data comes preloaded in Keras, which means we don't need to open or read any files
manually. The command to load it is the following, which will actually split the words into
training and testing sets and labels!:
from keras.datasets import imdb

(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
num_words=None,
skip_top=0,
maxlen=None,
seed=113,
start_char=1,
oov_char=2,
index_from=3)
The meanings of all of these arguments are here. But in a nutshell, the most important ones
are:
 num_words: Top most frequent words to consider. This is useful if you don't want to consider
very obscure words such as "Ultracrepidarian."
 skip_top: Top words to ignore. This is useful if you don't want to consider the most common
words. For example, the word "the" would add no information to the review, so we can skip it
by setting skip_top to 2 or higher.
Pre-processing the data
We first prepare the data by one-hot encoding it into (0,1)-vectors as follows: If, for example,
we have 10 words in our vocabulary, and the vector is (4,1,8), we'll turn it into the vector
(1,0,0,1,0,0,0,1,0,0).
Building the model
Now it's your turn to use all you've learned! You can build a neural network using Keras, train
it, and evaluate it! Make sure you also use methods such as dropout or regularization, and
good Keras optimizers to do this. A good accuracy to aim for is 85%. Can your model achieve
this?
Help
This is a self-assessed lab. If you need any help or want to check your answers, feel free to
check out the solutions notebook in the same folder, or click here.
08. Lab: IMDB Data in Keras
Workspace

Part 02-Module 01-Lesson 08_TensorFlow
01. Intro
Hi! It's Luis again!
Intro to TensorFlow
Now that you are an expert in Neural Networks with Keras, you're more than ready to learn
TensorFlow. In the following sections of this Nanodegree Program, you will be using Keras
and TensorFlow alternately. Keras is great for building neural networks quickly, but it
abstracts a lot of the details. TensorFlow is great for understanding how neural networks
operate on a lower level. This lesson will teach you what you need to know of TensorFlow,
and give you some exercises to practice.
This lesson will build up on the knowledge from the Deep Neural Networks lesson. If you
need to refresh your memory on any of the topics, such as Linear Functions, Softmax, Cross
Entropy, Batching, Epochs, etc., feel free to go back and watch them again.
 Linear Functions
 Softmax
 Cross Entropy
 Batching and Epochs
Enjoy!
02. Installing TensorFlow
Throughout this lesson, you'll apply your knowledge of neural networks on real datasets
using TensorFlow (link for China), an open source Deep Learning library created by Google.
You’ll use TensorFlow to classify images from the notMNIST dataset - a dataset of images of
English letters from A to J. You can see a few example images below.
Your goal is to automatically detect the letter based on the image in the dataset. You’ll be
working on your own computer for this lab, so, first things first, install TensorFlow!
Install
As usual, we'll be using Conda to install TensorFlow. You might already have a TensorFlow
environment, but check to make sure you have all the necessary packages.
OS X or Linux
Run the following commands to setup your environment:
conda create -n tensorflow python=3.5

source activate tensorflow
conda install pandas matplotlib jupyter notebook scipy scikit-learn
pip install tensorflow
Windows
And installing on Windows. In your console or Anaconda shell,
conda create -n tensorflow python=3.5

activate tensorflow
conda install pandas matplotlib jupyter notebook scipy scikit-learn
pip install tensorflow
Hello, world!
Try running the following code in your Python console to make sure you have TensorFlow
properly installed. The console will print "Hello, world!" if TensorFlow is installed. Don’t
worry about understanding what it does. You’ll learn about it in the next section.
# Create TensorFlow object called tensor

hello_constant = tf.constant('Hello World!')
with tf.Session() as sess:

# Run the tf.constant operation in the session
output = sess.run(hello_constant)
print(output)
04. Quiz: TensorFlow Linear Function
Linear functions in TensorFlow
The most common operation in neural networks is calculating the linear combination of inputs,
weights, and biases. As a reminder, we can write the output of the linear operation as
Here, \mathbf{W}W is a matrix of the weights connecting two layers. The

output \mathbf{y}y, the input \mathbf{x}x, and the biases \mathbf{b}b are all vectors.
Weights and Bias in TensorFlow
The goal of training a neural network is to modify weights and biases to best predict the labels.
In order to use weights and bias, you'll need a Tensor that can be modified. This leaves
out tf.placeholder() and tf.constant() , since those Tensors can't be modified. This is
where tf.Variable class comes in.
tf.Variable()
x = tf.Variable(5)
The tf.Variable class creates a tensor with an initial value that can be modified, much like a
normal Python variable. This tensor stores its state in the session, so you must initialize the
state of the tensor manually. You'll use the tf.global_variables_initializer() function
to initialize the state of all the Variable tensors.
Initialization
init = tf.global_variables_initializer()
sess.run(init)
The tf.global_variables_initializer() call returns an operation that will initialize all
TensorFlow variables from the graph. You call the operation using a session to initialize all the
variables as shown above. Using the tf.Variable class allows us to change the weights and
bias, but an initial value needs to be chosen.
Initializing the weights with random numbers from a normal distribution is good practice.
Randomizing the weights helps the model from becoming stuck in the same place every time
you train it. You'll learn more about this in the next lesson, when you study gradient descent.
Similarly, choosing weights from a normal distribution prevents any one weight from
overwhelming other weights. You'll use the tf.truncated_normal() function to generate
random numbers from a normal distribution.
tf.truncated_normal()
n_features = 120
n_labels = 5
weights = tf.Variable(tf.truncated_normal((n_features, n_labels)))
The tf.truncated_normal() function returns a tensor with random values from a normal
distribution whose magnitude is no more than 2 standard deviations from the mean.
Since the weights are already helping prevent the model from getting stuck, you don't need to
randomize the bias. Let's use the simplest solution, setting the bias to 0.
tf.zeros()
n_labels = 5
bias = tf.Variable(tf.zeros(n_labels))
The tf.zeros() function returns a tensor with all zeros.
Linear Classifier Quiz
A subset of the MNIST dataset
You'll be classifying the handwritten numbers 0 , 1 , and 2 from the MNIST dataset using
TensorFlow. The above is a small sample of the data you'll be training on. Notice how some of
the 1 s are written with a serif at the top and at different angles. The similarities and
differences will play a part in shaping the weights of the model.
Left: Weights for labeling 0. Middle: Weights for labeling 1. Right: Weights for labeling 2.
The images above are trained weights for each label ( 0 , 1 , and 2 ). The weights display the
unique properties of each digit they have found. Complete this quiz to train your own weights
using the MNIST dataset.
Instructions
1. Open quiz.py.
1. Implement get_weights to return a tf.Variable of weights
2. Implement get_biases to return a tf.Variable of biases
3. Implement xW + b in the linear function
2. Open sandbox.py
1. Initialize all weights
Since xW in xW + b is matrix multiplication, you have to use the tf.matmul() function
instead of tf.multiply() . Don't forget that order matters in matrix multiplication,
so tf.matmul(a,b) is not the same as tf.matmul(b,a) .
Start Quiz:
sandbox.py quiz.pyquiz_solution.pysandbox_solution.py
# Solution is available in the other "sandbox_solution.py" tab
from tensorflow.examples.tutorials.mnist import input_data
from quiz import get_weights, get_biases, linear
def mnist_features_labels(n_labels):
"""
Gets the first <n> labels from the MNIST dataset
:param n_labels: Number of labels to use
:return: Tuple of feature list and label list
"""
mnist_features = []
mnist_labels = []
mnist = input_data.read_data_sets('/datasets/ud730/mnist',
one_hot=True)
# In order to make quizzes run faster, we're only looking at

10000 images
for mnist_feature, mnist_label in
zip(*mnist.train.next_batch(10000)):
# Add features and labels if it's for the first <n>th labels
if mnist_label[:n_labels].any():
mnist_features.append(mnist_feature)
mnist_labels.append(mnist_label[:n_labels])
return mnist_features, mnist_labels
# Number of features (28*28 image is 784 features)

n_features = 784
# Number of labels
n_labels = 3
# Features and Labels

features = tf.placeholder(tf.float32)
labels = tf.placeholder(tf.float32)
# Weights and Biases

w = get_weights(n_features, n_labels)
b = get_biases(n_labels)
# Linear Function xW + b
logits = linear(features, w, b)
# Training data
train_features, train_labels = mnist_features_labels(n_labels)
with tf.Session() as session:

# TODO: Initialize session variables
# Softmax
prediction = tf.nn.softmax(logits)
# Cross entropy
# This quantifies how far off the predictions were.
# You'll learn more about this in future lessons.
cross_entropy = -tf.reduce_sum(labels * tf.log(prediction),
reduction_indices=1)
# Training loss
loss = tf.reduce_mean(cross_entropy)
# Rate at which the weights are changed

learning_rate = 0.08
# Gradient Descent
# This is the method used to train the model
optimizer =
tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
# Run optimizer and get loss

_, l = session.run(
[optimizer, loss],
feed_dict={features: train_features, labels: train_labels})
# Print loss
print('Loss: {}'.format(l))
Start Quiz:
sandbox.py quiz.py quiz_solution.pysandbox_solution.py
# Solution is available in the other "quiz_solution.py" tab
def get_weights(n_features, n_labels):

"""
Return TensorFlow weights
:param n_features: Number of features
:param n_labels: Number of labels
:return: TensorFlow weights
"""
# TODO: Return weights
pass
def get_biases(n_labels):
"""
Return TensorFlow bias
:return: TensorFlow bias
"""
# TODO: Return biases
pass
def linear(input, w, b):

"""
Return linear function in TensorFlow
:param input: TensorFlow input
:param w: TensorFlow weights
:param b: TensorFlow biases
:return: TensorFlow linear function
"""
# TODO: Linear Function (xW + b)
pass
Start Quiz:
sandbox.pyquiz.py quiz_solution.py sandbox_solution.py
# Quiz Solution
# Note: You can't run code in this tab
def get_weights(n_features, n_labels):

"""
Return TensorFlow weights
:param n_features: Number of features
:return: TensorFlow weights
"""
# TODO: Return weights
return tf.Variable(tf.truncated_normal((n_features, n_labels)))
def get_biases(n_labels):
"""
Return TensorFlow bias
:return: TensorFlow bias
"""
# TODO: Return biases
return tf.Variable(tf.zeros(n_labels))
def linear(input, w, b):

"""
Return linear function in TensorFlow
:param input: TensorFlow input
:param w: TensorFlow weights
:param b: TensorFlow biases
:return: TensorFlow linear function
"""
# TODO: Linear Function (xW + b)
return tf.add(tf.matmul(input, w), b)
Start Quiz:
sandbox.pyquiz.pyquiz_solution.py sandbox_solution.py
# Sandbox Solution
from quiz import get_weights, get_biases, linear
def mnist_features_labels(n_labels):
"""
Gets the first <n> labels from the MNIST dataset
:param n_labels: Number of labels to use
:return: Tuple of feature list and label list
"""
mnist_features = []
mnist_labels = []
one_hot=True)
# In order to make quizzes run faster, we're only looking at

10000 images
for mnist_feature, mnist_label in
zip(*mnist.train.next_batch(10000)):
# Add features and labels if it's for the first <n>th labels
if mnist_label[:n_labels].any():
mnist_features.append(mnist_feature)
mnist_labels.append(mnist_label[:n_labels])
return mnist_features, mnist_labels
# Number of features (28*28 image is 784 features)

n_features = 784
# Number of labels
n_labels = 3

features = tf.placeholder(tf.float32)
labels = tf.placeholder(tf.float32)
# Weights and Biases

w = get_weights(n_features, n_labels)
b = get_biases(n_labels)
# Linear Function xW + b
logits = linear(features, w, b)
# Training data
train_features, train_labels = mnist_features_labels(n_labels)
with tf.Session() as session:

session.run(tf.global_variables_initializer())
# Softmax
prediction = tf.nn.softmax(logits)
# Cross entropy
# This quantifies how far off the predictions were.
cross_entropy = -tf.reduce_sum(labels * tf.log(prediction),
reduction_indices=1)
# Training loss
loss = tf.reduce_mean(cross_entropy)
# Rate at which the weights are changed

# Gradient Descent
# This is the method used to train the model
optimizer =
tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
# Run optimizer and get loss

_, l = session.run(
[optimizer, loss],
feed_dict={features: train_features, labels: train_labels})
# Print loss
print('Loss: {}'.format(l))
05. Quiz: TensorFlow Softmax
TensorFlow Softmax
The softmax function squashes it's inputs, typically called logits or logit scores, to be between
0 and 1 and also normalizes the outputs such that they all sum to 1. This means the output of
the softmax function is equivalent to a categorical probability distribution. It's the perfect
function to use as the output activation for a network predicting multiple classes.
Example of the softmax function at work.
TensorFlow Softmax
We're using TensorFlow to build neural networks and, appropriately, there's a function for
calculating softmax.
x = tf.nn.softmax([2.0, 1.0, 0.2])

Easy as that! tf.nn.softmax() implements the softmax function for you. It takes in logits
and returns softmax activations.
Quiz
Use the softmax function in the quiz below to return the softmax of the logits.
Start Quiz:
quiz.py solution.py
# Solution is available in the other "solution.py" tab
def run():
output = None
logit_data = [2.0, 1.0, 0.1]
logits = tf.placeholder(tf.float32)
# TODO: Calculate the softmax of the logits

# softmax =

# TODO: Feed in the logit data
# output = sess.run(softmax, )
return output
Start Quiz:
quiz.py solution.py
# Quiz Solution
def run():
output = None
logit_data = [2.0, 1.0, 0.1]
logits = tf.placeholder(tf.float32)
softmax = tf.nn.softmax(logits)

output = sess.run(softmax, feed_dict={logits: logit_data})
return output
06. Quiz: TensorFlow Cross Entropy
Cross Entropy in TensorFlow
As with the softmax function, TensorFlow has a function to do the cross entropy calculations
for us.
Cross entropy loss function
Let's take what you learned from the video and create a cross entropy function in TensorFlow.
To create a cross entropy function in TensorFlow, you'll need to use two new functions:
 tf.reduce_sum()
 tf.log()
Reduce Sum
x = tf.reduce_sum([1, 2, 3, 4, 5]) # 15
The tf.reduce_sum() function takes an array of numbers and sums them together.
Natural Log
x = tf.log(100.0) # 4.60517
This function does exactly what you would expect it to do. tf.log() takes the natural log of
a number.
Quiz
Print the cross entropy using softmax_data and one_hot_encod_label .
(Alternative link for users in China.)
Start Quiz:
quiz.py solution.py
softmax_data = [0.7, 0.2, 0.1]

one_hot_data = [1.0, 0.0, 0.0]
softmax = tf.placeholder(tf.float32)
one_hot = tf.placeholder(tf.float32)
# TODO: Print cross entropy from session

Start Quiz:
quiz.py solution.py
# Quiz Solution
softmax_data = [0.7, 0.2, 0.1]

one_hot_data = [1.0, 0.0, 0.0]
softmax = tf.placeholder(tf.float32)
one_hot = tf.placeholder(tf.float32)
# ToDo: Print cross entropy from session

cross_entropy = -tf.reduce_sum(tf.multiply(one_hot, tf.log(softmax)))

print(sess.run(cross_entropy, feed_dict={softmax: softmax_data,
one_hot: one_hot_data}))
07. Quiz: Mini-batch
Mini-batching
In this section, you'll go over what mini-batching is and how to apply it in TensorFlow.
Mini-batching is a technique for training on subsets of the dataset instead of all the data at one
time. This provides the ability to train a model, even if a computer lacks the memory to store
the entire dataset.
Mini-batching is computationally inefficient, since you can't calculate the loss simultaneously
across all samples. However, this is a small price to pay in order to be able to run the model at
all.
It's also quite useful combined with SGD. The idea is to randomly shuffle the data at the start
of each epoch, then create the mini-batches. For each mini-batch, you train the network
weights with gradient descent. Since these batches are random, you're performing SGD with
each batch.
Let's look at the MNIST dataset with weights and a bias to see if your machine can handle it.

n_input = 784 # MNIST data input (img shape: 28*28)

n_classes = 10 # MNIST total classes (0-9 digits)
# Import MNIST data

mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)
# The features are already scaled and the data is shuffled

train_features = mnist.train.images
test_features = mnist.test.images
train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)
# Weights & bias

weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))
Question 1
Calculate the memory size of train_features , train_labels , weights , and bias in
bytes. Ignore memory for overhead, just calculate the memory required for the stored data.
You may have to look up how much memory a float32 requires, using this link.
train_features Shape: (55000, 784) Type: float32

train_labels Shape: (55000, 10) Type: float32
weights Shape: (784, 10) Type: float32
bias Shape: (10,) Type: float32
QUESTION:
How many bytes of memory does train_features need?
ANSWER:
SOLUTION:
NOTE: The solutions are expressed in RegEx pattern. Udacity uses these patterns to
check the given answer
QUESTION:
How many bytes of memory does train_labels need?
ANSWER:
SOLUTION:
QUESTION:
How many bytes of memory does weights need?
ANSWER:
SOLUTION:
QUESTION:
How many bytes of memory does bias need?

ANSWER:
SOLUTION:
The total memory space required for the inputs, weights and bias is around 174
megabytes, which isn't that much memory. You could train this whole dataset on most
CPUs and GPUs.
But larger datasets that you'll use in the future measured in gigabytes or more. It's
possible to purchase more memory, but it's expensive. A Titan X GPU with 12 GB of
memory costs over $1,000.
Instead, in order to run large models on your machine, you'll learn how to use mini-
batching.
Let's look at how you implement mini-batching in TensorFlow.
TensorFlow Mini-batching
In order to use mini-batching, you must first divide your data into batches.
Unfortunately, it's sometimes impossible to divide the data into batches of exactly
equal size. For example, imagine you'd like to create batches of 128 samples each
from a dataset of 1000 samples. Since 128 does not evenly divide into 1000, you'd
wind up with 7 batches of 128 samples, and 1 batch of 104 samples. (7*128 + 1*104 =
1000)
In that case, the size of the batches would vary, so you need to take advantage of
TensorFlow's tf.placeholder() function to receive the varying batch sizes.
Continuing the example, if each sample had n_input = 784 features and n_classes
= 10 possible labels, the dimensions for features would be [None,
n_input] and labels would be [None, n_classes] .

features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])
What does None do here?
The None dimension is a placeholder for the batch size. At runtime, TensorFlow will
accept any batch size greater than 0.
Going back to our earlier example, this setup allows you to
feed features and labels into the model as either the batches of 128 samples or
the single batch of 104 samples.
Question 2
Use the parameters below, how many batches are there, and what is the last batch
size?
features is (50000, 400)
labels is (50000, 10)
batch_size is 128
QUESTION:
How many batches are there?
ANSWER:
SOLUTION:
QUESTION:
What is the last batch size?
ANSWER:
SOLUTION:
Now that you know the basics, let's learn how to implement mini-batching.
Question 3
Implement the batches function to batch features and labels . The function should
return each batch with a maximum size of batch_size . To help you with the quiz, look
at the following example output of a working batches function.
# 4 Samples of features
example_features = [
['F11','F12','F13','F14'],
['F21','F22','F23','F24'],
['F31','F32','F33','F34'],
['F41','F42','F43','F44']]
# 4 Samples of labels
example_labels = [
['L11','L12'],
['L21','L22'],
['L31','L32'],
['L41','L42']]
example_batches = batches(3, example_features, example_labels)

The example_batches variable would be the following:
[
# 2 batches:
# First is a batch of size 3.
# Second is a batch of size 1
[
# First Batch is size 3
[
# 3 samples of features.
# There are 4 features per sample.
['F11', 'F12', 'F13', 'F14'],
['F21', 'F22', 'F23', 'F24'],
['F31', 'F32', 'F33', 'F34']
], [
# 3 samples of labels.
# There are 2 labels per sample.
['L11', 'L12'],
['L21', 'L22'],
['L31', 'L32']
]
], [
# Second Batch is size 1.
# Since batch size is 3, there is only one sample left from the 4 samples.
[
# 1 sample of features.
['F41', 'F42', 'F43', 'F44']
], [
# 1 sample of labels.
['L41', 'L42']
]
]
]
Implement the batches function in the "quiz.py" file below.
Start Quiz:
sandbox.py quiz.pyquiz_solution.py
from quiz import batches
from pprint import pprint
# 4 Samples of features
example_features = [
['F11','F12','F13','F14'],
['F21','F22','F23','F24'],
['F31','F32','F33','F34'],
['F41','F42','F43','F44']]
# 4 Samples of labels
example_labels = [
['L11','L12'],
['L21','L22'],
['L31','L32'],
['L41','L42']]
# PPrint prints data structures like 2d arrays, so they are easier to

read
pprint(batches(3, example_features, example_labels))
Start Quiz:
sandbox.py quiz.py quiz_solution.py
import math
def batches(batch_size, features, labels):
"""
Create batches of features and labels
:param batch_size: The batch size
:param features: List of features
:param labels: List of labels
:return: Batches of (Features, Labels)
"""
assert len(features) == len(labels)
# TODO: Implement batching
pass
Start Quiz:
sandbox.pyquiz.py quiz_solution.py
import math
"""
"""
# TODO: Implement batching
output_batches = []
sample_size = len(features)
for start_i in range(0, sample_size, batch_size):
end_i = start_i + batch_size
batch = [features[start_i:end_i], labels[start_i:end_i]]
output_batches.append(batch)
return output_batches
Let's use mini-batching to feed batches of MNIST features and labels into a linear
model.
Set the batch size and run the optimizer over all the batches with
the batches function. The recommended batch size is 128. If you have memory
restrictions, feel free to make it smaller.
Start Quiz:
quiz.py helper.pyquiz_solution.py
import numpy as np
from helper import batches
# Import MNIST data

one_hot=True)


# Weights & bias

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)
# Define loss and optimizer

cost =
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits,
labels=labels))
optimizer =
tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimi
ze(cost)
# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels,
1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# TODO: Set batch size
batch_size = None
assert batch_size is not None, 'You must set the batch size'

sess.run(init)
# TODO: Train optimizer on all batches

# for batch_features, batch_labels in ______
sess.run(optimizer, feed_dict={features: batch_features, labels:
batch_labels})
# Calculate accuracy for test dataset

test_accuracy = sess.run(
accuracy,
feed_dict={features: test_features, labels: test_labels})
print('Test Accuracy: {}'.format(test_accuracy))

Start Quiz:
quiz.py helper.py quiz_solution.py
import math
"""
"""
outout_batches = []
sample_size = len(features)
for start_i in range(0, sample_size, batch_size):
end_i = start_i + batch_size
batch = [features[start_i:end_i], labels[start_i:end_i]]
outout_batches.append(batch)
return outout_batches
Start Quiz:
quiz.pyhelper.py quiz_solution.py
import numpy as np
from helper import batches
# Import MNIST data

one_hot=True)


# Weights & bias

# Logits - xW + b

cost =
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits,
labels=labels))
optimizer =
tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimi
ze(cost)
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels,
1))
# TODO: Set batch size
batch_size = 128
assert batch_size is not None, 'You must set the batch size'

sess.run(init)
# TODO: Train optimizer on all batches

for batch_features, batch_labels in batches(batch_size,
train_features, train_labels):
sess.run(optimizer, feed_dict={features: batch_features,
labels: batch_labels})

accuracy,
The accuracy is low, but you probably know that you could train on the dataset more
than once. You can train a model using the dataset multiple times. You'll go over this
subject in the next section where we talk about "epochs".
08. Epochs
Epochs
An epoch is a single forward and backward pass of the whole dataset. This is used to increase
the accuracy of the model without requiring more data. This section will cover epochs in
TensorFlow and how to choose the right number of epochs.
The following TensorFlow code trains a model using 10 epochs.

import numpy as np
from helper import batches # Helper function created in Mini-batching section
def print_epoch_stats(epoch_i, sess, last_features, last_labels):

"""
Print cost and validation accuracy of an epoch
"""
current_cost = sess.run(
cost,
feed_dict={features: last_features, labels: last_labels})
valid_accuracy = sess.run(
accuracy,
feed_dict={features: valid_features, labels: valid_labels})
print('Epoch: {:<4} - Cost: {:<8.3} Valid Accuracy: {:<5.3}'.format(
epoch_i,
current_cost,
valid_accuracy))

# Import MNIST data

mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

valid_features = mnist.validation.images
valid_labels = mnist.validation.labels.astype(np.float32)

# Weights & bias

# Logits - xW + b

learning_rate = tf.placeholder(tf.float32)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits,
labels=labels))
optimizer =
tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
batch_size = 128
epochs = 10
learn_rate = 0.001
train_batches = batches(batch_size, train_features, train_labels)

sess.run(init)
# Training cycle
for epoch_i in range(epochs):
# Loop over all batches

for batch_features, batch_labels in train_batches:
train_feed_dict = {
features: batch_features,
labels: batch_labels,
learning_rate: learn_rate}
sess.run(optimizer, feed_dict=train_feed_dict)
# Print cost and validation accuracy of an epoch

print_epoch_stats(epoch_i, sess, batch_features, batch_labels)

accuracy,

Running the code will output the following:
Epoch: 0 - Cost: 11.0 Valid Accuracy: 0.204

Test Accuracy: 0.3801000118255615
Each epoch attempts to move to a lower cost, leading to better accuracy.
This model continues to improve accuracy up to Epoch 9. Let's increase the number of epochs
to 100.
...
....
Test Accuracy: 0.8696000006198883
From looking at the output above, you can see the model doesn't increase the validation
accuracy after epoch 80. Let's see what happens when we increase the learning rate.
learn_rate = 0.1

...
Test Accuracy: 0.7556000053882599
Looks like the learning rate was increased too much. The final accuracy was lower, and it
stopped improving earlier. Let's stick with the previous learning rate, but change the number of
epochs to 80.

Test Accuracy: 0.86909999418258667
The accuracy only reached 0.86, but that could be because the learning rate was too high.
Lowering the learning rate would require more epochs, but could ultimately achieve better
accuracy.
In the upcoming TensorFLow Lab, you'll get the opportunity to choose your own learning rate,
epoch count, and batch size to improve the model's accuracy.
09. Pre-Lab: NotMNIST in TensorFlow
TensorFlow Neural Network Lab
TensorFlow Lab
We've prepared a Jupyter notebook that will guide you through the process of creating a single
layer neural network in TensorFlow. You'll implement data normalization, then build and train
the network with TensorFlow.
Getting the notebook

The notebook and all related files are available from our GitHub repository. Either clone the
repository or download it as a Zip file.
Use Git to clone the repository.
git clone https://github.com/udacity/deep-learning.git

If you're unfamiliar with Git and GitHub, I highly recommend checking out our course. If
you'd rather not use Git, you can download the repository as a Zip archive. You can find the
repo here.
View The Notebook

In the directory with the notebook file, start your Jupyter notebook server
jupyter notebook
This should open a browser window for you. If it doesn't, go to http://localhost:8888/tree.
Although, the port number might be different if you have other notebook servers running, so
try 8889 instead of 8888 if you can't find the right server.
You should see the notebook intro_to_tensorflow.ipynb , this is the notebook you'll be
working on. The notebook has 3 problems for you to solve:
 Problem 1: Normalize the features

 Problem 2: Use TensorFlow operations to create features, labels, weight, and biases tensors
 Problem 3: Tune the learning rate, number of steps, and batch size for the best accuracy
This is a self-assessed lab. Compare your answers to the solutions here. If you have any
difficulty completing the lab, Udacity provides a few services to answer any questions you
might have.
Help
Remember that you can get assistance from your mentor, the Forums (click the link on the left
side of the classroom), or the Slack channel. You can also review the concepts from the
previous lessons.
10. Lab: NotMNIST in TensorFlow
Workspace

11. Two-layer Neural Network
Multilayer Neural Networks

In the previous lessons and the lab, you learned how to build a neural network of one layer.
Now, you'll learn how to build multilayer neural networks with TensorFlow. Adding a hidden
layer to a network allows it to model more complex functions. Also, using a non-linear
activation function on the hidden layer lets it model non-linear functions.
The first thing we'll learn to implement in TensorFlow is ReLU hidden layer. A ReLU is a
non-linear function, or rectified linear unit. The ReLU function is 0 for negative inputs
and xx for all inputs x >0x>0.
As before, the following nodes will build up on the knowledge from the Deep Neural
Networks lesson. If you need to refresh your mind, you can go back and watch them again.
 ReLU
 Feedforward
 Dropout
12. Quiz: TensorFlow ReLUs
TensorFlow ReLUs
TensorFlow provides the ReLU function as tf.nn.relu() , as shown below.
# Hidden Layer with ReLU activation function

hidden_layer = tf.add(tf.matmul(features, hidden_weights), hidden_biases)
hidden_layer = tf.nn.relu(hidden_layer)
output = tf.add(tf.matmul(hidden_layer, output_weights), output_biases)

The above code applies the tf.nn.relu() function to the hidden_layer , effectively turning
off any negative weights and acting like an on/off switch. Adding additional layers, like
the output layer, after an activation function turns the model into a nonlinear function. This
nonlinearity allows the network to solve more complex problems.
Quiz
Below you'll use the ReLU function to turn a linear single layer network into a non-linear
multilayer network.
Start Quiz:
quiz.py solution.py
output = None
hidden_layer_weights = [
[0.1, 0.2, 0.4],
[0.4, 0.6, 0.6],
[0.5, 0.9, 0.1],
[0.8, 0.2, 0.8]]
out_weights = [
[0.1, 0.6],
[0.2, 0.1],
[0.7, 0.9]]
# Weights and biases

weights = [
tf.Variable(hidden_layer_weights),
tf.Variable(out_weights)]
biases = [
tf.Variable(tf.zeros(3)),
tf.Variable(tf.zeros(2))]
# Input
features = tf.Variable([[1.0, 2.0, 3.0, 4.0], [-1.0, -2.0, -3.0,
-4.0], [11.0, 12.0, 13.0, 14.0]])
# TODO: Create Model
# TODO: Print session results

Start Quiz:
quiz.py solution.py
# Quiz Solution
output = None
[0.1, 0.2, 0.4],
[0.4, 0.6, 0.6],
[0.5, 0.9, 0.1],
[0.8, 0.2, 0.8]]
out_weights = [
[0.1, 0.6],
[0.2, 0.1],
[0.7, 0.9]]

weights = [
biases = [
# Input
features = tf.Variable([[1.0, 2.0, 3.0, 4.0], [-1.0, -2.0, -3.0,
-4.0], [11.0, 12.0, 13.0, 14.0]])
# TODO: Create Model

hidden_layer = tf.add(tf.matmul(features, weights[0]), biases[0])
logits = tf.add(tf.matmul(hidden_layer, weights[1]), biases[1])
# TODO: Print session results

sess.run(tf.global_variables_initializer())
print(sess.run(logits))
13. Deep Neural Network in TensorFlow
Deep Neural Network in TensorFlow
You've seen how to build a logistic classifier using TensorFlow. Now you're going to see how
to use the logistic classifier to build a deep neural network.
Step by Step
In the following walkthrough, we'll step through TensorFlow code written to classify the
letters in the MNIST database. If you would like to run the network on your computer, the file
is provided here. You can find this and many more examples of TensorFlow at Aymeric
Damien's GitHub repository.
Code
TensorFlow MNIST
mnist = input_data.read_data_sets(".", one_hot=True, reshape=False)
You'll use the MNIST dataset provided by TensorFlow, which batches and One-Hot encodes
the data for you.
Learning Parameters
# Parameters
training_epochs = 20
batch_size = 128 # Decrease batch size if you don't have enough memory
display_step = 1

The focus here is on the architecture of multilayer neural networks, not parameter tuning, so
here we'll just give you the learning parameters.
Hidden Layer Parameters

n_hidden_layer = 256 # layer number of features
The variable n_hidden_layer determines the size of the hidden layer in the neural network.
This is also known as the width of a layer.
Weights and Biases

# Store layers weight & bias
weights = {
'hidden_layer': tf.Variable(tf.random_normal([n_input, n_hidden_layer])),
'out': tf.Variable(tf.random_normal([n_hidden_layer, n_classes]))
}
biases = {
'hidden_layer': tf.Variable(tf.random_normal([n_hidden_layer])),
'out': tf.Variable(tf.random_normal([n_classes]))
}
Deep neural networks use multiple layers with each layer requiring it's own weight and bias.
The 'hidden_layer' weight and bias is for the hidden layer. The 'out' weight and bias is
for the output layer. If the neural network were deeper, there would be weights and biases for
each additional layer.
Input
# tf Graph input
x = tf.placeholder("float", [None, 28, 28, 1])
y = tf.placeholder("float", [None, n_classes])
x_flat = tf.reshape(x, [-1, n_input])

The MNIST data is made up of 28px by 28px images with a single channel.
The tf.reshape() function above reshapes the 28px by 28px matrices in x into row vectors
of 784px.
Multilayer Perceptron
# Hidden layer with RELU activation

layer_1 = tf.add(tf.matmul(x_flat, weights['hidden_layer']),\
biases['hidden_layer'])
layer_1 = tf.nn.relu(layer_1)
# Output layer with linear activation
logits = tf.add(tf.matmul(layer_1, weights['out']), biases['out'])
You've seen the linear function tf.add(tf.matmul(x_flat, weights['hidden_layer']),
biases['hidden_layer']) before, also known as xw + b . Combining linear functions
together using a ReLU will give you a two layer network.
Optimizer
cost = tf.reduce_mean(\
tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\
.minimize(cost)
This is the same optimization technique used in the Intro to TensorFLow lab.
Session
# Initializing the variables
# Launch the graph

sess.run(init)
# Training cycle
for epoch in range(training_epochs):
total_batch = int(mnist.train.num_examples/batch_size)
for i in range(total_batch):
batch_x, batch_y = mnist.train.next_batch(batch_size)
# Run optimization op (backprop) and cost op (to get loss value)
sess.run(optimizer, feed_dict={x: batch_x, y: batch_y})
The MNIST library in TensorFlow provides the ability to receive the dataset in batches.
Calling the mnist.train.next_batch() function returns a subset of the training data.
Deeper Neural Network
That's it! Going from one layer to two is easy. Adding more layers to the network allows you
to solve more complicated problems.
14. Save and Restore TensorFlow Models
Save and Restore TensorFlow Models
Training a model can take hours. But once you close your TensorFlow session, you lose all the
trained weights and biases. If you were to reuse the model in the future, you would have to
train it all over again!
Fortunately, TensorFlow gives you the ability to save your progress using a class
called tf.train.Saver . This class provides the functionality to save any tf.Variable to
your file system.
Saving Variables
Let's start with a simple example of saving weights and bias Tensors. For the first example
you'll just save two variables. Later examples will save all the weights in a practical model.
# The file path to save the data

save_file = './model.ckpt'
# Two Tensor Variables: weights and bias

weights = tf.Variable(tf.truncated_normal([2, 3]))
bias = tf.Variable(tf.truncated_normal([3]))
# Class used to save and/or restore Tensor Variables

saver = tf.train.Saver()

# Initialize all the Variables
# Show the values of weights and bias

print('Weights:')
print(sess.run(weights))
print('Bias:')
print(sess.run(bias))
# Save the model

saver.save(sess, save_file)
Weights:
[[-0.97990924 1.03016174 0.74119264]
[-0.82581609 -0.07361362 -0.86653847]]
Bias:
[ 1.62978125 -0.37812829 0.64723819]

The Tensors weights and bias are set to random values using
the tf.truncated_normal() function. The values are then saved to the save_file location,
"model.ckpt", using the tf.train.Saver.save() function. (The ".ckpt" extension stands for
"checkpoint".)
If you're using TensorFlow 0.11.0RC1 or newer, a file called "model.ckpt.meta" will also be
created. This file contains the TensorFlow graph.
Loading Variables
Now that the Tensor Variables are saved, let's load them back into a new model.
# Remove the previous weights and bias

tf.reset_default_graph()
# Two Variables: weights and bias

# Class used to save and/or restore Tensor Variables


# Load the weights and bias
saver.restore(sess, save_file)
# Show the values of weights and bias

print('Weight:')
print(sess.run(weights))
print('Bias:')
print(sess.run(bias))
Weights:
[[-0.97990924 1.03016174 0.74119264]
[-0.82581609 -0.07361362 -0.86653847]]
Bias:
[ 1.62978125 -0.37812829 0.64723819]
You'll notice you still need to create the weights and bias Tensors in Python.
The tf.train.Saver.restore() function loads the saved data into weights and bias .
Since tf.train.Saver.restore() sets all the TensorFlow Variables, you don't need to
call tf.global_variables_initializer() .
Save a Trained Model

Let's see how to train a model and save its weights.
First start with a model:
# Remove previous Tensors and Operations


import numpy as np
# Import MNIST data

mnist = input_data.read_data_sets('.', one_hot=True)

# Weights & bias

# Logits - xW + b

cost = tf.reduce_mean(\
tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\
.minimize(cost)
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
Let's train that model, then save the weights:
import math
save_file = './train_model.ckpt'
batch_size = 128
n_epochs = 100
# Launch the graph

# Training cycle
for epoch in range(n_epochs):
total_batch = math.ceil(mnist.train.num_examples / batch_size)
for i in range(total_batch):
batch_features, batch_labels = mnist.train.next_batch(batch_size)
sess.run(
optimizer,
feed_dict={features: batch_features, labels: batch_labels})
# Print status for every 10 epochs

if epoch % 10 == 0:
valid_accuracy = sess.run(
accuracy,
feed_dict={
features: mnist.validation.images,
labels: mnist.validation.labels})
print('Epoch {:<3} - Validation Accuracy: {}'.format(
epoch,
valid_accuracy))
# Save the model

print('Trained Model Saved.')
Epoch 0 - Validation Accuracy: 0.06859999895095825
Trained Model Saved.
Load a Trained Model

Let's load the weights and bias from memory, then check the test accuracy.
# Launch the graph

accuracy,
feed_dict={features: mnist.test.images, labels: mnist.test.labels})

Test Accuracy: 0.7229999899864197
That's it! You now know how to save and load a trained model in TensorFlow. Let's look at
loading weights and biases into modified models in the next section.
15. Finetuning
Loading the Weights and Biases into a New Model
Sometimes you might want to adjust, or "finetune" a model that you have already trained and
saved.
However, loading saved Variables directly into a modified model can generate errors. Let's go
over how to avoid these problems.
Naming Error
TensorFlow uses a string identifier for Tensors and Operations called name . If a name is not
given, TensorFlow will create one automatically. TensorFlow will give the first node the
name <Type> , and then give the name <Type>_<number> for the subsequent nodes. Let's see
how this can affect loading a model with a different order of weights and bias :

save_file = 'model.ckpt'

# Print the name of Weights and Bias

print('Save Weights: {}'.format(weights.name))
print('Save Bias: {}'.format(bias.name))




print('Load Weights: {}'.format(weights.name))
print('Load Bias: {}'.format(bias.name))

# Load the weights and bias - ERROR
The code above prints out the following:
Save Weights: Variable:0
Save Bias: Variable_1:0
Load Weights: Variable_1:0
Load Bias: Variable:0
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to
match.
You'll notice that the name properties for weights and bias are different than when you
saved the model. This is why the code produces the "Assign requires shapes of both tensors to
match" error. The code saver.restore(sess, save_file) is trying to load weight data
into bias and bias data into weights .
Instead of letting TensorFlow set the name property, let's set it manually:
save_file = 'model.ckpt'

weights = tf.Variable(tf.truncated_normal([2, 3]), name='weights_0')
bias = tf.Variable(tf.truncated_normal([3]), name='bias_0')

print('Save Weights: {}'.format(weights.name))
print('Save Bias: {}'.format(bias.name))



bias = tf.Variable(tf.truncated_normal([3]), name='bias_0')
weights = tf.Variable(tf.truncated_normal([2, 3]) ,name='weights_0')

print('Load Weights: {}'.format(weights.name))
print('Load Bias: {}'.format(bias.name))

# Load the weights and bias - No Error
print('Loaded Weights and Bias successfully.')

Save Weights: weights_0:0
Save Bias: bias_0:0
Load Weights: weights_0:0
Load Bias: bias_0:0
Loaded Weights and Bias successfully.
That worked! The Tensor names match and the data loaded correctly.
16. Quiz: TensorFlow Dropout
TensorFlow Dropout
Figure 1: Taken from the paper "Dropout: A Simple Way to Prevent Neural Networks from
Overfitting" (https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf)
Dropout is a regularization technique for reducing overfitting. The technique temporarily

drops units (artificial neurons) from the network, along with all of those units' incoming and
outgoing connections. Figure 1 illustrates how dropout works.
TensorFlow provides the tf.nn.dropout() function, which you can use to implement
dropout.
Let's look at an example of how to use tf.nn.dropout() .
keep_prob = tf.placeholder(tf.float32) # probability to keep units

hidden_layer = tf.nn.dropout(hidden_layer, keep_prob)

The code above illustrates how to apply dropout to a neural network.
The tf.nn.dropout() function takes in two parameters:
1. hidden_layer : the tensor to which you would like to apply dropout

2. keep_prob : the probability of keeping (i.e. not dropping) any given unit
keep_prob allows you to adjust the number of units to drop. In order to compensate for
dropped units, tf.nn.dropout() multiplies all units that are kept (i.e. not dropped)
by 1/keep_prob .
During training, a good starting value for keep_prob is 0.5 .
During testing, use a keep_prob value of 1.0 to keep all units and maximize the power of
the model.
Quiz 1
Take a look at the code snippet below. Do you see what's wrong?
There's nothing wrong with the syntax, however the test accuracy is extremely low.
...
keep_prob = tf.placeholder(tf.float32) # probability to keep units

...

for epoch_i in range(epochs):

for batch_i in range(batches):
....
sess.run(optimizer, feed_dict={
features: batch_features,
labels: batch_labels,
keep_prob: 0.5})
validation_accuracy = sess.run(accuracy, feed_dict={

features: test_features,
labels: test_labels,
keep_prob: 0.5})
What's wrong with the above code?

Dropout doesn't work with batching.

The keep_prob value of 0.5 is too low.

There shouldn't be a value passed to keep_prob when testing for accuracy.

keep_prob should be set to 1.0 when evaluating validation accuracy.
SOLUTION:keep_prob should be set to 1.0 when evaluating validation accuracy.

Quiz 2
This quiz will be starting with the code from the ReLU Quiz and applying a dropout layer.
Build a model with a ReLU layer and dropout layer using the keep_prob placeholder to pass
in a probability of 0.5 . Print the logits from the model.
Note: Output will be different every time the code is run. This is caused by dropout
randomizing the units it drops.
Start Quiz:
quiz.py solution.py
[0.1, 0.2, 0.4],
[0.4, 0.6, 0.6],
[0.5, 0.9, 0.1],
[0.8, 0.2, 0.8]]
out_weights = [
[0.1, 0.6],
[0.2, 0.1],
[0.7, 0.9]]

weights = [
biases = [
# Input
features = tf.Variable([[0.0, 2.0, 3.0, 4.0], [0.1, 0.2, 0.3, 0.4],
[11.0, 12.0, 13.0, 14.0]])
# TODO: Create Model with Dropout
# TODO: Print logits from a session

Start Quiz:
quiz.py solution.py
# Quiz Solution
[0.1, 0.2, 0.4],
[0.4, 0.6, 0.6],
[0.5, 0.9, 0.1],
[0.8, 0.2, 0.8]]
out_weights = [
[0.1, 0.6],
[0.2, 0.1],
[0.7, 0.9]]

weights = [
biases = [
# Input
features = tf.Variable([[0.0, 2.0, 3.0, 4.0], [0.1, 0.2, 0.3, 0.4],
[11.0, 12.0, 13.0, 14.0]])
# TODO: Create Model with Dropout

keep_prob = tf.placeholder(tf.float32)
# TODO: Print logits from a session

print(sess.run(logits, feed_dict={keep_prob: 0.5}))

Yousef Udacity Deep Learning Part1 Introdution + Part 2 NN

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Yousef Udacity Deep Learning Part1 Introdution + Part 2 NN

Uploaded by

Copyright:

Available Formats

Part 01_Introduction to Deep Learning

We will be following this repository by Yenchen Lin.

 Grokking Deep Learning by Andrew Trask. Use our exclusive discount

 Anaconda Navigator, a GUI for managing your environments and packages

conda upgrade conda

 Add export PATH="/Users/username/anaconda/bin:$PATH" to your .zsh_config file.

Which of these commands would you use to install the

 `conda install pandas`

Exported environment printed to the terminal

To create an environment from an environment file use conda env create -f

 Write your content directly with your editor of choice

Ready to get started? Check out the Quickstart guide.

Why we're using Python 3

 Jupyter is switching to Python 3 only

The main breakage between Python 2 and 3

print "Hello", "world!"

How notebooks work

To install Jupyter notebooks in a conda environment, use conda install jupyter

Notebook file opened in a text editor shows JSON data

For example, to convert a notebook to an HTML file, in your terminal use

jupyter nbconvert --to html notebook.ipynb

Turning on Slideshow toolbars for cells

Running the slideshow

jupyter nbconvert notebook.ipynb --to slides

To convert it and immediately see it, use

jupyter nbconvert notebook.ipynb --to slides --post serve

Again, congratulations and good luck!

Part1 >> Lesson 5:

Introduction to Neural Networks

00:00:04.040 --> 00:00:06.059

00:00:06.059 --> 00:00:08.759

00:00:08.759 --> 00:00:11.525

00:00:11.525 --> 00:00:12.809

00:00:12.808 --> 00:00:14.689

00:00:14.689 --> 00:00:15.990

00:00:15.990 --> 00:00:20.504

00:00:20.504 --> 00:00:24.839

00:00:24.838 --> 00:00:27.058

00:00:30.463 --> 00:00:35.054

00:00:35.054 --> 00:00:38.070

00:00:38.070 --> 00:00:40.649

00:00:40.649 --> 00:00:43.200

00:00:43.200 --> 00:00:46.130

# TODO: Set weight1, weight2, and bias

# DON'T CHANGE ANYTHING BELOW

# Generate and check output

What are two ways to go from an AND perceptron to an OR perceptron?

# TODO: Set weight1, weight2, and bias

# DON'T CHANGE ANYTHING BELOW

# Generate and check output

Does the misclassified point want the line to be closer or farther?

Perceptron Agorithm Pseudocode

00:00:04.995 --> 00:00:06.410

00:00:06.410 --> 00:00:07.895

00:00:07.894 --> 00:00:11.494

00:00:11.494 --> 00:00:14.804

00:00:14.804 --> 00:00:17.265

00:00:17.265 --> 00:00:20.964

00:00:20.964 --> 00:00:25.890

00:00:25.890 --> 00:00:28.088

00:00:28.088 --> 00:00:31.484

00:00:31.484 --> 00:00:34.704

00:00:34.704 --> 00:00:36.869