You are on page 1of 437

Part 01_Introduction to Deep Learning

You'll also learn about model evaluation and validation, an important technique for training
and assessing neural networks. We also have guest instructor Andrew Trask, author
of Grokking Deep Learning, developing a neural network for processing text and predicting
sentiment.
You'll also use convolutional networks to build an autoencoder, a network architecture used
for image compression and denoising. Then, you'll use a pretrained neural network (VGGnet),
to classify images of flowers the network has never seen before, a technique known
as transfer learning.
Then, you'll learn about word embeddings and implement the Word2Vec model, a network
that can learn about semantic relationships between words. These are used to increase the
efficiency of networks when you're processing text.
Applying Deep Learning

 Back to Home

 01. Introduction
 02. Style Transfer
 03. DeepTraffic
 04. Flappy Bird
 05. Books to Read
02. Style Transfer >> revision to lesson to apply from Github
Code directly
Style Transfer
As an example of the kind of things you'll be building with deep learning models, here is a
really fun project, fast style transfer. Style transfer allows you to take famous paintings, and
recreate your own images in their styles! The network learns the underlying techniques of
those paintings and figures out how to apply them on its own. This model was trained on the
styles of famous paintings and is able to transfer those styles to other images and even videos!

I used it to style my cat Chihiro in the style of Hokusai's The Great Wave Off Kanagawa.
DeepTraffic
Another great application of deep learning is in simulating traffic and making driving
decisions. You can find the DeepTraffic simulator here. The network here is attempting
to learn a driving strategy such that the car is moving as fast as possible
using reinforcement learning. The network is rewarded when the car chooses actions
that result in it moving fast. It's this feedback that allows the network to find a
strategy of actions for optimal speed.

To learn more about setting the parameters and training the network, read
the overview here.

Discuss how you built your network and your results with your fellow students in
Study Groups.
04. Flappy Bird
Flappy Bird

In this example, you'll get to see a deep learning agent playing Flappy Bird! You have the
option to train the agent yourself, but for now let's just start with the pre-trained network given
by the author. Note that the following agent is able to play without being told any information
about the structure of the game or its rules. It automatically discovers the rules of the game by
finding out how it did on each iteration.

We will be following this repository by Yenchen Lin.

Instructions

1. Install miniconda or anaconda if you have not already. You can follow our tutorial for help.
2. Create an environment for flappybird
o Mac/Linux:  conda create --name=flappybird python=2.7
o Windows:  conda create --name=flappybird python=3.5
3. Enter your conda environment
o Mac/Linux:  source activate flappybird
o Windows:  activate flappybird
4. conda install -c menpo opencv3
5. pip install pygame
6. pip install tensorflow
7. git clone https://github.com/yenchenlin/DeepLearningFlappyBird.git
8. cd DeepLearningFlappyBird
9. python deep_q_network.py

If all went correctly, you should be seeing a deep learning based agent play Flappy Bird! The
repository contains instructions for training your own agent if you're interested!

Books to read
We believe that you learn best when you are exposed to multiple perspectives on the
same idea. As such, we recommend checking out a few of the books below to get an
added perspective on Deep Learning.

 Grokking Deep Learning by Andrew Trask. Use our exclusive discount


code traskud17 for 40% off. This provides a very gentle introduction to Deep
Learning and covers the intuition more than the theory.

 Neural Networks And Deep Learning by Michael Nielsen. This book is more
rigorous than Grokking Deep Learning and includes a lot of fun, interactive
visualizations to play with.

 The Deep Learning Textbook from Ian Goodfellow, Yoshua Bengio, and Aaron
Courville. This online book contains a lot of material and is the most rigorous of
the three books suggested.

INSTRUCTOR NOTE:

Anaconda is a distribution of packages built for data science. It comes with conda, a
package and environment manager. You'll be using conda to create environments for
isolating your projects that use different versions of Python and/or different packages.
You'll also use it to install, uninstall, and update packages in your environments. Using
Anaconda has made my life working with data much more pleasant.
04. Installing Anaconda
Installation instructions

Installing Anaconda
Anaconda is available for Windows, Mac OS X, and Linux. You can find the installers
and installation instructions at https://www.anaconda.com/download/.

If you already have Python installed on your computer, this won't break anything.
Instead, the default Python used by your scripts and programs will be the one that
comes with Anaconda.

Choose the Python 3.6 version, you can install Python 2 versions later. Also, choose
the 64-bit installer if you have a 64-bit operating system, otherwise go with the 32-bit
installer. Go ahead and choose the appropriate version, then install it. Continue on
afterwards!

After installation, you’re automatically in the default conda environment with all
packages installed which you can see below. You can check out your own install by
entering  conda list  into your terminal.

Play
00:00
-00:01
Settings
Enter fullscreen
Play
On Windows
A bunch of applications are installed along with Anaconda:

 Anaconda Navigator, a GUI for managing your environments and packages


 Anaconda Prompt, a terminal where you can use the command line interface to
manage your environments and packages
 Spyder, an IDE geared toward scientific development

To avoid errors later, it's best to update all the packages in the default environment.
Open the Anaconda Prompt application. In the prompt, run the following
commands:

conda upgrade conda


conda upgrade --all
and answer yes when asked if you want to install the packages. The packages that
come with the initial install tend to be out of date, so updating them now will prevent
future errors from out of date software.

Note: In the previous step, running  conda upgrade conda  should not be necessary
because  --all  includes the conda package itself, but some users have encountered
errors without it.

In the rest of this lesson, I'll be asking you to use commands in your terminal. I highly
suggest you start working with Anaconda this way, then later use the GUI if you'd like.

Troubleshooting
If you are seeing the following "conda command not found" and are using ZShell, you
have to do the following:

 Add  export PATH="/Users/username/anaconda/bin:$PATH"  to your .zsh_config file.


05. Managing packages
Managing Packages
Once you have Anaconda installed, managing packages is fairly straightforward. To
install a package, type  conda install package_name  in your terminal. For example, to
install numpy, type  conda install numpy .

Play
00:02
-00:08
Settings
Enter fullscreen
Play
You can install multiple packages at the same time. Something like  conda install
numpy scipy pandas  will install all those packages simultaneously. It's also possible to
specify which version of a package you want by adding the version number such
as  conda install numpy=1.10 .

Conda also automatically installs dependencies for you. For example  scipy  depends
on  numpy , it uses and requires  numpy . If you install just  scipy  ( conda install scipy ),
Conda will also install  numpy  if it isn't already installed.

Most of the commands are pretty intuitive. To uninstall, use  conda remove
package_name . To update a package  conda update package_name . If you want to
update all packages in an environment, which is often useful, use  conda update
--all . And finally, to list installed packages, it's  conda list  which you've seen
before.

If you don't know the exact name of the package you're looking for, you can try
searching with  conda search *search_term* . For example, I know I want to
install Beautiful Soup, but I'm not sure of the exact package name. So, I try  conda
search *beautifulsoup* . Note that your shell might expand the wildcard  *  before
running the conda command. To fix this, wrap the search string in single or double
quotes like  conda search '*beautifulsoup*' .
It returns a list of the Beautiful Soup packages available with the appropriate package
name,  beautifulsoup4 .

Which of these commands would you use to install the


packages  numpy  and  pandas  with conda? (More than one might be correct - select all
that apply.)

 
conda install numpy

 
conda install pandas

 
conda install numpy pandas

SOLUTION:

 `conda install pandas`


 `conda install numpy pandas`

udacimak v1.1.3
07. More environment actions
Saving and loading environments
A really useful feature is sharing environments so others can install all the packages
used in your code, with the correct versions. You can save the packages to a YAML file
with  conda env export > environment.yaml . The first part  conda env export  writes
out all the packages in the environment, including the Python version.

Exported environment printed to the terminal

Above you can see the name of the environment and all the dependencies (along with
versions) are listed. The second part of the export command,  >
environment.yaml  writes the exported text to a YAML file  environment.yaml . This file
can now be shared and others will be able to create the same environment you used
for the project.

To create an environment from an environment file use  conda env create -f


environment.yaml . This will create a new environment with the same name listed
in  environment.yaml .

Listing environments
If you forget what your environments are named (happens to me sometimes),
use  conda env list  to list out all the environments you've created. You should see a
list of environments, there will be an asterisk next to the environment you're currently
in. The default environment, the environment used when you aren't in one, is
called  root .

Removing environments
If there are environments you don't use anymore,  conda env remove -n env_name  will
remove the specified environment (here, named  env_name ).

https://docs.getpelican.com/en/stable/

Pelican 4.2.0
Pelican is a static site generator, written in Python. Highlights include:

 Write your content directly with your editor of choice


in reStructuredText or Markdown formats
 Includes a simple CLI tool to (re)generate your site
 Easy to interface with distributed version control systems and web hooks
 Completely static output is easy to host anywhere

Ready to get started? Check out the Quickstart guide.


08. Best practices
Best practices
Using environments
One thing that’s helped me tremendously is having separate environments for Python 2 and
Python 3. I used  conda create -n py2 python=2  and  conda create -n py3 python=3  to
create two separate environments,  py2  and  py3 . Now I have a general use environment for
each Python version. In each of those environments, I've installed most of the standard data
science packages (numpy, scipy, pandas, etc.). Remember that when you set up an
environment initially, you'll only start with the standard packages and whatever packages you
specify in your  conda create  statement.

I’ve also found it useful to create environments for each project I’m working on. It works great
for non-data related projects too like web apps with Flask. For example, I have an environment
for my personal blog using Pelican.

Sharing environments
When sharing your code on GitHub, it's good practice to make an environment file and include
it in the repository. This will make it easier for people to install all the dependencies for your
code. I also usually include a pip  requirements.txt  file using  pip freeze  (learn more
here) for people not using conda.

More to learn
To learn more about conda and how it fits in the Python ecosystem, check out this article by
Jake Vanderplas: Conda myths and misconceptions. And here's the conda documentation you
can reference later.
09. On Python versions at Udacity
Python versions at Udacity
Most Nanodegree programs at Udacity will be (or are already) using Python 3 almost
exclusively.

Why we're using Python 3

 Jupyter is switching to Python 3 only


 Python 2.7 is being retired
 Python 3 has been out for almost 10 years, and there are very few dependencies (and none in
this program) that are incompatible.

At this point, there are enough new features in Python 3 that it doesn't make much sense to
stick with Python 2 unless you're working with old code. All new Python code should be
written for version 3. Read more here.

The main breakage between Python 2 and 3


For the most part, Python 2 code will work with Python 3. Of course, most new features
introduced with Python 3 versions won't be backwards compatible. The place where your
Python 2 code will fail most often is the  print  statement.

For most of Python's history including Python 2, printing was done like so:

print "Hello", "world!"


> Hello world!
This was changed in Python 3 to a function.

print("Hello", "world!")
> Hello world!
The  print  function was back-ported to Python 2 in version 2.6 through
the  __future__  module:

# In Python 2.6+
from __future__ import print_function
print("Hello", "world!")
> Hello world!
The  print  statement doesn't work in Python 3. If you want to print something and have it
work in both Python versions, you'll need to import  print_function  in your Python 2 code.

Jupyter Notebooks

 Back to Home
 01. Instructor
 02. What are Jupyter notebooks?
 03. Installing Jupyter Notebook
 04. Launching the notebook server
 05. Notebook interface
 06. Code cells
 07. Markdown cells
 08. Keyboard shortcuts
 09. Magic keywords
 10. Converting notebooks
 11. Creating a slideshow
 12. Finishing up
02. What are Jupyter notebooks?
Jupyter

Play
01:21
-00:14
Mute
Disable captions
Settings
Enter fullscreen
Play
What are Jupyter notebooks?
Welcome to this lesson on using Jupyter notebooks. The notebook is a web application that
allows you to combine explanatory text, math equations, code, and visualizations all in one
easily sharable document. For example, here's one of my favorite notebooks shared recently,
the analysis of gravitational waves from two colliding blackholes detected by the LIGO
experiment. You could download the data, run the code in the notebook, and repeat the
analysis, in effect detecting the gravitational waves yourself!

Notebooks have quickly become an essential tool when working with data. You'll find them
being used for data cleaning and exploration, visualization, machine learning, and big data
analysis. Here's an example notebook I made for my personal blog that shows off many of the
features of notebooks. Typically you'd be doing this work in a terminal, either the normal
Python shell or with IPython. Your visualizations would be in separate windows, any
documentation would be in separate documents, along with various scripts for functions and
classes. However, with notebooks, all of these are in one place and easily read together.

Notebooks are also rendered automatically on GitHub. It’s a great feature that lets you easily
share your work. There is also http://nbviewer.jupyter.org/ that renders the notebooks from
your GitHub repo or from notebooks stored elsewhere.

Literate programming
Notebooks are a form of literate programming proposed by Donald Knuth in 1984. With
literate programming, the documentation is written as a narrative alongside the code instead of
sitting off by its own. In Donald Knuth's words,

Instead of imagining that our main task is to instruct a computer what to do, let us concentrate
rather on explaining to human beings what we want a computer to do.

After all, code is written for humans, not for computers. Notebooks provide exactly this
capability. You are able to write documentation as narrative text, along with code. This is not
only useful for the people reading your notebooks, but for your future self coming back to the
analysis.

Just a small aside: recently, this idea of literate programming has been extended to a whole
programming language, Eve.

How notebooks work


Jupyter notebooks grew out of the IPython project started by Fernando Perez. IPython is an
interactive shell, similar to the normal Python shell but with great features like syntax
highlighting and code completion. Originally, notebooks worked by sending messages from
the web app (the notebook you see in the browser) to an IPython kernel (an IPython
application running in the background). The kernel executed the code, then sent it back to the
notebook. The current architecture is similar, drawn out below.

From Jupyter documentation

The central point is the notebook server. You connect to the server through your browser and
the notebook is rendered as a web app. Code you write in the web app is sent through the
server to the kernel. The kernel runs the code and sends it back to the server, then any output is
rendered back in the browser. When you save the notebook, it is written to the server as a
JSON file with a  .ipynb  file extension.

The great part of this architecture is that the kernel doesn't need to run Python. Since the
notebook and the kernel are separate, code in any language can be sent between them. For
example, two of the earlier non-Python kernels were for the R and Julia languages. With an R
kernel, code written in R will be sent to the R kernel where it is executed, exactly the same as
Python code running on a Python kernel. IPython notebooks were renamed because notebooks
became language agnostic. The new name Jupyter comes from the combination
of Julia, Python, and R. If you're interested, here's a list of available kernels.

Another benefit is that the server can be run anywhere and accessed via the internet. Typically
you'll be running the server on your own machine where all your data and notebook files are
stored. But, you could also set up a server on a remote machine or cloud instance like
Amazon's EC2. Then, you can access the notebooks in your browser from anywhere in the
world.
03. Installing Jupyter Notebook
Installing Jupyter Notebook
By far the easiest way to install Jupyter is with Anaconda. Jupyter notebooks automatically
come with the distribution. You'll be able to use notebooks from the default environment.

To install Jupyter notebooks in a conda environment, use  conda install jupyter


notebook .

Jupyter notebooks are also available through pip with  pip install jupyter notebook .
04. Launching the notebook server
Launching the notebook server
To start a notebook server, enter  jupyter notebook  in your terminal or console. This will
start the server in the directory you ran the command in. That means any notebook files will be
saved in that directory. Typically you'd want to start the server in the directory where your
notebooks live. However, you can navigate through your file system to where the notebooks
are.

When you run the command (try it yourself!), the server home should open in your browser.
By default, the notebook server runs at  http://localhost:8888 . If you aren't familiar with
this,  localhost  means your computer and  8888  is the port the server is communicating on.
As long as the server is still running, you can always come back to it by going to
http://localhost:8888 in your browser.

If you start another server, it'll try to use port  8888 , but since it is occupied, the new server
will run on port  8889 . Then, you'd connect to it at  http://localhost:8889 . Every
additional notebook server will increment the port number like this.

If you tried starting your own server, it should look something like this:
05. Notebook interface
Notebook interface
When you create a new notebook, you should see something like this:
06. Code cells
Code cells
Most of your work in notebooks will be done in code cells. This is where you write your code
and it gets executed. In code cells you can write any code, assigning variables, defining
functions and classes, importing packages, and more. Any code executed in one cell is
available in all other cells.

To give you some practice, I created a notebook you can work through. Download the
notebook Working With Code Cells below then run it from your own notebook server. (In
your terminal, change to the directory with the notebook file, then enter  jupyter notebook )
Your browser might try to open the notebook file without downloading it. If that happens,
right click on the link then choose "Save Link As…"

Need to understand
07. Markdown cells
Markdown cells
As mentioned before, cells can also be used for text written in Markdown. Markdown is a
formatting syntax that allows you to include links, style text as bold or italicized, and format
code. As with code cells, you press Shift + Enter or Control + Enter to run the Markdown
cell, where it will render the Markdown to formatted text. Including text allows you to write a
narrative alongside your code, as well as documenting your code and the thoughts that went
into it.

You can find the documentation here, but I'll provide a short primer.

Headers
You can write headers using the pound/hash/octothorpe symbol  #  placed before the text.
One  #  renders as an  h1  header, two  # s is an h2, and so on. Looks like this:

# Header 1
## Header 2
### Header 3
renders as

Header 1
Header 2
Header 3

Links
Linking in Markdown is done by enclosing text in square brackets and the URL in
parentheses, like this  [Udacity's home page](https://www.udacity.com)  for a link
to Udacity's home page.

Emphasis
You can add emphasis through bold or italics with asterisks or underscores ( *  or  _ ). For
italics, wrap the text in one asterisk or underscore,  _gelato_  or  *gelato*  renders as gelato.

Bold text uses two symbols,  **aardvark**  or  __aardvark__  looks like aardvark.

Either asterisks or underscores are fine as long as you use the same symbol on both sides of
the text.

Code
There are two different ways to display code, inline with text and as a code block separated
from the text. To format inline code, wrap the text in backticks. For
example,  `string.punctuation`  renders as  string.punctuation .
To create a code block, start a new line and wrap the text in three backticks

```
import requests
response = requests.get('https://www.udacity.com')
```
or indent each line of the code block with four spaces.

import requests
response = requests.get('https://www.udacity.com')
Math expressions
You can create math expressions in Markdown cells using LaTeX symbols. Notebooks use
MathJax to render the LaTeX symbols as math symbols. To start math mode, wrap the LaTeX
in dollar signs  $y = mx + b$  for inline math. For a math block, use double dollar signs,

$$
y = \frac{a}{b+c}
$$
This is a really useful feature, so if you don't have experience with LaTeX please read this
primer on using it to create math expressions.

Play
00:02
-00:00
Settings
Enter fullscreen
Play
Wrapping up
Here's a cheatsheet you can use as a reference for writing Markdown. My advice is to make
use of the Markdown cells. Your notebooks will be much more readable compared to a bunch
of code blocks.
08. Keyboard shortcuts
Keyboard shortcuts
Notebooks come with a bunch of keyboard shortcuts that let you use your keyboard to interact
with the cells, instead of using the mouse and toolbars. They take a bit of time to get used to,
but when you're proficient with the shortcuts you'll be much faster at working in notebooks.
To learn more about the shortcuts and get practice using them, download the
notebook Keyboard Shortcuts below. Again, your browser might try to open it, but you want
to save it to your computer. Right click on the link, then choose "Save Link As…"
09. Magic keywords
Magic keywords
Magic keywords are special commands you can run in cells that let you control the notebook
itself or perform system calls such as changing directories. For example, you can set up
matplotlib to work interactively in the notebook with  %matplotlib .

Magic commands are preceded with one or two percent signs ( %  or  %% ) for line magics and
cell magics, respectively. Line magics apply only to the line the magic command is written on,
while cell magics apply to the whole cell.

NOTE: These magic keywords are specific to the normal Python kernel. If you are using other
kernels, these most likely won't work.

Timing code
At some point, you'll probably spend some effort optimizing code to run faster. Timing how
quickly your code runs is essential for this optimization. You can use the  timeit  magic
command to time how long it takes for a function to run, like so:
10. Converting notebooks
Converting notebooks
Notebooks are just big JSON files with the extension  .ipynb .

Notebook file opened in a text editor shows JSON data

Since notebooks are JSON, it is simple to convert them to other formats. Jupyter comes with a
utility called  nbconvert  for converting to HTML, Markdown, slideshows, etc.

For example, to convert a notebook to an HTML file, in your terminal use

jupyter nbconvert --to html notebook.ipynb


Converting to HTML is useful for sharing your notebooks with others who aren't using
notebooks. Markdown is great for including a notebook in blogs and other text editors that
accept Markdown formatting.
As always, learn more about  nbconvert  from the documentation.
11. Creating a slideshow
Creating a slideshow
Create slideshows from notebooks is one of my favorite features. You can see an example of a
slideshow here introducing pandas for working with data.

The slides are created in notebooks like normal, but you'll need to designate which cells are
slides and the type of slide the cell will be. In the menu bar, click View > Cell Toolbar >
Slideshow to bring up the slide cell menu on each cell.

Turning on Slideshow toolbars for cells

This will show a menu dropdown on each cell that lets you choose how the cell shows up in
the slideshow.
Choose slide type

Slides are full slides that you move through left to right. Sub-slides show up in the slideshow
by pressing up or down. Fragments are hidden at first, then appear with a button press. You
can skip cells in the slideshow with Skip and Notes leaves the cell as speaker notes.

Running the slideshow


To create the slideshow from the notebook file, you'll need to use  nbconvert :

jupyter nbconvert notebook.ipynb --to slides


This just converts the notebook to the necessary files for the slideshow, but you need to serve
it with an HTTP server to actually see the presentation.

To convert it and immediately see it, use

jupyter nbconvert notebook.ipynb --to slides --post serve


This will open up the slideshow in your browser so you can present it.
12. Finishing up
Congratulations!
You've made it to the end of this short course on tools in the Python data science workflow.
Making good use of Anaconda and Jupyter Notebooks will increase your productivity and
general well-being. There is a lot to learn to get the most out of these, Markdown and LaTeX
for instance, but after a bit you'll be wondering why data analysis is done any other way.

Again, congratulations and good luck!

Part1 >> Lesson 5:


Matrix Math and NumPy Refresher

 Back to Home

 01. Introduction
 02. Data Dimensions
 03. Data in NumPy
 04. Element-wise Matrix Operations
 05. Element-wise Operations in NumPy
 06. Matrix Multiplication: Part 1
 07. Matrix Multiplication: Part 2
 08. NumPy Matrix Multiplication
 09. Matrix Transposes
 10. Transposes in NumPy
 11. NumPy Quiz
Part 02_Neural Networks

Introduction to Neural Networks

 Back to Home

 01. Instructor
 02. Introduction
 03. Classification Problems 1
 04. Classification Problems 2
 05. Linear Boundaries
 06. Higher Dimensions
 07. Perceptrons
 08. Why "Neural Networks"?
 09. Perceptrons as Logical Operators
 10. Perceptron Trick
 11. Perceptron Algorithm
 12. Non-Linear Regions
 13. Error Functions
 14. Log-loss Error Function
 15. Discrete vs Continuous
 16. Softmax
 17. One-Hot Encoding
 18. Maximum Likelihood
 19. Maximizing Probabilities
 20. Cross-Entropy 1
 21. Cross-Entropy 2
 22. Multi-Class Cross Entropy
 23. Logistic Regression
 24. Gradient Descent
 25. Logistic Regression Algorithm
 26. Pre-Lab: Gradient Descent
 27. Notebook: Gradient Descent
 28. Perceptron vs Gradient Descent
 29. Continuous Perceptrons
 30. Non-linear Data
 31. Non-Linear Models
 32. Neural Network Architecture
 33. Feedforward
 34. Backpropagation
 35. Pre-Lab: Analyzing Student Data
 36. Notebook: Analyzing Student Data
 37. Outro
00:00:00.000 --> 00:00:04.040
So you may be wondering why are these objects called neural networks.

00:00:04.040 --> 00:00:06.059


Well, the reason why they're called neural networks is

00:00:06.059 --> 00:00:08.759


because perceptions kind of look like neurons in the brain.

00:00:08.759 --> 00:00:11.525


In the left we have a perception with four inputs.

00:00:11.525 --> 00:00:12.809


The number is one, zero,

00:00:12.808 --> 00:00:14.689


four, and minus two.

00:00:14.689 --> 00:00:15.990


And what the perception does,

00:00:15.990 --> 00:00:20.504


it calculates some equations on the input and decides to return a one or a
zero.

00:00:20.504 --> 00:00:24.839


In a similar way neurons in the brain take inputs coming from the dendrites.

00:00:24.838 --> 00:00:27.058


These inputs are nervous impulses.
00:00:27.059 --> 00:00:30.464
So what the neuron does is it does something with the nervous impulses

00:00:30.463 --> 00:00:35.054


and then it decides if it outputs a nervous impulse or not through the axon.

00:00:35.054 --> 00:00:38.070


The way we'll create neural networks later in this lesson

00:00:38.070 --> 00:00:40.649


is by concatenating these perceptions so we'll be mimicking

00:00:40.649 --> 00:00:43.200


the way the brain connects neurons by taking the output from

00:00:43.200 --> 00:00:46.130


one and turning it into the input for another one.
09. Perceptrons as Logical Operators
Perceptrons as Logical Operators
In this lesson, we'll see one of the many great applications of perceptrons. As logical
operators! You'll have the chance to create the perceptrons for the most common of these,
the AND, OR, and NOT operators. And then, we'll see what to do about the
elusive XOR operator. Let's dive in!

AND Perceptron
AND And OR Perceptrons
What are the weights and bias for the AND perceptron?
Set the weights ( weight1 ,  weight2 ) and bias  bias  to the correct values that calculate
AND operation as shown above.

Start Quiz:

import pandas as pd

# TODO: Set weight1, weight2, and bias


weight1 = 0.0
weight2 = 0.0
bias = 0.0

# DON'T CHANGE ANYTHING BELOW


# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, False, False, True]
outputs = []

# Generate and check output


for test_input, correct_output in zip(test_inputs, correct_outputs):
linear_combination = weight1 * test_input[0] + weight2 *
test_input[1] + bias
output = int(linear_combination >= 0)
is_correct_string = 'Yes' if output == correct_output else 'No'
outputs.append([test_input[0], test_input[1], linear_combination,
output, is_correct_string])

# Print output
num_wrong = len([output[4] for output in outputs if output[4] ==
'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', ' Input 2',
' Linear Combination', ' Activation Output', ' Is Correct'])
if not num_wrong:
print('Nice! You got it all correct.\n')
else:
print('You got {} wrong. Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))

The OR perceptron is very similar to an AND perceptron. In the image below, the OR
perceptron has the same line as the AND perceptron, except the line is shifted down.
What can you do to the weights and/or bias to achieve this? Use the following AND
perceptron to create an OR Perceptron.
OR Perceptron Quiz

What are two ways to go from an AND perceptron to an OR perceptron?

 
Increase the weights
 
Decrease the weights

 
Increase a single weight

 
Decrease a single weight

 
Increase the magnitude of the bias

 
Decrease the magnitude of the bias
NOT Perceptron
Unlike the other perceptrons we looked at, the NOT operation only cares about one
input. The operation returns a  0  if the input is  1  and a  1  if it's a  0 . The other inputs
to the perceptron are ignored.

In this quiz, you'll set the weights ( weight1 ,  weight2 ) and bias  bias  to the values
that calculate the NOT operation on the second input and ignores the first input.

Start Quiz:
quiz.py
import pandas as pd

# TODO: Set weight1, weight2, and bias


weight1 = 0.0
weight2 = 0.0
bias = 0.0

# DON'T CHANGE ANYTHING BELOW


# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [True, False, True, False]
outputs = []

# Generate and check output


for test_input, correct_output in zip(test_inputs, correct_outputs):
linear_combination = weight1 * test_input[0] + weight2 *
test_input[1] + bias
output = int(linear_combination >= 0)
is_correct_string = 'Yes' if output == correct_output else 'No'
outputs.append([test_input[0], test_input[1], linear_combination,
output, is_correct_string])

# Print output
num_wrong = len([output[4] for output in outputs if output[4] ==
'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', ' Input 2',
' Linear Combination', ' Activation Output', ' Is Correct'])
if not num_wrong:
print('Nice! You got it all correct.\n')
else:
print('You got {} wrong. Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))
XOR Perceptron

Play
00:10
-00:35
Mute
Disable captions
Settings
Enter fullscreen
Play

XOR Perceptron
Quiz: Build an XOR Multi-Layer Perceptron
Now, let's build a multi-layer perceptron from the AND, NOT, and OR perceptrons to
create XOR logic!

The neural network below contains 3 perceptrons, A, B, and C. The last one (AND) has
been given for you. The input to the neural network is from the first node. The output
comes out of the last node.

The multi-layer perceptron below calculates XOR. Each perceptron is a logic operation
of AND, OR, and NOT. However, the perceptrons A, B, and C don't indicate their
operation. In the following quiz, set the correct operations for the perceptrons to
calculate XOR.

QUIZ QUESTION::

Set the operations for the perceptrons in the XOR neural network.

ANSWER CHOICES:
A

 
B

 
C

Perceptron Operators
NOT
AND
OR
SOLUTION:
udacimak v1.1.3
10. Perceptron Trick
Perceptron Trick
In the last section you used your logic and your mathematical knowledge to create perceptrons
for some of the most common logical operators. In real life, though, we can't be building these
perceptrons ourselves. The idea is that we give them the result, and they build themselves. For
this, here's a pretty neat trick that will help us.

Perceptron Algorithm

Play
00:00
-00:07
Mute
Disable captions
Settings
Enter fullscreen
Play

Does the misclassified point want the line to be closer or farther?

 
Closer

 
Farther

SOLUTION:Closer
DL 10 S Perceptron Algorithm

Play
00:00
-00:01
Mute
Disable captions
Settings
Enter fullscreen
Play
Time for some math!

Now that we've learned that the points that are misclassified, want the line to move closer to
them, let's do some math. The following video shows a mathematical trick that modifies the
equation of the line, so that it comes closer to a particular point.
11. Perceptron Algorithm
Perceptron Algorithm
And now, with the perceptron trick in our hands, we can fully develop the perceptron
algorithm! The following video will show you the pseudocode, and in the quiz below, you'll
have the chance to code it in Python.

Perceptron Agorithm Pseudocode


00:00:00.000 --> 00:00:04.995
Now, we finally have all the tools for describing the perceptron algorithm.

00:00:04.995 --> 00:00:06.410


We start with the random equation,

00:00:06.410 --> 00:00:07.895


which will determine some line,

00:00:07.894 --> 00:00:11.494


and two regions, the positive and the negative region.

00:00:11.494 --> 00:00:14.804


Now, we'll move this line around to get a better and better fit.

00:00:14.804 --> 00:00:17.265


So, we ask all the points how they're doing.

00:00:17.265 --> 00:00:20.964


The four correctly classified points say, "I'm good."

00:00:20.964 --> 00:00:25.890


And the two incorrectly classified points say, "Come closer."

00:00:25.890 --> 00:00:28.088


So, let's listen to the point in the right,

00:00:28.088 --> 00:00:31.484


and apply the trick to make the line closer to this point.

00:00:31.484 --> 00:00:34.704


So, here it is. Now, this point is good.

00:00:34.704 --> 00:00:36.869


Now, let's listen to the point in the left.

00:00:36.869 --> 00:00:38.349


The points says, "Come closer."

00:00:38.350 --> 00:00:39.770


We apply the trick,

00:00:39.770 --> 00:00:41.685


and now the line goes closer to it,

00:00:41.685 --> 00:00:45.094


and it actually goes over it classifying correctly.

00:00:45.094 --> 00:00:48.484


Now, every point is correctly classified and happy.

00:00:48.484 --> 00:00:52.670


So, let's actually write the pseudocode for this perceptron algorithm.

00:00:52.670 --> 00:00:53.780


We start with random weights,

00:00:53.780 --> 00:00:55.640


w1 up to wn and b.

00:00:55.640 --> 00:00:57.774


This gives us the question wx plus b,

00:00:57.774 --> 00:01:02.004


the line, and the positive and negative areas.

00:01:02.005 --> 00:01:05.822


Now, for every misclassified point with coordinates x1 up to xn,

00:01:05.822 --> 00:01:07.740


we do the following.

00:01:07.739 --> 00:01:09.184


If the prediction was zero,

00:01:09.185 --> 00:01:12.879


which means the point is a positive point in the negative area,

00:01:12.879 --> 00:01:16.490


then we'll update the weights as follows: for i equals 1 to n,

00:01:16.489 --> 00:01:21.049


we change wi, to wi plus alpha times xi,

00:01:21.049 --> 00:01:23.664


where alpha is the learning rate.

00:01:23.665 --> 00:01:26.060


In this case, we're using 0.1.

00:01:26.060 --> 00:01:28.659


Sometimes, we use 0.01 etc.

00:01:28.659 --> 00:01:33.840


It depends. Then we also change the bi as unit to b plus alpha.

00:01:33.840 --> 00:01:38.024


That moves the line closer to the misclassified point.

00:01:38.024 --> 00:01:39.700


Now, if the prediction was one,

00:01:39.700 --> 00:01:42.415


which means a point is a negative point in the positive area,

00:01:42.415 --> 00:01:44.650


then we'll update the weights in a similar way,

00:01:44.650 --> 00:01:46.950


except we subtract instead of adding.

00:01:46.950 --> 00:01:50.545


This means for i equals 1, change wi,

00:01:50.545 --> 00:01:53.299


to wi minus alpha xi,

00:01:53.299 --> 00:01:57.995


and change the bi as unit b to b minus alpha.

00:01:57.995 --> 00:02:01.770


And now, the line moves closer to our misclassified point.

00:02:01.769 --> 00:02:05.024


And now, we just repeat this step until we get no errors,

00:02:05.025 --> 00:02:07.425


or until we have a number of error that is small.

00:02:07.424 --> 00:02:08.564


Or simply we can just say,

00:02:08.564 --> 00:02:11.520


do the step a thousand times and stop.

00:02:11.520 --> 00:02:14.000


We'll see what are our options later in the class.
Coding the Perceptron Algorithm
Time to code! In this quiz, you'll have the chance to implement the perceptron
algorithm to separate the following data (given in the file data.csv).

Recall that the perceptron step works as follows. For a point with coordinates
(p,q)(p,q), label yy, and prediction given by the equation \hat{y} = step(w_1x_1 +
w_2x_2 + b)y^=step(w1x1+w2x2+b):

 If the point is correctly classified, do nothing.


 If the point is classified positive, but it has a negative label, subtract
\alpha p, \alpha q,αp,αq, and
\alphaα
from
w_1, w_2,w1,w2,
and
bb
respectively.
 If the point is classified negative, but it has a positive label, add
\alpha p, \alpha q,αp,αq,
and
\alphaα
to
w_1, w_2,w1,w2,
and
bb
respectively.

Then click on  test run  to graph the solution that the perceptron algorithm gives you.
It'll actually draw a set of dotted lines, that show how the algorithm approaches to the
best solution, given by the black solid line.

Feel free to play with the parameters of the algorithm (number of epochs, learning
rate, and even the randomizing of the initial parameters) to see how your initial
conditions can affect the solution!
Start Quiz:
perceptron.py data.csvsolution.py
import numpy as np
# Setting the random seed, feel free to change it and see different
solutions.
np.random.seed(42)

def stepFunction(t):
if t >= 0:
return 1
return 0

def prediction(X, W, b):


return stepFunction((np.matmul(X,W)+b)[0])

# TODO: Fill in the code below to implement the perceptron trick.


# The function should receive as inputs the data X, the labels y,
# the weights W (as an array), and the bias b,
# update the weights and bias W, b, according to the perceptron
algorithm,
# and return W and b.
def perceptronStep(X, y, W, b, learn_rate = 0.01):
# Fill in code
return W, b

# This function runs the perceptron algorithm repeatedly on the


dataset,
# and returns a few of the boundary lines obtained in the iterations,
# for plotting purposes.
# Feel free to play with the learning rate and the num_epochs,
# and see your results plotted below.
def trainPerceptronAlgorithm(X, y, learn_rate = 0.01, num_epochs =
25):
x_min, x_max = min(X.T[0]), max(X.T[0])
y_min, y_max = min(X.T[1]), max(X.T[1])
W = np.array(np.random.rand(2,1))
b = np.random.rand(1)[0] + x_max
# These are the solution lines that get plotted below.
boundary_lines = []
for i in range(num_epochs):
# In each epoch, we apply the perceptron step.
W, b = perceptronStep(X, y, W, b, learn_rate)
boundary_lines.append((-W[0]/W[1], -b/W[1]))
return boundary_lines
Start Quiz:
perceptron.pydata.csv solution.py
def perceptronStep(X, y, W, b, learn_rate = 0.01):
for i in range(len(X)):
y_hat = prediction(X[i],W,b)
if y[i]-y_hat == 1:
W[0] += X[i][0]*learn_rate
W[1] += X[i][1]*learn_rate
b += learn_rate
elif y[i]-y_hat == -1:
W[0] -= X[i][0]*learn_rate
W[1] -= X[i][1]*learn_rate
b -= learn_rate
return W, b
14. Log-loss Error Function
Error Functions

Play
01:17
05:50
Disable captions
Settings
Enter fullscreen
Play
We pick back up on log-loss error with the gradient descent concept.

Which of the following conditions should be met in order to apply gradient descent?
(Check all that apply.)

 
The error function should be discrete

 
The error function should contain only positive values

 
The error function should be differentiable

 
The error function should be normalized

 
The error function should be continuous

SOLUTION:

 The error function should be differentiable


 The error function should be continuous

udacimak v1.1.3
15. Discrete vs Continuous
Discrete vs Continuous Predictions
In the last few videos, we learned that continuous error functions are better than discrete error
functions, when it comes to optimizing. For this, we need to switch from discrete to
continuous predictions. The next two videos will guide us in doing that.
Need to understand
16. Softmax
Multi-Class Classification and Softmax

The Softmax Function


In the next video, we'll learn about the softmax function, which is the equivalent of the
sigmoid activation function, but when the problem has 3 or more classes.

Softmax Quiz

What function turns every number into a positive number?


 
sin

 
cos

 
log

 
exp

SOLUTION:exp
Quiz: Coding Softmax
And now, your time to shine! Let's code the formula for the Softmax function in
Python.

Start Quiz:
softmax.py solution.py
import numpy as np

# Write a function that takes as input a list of numbers, and returns


# the list of values given by the softmax function.
def softmax(L):
pass

import numpy as np

def softmax(L):
expL = np.exp(L)
sumExpL = sum(expL)
result = []
for i in expL:
result.append(i*1.0/sumExpL)
return result

# Note: The function np.divide can also be used here, as follows:


# def softmax(L):
# expL = np.exp(L)
# return np.divide (expL, expL.sum())
18. Maximum Likelihood
Maximum Likelihood
Probability will be one of our best friends as we go through Deep Learning. In this lesson,
we'll see how we can use probability to evaluate (and improve!) our models.

00:00:00.000 --> 00:00:02.859


So we're still in our quest for an algorithm that will help

00:00:02.859 --> 00:00:05.525


us pick the best model that separates our data.

00:00:05.525 --> 00:00:09.984


Well, since we're dealing with probabilities then let's use them in our
favor.

00:00:09.984 --> 00:00:12.894


Let's say I'm a student and I have two models.

00:00:12.894 --> 00:00:16.300


One that tells me that my probability of getting accepted is
00:00:16.300 --> 00:00:20.859
80% and one that tells me the probability is 55%.

00:00:20.859 --> 00:00:22.855


Which model looks more accurate?

00:00:22.855 --> 00:00:24.879


Well, if I got accepted then I'd say

00:00:24.879 --> 00:00:28.210


the better model is probably the one that says 80%.

00:00:28.210 --> 00:00:29.679


What if I didn't get accepted?

00:00:29.678 --> 00:00:33.983


Then the more accurate model is more likely the one that says 55 percent.

00:00:33.984 --> 00:00:37.469


But I'm just one person. What if it was me and a friend?

00:00:37.469 --> 00:00:40.789


Well, the best model would more likely be the one that

00:00:40.789 --> 00:00:44.506


gives the higher probabilities to the events that happened to us,

00:00:44.506 --> 00:00:47.085


whether it's acceptance or rejection.

00:00:47.085 --> 00:00:49.149


This sounds pretty intuitive.

00:00:49.149 --> 00:00:52.310


The method is called maximum likelihood.

00:00:52.310 --> 00:00:58.353


What we do is we pick the model that gives the existing labels the highest
probability.

00:00:58.353 --> 00:01:00.259


Thus, by maximizing the probability,

00:01:00.259 --> 00:01:02.000


we can pick the best possible model.
19. Maximizing Probabilities
Maximizing Probabilities
In this lesson and quiz, we will learn how to maximize a probability, using some math.
Nothing more than high school math, so get ready for a trip down memory lane!
20. Cross-Entropy 1
Cross Entropy 1

INSTRUCTOR NOTE:

Correction: At 2:18, the top right point should be labelled  -log(0.7)  instead of  -
log(0.2) .
WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:02.250


Correct. The answer is logarithm,

00:00:02.250 --> 00:00:06.389


because logarithm has this very nice identity that says that the logarithm of
00:00:06.389 --> 00:00:11.929
the product A times B is the sum of the logarithms of A and B.

00:00:11.929 --> 00:00:13.294


So this is what we do.

00:00:13.294 --> 00:00:17.559


We take our products and we take the logarithms,

00:00:17.559 --> 00:00:21.854


so now we get a sum of the logarithms of the factors.

00:00:21.855 --> 00:00:28.219


So the ln(0.6*0.2*0.1*0.7) is equal to

00:00:28.219 --> 00:00:35.700


ln(0.6) + ln(0.2) + ln(0.1) + ln(0.7) etc. Now from now until the end of class,

00:00:35.700 --> 00:00:40.040


we'll be taking the natural logarithm which is base e instead of 10.

00:00:40.039 --> 00:00:41.759


Nothing different happens with base 10.

00:00:41.759 --> 00:00:44.945


Everything works the same as everything gets scaled by the same factor.

00:00:44.945 --> 00:00:46.770


So it's just more for convention.

00:00:46.770 --> 00:00:51.330


We can calculate those values and get minus 0.51, minus 1.61,

00:00:51.329 --> 00:00:58.164


minus 0.23 etc. Notice that they are all negative numbers and that actually makes sense.

00:00:58.164 --> 00:01:01.560


This is because the logarithm of a number between 0 and 1 is always

00:01:01.560 --> 00:01:05.594


a negative number since the logarithm of one is zero.

00:01:05.594 --> 00:01:07.789


So it actually makes sense to think of the negative of

00:01:07.790 --> 00:01:11.260


the logarithm of the probabilities and we'll get positive numbers.
00:01:11.260 --> 00:01:15.740
So that's what we'll do. We'll take the negative of the logarithm of the probabilities.

00:01:15.739 --> 00:01:18.905


That sums up negatives of logarithms of the probabilities,

00:01:18.905 --> 00:01:23.180


we'll call the cross entropy which is a very important concept in the class.

00:01:23.180 --> 00:01:25.385


If we calculate the cross entropies,

00:01:25.385 --> 00:01:30.255


we see that the bad model on left has a cross entropy 4.8 which is high.

00:01:30.254 --> 00:01:35.229


Whereas the good model on the right has a cross entropy of 1.2 which is low.

00:01:35.230 --> 00:01:37.454


This actually happens all the time.

00:01:37.454 --> 00:01:38.810


A good model will give us

00:01:38.810 --> 00:01:43.185


a low cross entropy and a bad model will give us a high cross entropy.

00:01:43.185 --> 00:01:44.629


The reason for this is simply that

00:01:44.629 --> 00:01:47.390


a good model gives us a high probability and the negative

00:01:47.390 --> 00:01:52.599


of the logarithm of a large number is a small number and vice versa.

00:01:52.599 --> 00:01:55.250


This method is actually much more powerful than we think.

00:01:55.250 --> 00:01:59.180


If we calculate the probabilities and pair the points with the corresponding logarithms,

00:01:59.180 --> 00:02:01.470


we actually get an error for each point.

00:02:01.469 --> 00:02:06.539


So again, here we have probabilities for both models and the products of them.
00:02:06.540 --> 00:02:09.944
Now, we take the negative of the logarithms which gives us sum of

00:02:09.944 --> 00:02:15.319


logarithms and if we pair each logarithm with the point where it came from,

00:02:15.319 --> 00:02:17.859


we actually get a value for each point.

00:02:17.860 --> 00:02:19.565


And if we calculate the values,

00:02:19.564 --> 00:02:22.185


we get this. Check it out.

00:02:22.185 --> 00:02:24.319


If we look carefully at the values we can see that

00:02:24.319 --> 00:02:26.430


the points that are mis-classified has like

00:02:26.430 --> 00:02:31.295


values like 2.3 for this point or 1.6 one for this point,

00:02:31.294 --> 00:02:36.544


whereas the points that are correctly classified have small values.

00:02:36.544 --> 00:02:38.719


And the reason for this is again is that

00:02:38.719 --> 00:02:42.604


a correctly classified point will have a probability that as close to 1,

00:02:42.604 --> 00:02:44.989


which when we take the negative of the logarithm,

00:02:44.990 --> 00:02:46.915


we'll get a small value.

00:02:46.914 --> 00:02:51.215


Thus we can think of the negatives of these logarithms as errors at each point.

00:02:51.215 --> 00:02:53.539


Points that are correctly classified will have

00:02:53.539 --> 00:02:57.594


small errors and points that are mis-classified will have large errors.
00:02:57.594 --> 00:03:02.530
And now we've concluded that our cross entropy will tell us if a model is good or bad.

00:03:02.530 --> 00:03:06.800


So now our goal has changed from maximizing a probability to minimizing

00:03:06.800 --> 00:03:12.580


a cross entropy in order to get from the model in left to the model in the right.

00:03:12.580 --> 00:03:14.655


And that error function that we're looking for,

00:03:14.655 --> 00:03:17.000


that was precisely the cross entropy.
21. Cross-Entropy 2
Cross-Entropy
So we're getting somewhere, there's definitely a connection between probabilities and error
functions, and it's called Cross-Entropy. This concept is tremendously popular in many fields,
including Machine Learning. Let's dive more into the formula, and actually code it!

Formula For Cross 1


00:02:09.050 --> 00:02:12.080
And notice that the events with high probability have

00:02:12.080 --> 00:02:16.345


low cross-entropy and the events with low probability have high cross-entropy.
WEBVTT
Kind: captions
Language: en

00:00:02.440 --> 00:00:07.140


Let's look a bit closer into Cross-Entropy by switching to a different example.

00:00:07.140 --> 00:00:08.935


Let's say we have three doors.

00:00:08.935 --> 00:00:11.330


And no this is not the Monty Hall problem.

00:00:11.330 --> 00:00:13.775


We have the green door, the red door,

00:00:13.775 --> 00:00:18.720


and the blue door, and behind each door we could have a gift or not have a gift.

00:00:18.720 --> 00:00:23.150


And the probabilities of there being a gift behind each door is 0.8 for the first one,

00:00:23.150 --> 00:00:24.935


0.7 for the second one,

00:00:24.935 --> 00:00:26.900


0.1 for the third one.

00:00:26.900 --> 00:00:29.805


So for example behind the green door

00:00:29.805 --> 00:00:33.155


there is an 80 percent probability of there being a gift,

00:00:33.155 --> 00:00:36.780


and a 20 percent probability of there not being a gift.

00:00:36.780 --> 00:00:39.610


So we can put the information in this table where

00:00:39.610 --> 00:00:42.970


the probabilities of there being a gift are given in the top row,

00:00:42.970 --> 00:00:46.630


and the probabilities of there not being a gift are given in the bottom row.

00:00:46.630 --> 00:00:49.180


So let's say we want to make a bet on the outcomes.

00:00:49.180 --> 00:00:53.375


So we want to try to figure out what is the most likely scenario here.

00:00:53.375 --> 00:00:56.880


And for that we'll assume they're independent events.

00:00:56.880 --> 00:00:59.870


In this case, the most likely scenario is just

00:00:59.870 --> 00:01:03.440


obtained by picking the largest probability in each column.

00:01:03.440 --> 00:01:06.875


So for the first door is more likely to have a gift than not have a gift.

00:01:06.875 --> 00:01:09.230


So we'll say there's a gift behind the first door.

00:01:09.230 --> 00:01:12.680


For the second door, it's also more likely that there's a gift.

00:01:12.680 --> 00:01:14.995


So we'll say there's a gift behind the second door.

00:01:14.995 --> 00:01:18.060


And for the third door it's much more likely that there's no gift,

00:01:18.060 --> 00:01:21.015


so we'll say there's no gift behind the third door.

00:01:21.015 --> 00:01:22.700


And as the events are independent,

00:01:22.700 --> 00:01:24.810


the probability for this whole arrangement is

00:01:24.810 --> 00:01:27.995


the product of the three probabilities which is 0.8,

00:01:27.995 --> 00:01:31.096


times 0.7, times 0.9,

00:01:31.096 --> 00:01:33.446


which ends up being 0.504,

00:01:33.446 --> 00:01:36.665


which is roughly 50 percent.

00:01:36.665 --> 00:01:39.680


So let's look at all the possible scenarios in the table.

00:01:39.680 --> 00:01:43.085


Here's a table with all the possible scenarios for each door

00:01:43.085 --> 00:01:46.940


and there are eight scenarios since each door gives us two possibilities each,

00:01:46.940 --> 00:01:48.815


and there are three doors.

00:01:48.815 --> 00:01:51.545


So we do as before to obtain the probability of

00:01:51.545 --> 00:01:57.245


each arrangement by multiplying the three independent probabilities to get these numbers.

00:01:57.245 --> 00:01:59.590


You can check that these numbers add to one.

00:01:59.590 --> 00:02:02.570


And from last video we learned that the negative

00:02:02.570 --> 00:02:05.955


of the logarithm of the probabilities across entropy.

00:02:05.955 --> 00:02:09.050


So let's go ahead and calculate the cross-entropy.

00:02:09.050 --> 00:02:12.080


And notice that the events with high probability have

00:02:12.080 --> 00:02:16.345


low cross-entropy and the events with low probability have high cross-entropy.

00:02:16.345 --> 00:02:19.130


For example, the second row which has probability of

00:02:19.130 --> 00:02:24.440


0.504 gives a small cross-entropy of 0.69,

00:02:24.440 --> 00:02:28.675


and the second to last row which is very very unlikely has a probability of

00:02:28.675 --> 00:02:34.441


0.006 gives a cross entropy a 5.12.

00:02:34.441 --> 00:02:37.573


So let's actually calculate a formula for the cross-entropy.

00:02:37.573 --> 00:02:39.215


Here we have our three doors,

00:02:39.215 --> 00:02:44.180


and our sample scenario said that there is a gift behind the first and second doors,

00:02:44.180 --> 00:02:46.445


and no gift behind the third door.

00:02:46.445 --> 00:02:49.370


Recall that the probabilities of these events happening

00:02:49.370 --> 00:02:52.189


are 0.8 for a gift behind the first door,

00:02:52.189 --> 00:02:54.665


0.7 for a gift behind the second door,

00:02:54.665 --> 00:02:57.915


and 0.9 for no gift behind the third door.

00:02:57.915 --> 00:02:59.510


So when we calculate the cross-entropy,

00:02:59.510 --> 00:03:03.622


we get the negative of the logarithm of the product,

00:03:03.622 --> 00:03:08.015


which is a sum of the negatives of the logarithms of the factors,

00:03:08.015 --> 00:03:14.070


which is negative logarithm of 0.8 minus logarithm of 0.7 minus logarithm 0.9.

00:03:14.070 --> 00:03:17.225


And in order to drive the formula we'll have some variables.

00:03:17.225 --> 00:03:20.885


So let's call P1 the probability that there's a gift behind the first door,

00:03:20.885 --> 00:03:24.110


P2 the probability there's a gift behind the second door,

00:03:24.110 --> 00:03:27.940


and P3 the probability there's a gift behind the third door.

00:03:27.940 --> 00:03:30.580


So this 0.8 here is P1,

00:03:30.580 --> 00:03:32.580


this 0.7 here is P2,

00:03:32.580 --> 00:03:35.370


and this 0.9 here is one minus P3.

00:03:35.370 --> 00:03:36.990


So it's a probability of there not being

00:03:36.990 --> 00:03:41.460


a gift is one minus the probability of there being a gift.

00:03:41.460 --> 00:03:43.785


Let's have another variable called Yi,

00:03:43.785 --> 00:03:46.980


which will be one of there's a present behind the ith door,

00:03:46.980 --> 00:03:49.750


and zero there's no present.

00:03:49.750 --> 00:03:53.470


So Yi is technically a number of presents behind the ith door.

00:03:53.470 --> 00:03:55.442


In this case Y1 equals one,

00:03:55.442 --> 00:04:00.210


Y2 equals one, and Y3 equals zero.

00:04:00.210 --> 00:04:02.550


So we can put all this together and derive a formula

00:04:02.550 --> 00:04:05.355


for the cross-entropy and it's this sum.

00:04:05.355 --> 00:04:08.305


Now let's look at the formula inside the summation.

00:04:08.305 --> 00:04:12.155


Noted that if there is a present behind the ith door,

00:04:12.155 --> 00:04:14.300


then Yi equals one.

00:04:14.300 --> 00:04:17.180


So the first term is logarithm of the Pi.

00:04:17.180 --> 00:04:19.795


And the second term is zero.

00:04:19.795 --> 00:04:24.285


Likewise, if there is no present behind the ith door,

00:04:24.285 --> 00:04:26.355


then Yi is zero.

00:04:26.355 --> 00:04:28.355


So this first term is zero.

00:04:28.355 --> 00:04:32.655


And this term is precisely logarithm of one minus Pi.

00:04:32.655 --> 00:04:35.785


Therefore, this formula really encompasses the sums of the

00:04:35.785 --> 00:04:39.935


negative of logarithms which is precisely the cross-entropy.

00:04:39.935 --> 00:04:45.640


So the cross-entropy really tells us when two vectors are similar or different.

00:04:45.640 --> 00:04:52.170


For example, if you calculate the cross entropy of the pair one one zero,

00:04:52.170 --> 00:04:53.485


and 0.8, 0.7, 0.1, we get 0.69.

00:04:53.485 --> 00:05:00.500


And that is low because one one zero is a similar vector to 0.8, 0.7, 0.1.

00:05:00.500 --> 00:05:05.510


Which means that the arrangement of gifts given by the first set of

00:05:05.510 --> 00:05:08.270


numbers is likely to happen based

00:05:08.270 --> 00:05:11.715


on the probabilities given by the second set of numbers.

00:05:11.715 --> 00:05:17.107


But on the other hand if we calculate the cross-entropy of the pairs zero zero one,

00:05:17.107 --> 00:05:19.559


and 0.8, 0.7, 0.1,

00:05:19.559 --> 00:05:23.210


that is 5.12 which is very high.

00:05:23.210 --> 00:05:27.380


This is because the arrangement of gifts being given by the first set of numbers is

00:05:27.380 --> 00:05:32.030


very unlikely to happen from the probabilities given by the second set of numbers.
Start Quiz:
cross_entropy.py solution.py
import numpy as np

# Write a function that takes as input two lists Y, P,


# and returns the float corresponding to their cross-entropy.
def cross_entropy(Y, P):
pass
import numpy as np

def cross_entropy(Y, P):


Y = np.float_(Y)
P = np.float_(P)
return -np.sum(Y * np.log(P) + (1 - Y) * np.log(1 - P))
WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:05.220


Now that was when we had two classes namely receiving a gift or not receiving a gift.

00:00:05.220 --> 00:00:09.285


What happens if we have more classes? Let's take a look.

00:00:09.285 --> 00:00:10.790


So we have a similar problem.

00:00:10.790 --> 00:00:12.220


We still have three doors.

00:00:12.220 --> 00:00:14.940


And this problem is still not the Monty Hall problem.

00:00:14.940 --> 00:00:16.860


Behind each door there can be an animal,

00:00:16.860 --> 00:00:19.140


and the animal can be of three types.

00:00:19.140 --> 00:00:21.555


It can be a duck, it can be a beaver,

00:00:21.555 --> 00:00:23.575


or it can be a walrus.

00:00:23.575 --> 00:00:26.095


So let's look at this table of probabilities.

00:00:26.095 --> 00:00:28.485


According to the first column on the table,

00:00:28.485 --> 00:00:30.135


behind the first door,

00:00:30.135 --> 00:00:33.380


the probability of finding a duck is 0.7,

00:00:33.380 --> 00:00:35.800


the probability of finding a beaver is 0.2,

00:00:35.800 --> 00:00:39.080


and the probability of finding a walrus is 0.1.

00:00:39.080 --> 00:00:42.150


Notice that the numbers in each column need to add to

00:00:42.150 --> 00:00:45.745


one because there is some animal behind door one.

00:00:45.745 --> 00:00:50.030


The numbers in the rows do not need to add to one as you can see.

00:00:50.030 --> 00:00:53.825


It could easly be that we have a duck behind every door and that's okay.

00:00:53.825 --> 00:00:55.590


So let's look at a sample scenario.

00:00:55.590 --> 00:00:57.225


Let's say we have our three doors,

00:00:57.225 --> 00:00:59.775


and behind the first door, there's a duck,

00:00:59.775 --> 00:01:02.040


behind the second door there's a walrus,

00:01:02.040 --> 00:01:04.805


and behind the third door there's also a walrus.

00:01:04.805 --> 00:01:07.895


Recall that the probabilities are again by the table.

00:01:07.895 --> 00:01:11.555


So a duck behind the first door is 0.7 likely,

00:01:11.555 --> 00:01:14.870


a walrus behind the second door is 0.3 likely,

00:01:14.870 --> 00:01:18.925


and a walrus behind the third door is 0.4 likely.

00:01:18.925 --> 00:01:21.930


So the probability of obtaining this three animals is the product of

00:01:21.930 --> 00:01:25.470


the probabilities of the three events since they are independent events,

00:01:25.470 --> 00:01:27.900


which in this case it's 0.084.

00:01:27.900 --> 00:01:30.285


And as we learn,

00:01:30.285 --> 00:01:33.000


that cross entropy here is given by

00:01:33.000 --> 00:01:37.065


the sums of the negatives of the logarithms of the probabilities.

00:01:37.065 --> 00:01:40.720


So the first one is negative logarithm of 0.7.

00:01:40.720 --> 00:01:43.710


The second one is negative logarithm of 0.3.

00:01:43.710 --> 00:01:46.740


And the third one is negative logarithm of 0.4.

00:01:46.740 --> 00:01:52.255


The Cross entropy's and the sum of these three which is actually 2.48.

00:01:52.255 --> 00:01:55.490


But we want a formula, so let's put some variables here.

00:01:55.490 --> 00:02:00.187


So P11 is the probability of finding a duck behind door one.

00:02:00.187 --> 00:02:04.535


P12 is the probability of finding a duck behind door two etc.

00:02:04.535 --> 00:02:09.260


And let's have the indicator variables Y1j D1 if there's

00:02:09.260 --> 00:02:14.790


a duck behind door J. Y2j B1 if there's a beaver behind door J,

00:02:14.790 --> 00:02:19.285


and Y3j B1 if there's a walrus behind door J.

00:02:19.285 --> 00:02:21.935


And these variables are zero otherwise.

00:02:21.935 --> 00:02:24.210


And so, the formula for the cross entropy is

00:02:24.210 --> 00:02:27.445


simply the negative of the summation from i_ equals_ one to n,

00:02:27.445 --> 00:02:35.630


up to summation from y_ equals_ j to m of Yij_ times_ the logarithm of Pij.

00:02:35.630 --> 00:02:39.150


In this case, m is a number of classes.

00:02:39.150 --> 00:02:42.330


This formula works because Yij being zero one,

00:02:42.330 --> 00:02:45.135


makes sure that we're only adding the logarithms

00:02:45.135 --> 00:02:48.555


of the probabilities of the events that actually have occurred.

00:02:48.555 --> 00:02:53.760


And voila, this is the formula for the cross entropy in more classes.

00:02:53.760 --> 00:02:55.080


Now I'm going to leave this equestion.

00:02:55.080 --> 00:03:00.085


Given that we have a formula for cross entropy for two classes and one for m classes.

00:03:00.085 --> 00:03:04.240


These formulas look different but are they the same for m_ equals_ two?

00:03:04.240 --> 00:03:05.565


Obviously the answer is yes,

00:03:05.565 --> 00:03:07.950


but it's a cool exercise to actually write them down and

00:03:07.950 --> 00:03:11.000


convince yourself that they are actually the same.
23. Logistic Regression
Logistic Regression
Now, we're finally ready for one of the most popular and useful algorithms in Machine
Learning, and the building block of all that constitutes Deep Learning. The Logistic
Regression Algorithm. And it basically goes like this:

 Take your data


 Pick a random model
 Calculate the error
 Minimize the error, and obtain a better model
 Enjoy!

Calculating the Error Function

Let's dive into the details. The next video will show you how to calculate an error function.
Important
WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:03.632


So this is a good time for a quick recap of the last couple of lessons.

00:00:03.632 --> 00:00:05.640


Here we have two models.

00:00:05.639 --> 00:00:08.934


The bad model on the left and the good model on the right.

00:00:08.935 --> 00:00:13.440


And for each one of those we calculate the cross entropy which is the sum of

00:00:13.439 --> 00:00:19.259


the negatives of the logarithms off the probabilities of the points being their colors.

00:00:19.260 --> 00:00:22.170


And we conclude that the one on the right is better

00:00:22.170 --> 00:00:25.860


because a cross entropy is much smaller.
00:00:25.859 --> 00:00:29.269
So let's actually calculate the formula for the error function.

00:00:29.269 --> 00:00:31.559


Let's split into two cases.

00:00:31.559 --> 00:00:34.269


The first case being when y=1.

00:00:34.270 --> 00:00:36.130


So when the point is blue to begin with,

00:00:36.130 --> 00:00:42.480


the model tells us that the probability of being blue is the prediction y_hat.

00:00:42.479 --> 00:00:47.849


So for these two points the probabilities are 0.6 and 0.2.

00:00:47.850 --> 00:00:50.910


As we can see the point in the blue area has

00:00:50.909 --> 00:00:55.000


more probability of being blue than the point in the red area.

00:00:55.000 --> 00:01:00.500


And our error is simply the negative logarithm of this probability.

00:01:00.500 --> 00:01:04.010


So it's precisely minus logarithm of y_hat.

00:01:04.010 --> 00:01:09.665


In the figure it's minus logarithm of 0.6. and minus logarithm of 0.2.

00:01:09.665 --> 00:01:13.745


Now if y=0, so when the point is red,

00:01:13.745 --> 00:01:17.585


then we need to calculate the probability of the point being red.

00:01:17.584 --> 00:01:22.339


The probability of the point being red is one minus the probability of the point being

00:01:22.340 --> 00:01:27.750


blue which is precisely 1 minus the prediction y_hat.

00:01:27.750 --> 00:01:30.890


So the error is precisely the negative logarithm of
00:01:30.890 --> 00:01:35.870
this probability which is negative logarithm of 1 - y_hat.

00:01:35.870 --> 00:01:42.040


In this case we get negative logarithm 0.1 and negative logarithm 0.7.

00:01:42.040 --> 00:01:46.605


So we conclude that the error is a negative logarithm of y_hat if the point is blue.

00:01:46.605 --> 00:01:50.635


And negative logarithm of one - y_hat the point is red.

00:01:50.635 --> 00:01:53.625


We can summarize these two formulas into this one.

00:01:53.625 --> 00:02:02.159


Error = - (1-y)(ln( 1- y_hat)) - y ln(y_hat).

00:02:02.159 --> 00:02:03.759


Why does this formula work?

00:02:03.760 --> 00:02:05.730


Well because if the point is blue,

00:02:05.730 --> 00:02:10.664


then y=1 which means 1-y=0 which makes the first term

00:02:10.664 --> 00:02:16.495


0 and the second term is simply logarithm of y_hat.

00:02:16.495 --> 00:02:20.219


Similarly, if the point is red then y=0.

00:02:20.219 --> 00:02:27.680


So the second term of the formula is 0 and the first one is logarithm of 1- y_hat.

00:02:27.680 --> 00:02:31.145


Now the formula for the error function is simply the sum over

00:02:31.145 --> 00:02:35.510


all the error functions of points which is precisely the summation here.

00:02:35.509 --> 00:02:38.564


That's going to be this 4.8 we have over here.

00:02:38.564 --> 00:02:41.469


Now by convention we'll actually consider the average,
00:02:41.469 --> 00:02:45.330
not the sum which is where we are dividing by n over here.

00:02:45.330 --> 00:02:49.050


This will turn the 4.8 into a 1.2.

00:02:49.050 --> 00:02:53.330


From now on we'll use this formula as our error function.

00:02:53.330 --> 00:02:58.860


And now since y_hat is given by the sigmoid of the linear function wx + b,

00:02:58.860 --> 00:03:01.890


then the total formula for the error is actually in terms

00:03:01.889 --> 00:03:05.094


of w and b which are the weights of the model.

00:03:05.094 --> 00:03:08.219


And it's simply the summation we see here.

00:03:08.219 --> 00:03:14.449


In this case y_i is just the label of the point x_superscript_i.

00:03:14.449 --> 00:03:17.364


So now that we've calculated it our goal is to minimize it.

00:03:17.365 --> 00:03:18.975


And that's what we'll do next.

00:03:18.974 --> 00:03:20.293


And just a small aside,

00:03:20.294 --> 00:03:23.210


what we did is for binary classification problems.

00:03:23.210 --> 00:03:25.670


If we have a multiclass classification problem then

00:03:25.669 --> 00:03:28.490


the error is now given by the multiclass entropy.

00:03:28.491 --> 00:03:33.380


This formula is given here where for every data point we take the product

00:03:33.379 --> 00:03:39.139


of the label times the logarithm of the prediction and then we average all these values.
00:03:39.139 --> 00:03:41.539
And again it's a nice exercise to convince yourself that

00:03:41.539 --> 00:03:45.000


the two are the same when there are just two classes.
24. Gradient Descent
Gradient Descent
In this lesson, we'll learn the principles and the math behind the gradient descent algorithm
So, a small gradient means we'll change our coordinates by a little bit, and a large
gradient means we'll change our coordinates by a lot.

If this sounds anything like the perceptron algorithm, this is no coincidence! We'll see
it in a bit.
00:00:00.000 --> 00:00:02.580
And now we finally have the tools to write

00:00:02.580 --> 00:00:05.120


the pseudocode for the grading descent algorithm,

00:00:05.120 --> 00:00:06.830


and it goes like this.

00:00:06.830 --> 00:00:15.170


Step one, start with random weights w_one up to w_n and b which will give us a line,

00:00:15.170 --> 00:00:19.270


and not just a line, but the whole probability function given by sigmoid of w x plus b.

00:00:19.270 --> 00:00:22.820


Now for every point we'll calculate the error,

00:00:22.820 --> 00:00:25.150


and as we can see the error is high for

00:00:25.150 --> 00:00:29.230


misclassified points and small for correctly classified points.
00:00:29.230 --> 00:00:32.545
Now for every point with coordinates x_one up to x_n,

00:00:32.545 --> 00:00:36.845


we update w_i by adding the learning rate

00:00:36.845 --> 00:00:42.950


alpha times the partial derivative of the error function with respect to w_i.

00:00:42.950 --> 00:00:45.120


We also update b by adding alpha times

00:00:45.120 --> 00:00:48.440


the partial derivative of the error function with respect to be.

00:00:48.440 --> 00:00:49.920


This gives us new weights,

00:00:49.920 --> 00:00:52.610


w_i_prime and then new bias b_prime.

00:00:52.610 --> 00:00:55.330


Now we've already calculated these partial derivatives and we

00:00:55.330 --> 00:00:58.605


know that they are y_hat minus y times

00:00:58.605 --> 00:01:01.295


x_i for the derivative with respect to w_i

00:01:01.295 --> 00:01:05.215


and y_hat minus y for the derivative with respect to b.

00:01:05.215 --> 00:01:08.840


So that's how we'll update the weights.

00:01:08.840 --> 00:01:13.350


Now repeat this process until the error is small,

00:01:13.350 --> 00:01:15.765


or we can repeat it a fixed number of times.

00:01:15.765 --> 00:01:18.840


The number of times is called the epochs and we'll learn them later.

00:01:18.840 --> 00:01:20.100


Now this looks familiar,
00:01:20.100 --> 00:01:21.935
have we seen something like that before?

00:01:21.935 --> 00:01:24.300


Well, we look at the points and what each point is doing is

00:01:24.300 --> 00:01:26.640


it's adding a multiple of itself into the weights of

00:01:26.640 --> 00:01:31.640


the line in order to get the line to move closer towards it if it's misclassified.

00:01:31.640 --> 00:01:34.435


That's pretty much what the Perceptron algorithm is doing.

00:01:34.435 --> 00:01:36.000


So in the next video, we'll look at

00:01:36.000 --> 00:01:39.000


the similarities because it's a bit suspicious how similar they are.
26. Pre-Lab: Gradient Descent
Implementing Gradient Descent
In the following lab, you'll be able to implement the gradient descent algorithm on the
following sample dataset with two classes.
Workspace
To open this notebook, you have two options:

 Go to the next page in the classroom (recommended)


 Clone the repo from Github and open the notebook GradientDescent.ipynb in
the gradient-descent folder. You can either download the repository with  git clone
https://github.com/udacity/deep-learning.git , or download it as an archive file
from this link.

Instructions
In this notebook, you'll be implementing the functions that build the gradient descent
algorithm, namely:

 sigmoid : The sigmoid activation function.


 output_formula : The formula for the prediction.
 error_formula : The formula for the error at a point.
 update_weights : The function that updates the parameters with one gradient descent
step.

When you implement them, run the  train  function and this will graph the several of
the lines that are drawn in successive gradient descent steps. It will also graph the
error function, and you can see it decreasing as the number of epochs grows.

This is a self-assessed lab. If you need any help or want to check your answers, feel
free to check out the solutions notebook in the same folder, or by clicking here.
27. Notebook: Gradient Descent
Workspace

This section contains either a workspace (it can be a Jupyter Notebook workspace or an online
code editor work space, etc.) and it cannot be automatically downloaded to be generated here.
Please access the classroom with your account and manually download the workspace to your
local machine. Note that for some courses, Udacity upload the workspace files
onto https://github.com/udacity, so you may be able to download them there.

Workspace Information:

 Default file path:


 Workspace type: jupyter
 Opened files (when workspace is loaded): n/a

28. Perceptron vs Gradient Descent


Gradient Descent Vs Perceptron Algorithm
00:00:00.000 --> 00:00:03.990
So let's compare the Perceptron algorithm and the Gradient Descent algorithm.

00:00:03.990 --> 00:00:05.845


In the Gradient Descent algorithm,

00:00:05.845 --> 00:00:09.535


we take the weights and change them from Wi to

00:00:09.535 --> 00:00:13.915


Wi_ plus_ alpha_ times_ Y hat_ minus_ Y_ times_ Xi.

00:00:13.915 --> 00:00:15.325


In the Perceptron algorithm,

00:00:15.325 --> 00:00:17.253


not every point changes weights,

00:00:17.253 --> 00:00:18.960


only the misclassified ones.

00:00:18.960 --> 00:00:21.385


Here, if X is misclassified,

00:00:21.385 --> 00:00:27.525


we'll change the weights by adding Xi to Wi if the point label is positive,
00:00:27.525 --> 00:00:29.785
and subtracting if negative.

00:00:29.785 --> 00:00:32.327


Now the question is, are these two things the same?

00:00:32.327 --> 00:00:34.920


Well, let's remember that in that Perceptron algorithm,

00:00:34.920 --> 00:00:37.350


the labels are one and zero.

00:00:37.350 --> 00:00:40.320


And the predictions Y-hat are also one and zero.

00:00:40.320 --> 00:00:43.060


So, if the point is correct, classified,

00:00:43.060 --> 00:00:48.440


then Y_ minus_ Y-hat is zero because Y is equal to Y-hat.

00:00:48.440 --> 00:00:50.205


Now, if the point is labeled blue,

00:00:50.205 --> 00:00:52.095


then Y_ equals_ one.

00:00:52.095 --> 00:00:53.220


And if it's misclassified,

00:00:53.220 --> 00:00:55.950


then the prediction must be Y-hat_ equals_ zero.

00:00:55.950 --> 00:00:59.265


So Y-hat_ minus_ Y is minus one.

00:00:59.265 --> 00:01:01.050


Similarly, with the points labeled red,

00:01:01.050 --> 00:01:04.105


then Y_ equals_ zero and Y-hat_ equals_ one.

00:01:04.105 --> 00:01:06.180


So, Y-hat_ minus_ Y_ equals_ one.

00:01:06.180 --> 00:01:08.300


This may not be super clear right away.
00:01:08.300 --> 00:01:10.035
But if you stare at the screen for long enough,

00:01:10.035 --> 00:01:13.620


you'll realize that the right and the left are exactly the same thing.

00:01:13.620 --> 00:01:15.175


The only difference is that in the left,

00:01:15.175 --> 00:01:17.776


Y-hat can take any number between zero and one,

00:01:17.776 --> 00:01:19.650


whereas in the right,

00:01:19.650 --> 00:01:23.305


Y-hat can take only the values zero or one.

00:01:23.305 --> 00:01:25.175


It's pretty fascinating, isn't it?

00:01:25.175 --> 00:01:28.055


But let's study Gradient Descent even more carefully.

00:01:28.055 --> 00:01:31.680


Both in the Perceptron algorithm and the Gradient Descent algorithm,

00:01:31.680 --> 00:01:36.570


a point that is misclassified tells a line to come closer because eventually,

00:01:36.570 --> 00:01:40.770


it wants the line to surpass it so it can be in the correct side.

00:01:40.770 --> 00:01:43.734


Now, what happens if the point is correctly classified?

00:01:43.734 --> 00:01:47.315


Well, the Perceptron algorithm says do absolutely nothing.

00:01:47.315 --> 00:01:49.575


In the Gradient Descent algorithm,

00:01:49.575 --> 00:01:51.195


you are changing the weights.

00:01:51.195 --> 00:01:52.830


But what is it doing?
00:01:52.830 --> 00:01:54.480
Well, if we look carefully,

00:01:54.480 --> 00:01:56.640


what the point is telling the line,

00:01:56.640 --> 00:01:58.875


is to go farther away.

00:01:58.875 --> 00:02:01.120


And this makes sense, right?

00:02:01.120 --> 00:02:03.180


Because if you're correctly classified,

00:02:03.180 --> 00:02:05.895


say, if you're a blue point in the blue region,

00:02:05.895 --> 00:02:08.385


you'd like to be even more into the blue region,

00:02:08.385 --> 00:02:10.740


so your prediction is even closer to one,

00:02:10.740 --> 00:02:13.060


and your error is even smaller.

00:02:13.060 --> 00:02:16.320


Similarly, for a red point in the red region.

00:02:16.320 --> 00:02:19.590


So it makes sense that the point tells the line to go farther away.

00:02:19.590 --> 00:02:22.925


And that's precisely what the Gradient Descent algorithm does.

00:02:22.925 --> 00:02:26.540


The misclassified points asks the line to come closer and

00:02:26.540 --> 00:02:30.315


the correctly classified points asks the line to go farther away.

00:02:30.315 --> 00:02:33.240


The line listens to all the points and takes steps in

00:02:33.240 --> 00:02:37.000


such a way that it eventually arrives to a pretty good solution.
In the video at 0:12 mark, the instructor said  y hat minus y . It should be  y minus y
hat  instead as stated on the slide.

29. Continuous Perceptrons


Continuous Perceptrons
00:00:00.000 --> 00:00:04.139

So, this is just a small recap video that will get us ready for what's coming.

00:00:04.139 --> 00:00:06.809

Recall that if we have our data in the form of these points over

00:00:06.809 --> 00:00:10.710

here and the linear model like this one, for example,

00:00:10.710 --> 00:00:14.865

with equation 2x1 + 7x2 - 4 = 0,

00:00:14.865 --> 00:00:19.495

this will give rise to a probability function that looks like this.

00:00:19.495 --> 00:00:23.760

Where the points on the blue or positive region have more chance of being

00:00:23.760 --> 00:00:29.570

blue and the points in the red or negative region have more chance of being red.

00:00:29.570 --> 00:00:32.609

And this will give rise to this perception where we label

00:00:32.609 --> 00:00:36.314

the edges by the weights and the node by the bias.


00:00:36.314 --> 00:00:37.664

So, what the perception does,

00:00:37.664 --> 00:00:39.655

it takes to point (x1, x2),

00:00:39.655 --> 00:00:44.689

plots it in the graph and then it returns a probability that the point is blue.

00:00:44.689 --> 00:00:47.378

In this case, it returns a 0.9

00:00:47.378 --> 00:00:51.200

and this mimics the neurons in the brain because they receive nervous impulses,

00:00:51.200 --> 00:00:54.000

do something inside and return a nervous impulse.

30. Non-linear Data


Non-Linear Data

WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:04.003


Now we've been dealing a lot with data sets that can be separated by a line,
00:00:04.003 --> 00:00:05.543
like this one over here.

00:00:05.543 --> 00:00:08.865


But as you can imagine the real world is much more complex than that.

00:00:08.865 --> 00:00:12.150


This is where neural networks can show their full potential.

00:00:12.150 --> 00:00:14.970


In the next few videos we'll see how to deal with

00:00:14.970 --> 00:00:17.350


more complicated data sets that require

00:00:17.350 --> 00:00:20.109


highly non-linear boundaries such as this one over here.
WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:02.459


So, let's go back to this example of where we saw

00:00:02.459 --> 00:00:04.769


some data that is not linearly separable.

00:00:04.769 --> 00:00:09.740


So a line can not divide these red and blue points and we looked at some solutions,

00:00:09.740 --> 00:00:14.185


and if you remember, the one we considered more seriously was this curve over here.

00:00:14.185 --> 00:00:18.664


So what I'll teach you now is to find this curve and it's very similar than before.
00:00:18.664 --> 00:00:20.519
We'll still use grading dissent.

00:00:20.518 --> 00:00:23.009


In a nutshell, what we're going to do is for

00:00:23.010 --> 00:00:25.769


these data which is not separable with a line,

00:00:25.768 --> 00:00:30.599


we're going to create a probability function where the points in the blue region are more

00:00:30.600 --> 00:00:36.240


likely to be blue and the points in the red region are more likely to be red.

00:00:36.240 --> 00:00:39.798


And this curve here that separates them is

00:00:39.798 --> 00:00:44.329


a set of points which are equally likely to be blue or red.

00:00:44.329 --> 00:00:47.789


Everything will be the same as before except this equation

00:00:47.789 --> 00:00:52.000


won't be linear and that's where neural networks come into play.
32. Neural Network Architecture
Neural Network Architecture
Ok, so we're ready to put these building blocks together, and build great Neural Networks! (Or
Multi-Layer Perceptrons, however you prefer to call them.)

This first two videos will show us how to combine two perceptrons into a third, more
complicated one.
WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:03.464


Now I'm going to show you how to create these nonlinear models.

00:00:03.464 --> 00:00:06.058


What we're going to do is a very simple trick.

00:00:06.059 --> 00:00:12.060


We're going to combine two linear models into a nonlinear model as follows.

00:00:12.060 --> 00:00:13.769


Visually it looks like this.

00:00:13.769 --> 00:00:17.518


The two models over imposed creating the model on the right.

00:00:17.518 --> 00:00:20.084


It's almost like we're doing arithmetic on models.

00:00:20.085 --> 00:00:24.160


It's like saying "This line plus this line equals that curve."

00:00:24.160 --> 00:00:26.824


Let me show you how to do this mathematically.
00:00:26.824 --> 00:00:30.750
So a linear model as we know is a whole probability space.

00:00:30.750 --> 00:00:36.478


This means that for every point it gives us the probability of the point being blue.

00:00:36.478 --> 00:00:39.179


So, for example, this point over here is in

00:00:39.179 --> 00:00:43.890


the blue region so its probability of being blue is 0.7.

00:00:43.890 --> 00:00:47.250


The same point given by the second probability space is

00:00:47.250 --> 00:00:52.170


also in the blue region so it's probability of being blue is 0.8.

00:00:52.170 --> 00:00:53.353


Now the question is,

00:00:53.353 --> 00:00:55.890


how do we combine these two?

00:00:55.890 --> 00:01:00.225


Well, the simplest way to combine two numbers is to add them, right?

00:01:00.225 --> 00:01:05.409


So 0.8 plus 0.7 is 1.5.

00:01:05.409 --> 00:01:09.890


But now, this doesn't look like a probability anymore since it's bigger than one.

00:01:09.890 --> 00:01:15.915


And probabilities need to be between 0 and 1. So what can we do?

00:01:15.915 --> 00:01:20.980


How do we turn this number that is larger than 1 into something between 0 and 1?

00:01:20.980 --> 00:01:24.079


Well, we've been in this situation before and we have a pretty good tool that

00:01:24.078 --> 00:01:27.744


turns every number into something between 0 and 1.

00:01:27.745 --> 00:01:30.234


That's just a sigmoid function.
00:01:30.233 --> 00:01:32.780
So that's what we're going to do.

00:01:32.780 --> 00:01:36.858


We applied the sigmoid function to 1.5 to get the value

00:01:36.858 --> 00:01:40.188


0.82 and that's the probability of

00:01:40.188 --> 00:01:44.568


this point being blue in the resulting probability space.

00:01:44.569 --> 00:01:47.299


So now we've managed to create a probability function for

00:01:47.299 --> 00:01:51.243


every single point in the plane and that's how we combined two models.

00:01:51.243 --> 00:01:54.093


We calculate the probability for one of them,

00:01:54.093 --> 00:01:56.140


the probability for the other,

00:01:56.140 --> 00:01:59.334


then add them and then we apply the sigmoid function.

00:01:59.334 --> 00:02:01.340


Now, what if we wanted to weight this sum?

00:02:01.340 --> 00:02:04.370


What, if say, we wanted the model in the top to have

00:02:04.370 --> 00:02:07.849


more of a saying the resulting probability than the second?

00:02:07.849 --> 00:02:11.569


So something like this where the resulting model looks a lot more like the one in

00:02:11.568 --> 00:02:15.698


the top then like the one in the bottom. Well, we can add weights.

00:02:15.699 --> 00:02:22.355


For example, we can say "I want seven times the first model plus the second one."

00:02:22.354 --> 00:02:24.240


Actually, I can add the weights since I want.
00:02:24.241 --> 00:02:29.574
For example, I can say "Seven times the first one plus five times the second one."

00:02:29.574 --> 00:02:34.335


And when I do get the combine the model is I take the first probability,

00:02:34.335 --> 00:02:36.789


multiply it by seven,

00:02:36.788 --> 00:02:43.293


then take the second one and multiply it by five and I can even add a bias if I want.

00:02:43.294 --> 00:02:45.526


Say, the bias is minus 6,

00:02:45.526 --> 00:02:48.020


then we add it to the whole equation.

00:02:48.020 --> 00:02:52.735


So we'll have seven times this plus five times this minus six,

00:02:52.735 --> 00:02:54.914


which gives us 2.9.

00:02:54.913 --> 00:03:00.679


We then apply the sigmoid function and that gives us 0.95.

00:03:00.680 --> 00:03:02.680


So it's almost like we had before, isn't it?

00:03:02.680 --> 00:03:06.085


Before we had a line that is a linear combination

00:03:06.085 --> 00:03:10.240


of the input values times the weight plus a bias.

00:03:10.240 --> 00:03:13.300


Now we have that this model is a linear combination of

00:03:13.300 --> 00:03:17.650


the two previous model times the weights plus some bias.

00:03:17.650 --> 00:03:18.905


So it's almost the same thing.

00:03:18.905 --> 00:03:21.599


It's almost like this curved model in the right.
00:03:21.599 --> 00:03:25.818
It's a linear combination of the two linear models before

00:03:25.818 --> 00:03:30.573


or we can even think of it as the line between the two models.

00:03:30.574 --> 00:03:32.069


This is no coincidence.

00:03:32.068 --> 00:03:35.435


This is at the heart of how neural networks get built.

00:03:35.435 --> 00:03:38.628


Of course, we can imagine that we can keep doing this always obtaining

00:03:38.628 --> 00:03:43.228


more new complex models out of linear combinations of the existing ones.

00:03:43.229 --> 00:03:47.000


And this is what we're going to do to build our neural networks.

29 Neural Network Architecture 2


Very important
WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:02.770


So in the previous session we learn that we can

00:00:02.770 --> 00:00:05.679


add to linear models to obtain a third model.

00:00:05.679 --> 00:00:07.734


As a matter of fact, we did even more.
00:00:07.735 --> 00:00:10.505
We can take a linear combination of two models.

00:00:10.505 --> 00:00:13.750


So, the first model times a constant plus the second model times a

00:00:13.750 --> 00:00:17.785


constant plus a bias and that gives us a non-linear model.

00:00:17.785 --> 00:00:22.240


That looks a lot like perceptrons where we can take a value times a constant plus

00:00:22.239 --> 00:00:26.784


another value times a constant plus a bias and get a new value.

00:00:26.785 --> 00:00:28.204


And that's no coincidence.

00:00:28.204 --> 00:00:31.649


That's actually the building block of Neural Networks.

00:00:31.649 --> 00:00:33.210


So, let's look at an example.

00:00:33.210 --> 00:00:40.304


Let's say, we have this linear model where the linear equation is 5x1 minus 2x2 plus 8.

00:00:40.304 --> 00:00:42.689


That's represented by this perceptron.

00:00:42.689 --> 00:00:46.169


And we have another linear model with equations 7x1 minus

00:00:46.170 --> 00:00:52.045


3x2 minus 1 which is represented by this perceptron over here.

00:00:52.045 --> 00:00:55.929


Let's draw them nicely in here and let's use another perceptron

00:00:55.929 --> 00:01:00.070


to combine these two models using the Linear Equation,

00:01:00.070 --> 00:01:06.420


seven times the first model plus five times the second model minus six.

00:01:06.420 --> 00:01:11.170


And now the magic happens when we join these together and we get a Neural Network.
00:01:11.170 --> 00:01:16.480
We clean it up a bit and we obtain this. All the weights are there.

00:01:16.480 --> 00:01:18.670


The weights on the left,

00:01:18.670 --> 00:01:22.445


tell us what equations the linear models have.

00:01:22.444 --> 00:01:25.024


And the weights on the right,

00:01:25.025 --> 00:01:27.160


tell us what the linear combination is of

00:01:27.159 --> 00:01:31.629


the two models to obtain the curve non-linear model in the right.

00:01:31.629 --> 00:01:35.319


So, whenever you see a Neural Network like the one on the left,

00:01:35.319 --> 00:01:40.204


think of what could be the nonlinear boundary defined by the Neural Network.

00:01:40.204 --> 00:01:45.444


Now, note that this was drawn using the notation that puts a bias inside the node.

00:01:45.444 --> 00:01:50.394


This can also be drawn using the notation that keeps the bias as a separate node.

00:01:50.394 --> 00:01:52.939


Here, what we do is, in every layer we have

00:01:52.939 --> 00:01:56.870


a bias unit coming from a node with a one on it.

00:01:56.870 --> 00:01:59.870


So for example, the minus eight on the top node

00:01:59.870 --> 00:02:04.160


becomes an edge labelled minus eight coming from the bias node.

00:02:04.159 --> 00:02:06.119


We can see that this Neural Network uses

00:02:06.120 --> 00:02:09.000


a Sigmoid Activation Function and the Perceptrons.
Multiple layers

Now, not all neural networks look like the one above. They can be way more
complicated! In particular, we can do the following things:

 Add more nodes to the input, hidden, and output layers.


 Add more layers.

We'll see the effects of these changes in the next video.


WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:04.495


Neural networks have a certain special architecture with layers.

00:00:04.495 --> 00:00:07.320


The first layer is called the input layer,

00:00:07.320 --> 00:00:08.934


which contains the inputs,

00:00:08.933 --> 00:00:11.931


in this case, x1 and x2.

00:00:11.932 --> 00:00:14.460


The next layer is called the hidden layer,

00:00:14.460 --> 00:00:18.855


which is a set of linear models created with this first input layer.

00:00:18.855 --> 00:00:21.940


And then the final layer is called the output layer,

00:00:21.940 --> 00:00:26.614


where the linear models get combined to obtain a nonlinear model.
00:00:26.614 --> 00:00:28.644
You can have different architectures.

00:00:28.643 --> 00:00:31.764


For example, here's one with a larger hidden layer.

00:00:31.765 --> 00:00:33.689


Now we're combining three linear models to

00:00:33.689 --> 00:00:36.600


obtain the triangular boundary in the output layer.

00:00:36.600 --> 00:00:39.649


Now what happens if the input layer has more nodes?

00:00:39.649 --> 00:00:43.460


For example, this neural network has three nodes in its input layer.

00:00:43.460 --> 00:00:46.435


Well, that just means we're not living in two-dimensional space anymore.

00:00:46.435 --> 00:00:48.755


We're living in three-dimensional space,

00:00:48.755 --> 00:00:50.045


and now our hidden layer,

00:00:50.045 --> 00:00:51.689


the one with the linear models,

00:00:51.689 --> 00:00:54.795


just gives us a bunch of planes in three space,

00:00:54.795 --> 00:00:59.820


and the output layer bounds a nonlinear region in three space.

00:00:59.820 --> 00:01:03.030


In general, if we have n nodes in our input layer,

00:01:03.030 --> 00:01:06.780


then we're thinking of data living in n-dimensional space.

00:01:06.780 --> 00:01:08.983


Now what if our output layer has more nodes?

00:01:08.983 --> 00:01:10.890


Then we just have more outputs.
00:01:10.890 --> 00:01:14.209
In that case, we just have a multiclass classification model.

00:01:14.209 --> 00:01:18.329


So if our model is telling us if an image is a cat or dog or a bird,

00:01:18.328 --> 00:01:20.309


then we simply have each node in

00:01:20.310 --> 00:01:25.140


the output layer output a score for each one of the classes: one for the cat,

00:01:25.140 --> 00:01:27.930


one for the dog, and one for the bird.

00:01:27.930 --> 00:01:31.189


And finally, and here's where things get pretty cool,

00:01:31.188 --> 00:01:33.274


what if we have more layers?

00:01:33.275 --> 00:01:36.090


Then we have what's called a deep neural network.

00:01:36.090 --> 00:01:39.435


Now what happens here is our linear models combine to create

00:01:39.435 --> 00:01:45.364


nonlinear models and then these combine to create even more nonlinear models.

00:01:45.364 --> 00:01:48.150


In general, we can do this many times and obtain

00:01:48.150 --> 00:01:51.329


highly complex models with lots of hidden layers.

00:01:51.328 --> 00:01:54.434


This is where the magic of neural networks happens.

00:01:54.435 --> 00:01:56.406


Many of the models in real life,

00:01:56.406 --> 00:01:59.054


for self-driving cars or for game-playing agents,

00:01:59.055 --> 00:02:01.049


have many, many hidden layers.
00:02:01.049 --> 00:02:02.879
That neural network will just split

00:02:02.879 --> 00:02:07.091


the n-dimensional space with a highly nonlinear boundary,

00:02:07.090 --> 00:02:08.370


such as maybe the one on the right.

Multi-Class Classification

And here we elaborate a bit more into what can be done if our neural network needs
to model data with more than one output.

Multiclass Classification
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:02.640
We briefly mentioned multi-class classification

00:00:02.640 --> 00:00:05.439


in the last video but let me be more specific.

00:00:05.440 --> 00:00:07.469


It seems that neural networks work really well when

00:00:07.469 --> 00:00:10.278


the problem consist on classifying two classes.

00:00:10.278 --> 00:00:13.800


For example, if the model predicts a probability of receiving

00:00:13.800 --> 00:00:18.625


a gift or not then the answer just comes as the output of the neural network.

00:00:18.625 --> 00:00:20.588


But what happens if we have more classes?

00:00:20.588 --> 00:00:23.643


Say, we want the model to tell us if an image is a duck,

00:00:23.643 --> 00:00:26.849


a beaver, or a walrus.

00:00:26.850 --> 00:00:30.695


Well, one thing we can do is create a neural network to predict if the image is a duck,

00:00:30.695 --> 00:00:33.990


then another neural network to predict if the image is a beaver,

00:00:33.990 --> 00:00:37.408


and a third neural network to predict if the image is a walrus.

00:00:37.408 --> 00:00:42.548


Then we can just use SoftMax or pick the answer that gives us the highest probability.

00:00:42.548 --> 00:00:45.073


But this seems like overkill, right?

00:00:45.073 --> 00:00:48.280


The first layers of the neural network should be enough to tell us things about

00:00:48.280 --> 00:00:52.545


the image and maybe just the last layer should tell us which animal it is.
00:00:52.545 --> 00:00:56.448
As a matter of fact, as you'll see in the CNN section,

00:00:56.448 --> 00:00:58.594


this is exactly the case.

00:00:58.594 --> 00:01:02.719


So what we need here is to add more nodes in the output layer and each one of

00:01:02.719 --> 00:01:07.730


the nodes will give us the probability that the image is each of the animals.

00:01:07.730 --> 00:01:11.569


Now, we take the scores and apply the SoftMax function that was previously

00:01:11.569 --> 00:01:15.989


defined to obtain well-defined probabilities.

00:01:15.989 --> 00:01:20.000


This is how we get neural networks to do multi-class classification.
33. Feedforward
Feedforward
Feedforward is the process neural networks use to turn the input into an output. Let's study it
more carefully, before we dive into how to train the networks.
WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:03.000


So now that we have defined what neural networks are,

00:00:03.000 --> 00:00:04.810


we need to learn how to train them.

00:00:04.810 --> 00:00:07.290


Training them really means what parameters should they

00:00:07.290 --> 00:00:10.495


have on the edges in order to model our data well.

00:00:10.495 --> 00:00:12.180


So in order to learn how to train them,

00:00:12.180 --> 00:00:16.800


we need to look carefully at how they process the input to obtain an output.

00:00:16.800 --> 00:00:19.820


So let's look at our simplest neural network, a perceptron.
00:00:19.820 --> 00:00:23.400
This perceptron receives a data point of the form x1,

00:00:23.400 --> 00:00:27.195


x2 where the label is Y=1.

00:00:27.195 --> 00:00:29.385


This means that the point is blue.

00:00:29.385 --> 00:00:34.680


Now the perceptron is defined by a linear equation say w1, x1 plus w2,

00:00:34.680 --> 00:00:41.595


x2 plus B, where w1 and w2 are the weights in the edges and B is the bias in the note.

00:00:41.595 --> 00:00:43.555


Here, w1 is bigger than w2,

00:00:43.555 --> 00:00:46.200


so we'll denote that by drawing the edge labelled w1

00:00:46.200 --> 00:00:49.820


much thicker than the edge labelled w2.

00:00:49.820 --> 00:00:53.280


Now, what the perceptron does is it plots the point x1,

00:00:53.280 --> 00:00:57.240


x2 and it outputs the probability that the point is blue.

00:00:57.240 --> 00:01:01.173


Here is the point is in the red area and then the output is a small number,

00:01:01.173 --> 00:01:03.795


since the point is not very likely to be blue.

00:01:03.795 --> 00:01:07.045


This process is known as feedforward.

00:01:07.045 --> 00:01:11.070


We can see that this is a bad model because the point is actually blue.

00:01:11.070 --> 00:01:12.570


Given that the third coordinate,

00:01:12.570 --> 00:01:14.820


the Y is one.
00:01:14.820 --> 00:01:17.010
Now if we have a more complicated neural network,

00:01:17.010 --> 00:01:18.570


then the process is the same.

00:01:18.570 --> 00:01:22.050


Here, we have thick edges corresponding to large weights and

00:01:22.050 --> 00:01:26.280


thin edges corresponding to small weights and the neural network plots

00:01:26.280 --> 00:01:29.070


the point in the top graph and also in

00:01:29.070 --> 00:01:35.025


the bottom graph and the outputs coming out will be a small number from the top model.

00:01:35.025 --> 00:01:38.580


The point lies in the red area which means it has a small probability of being

00:01:38.580 --> 00:01:43.140


blue and a large number from the second model,

00:01:43.140 --> 00:01:44.895


since the point lies in the blue area which means

00:01:44.895 --> 00:01:47.280


it has a large probability of being blue.

00:01:47.280 --> 00:01:51.650


Now, as the two models get combined into this nonlinear model and

00:01:51.650 --> 00:01:53.180


the output layer just plots

00:01:53.180 --> 00:01:57.485


the point and it tells the probability that the point is blue.

00:01:57.485 --> 00:02:00.620


As you can see, this is a bad model because it

00:02:00.620 --> 00:02:03.750


puts the point in the red area and the point is blue.

00:02:03.750 --> 00:02:08.280


Again, this process called feedforward and we'll look at it more carefully.
00:02:08.280 --> 00:02:13.070
Here, we have our neural network and the other notations so the bias is in the outside.

00:02:13.070 --> 00:02:15.260


Now we have a matrix of weights.

00:02:15.260 --> 00:02:21.285


The matrix w superscript one denoting the first layer and the entries are the weights w1,

00:02:21.285 --> 00:02:23.310


1 up to w3, 2.

00:02:23.310 --> 00:02:26.175


Notice that the biases have now been written as w3,

00:02:26.175 --> 00:02:30.110


1 and w3, 2 this is just for convenience.

00:02:30.110 --> 00:02:31.520


Now in the next layer,

00:02:31.520 --> 00:02:36.115


we also have a matrix this one is w superscript two for the second layer.

00:02:36.115 --> 00:02:38.840


This layer contains the weights that tell us how to combine

00:02:38.840 --> 00:02:43.700


the linear models in the first layer to obtain the nonlinear model in the second layer.

00:02:43.700 --> 00:02:45.060


Now what happens is some math.

00:02:45.060 --> 00:02:47.135


We have the input in the form x1, x2,

00:02:47.135 --> 00:02:51.000


1 where the one comes from the bias unit.

00:02:51.000 --> 00:02:55.660


Now we multiply it by the matrix w1 to get these outputs.

00:02:55.660 --> 00:03:01.250


Then, we apply the sigmoid function to turn the outputs into values between zero and one.

00:03:01.250 --> 00:03:04.130


Then the vector format these values gets a one attatched for
00:03:04.130 --> 00:03:08.280
the bias unit and multiplied by the second matrix.

00:03:08.280 --> 00:03:12.110


This returns an output that now gets thrown into a sigmoid function to

00:03:12.110 --> 00:03:16.290


obtain the final output which is y-hat.

00:03:16.290 --> 00:03:21.155


Y-hat is the prediction or the probability that the point is labeled blue.

00:03:21.155 --> 00:03:23.275


So this is what neural networks do.

00:03:23.275 --> 00:03:25.760


They take the input vector and then apply

00:03:25.760 --> 00:03:29.170


a sequence of linear models and sigmoid functions.

00:03:29.170 --> 00:03:32.825


These maps when combined become a highly non-linear map.

00:03:32.825 --> 00:03:37.310


And the final formula is simply y-hat equals sigmoid of

00:03:37.310 --> 00:03:42.995


w2 combined with sigmoid of w1 applied to x.

00:03:42.995 --> 00:03:48.025


Just for redundance, we do this again on a multi-layer perceptron or neural network.

00:03:48.025 --> 00:03:51.105


To calculate our prediction y-hat,

00:03:51.105 --> 00:03:53.380


we start with the unit vector x,

00:03:53.380 --> 00:03:55.560


then we apply the first matrix and

00:03:55.560 --> 00:04:00.405


a sigmoid function to get the values in the second layer.

00:04:00.405 --> 00:04:05.360


Then, we apply the second matrix and another sigmoid function to get the values on
00:04:05.360 --> 00:04:13.315
the third layer and so on and so forth until we get our final prediction, y-hat.

00:04:13.315 --> 00:04:16.430


And this is the feedforward process that the neural networks

00:04:16.430 --> 00:04:20.000


use to obtain the prediction from the input vector.

Error Function
Just as before, neural networks will produce an error function, which at the end, is
what we'll be minimizing. The following video shows the error function for a neural
network.

DL 42 Neural Network Error Function (1)


0:00:00.000 --> 00:00:02.520
So, our goal is to train our neural network.

00:00:02.520 --> 00:00:03.715


In order to do this,
00:00:03.715 --> 00:00:05.950
we have to define the error function.

00:00:05.950 --> 00:00:10.375


So, let's look again at what the error function was for perceptrons.

00:00:10.375 --> 00:00:12.135


So, here's our perceptron.

00:00:12.135 --> 00:00:15.000


In the left, we have our input vector with

00:00:15.000 --> 00:00:18.900


entries x_1 up to x_n, and one for the bias unit.

00:00:18.900 --> 00:00:23.945


And the edges with weights W_1 up to W_n,

00:00:23.945 --> 00:00:26.360


and b for the bias unit.

00:00:26.360 --> 00:00:30.275


Finally, we can see that this perceptor uses a sigmoid function.

00:00:30.275 --> 00:00:37.008


And the prediction is defined as y-hat equals sigmoid of Wx plus b.

00:00:37.008 --> 00:00:39.750


And as we saw, this function gives us a measure of

00:00:39.750 --> 00:00:44.175


the error of how badly each point is being classified.

00:00:44.175 --> 00:00:48.565


Roughly, this is a very small number if the point is correctly classified,

00:00:48.565 --> 00:00:50.640


and a measure of how far the point is from

00:00:50.640 --> 00:00:53.415


the line and the point is incorrectly classified.

00:00:53.415 --> 00:00:57.840


So, what are we going to do to define the error function in a multilayer perceptron?

00:00:57.840 --> 00:01:00.000


Well, as we saw, our prediction is simply
00:01:00.000 --> 00:01:03.740
a combination of matrix multiplications and sigmoid functions.

00:01:03.740 --> 00:01:07.370


But the error function can be the exact same thing, right?

00:01:07.370 --> 00:01:08.817


It can be the exact same formula,

00:01:08.817 --> 00:01:12.000


except now, y-hat is just a bit more complicated.

00:01:12.000 --> 00:01:17.490


And still, this function will tell us how badly a point gets misclassified.

00:01:17.490 --> 00:01:20.000


Except now, it's looking at a more complicated boundary.
34. Backpropagation
Backpropagation
Now, we're ready to get our hands into training a neural network. For this, we'll use the
method known as backpropagation. In a nutshell, backpropagation will consist of:

 Doing a feedforward operation.


 Comparing the output of the model with the desired output.
 Calculating the error.
 Running the feedforward operation backwards (backpropagation) to spread the error to each
of the weights.
 Use this to update the weights, and get a better model.
 Continue this until we have a model that is good.

Sounds more complicated than what it actually is. Let's take a look in the next few videos. The
first video will show us a conceptual interpretation of what backpropagation is.
WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:04.309


So now we're finally ready to get our hands into training a neural network.

00:00:04.309 --> 00:00:06.449


So let's quickly recall feedforward.

00:00:06.450 --> 00:00:10.469


We have our perceptron with a point coming in labeled positive.

00:00:10.470 --> 00:00:15.050


And our equation w1x1 + w2x2 + b,

00:00:15.050 --> 00:00:19.804


where w1 and w2 are the weights and b is the bias.

00:00:19.804 --> 00:00:21.009


Now, what the perceptron does is,

00:00:21.010 --> 00:00:25.405


it plots a point and returns a probability that the point is blue.

00:00:25.405 --> 00:00:29.345


Which in this case is small since the point is in the red area.

00:00:29.344 --> 00:00:32.070


Thus, this is a bad perceptron since it

00:00:32.070 --> 00:00:36.164


predicts that the point is red when the point is really blue.

00:00:36.164 --> 00:00:39.609


And now let's recall what we did in the gradient descent algorithm.

00:00:39.609 --> 00:00:42.284


We did this thing called Backpropagation.

00:00:42.284 --> 00:00:44.879


We went in the opposite direction.

00:00:44.880 --> 00:00:48.885


We asked the point, "What do you want the model to do for you?"

00:00:48.884 --> 00:00:50.829


And the point says, "Well,

00:00:50.829 --> 00:00:55.204


I am misclassified so I want this boundary to come closer to me."

00:00:55.204 --> 00:00:59.894


And we saw that the line got closer to it by updating the weights.

00:00:59.895 --> 00:01:01.625


Namely, in this case,

00:01:01.625 --> 00:01:07.239


let's say that it tells the weight w1 to go lower and the weight w2 to go higher.

00:01:07.239 --> 00:01:08.694


And this is just an illustration,

00:01:08.694 --> 00:01:10.379


it's not meant to be exact.

00:01:10.379 --> 00:01:12.045


So we obtain new weights,

00:01:12.045 --> 00:01:19.490


w1' and w2' which define a new line which is now closer to the point.

00:01:19.489 --> 00:01:22.170


So what we're doing is like descending from

00:01:22.170 --> 00:01:23.780


Mt. Errorest, right?

00:01:23.780 --> 00:01:29.864


The height is going to be the error function E(W) and we calculate the gradient

00:01:29.864 --> 00:01:32.679


of the error function which is exactly

00:01:32.680 --> 00:01:35.857


like asking the point what does is it want the model to do.

00:01:35.856 --> 00:01:40.340


And as we take the step down the direction of the negative of the gradient,

00:01:40.340 --> 00:01:43.969


we decrease the error to come down the mountain.

00:01:43.969 --> 00:01:45.304


This gives us a new error,

00:01:45.305 --> 00:01:49.932


E(W') and a new model W' with a smaller error,

00:01:49.932 --> 00:01:53.480


which means we get a new line closer to the point.

00:01:53.480 --> 00:01:58.130


We continue doing this process in order to minimize the error.

00:01:58.129 --> 00:01:59.890


So that was for a single perceptron.

00:01:59.890 --> 00:02:02.760


Now, what do we do for multi-layer perceptrons?

00:02:02.760 --> 00:02:06.745


Well, we still do the same process of reducing the error by descending from the mountain,

00:02:06.745 --> 00:02:11.055


except now, since the error function is more complicated then it's not

00:02:11.055 --> 00:02:12.775


Mt. Errorest, now it's

00:02:12.775 --> 00:02:15.789


Mt. Kilimanjerror. But same thing,

00:02:15.788 --> 00:02:19.554


we calculate the error function and its gradient.

00:02:19.555 --> 00:02:25.290


We then walk in the direction of the negative of the gradient in order to find

00:02:25.289 --> 00:02:28.644


a new model W' with a smaller error

00:02:28.645 --> 00:02:32.719


E(W') which will give us a better prediction.

00:02:32.719 --> 00:02:36.895


And we continue doing this process in order to minimize the error.

00:02:36.895 --> 00:02:40.149


So let's look again at what feedforward does in a multi-layer perceptron.

00:02:40.149 --> 00:02:45.990


The point comes in with coordinates (x1, x2) and label y = 1.

00:02:45.990 --> 00:02:50.570


It gets plotted in the linear models corresponding to the hidden layer.

00:02:50.569 --> 00:02:54.019


And then, as this layer gets combined the point gets

00:02:54.020 --> 00:02:58.280


plotted in the resulting non-linear model in the output layer.

00:02:58.280 --> 00:03:01.400


And the probability that the point is blue is obtained by

00:03:01.400 --> 00:03:05.060


the position of this point in the final model.

00:03:05.060 --> 00:03:07.189


Now, pay close attention because this is

00:03:07.189 --> 00:03:11.094


the key for training neural networks, it's Backpropagation.

00:03:11.094 --> 00:03:13.849


We'll do as before, we'll check the error.

00:03:13.849 --> 00:03:16.159


So this model is not good because it predicts that

00:03:16.159 --> 00:03:19.365


the point will be red when in reality the point is blue.

00:03:19.365 --> 00:03:21.320


So we'll ask the point,

00:03:21.319 --> 00:03:26.579


"What do you want this model to do in order for you to be better classified?"

00:03:26.580 --> 00:03:31.615


And the point says, "I kind of want this blue region to come closer to me."

00:03:31.615 --> 00:03:35.195


Now, what does it mean for the region to come closer to it?

00:03:35.194 --> 00:03:39.049


Well, let's look at the two linear models in the hidden layer.

00:03:39.050 --> 00:03:42.735


Which one of these two models is doing better?

00:03:42.735 --> 00:03:45.740


Well, it seems like the top one is badly misclassifying

00:03:45.740 --> 00:03:50.230


the point whereas the bottom one is classifying it correctly.

00:03:50.229 --> 00:03:55.454


So we kind of want to listen to the bottom one more and to the top one less.

00:03:55.455 --> 00:03:58.880


So what we want to do is to reduce the weight coming from

00:03:58.879 --> 00:04:02.519


the top model and increase the weight coming from the bottom model.

00:04:02.520 --> 00:04:05.909


So now our final model will look a lot

00:04:05.909 --> 00:04:10.034


more like the bottom model than like the top model.

00:04:10.034 --> 00:04:12.014


But we can do even more.

00:04:12.014 --> 00:04:15.464


We can actually go to the linear models and ask the point,

00:04:15.465 --> 00:04:20.250


"What can these models do to classify you better?"

00:04:20.250 --> 00:04:22.139


And the point will say, "Well,

00:04:22.139 --> 00:04:24.832


the top model is misclassifying me,

00:04:24.833 --> 00:04:28.635


so I kind of want this line to move closer to me.

00:04:28.634 --> 00:04:33.084


And the second model is correctly classifying me,

00:04:33.084 --> 00:04:37.370


so I want this line to move farther away from me."

00:04:37.370 --> 00:04:41.670


And so this change in the model will actually update the weights.

00:04:41.670 --> 00:04:46.000


Let's say, it'll increase these two and decrease these two.

00:04:46.000 --> 00:04:50.735


So now after we update all the weights we have better predictions at

00:04:50.735 --> 00:04:53.569


all the models in the hidden layer and

00:04:53.569 --> 00:04:57.589


also a better prediction at the model in the output layer.

00:04:57.589 --> 00:05:02.125


Notice that in this video we intentionally left the bias unit away for clarity.

00:05:02.125 --> 00:05:06.649


In reality, when you update the weights we're also updating the bias unit.

00:05:06.649 --> 00:05:08.659


If you're the kind of person who likes formality,

00:05:08.660 --> 00:05:12.070


don't worry, we'll calculate these gradients in detail soon.
Backpropagation Math

And the next few videos will go deeper into the math. Feel free to tune out, since this
part gets handled by Keras pretty well. If you'd like to go start training networks right
away, go to the next section. But if you enjoy calculating lots of derivatives, let's dive
in!

In the video below at 1:24, the edges should be directed to the sigmoid function and
not the bias at that last layer; the edges of the last layer point to the bias currently
which is incorrect.
important

Chain Rule
We'll need to recall the chain rule to help us calculate derivatives.

Chain Rule
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:02.919
So before we start calculating derivatives,

00:00:02.919 --> 00:00:04.980


let's do a refresher on the chain rule which

00:00:04.980 --> 00:00:08.080


is the main technique we'll use to calculate them.

00:00:08.080 --> 00:00:12.390


The chain rule says, if you have a variable x on a function f that you

00:00:12.390 --> 00:00:17.824


apply to x to get f of x, which we're gonna call A,

00:00:17.824 --> 00:00:19.809


and then another function g,

00:00:19.809 --> 00:00:23.604


which you apply to f of x to get g of f of x,

00:00:23.605 --> 00:00:26.760


which we're gonna call B, the chain rule says,

00:00:26.760 --> 00:00:32.920


if you want to find the partial derivative of B with respect to x,

00:00:32.920 --> 00:00:36.685


that's just a partial derivative of B with respect to

00:00:36.685 --> 00:00:41.704


A times the partial derivative of A with respect to x.

00:00:41.704 --> 00:00:43.184


So it literally says,

00:00:43.185 --> 00:00:47.605


when composing functions, that derivatives just multiply,

00:00:47.604 --> 00:00:50.344


and that's gonna be super useful for us because

00:00:50.344 --> 00:00:55.185


feed forwarding is literally composing a bunch of functions,

00:00:55.185 --> 00:01:00.554


and back propagation is literally taking the derivative at each piece,

00:01:00.554 --> 00:01:03.344


and since taking the derivative of a composition

00:01:03.344 --> 00:01:06.756


is the same as multiplying the partial derivatives,

00:01:06.756 --> 00:01:10.290


then all we're gonna do is multiply a bunch of
00:01:10.290 --> 00:01:14.130
partial derivatives to get what we want. Pretty simple, right?

important
WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:04.305


So, let us go back to our neural network with our weights and our input.

00:00:04.305 --> 00:00:08.730


And recall that the weights with superscript 1 belong to the first layer,

00:00:08.730 --> 00:00:12.775


and the weights with superscript 2 belong to the second layer.

00:00:12.775 --> 00:00:15.390


Also, recall that the bias is not called b anymore.

00:00:15.390 --> 00:00:16.960


Now, it is called W31,

00:00:16.960 --> 00:00:19.665


W32 etc. for convenience,

00:00:19.665 --> 00:00:22.830


so that we can have everything in matrix notation.

00:00:22.830 --> 00:00:25.280


And now what happens with the input?

00:00:25.280 --> 00:00:28.218


So, let us do the feedforward process.

00:00:28.218 --> 00:00:29.910


In the first layer,
00:00:29.910 --> 00:00:34.620
we take the input and multiply it by the weights and that gives us h1,

00:00:34.620 --> 00:00:37.860


which is a linear function of the input and the weights.

00:00:37.860 --> 00:00:39.600


Same thing with h2,

00:00:39.600 --> 00:00:42.085


given by this formula over here.

00:00:42.085 --> 00:00:43.350


Now, in the second layer,

00:00:43.350 --> 00:00:46.485


we would take this h1 and h2 and the new bias,

00:00:46.485 --> 00:00:48.645


apply the sigmoid function,

00:00:48.645 --> 00:00:52.980


and then apply a linear function to them by multiplying them by

00:00:52.980 --> 00:00:57.345


the weights and adding them to get a value of h. And finally,

00:00:57.345 --> 00:00:58.530


in the third layer,

00:00:58.530 --> 00:01:02.100


we just take a sigmoid function of h to get

00:01:02.100 --> 00:01:07.540


our prediction or probability between 0 and 1, which is ŷ.

00:01:07.540 --> 00:01:11.070


And we can read this in more condensed notation by saying that

00:01:11.070 --> 00:01:15.244


the matrix corresponding to the first layer is W superscript 1,

00:01:15.244 --> 00:01:19.930


the matrix corresponding to the second layer is W superscript 2,

00:01:19.930 --> 00:01:22.570


and then the prediction we had is just going to be

00:01:22.570 --> 00:01:26.260


the sigmoid of W superscript 2 combined with

00:01:26.260 --> 00:01:33.540


the sigmoid of W superscript 1 applied to the input x and that is
feedforward.

00:01:33.540 --> 00:01:35.890


Now, we are going to develop backpropagation,

00:01:35.890 --> 00:01:39.010


which is precisely the reverse of feedforward.

00:01:39.010 --> 00:01:40.930


So, we are going to calculate the derivative of

00:01:40.930 --> 00:01:44.890


this error function with respect to each of the weights in

00:01:44.890 --> 00:01:52.611


the labels by using the chain rule.

00:01:52.611 --> 00:01:59.110


So, let us recall that our error function is this formula over here,

00:01:59.110 --> 00:02:02.760


which is a function of the prediction ŷ.

00:02:02.760 --> 00:02:07.015


But, since the prediction is a function of all the weights wij,

00:02:07.015 --> 00:02:13.540


then the error function can be seen as the function on all the wij.

00:02:13.540 --> 00:02:16.960


Therefore, the gradient is simply the vector formed by

00:02:16.960 --> 00:02:23.500


all the partial derivatives of the error function E with respect to each of
the weights.

00:02:23.500 --> 00:02:25.196


So, let us calculate one of these derivatives.

00:02:25.196 --> 00:02:31.210


Let us calculate derivative of E with respect to W11 superscript 1.

00:02:31.210 --> 00:02:35.140


So, since the prediction is simply a composition of functions and by the
chain rule,

00:02:35.140 --> 00:02:37.750


we know that the derivative with respect to this

00:02:37.750 --> 00:02:41.650


is the product of all the partial derivatives.

00:02:41.650 --> 00:02:44.710


In this case, the derivative E with respect

00:02:44.710 --> 00:02:48.617


to W11 is the derivative of either respect to ŷ times

00:02:48.617 --> 00:02:52.480


the derivative ŷ with respect to h
00:02:52.480 --> 00:02:57.650
times the derivative h with respect to h1 times the derivative h1 with
respect to W11.

00:02:57.650 --> 00:03:01.345


This may seem complicated,

00:03:01.345 --> 00:03:03.790


but the fact that we can calculate a derivative of

00:03:03.790 --> 00:03:06.370


such a complicated composition function by just

00:03:06.370 --> 00:03:10.235


multiplying 4 partial derivatives is remarkable.

00:03:10.235 --> 00:03:12.360


Now, we have already calculated the first one,

00:03:12.360 --> 00:03:14.767


the derivative of E with respect to ŷ.

00:03:14.767 --> 00:03:16.430


And if you remember, we got ŷ minus y.

00:03:16.430 --> 00:03:20.095


So, let us calculate the other ones.

00:03:20.095 --> 00:03:25.193


Let us zoom in a bit and look at just one piece of our multi-layer
perceptron.

00:03:25.193 --> 00:03:28.665


The inputs are some values h1 and h2,

00:03:28.665 --> 00:03:30.955


which are values coming in from before.

00:03:30.955 --> 00:03:34.905


And once we apply the sigmoid and a linear function

00:03:34.905 --> 00:03:39.045


on h1 and h2 and 1 corresponding to the biased unit,

00:03:39.045 --> 00:03:41.550


we get a result h. So,

00:03:41.550 --> 00:03:44.670


now what is the derivative of h with respect to h1?

00:03:44.670 --> 00:03:51.130


Well, h is a sum of three things and only one of them contains h1.

00:03:51.130 --> 00:03:55.940


So, the second and the third summon just give us a derivative of 0.
00:03:55.940 --> 00:04:03.205
The first summon gives us W11 superscript 2 because that is a constant,

00:04:03.205 --> 00:04:08.715


and that times the derivative of the sigmoid function with respect to h1.

00:04:08.715 --> 00:04:12.615


This is something that we calculated below in the instructor comments,

00:04:12.615 --> 00:04:15.960


which is that the sigmoid function has a beautiful derivative,

00:04:15.960 --> 00:04:19.200


namely the derivative of sigmoid of h is

00:04:19.200 --> 00:04:24.660


precisely sigmoid of h times 1 minus sigmoid of h. Again,

00:04:24.660 --> 00:04:27.600


you can see this development underneath in the instructor comments.

00:04:27.600 --> 00:04:31.275


You also have the chance to code this in the quiz because at the end of the
day,

00:04:31.275 --> 00:04:35.635


we just code these formulas and then use them forever, and that is it.

00:04:35.635 --> 00:04:37.020


That is how you train a neural network.

Calculation of the derivative of the sigmoid function

Recall that the sigmoid function has a beautiful derivative, which we can see in the
following calculation. This will make our backpropagation step much cleaner.
35. Pre-Lab: Analyzing Student Data
Lab: Analyzing Student Data
Now, we're ready to put neural networks in practice. We'll analyze a dataset of student
admissions at UCLA.

To open this notebook, you have two options:

 Go to the next page in the classroom (recommended).


 Clone the repo from Github and open the notebook StudentAdmissions.ipynb in
the student_admissions folder. You can either download the repository with  git clone
https://github.com/udacity/deep-learning.git , or download it as an archive file
from this link.

Instructions
In this notebook, you'll be implementing some of the steps in the training of the neural
network, namely:

 One-hot encoding the data


 Scaling the data
 Writing the backpropagation step

This is a self-assessed lab. If you need any help or want to check your answers, feel free to
check out the solutions notebook in the same folder, or by clicking here.

36. Notebook: Analyzing Student Data


Workspace

This section contains either a workspace (it can be a Jupyter Notebook workspace or an online
code editor work space, etc.) and it cannot be automatically downloaded to be generated here.
Please access the classroom with your account and manually download the workspace to your
local machine. Note that for some courses, Udacity upload the workspace files
onto https://github.com/udacity, so you may be able to download them there.

Workspace Information:

 Default file path:


 Workspace type: jupyter
 Opened files (when workspace is loaded): n/a
Great job!
You now know how neural networks work and how they get trained. In the next
lesson, Mat will guide you through implementing this training process in NumPy. See
you soon!

Implementing Gradient Descent

 Back to Home

 01. Mean Squared Error Function


 02. Gradient Descent
 03. Gradient Descent: The Math
 04. Gradient Descent: The Code
 05. Implementing Gradient Descent
 06. Multilayer Perceptrons
 07. Backpropagation
 08. Implementing Backpropagation
 09. Further Reading
Log-Loss vs Mean Squared Error
In the previous section, Luis taught you about the log-loss function. There are many
other error functions used for neural networks. Let me teach you another one, called
the mean squared error. As the name says, this one is the mean of the squares of the
differences between the predictions and the labels. In the following section I'll go over
it in detail, then we'll get to implement backpropagation with it on the same student
admissions dataset.

And as a bonus, we'll be implementing this in a very effective way using matrix
multiplication with NumPy!
02. Gradient Descent
Gradient Descent with Squared Errors
We want to find the weights for our neural networks. Let's start by thinking about the goal.
The network needs to make predictions as close as possible to the real values. To measure this,
we use a metric of how wrong the predictions are, the error. A common metric is the sum of
the squared errors (SSE):

E = \frac{1}{2}\sum_{\mu} \sum_j \left[ y^{\mu}_j - \hat{y} ^{\mu}_j


\right]^2E=21∑μ∑j[yjμ−y^jμ]2
where \hat yy^ is the prediction and yy is the true value, and you take the sum over all output
units jj and another sum over all data points \muμ. This might seem like a really complicated
equation at first, but it's fairly simple once you understand the symbols and can say what's
going on in words.
First, the inside sum over jj. This variable jj represents the output units of the network. So this
inside sum is saying for each output unit, find the difference between the true value yy and the
predicted value from the network \hat yy^, then square the difference, then sum up all those
squares.
Then the other sum over \muμ is a sum over all the data points. So, for each data point you
calculate the inner sum of the squared differences for each output unit. Then you sum up those
squared differences for each data point. That gives you the overall error for all the output
predictions for all the data points.
The SSE is a good choice for a few reasons. The square ensures the error is always positive
and larger errors are penalized more than smaller errors. Also, it makes the math nice, always
a plus.

Remember that the output of a neural network, the prediction, depends on the weights

\hat{y}^{\mu}_j = f \left( \sum_i{ w_{ij} x^{\mu}_i }\right)y^jμ=f(∑iwijxiμ)


and accordingly the error depends on the weights

E = \frac{1}{2}\sum_{\mu} \sum_j \left[ y^{\mu}_j - f \left( \sum_i{ w_{ij}


x^{\mu}_i }\right) \right]^2E=21∑μ∑j[yjμ−f(∑iwijxiμ)]2
We want the network's prediction error to be as small as possible and the weights are the
knobs we can use to make that happen. Our goal is to find weights w_{ij}wij that minimize
the squared error EE. To do this with a neural network, typically you'd use gradient descent.
Enter Gradient Descent
As Luis said, with gradient descent, we take multiple small steps towards our goal. In
this case, we want to change the weights in steps that reduce the error. Continuing
the analogy, the error is our mountain and we want to get to the bottom. Since the
fastest way down a mountain is in the steepest direction, the steps taken should be in
the direction that minimizes the error the most. We can find this direction by
calculating the gradient of the squared error.

Gradient is another term for rate of change or slope. If you need to brush up on this
concept, check out Khan Academy's great lectures on the topic.
The gradient is just a derivative generalized to functions with more than one variable.
We can use calculus to find the gradient at any point in our error function, which
depends on the input weights. You'll see how the gradient descent step is derived on
the next page.

Below I've plotted an example of the error of a neural network with two inputs, and
accordingly, two weights. You can read this like a topographical map where points on
a contour line have the same error and darker contour lines correspond to larger
errors.

At each step, you calculate the error and the gradient, then use those to determine
how much to change each weight. Repeating this process will eventually find weights
that are close to the minimum of the error function, the black dot in the middle.
Caveats
Since the weights will just go wherever the gradient takes them, they can end up
where the error is low, but not the lowest. These spots are called local minima. If the
weights are initialized with the wrong values, gradient descent could lead the weights
into a local minimum, illustrated below.

There are methods to avoid this, such as using momentum.


03. Gradient Descent: The Math
Gradient Descent-Math
INSTRUCTOR NOTE:

Notes

Check out Khan Academy's Multivariable calculus lessons if you are unfamiliar with the
subject.

# Defining the sigmoid function for activations


def sigmoid(x):
return 1/(1+np.exp(-x))

# Derivative of the sigmoid function


def sigmoid_prime(x):
return sigmoid(x) * (1 - sigmoid(x))

# Input data
x = np.array([0.1, 0.3])
# Target
y = 0.2
# Input to output weights
weights = np.array([-0.8, 0.5])

# The learning rate, eta in the weight step equation


learnrate = 0.5

# the linear combination performed by the node (h in f(h) and f'(h))


h = x[0]*weights[0] + x[1]*weights[1]
# or h = np.dot(x, weights)

# The neural network output (y-hat)


nn_output = sigmoid(h)

# output error (y - y-hat)


error = y - nn_output

# output gradient (f'(h))


output_grad = sigmoid_prime(h)

# error term (lowercase delta)


error_term = error * output_grad

# Gradient descent step


del_w = [ learnrate * error_term * x[0],
learnrate * error_term * x[1]]
# or del_w = learnrate * error_term * x
Start Quiz:
gradient.py solution.py
import numpy as np

def sigmoid(x):
"""
Calculate sigmoid
"""
return 1/(1+np.exp(-x))

def sigmoid_prime(x):
"""
# Derivative of the sigmoid function
"""
return sigmoid(x) * (1 - sigmoid(x))

learnrate = 0.5
x = np.array([1, 2, 3, 4])
y = np.array(0.5)

# Initial weights
w = np.array([0.5, -0.5, 0.3, 0.1])

### Calculate one gradient descent step for each weight


### Note: Some steps have been consilated, so there are
### fewer variable names than in the above sample code

# TODO: Calculate the node's linear combination of inputs and weights


h = None

# TODO: Calculate output of neural network


nn_output = None

# TODO: Calculate error of neural network


error = None

# TODO: Calculate the error term


# Remember, this requires the output gradient, which we haven't
# specifically added a variable for.
error_term = None

# TODO: Calculate change in weights


del_w = None

print('Neural Network output:')


print(nn_output)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)
Start Quiz:
gradient.py solution.py
import numpy as np

def sigmoid(x):
"""
Calculate sigmoid
"""
return 1/(1+np.exp(-x))

def sigmoid_prime(x):
"""
# Derivative of the sigmoid function
"""
return sigmoid(x) * (1 - sigmoid(x))

learnrate = 0.5
x = np.array([1, 2. 3, 4])
y = np.array(0.5)

# Initial weights
w = np.array([0.5, -0.5, 0.3, 0.1])

### Calculate one gradient descent step for each weight


### Note: Some steps have been consilated, so there are
### fewer variable names than in the above sample code

# TODO: Calculate the node's linear combination of inputs and weights


h = np.dot(x, w)

# TODO: Calculate output of neural network


nn_output = sigmoid(h)

# TODO: Calculate error of neural network


error = y - nn_output

# TODO: Calculate the error term


# Remember, this requires the output gradient, which we haven't
# specifically added a variable for.
error_term = error * sigmoid_prime(h)
# Note: The sigmoid_prime function calculates sigmoid(h) twice,
# but you've already calculated it once. You can make this
# code more efficient by calculating the derivative directly
# rather than calling sigmoid_prime, like this:
# error_term = error * nn_output * (1 - nn_output)
# TODO: Calculate change in weights
del_w = learnrate * error_term * x

print('Neural Network output:')


print(nn_output)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)
The goal here is to predict if a student will be admitted to a graduate program based
on these features. For this, we'll use a network with one output layer with one unit.
We'll use a sigmoid function for the output unit activation.

Data cleanup
You might think there will be three input units, but we actually need to transform the
data first. The  rank  feature is categorical, the numbers don't encode any sort of
relative values. Rank 2 is not twice as much as rank 1, rank 3 is not 1.5 more than rank
2. Instead, we need to use dummy variables to encode  rank , splitting the data into
four new columns encoded with ones or zeros. Rows with rank 1 have one in the rank
1 dummy column, and zeros in all other columns. Rows with rank 2 have one in the
rank 2 dummy column, and zeros in all other columns. And so on.

We'll also need to standardize the GRE and GPA data, which means to scale the values
such that they have zero mean and a standard deviation of 1. This is necessary
because the sigmoid function squashes really small and really large inputs. The
gradient of really small and large inputs is zero, which means that the gradient
descent step will go to zero too. Since the GRE and GPA values are fairly large, we
have to be really careful about how we initialize the weights or the gradient descent
steps will die off and the network won't train. Instead, if we standardize the data, we
can initialize the weights easily and everyone is happy.

This is just a brief run-through, you'll learn more about preparing data later. If you're
interested in how I did this, check out the  data_prep.py  file in the programming
exercise below.

Now that the data is ready, we see that there are six input features:  gre ,  gpa , and the
four  rank  dummy variables.

Mean Square Error


We're going to make a small change to how we calculate the error here. Instead of the
SSE, we're going to use the mean of the square errors (MSE). Now that we're using a
lot of data, summing up all the weight steps can lead to really large updates that
make the gradient descent diverge. To compensate for this, you'd need to use a quite
small learning rate. Instead, we can just divide by the number of records in our
data, mm to take the average. This way, no matter how much data we use, our
learning rates will typically be in the range of 0.01 to 0.001. Then, we can use the MSE
(shown below) to calculate the gradient and the result is the same as before, just
averaged instead of summed.
Programming exercise
Below, you'll implement gradient descent and train the network on the admissions
data. Your goal here is to train the network until you reach a minimum in the mean
square error (MSE) on the training set. You need to implement:

 The network output:  output .


 The output error:  error .
 The error term:  error_term .
 Update the weight step:  del_w += .
 Update the weights:  weights += .

After you've written these parts, run the training by pressing "Test Run". The MSE will
print out, as well as the accuracy on a test set, the fraction of correctly predicted
admissions.

Feel free to play with the hyperparameters and see how it changes the MSE.
Start Quiz:
gradient.py data_prep.pybinary.csvsolution.py
import numpy as np
from data_prep import features, targets, features_test, targets_test

def sigmoid(x):
"""
Calculate sigmoid
"""
return 1 / (1 + np.exp(-x))

# TODO: We haven't provided the sigmoid_prime function like we did in


# the previous lesson to encourage you to come up with a more
# efficient solution. If you need a hint, check out the
comments
# in solution.py from the previous lecture.

# Use to same seed to make debugging easier


np.random.seed(42)

n_records, n_features = features.shape


last_loss = None

# Initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)

# Neural Network hyperparameters


epochs = 1000
learnrate = 0.5

for e in range(epochs):
del_w = np.zeros(weights.shape)
for x, y in zip(features.values, targets):
# Loop through all records, x is the input, y is the target

# Note: We haven't included the h variable from the previous


# lesson. You can add it if you want, or you can
calculate
# the h together with the output

# TODO: Calculate the output


output = None

# TODO: Calculate the error


error = None

# TODO: Calculate the error term


error_term = None

# TODO: Calculate the change in weights for this sample


# and add it to the total weight change
del_w += 0

# TODO: Update weights using the learning rate and the average
change in weights
weights += 0

# Printing out the mean square error on the training set


if e % (epochs / 10) == 0:
out = sigmoid(np.dot(features, weights))
loss = np.mean((out - targets) ** 2)
if last_loss and last_loss < loss:
print("Train loss: ", loss, " WARNING - Loss
Increasing")
else:
print("Train loss: ", loss)
last_loss = loss

# Calculate accuracy on test data


tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))
Start Quiz:
gradient.py data_prep.py binary.csvsolution.py
import numpy as np
import pandas as pd

admissions = pd.read_csv('binary.csv')

# Make dummy variables for rank


data = pd.concat([admissions, pd.get_dummies(admissions['rank'],
prefix='rank')], axis=1)
data = data.drop('rank', axis=1)

# Standarize features
for field in ['gre', 'gpa']:
mean, std = data[field].mean(), data[field].std()
data.loc[:,field] = (data[field]-mean)/std

# Split off random 10% of the data for testing


np.random.seed(42)
sample = np.random.choice(data.index, size=int(len(data)*0.9),
replace=False)
data, test_data = data.ix[sample], data.drop(sample)

# Split into features and targets


features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1),
test_data['admit']
Start Quiz:
gradient.pydata_prep.pybinary.csv solution.py
import numpy as np
from data_prep import features, targets, features_test, targets_test

def sigmoid(x):
"""
Calculate sigmoid
"""
return 1 / (1 + np.exp(-x))

# TODO: We haven't provided the sigmoid_prime function like we did in


# the previous lesson to encourage you to come up with a more
# efficient solution. If you need a hint, check out the
comments
# in solution.py from the previous lecture.

# Use to same seed to make debugging easier


np.random.seed(42)

n_records, n_features = features.shape


last_loss = None

# Initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)

# Neural Network hyperparameters


epochs = 1000
learnrate = 0.5

for e in range(epochs):
del_w = np.zeros(weights.shape)
for x, y in zip(features.values, targets):
# Loop through all records, x is the input, y is the target

# Activation of the output unit


# Notice we multiply the inputs and the weights here
# rather than storing h as a separate variable
output = sigmoid(np.dot(x, weights))

# The error, the target minus the network output


error = y - output

# The error term


# Notice we calulate f'(h) here instead of defining a
separate
# sigmoid_prime function. This just makes it faster because
we
# can re-use the result of the sigmoid function stored in
# the output variable
error_term = error * output * (1 - output)

# The gradient descent step, the error times the gradient


times the inputs
del_w += error_term * x

# Update the weights here. The learning rate times the


# change in weights, divided by the number of records to average
weights += learnrate * del_w / n_records

# Printing out the mean square error on the training set


if e % (epochs / 10) == 0:
out = sigmoid(np.dot(features, weights))
loss = np.mean((out - targets) ** 2)
if last_loss and last_loss < loss:
print("Train loss: ", loss, " WARNING - Loss
Increasing")
else:
print("Train loss: ", loss)
last_loss = loss

# Calculate accuracy on test data


tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))
06. Multilayer Perceptrons
Implementing the hidden layer
Prerequisites
Below, we are going to walk through the math of neural networks in a multilayer
perceptron. With multiple perceptrons, we are going to move to using vectors and
matrices. To brush up, be sure to view the following:

1. Khan Academy's introduction to vectors.


2. Khan Academy's introduction to matrices.

Derivation
Before, we were dealing with only one output node which made the code
straightforward. However now that we have multiple input units and multiple hidden
units, the weights between them will require two indices: w_{ij}wij where ii denotes
input units and jj are the hidden units.
For example, the following image shows our network, with its input units labeled x_1,
x_2,x1,x2, and x_3x3, and its hidden nodes labeled h_1h1 and h_2h2:

The lines indicating the weights leading to h_1h1 have been colored differently from
those leading to h_2h2 just to make it easier to read.
Now to index the weights, we take the input unit number for the _ii and the hidden
unit number for the _jj. That gives us

w_{11}w11

for the weight leading from x_1x1 to h_1h1, and

w_{12}w12

for the weight leading from x_1x1 to h_2h2.

The following image includes all of the weights between the input layer and the
hidden layer, labeled with their appropriate w_{ij}wij indices:
Start Quiz:
multilayer.py solution.py
import numpy as np

def sigmoid(x):
"""
Calculate sigmoid
"""
return 1/(1+np.exp(-x))

# Network size
N_input = 4
N_hidden = 3
N_output = 2

np.random.seed(42)
# Make some fake data
X = np.random.randn(4)

weights_input_to_hidden = np.random.normal(0, scale=0.1,


size=(N_input, N_hidden))
weights_hidden_to_output = np.random.normal(0, scale=0.1,
size=(N_hidden, N_output))

# TODO: Make a forward pass through the network

hidden_layer_in = np.dot(X, weights_input_to_hidden)


hidden_layer_out = sigmoid(hidden_layer_in)

print('Hidden-layer Output:')
print(hidden_layer_out)

output_layer_in = np.dot(hidden_layer_out, weights_hidden_to_output)


output_layer_out = sigmoid(output_layer_in)

print('Output-layer Output:')
print(output_layer_out)

07. Backpropagation
Backpropagation
Backpropagation
Now we've come to the problem of how to make a multilayer neural network learn.
Before, we saw how to update weights with gradient descent. The backpropagation
algorithm is just an extension of that, using the chain rule to find the error with the
respect to the weights connecting the input layer to the hidden layer (for a two layer
network).

To update the weights to hidden layers using gradient descent, you need to know
how much error each of the hidden units contributed to the final output. Since the
output of a layer is determined by the weights between layers, the error resulting
from units is scaled by the weights going forward through the network. Since we
know the error at the output, we can use the weights to work backwards to hidden
layers.

For example, in the output layer, you have errors \delta^o_kδko attributed to each


output unit kk. Then, the error attributed to hidden unit jj is the output errors, scaled
by the weights between the output and hidden layers (and the gradient):

Then, the gradient descent step is the same as before, just with the new errors:

where w_{ij}wij are the weights between the inputs and hidden layer and x_ixi are
input unit values. This form holds for however many layers there are. The weight steps
are equal to the step size times the output error of the layer times the values of the
inputs to that layer

Here, you get the output error, \delta_{output}δoutput, by propagating the errors


backwards from higher layers. And the input values, V_{in}Vin are the inputs to the
layer, the hidden layer activations to the output unit for example.
Working through an example

Let's walk through the steps of calculating the weight updates for a simple two layer
network. Suppose there are two input values, one hidden unit, and one output unit,
with sigmoid activations on the hidden and output units. The following image depicts
this network. (Note: the input values are shown as nodes at the bottom of the image,
while the network's output value is shown as \hat yy^ at the top. The inputs
themselves do not count as a layer, which is why this is considered a two layer
network.)
Very Important
It turns out this is exactly how we want to calculate the weight update step. As before,
if you have your inputs as a 2D array with one row, you can also
do  hidden_error*inputs.T , but that won't work if  inputs  is a 1D array.

Backpropagation exercise
Below, you'll implement the code to calculate one backpropagation update step for
two sets of weights. I wrote the forward pass - your goal is to code the backward pass.

Things to do

 Calculate the network's output error.


 Calculate the output layer's error term.
 Use backpropagation to calculate the hidden layer's error term.
 Calculate the change in weights (the delta weights) that result from propagating the
errors back through the network.

Start Quiz:
backprop.py solution.py
import numpy as np

def sigmoid(x):
"""
Calculate sigmoid
"""
return 1 / (1 + np.exp(-x))

x = np.array([0.5, 0.1, -0.2])


target = 0.6
learnrate = 0.5

weights_input_hidden = np.array([[0.5, -0.6],


[0.1, -0.2],
[0.1, 0.7]])

weights_hidden_output = np.array([0.1, -0.3])

## Forward pass
hidden_layer_input = np.dot(x, weights_input_hidden)
hidden_layer_output = sigmoid(hidden_layer_input)

output_layer_in = np.dot(hidden_layer_output, weights_hidden_output)


output = sigmoid(output_layer_in)
## Backwards pass
## TODO: Calculate output error
error = None

# TODO: Calculate error term for output layer


output_error_term = None

# TODO: Calculate error term for hidden layer


hidden_error_term = None

# TODO: Calculate change in weights for hidden layer to output layer


delta_w_h_o = None

# TODO: Calculate change in weights for input layer to hidden layer


delta_w_i_h = None

print('Change in weights for hidden layer to output layer:')


print(delta_w_h_o)
print('Change in weights for input layer to hidden layer:')
print(delta_w_i_h)
Start Quiz:
backprop.py solution.py
import numpy as np

def sigmoid(x):
"""
Calculate sigmoid
"""
return 1 / (1 + np.exp(-x))

x = np.array([0.5, 0.1, -0.2])


target = 0.6
learnrate = 0.5

weights_input_hidden = np.array([[0.5, -0.6],


[0.1, -0.2],
[0.1, 0.7]])

weights_hidden_output = np.array([0.1, -0.3])

## Forward pass
hidden_layer_input = np.dot(x, weights_input_hidden)
hidden_layer_output = sigmoid(hidden_layer_input)

output_layer_in = np.dot(hidden_layer_output, weights_hidden_output)


output = sigmoid(output_layer_in)

## Backwards pass
## TODO: Calculate output error
error = target - output

# TODO: Calculate error term for output layer


output_error_term = error * output * (1 - output)

# TODO: Calculate error term for hidden layer


hidden_error_term = np.dot(output_error_term, weights_hidden_output)
* \
hidden_layer_output * (1 - hidden_layer_output)

# TODO: Calculate change in weights for hidden layer to output layer


delta_w_h_o = learnrate * output_error_term * hidden_layer_output

# TODO: Calculate change in weights for input layer to hidden layer


delta_w_i_h = learnrate * hidden_error_term * x[:, None]

print('Change in weights for hidden layer to output layer:')


print(delta_w_h_o)
print('Change in weights for input layer to hidden layer:')
print(delta_w_i_h)
08. Implementing Backpropagation
Implementing backpropagation
Now we've seen that the error term for the output layer is

\delta_k = (y_k - \hat y_k) f&#x27;(a_k)δk=(yk−y^k)f′(ak)


and the error term for the hidden layer is
For now we'll only consider a simple network with one hidden layer and one output unit.
Here's the general algorithm for updating the weights with backpropagation:

 Set the weight steps for each layer to zero

o The input to hidden weights \Delta w_{ij}


= 0Δwij=0
o The hidden to output weights \Delta W_j = 0ΔWj=0
 For each record in the training data:

o Make a forward pass through the network, calculating the output \hat yy^
o Calculate the error gradient in the output unit, \delta^o = (y - \hat y)
f&#x27;(z)δo=(y−y^)f′(z) where z = \sum_j W_j a_jz=∑jWjaj, the
input to the output unit.
o Propagate the errors to the hidden layer \delta^h_j = \delta^o W_j
f&#x27;(h_j)δjh=δoWjf′(hj)
o Update the weight steps:

 \Delta W_j = \Delta W_j + \delta^o a_jΔWj=ΔWj+δoaj


 \Delta w_{ij} = \Delta w_{ij} + \delta^h_j a_iΔwij=Δwij+δjhai
 Update the weights, where \etaη is the learning rate and mm is the number of records:
o W_j = W_j + \eta \Delta W_j / mWj=Wj+ηΔWj/m
o w_{ij} = w_{ij} + \eta \Delta w_{ij} / mwij=wij+ηΔwij/m
 Repeat for ee epochs.
Backpropagation exercise
Now you're going to implement the backprop algorithm for a network trained on the graduate
school admission data. You should have everything you need from the previous exercises to
complete this one.

Your goals here:

 Implement the forward pass.


 Implement the backpropagation algorithm.
 Update the weights.

Start Quiz:
backprop.py data_prep.pybinary.csvsolution.py
import numpy as np
from data_prep import features, targets, features_test, targets_test

np.random.seed(21)

def sigmoid(x):
"""
Calculate sigmoid
"""
return 1 / (1 + np.exp(-x))

# Hyperparameters
n_hidden = 2 # number of hidden units
epochs = 900
learnrate = 0.005

n_records, n_features = features.shape


last_loss = None
# Initialize weights
weights_input_hidden = np.random.normal(scale=1 / n_features ** .5,
size=(n_features, n_hidden))
weights_hidden_output = np.random.normal(scale=1 / n_features ** .5,
size=n_hidden)

for e in range(epochs):
del_w_input_hidden = np.zeros(weights_input_hidden.shape)
del_w_hidden_output = np.zeros(weights_hidden_output.shape)
for x, y in zip(features.values, targets):
## Forward pass ##
# TODO: Calculate the output
hidden_input = None
hidden_output = None
output = None

## Backward pass ##
# TODO: Calculate the network's prediction error
error = None

# TODO: Calculate error term for the output unit


output_error_term = None

## propagate errors to hidden layer

# TODO: Calculate the hidden layer's contribution to the


error
hidden_error = None

# TODO: Calculate the error term for the hidden layer


hidden_error_term = None

# TODO: Update the change in weights


del_w_hidden_output += 0
del_w_input_hidden += 0

# TODO: Update weights (don't forget to division by n_records or


number of samples)
weights_input_hidden += 0
weights_hidden_output += 0

# Printing out the mean square error on the training set


if e % (epochs / 10) == 0:
hidden_output = sigmoid(np.dot(x, weights_input_hidden))
out = sigmoid(np.dot(hidden_output,
weights_hidden_output))
loss = np.mean((out - targets) ** 2)

if last_loss and last_loss < loss:


print("Train loss: ", loss, " WARNING - Loss
Increasing")
else:
print("Train loss: ", loss)
last_loss = loss

# Calculate accuracy on test data


hidden = sigmoid(np.dot(features_test, weights_input_hidden))
out = sigmoid(np.dot(hidden, weights_hidden_output))
predictions = out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))
INSTRUCTOR NOTE:

Note: This code takes a while to execute, so Udacity's servers sometimes return with an error
saying it took too long. If that happens, it usually works if you try again.
Start Quiz:
backprop.py data_prep.py binary.csvsolution.py
import numpy as np
import pandas as pd

admissions = pd.read_csv('binary.csv')

# Make dummy variables for rank


data = pd.concat([admissions, pd.get_dummies(admissions['rank'],
prefix='rank')], axis=1)
data = data.drop('rank', axis=1)

# Standarize features
for field in ['gre', 'gpa']:
mean, std = data[field].mean(), data[field].std()
data.loc[:,field] = (data[field]-mean)/std

# Split off random 10% of the data for testing


np.random.seed(21)
sample = np.random.choice(data.index, size=int(len(data)*0.9),
replace=False)
data, test_data = data.ix[sample], data.drop(sample)

# Split into features and targets


features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1),
test_data['admit']
Start Quiz:
backprop.pydata_prep.pybinary.csv solution.py
import numpy as np
from data_prep import features, targets, features_test, targets_test

np.random.seed(21)

def sigmoid(x):
"""
Calculate sigmoid
"""
return 1 / (1 + np.exp(-x))

# Hyperparameters
n_hidden = 2 # number of hidden units
epochs = 900
learnrate = 0.005

n_records, n_features = features.shape


last_loss = None
# Initialize weights
weights_input_hidden = np.random.normal(scale=1 / n_features ** .5,
size=(n_features, n_hidden))
weights_hidden_output = np.random.normal(scale=1 / n_features ** .5,
size=n_hidden)

for e in range(epochs):
del_w_input_hidden = np.zeros(weights_input_hidden.shape)
del_w_hidden_output = np.zeros(weights_hidden_output.shape)
for x, y in zip(features.values, targets):
## Forward pass ##
# TODO: Calculate the output
hidden_input = np.dot(x, weights_input_hidden)
hidden_output = sigmoid(hidden_input)

output = sigmoid(np.dot(hidden_output,
weights_hidden_output))

## Backward pass ##
# TODO: Calculate the network's prediction error
error = y - output

# TODO: Calculate error term for the output unit


output_error_term = error * output * (1 - output)
## propagate errors to hidden layer

# TODO: Calculate the hidden layer's contribution to the


error
hidden_error = np.dot(output_error_term,
weights_hidden_output)

# TODO: Calculate the error term for the hidden layer


hidden_error_term = hidden_error * hidden_output * (1 -
hidden_output)

# TODO: Update the change in weights


del_w_hidden_output += output_error_term * hidden_output
del_w_input_hidden += hidden_error_term * x[:, None]

# TODO: Update weights


weights_input_hidden += learnrate * del_w_input_hidden /
n_records
weights_hidden_output += learnrate * del_w_hidden_output /
n_records

# Printing out the mean square error on the training set


if e % (epochs / 10) == 0:
hidden_output = sigmoid(np.dot(x, weights_input_hidden))
out = sigmoid(np.dot(hidden_output,
weights_hidden_output))
loss = np.mean((out - targets) ** 2)

if last_loss and last_loss < loss:


print("Train loss: ", loss, " WARNING - Loss
Increasing")
else:
print("Train loss: ", loss)
last_loss = loss

# Calculate accuracy on test data


hidden = sigmoid(np.dot(features_test, weights_input_hidden))
out = sigmoid(np.dot(hidden, weights_hidden_output))
predictions = out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))
09. Further Reading
Further reading
Backpropagation is fundamental to deep learning. TensorFlow and other libraries will
perform the backprop for you, but you should really really understand the algorithm.
We'll be going over backprop again, but here are some extra resources for you:

Very important

 From Andrej Karpathy: Yes, you should understand backprop

 Also from Andrej Karpathy, a lecture from Stanford's CS231n course


Need to understand
To reach global minimum
Very important
Part 02-Module 01-Lesson 04_GPU Workspaces Demo
01. Introduction to GPU Workspaces

## Introduction
Udacity Workspaces with GPU support are available for some projects as an alternative to
manually configuring your own remote server with GPU support. These workspaces provide a
Jupyter notebook server directly in your browser. This lesson will briefly introduce the
Workspaces interface.

Important Notes:

 Workspaces sessions are connections from your browser to a remote server. Each student has
a limited number of GPU hours allocated on the servers (the allocation is significantly more
than completing the projects is expected to take). There is currently no limit on the number of
Workspace hours when GPU mode is disabled.
 Workspace data stored in the user's home folder is preserved between sessions (and can be
reset as needed, e.g., to get project updates).
 Only 3 gigabytes of data can be stored in the home folder.
 Workspace sessions are preserved if your connection drops or your browser window is
closed, simply return to the classroom and re-open the workspace page; however, workspace
sessions are automatically terminated after a period of inactivity. This will prevent you from
leaving a session connection open and burning through your time allocation. (See the section
on active connections below.)
 The kernel state is preserved as long as the notebook session remains open, but it
is not preserved if the session is closed. If you exit the notebook for more than half an hour
and the session is closed, you will need to re-run any previously-run cells before continuing.

## Overview
The default workspaces interface

When the workspace opens, you'll see the normal Jupyter file browser. From this
interface you can open a notebook file, start a remote terminal session, enable the
GPU, submit your project, or reset the workspace data, and more. Clicking the three
bars in the top left corner above the Jupyter logo will toggle hiding the classroom
lessons sidebar.

NOTE: You can always return to the file browser page from anywhere else in the
workspace by clicking the Jupyter logo in the top left corner.

## Opening a notebook
View of the project notebook

Clicking the name of a notebook (*.ipynb) file in the file list will open a standard
Jupyter notebook view of the project. The notebook session will remain open as long
as you are active, and will be automatically terminated after 30 minutes of inactivity.

You can exit a notebook by clicking on the Jupyter logo in the top left corner.

NOTE: Notebooks continue to run in the background unless they are stopped. IF
GPU MODE IS ACTIVE, IT WILL REMAIN ACTIVE AFTER CLOSING OR STOPPING A
NOTEBOOK. YOU CAN ONLY STOP GPU MODE WITH THE GPU TOGGLE BUTTON.
(See next section.)
## Enabling GPU Mode

The GPU Toggle Button

GPU Workspaces can also be run without time restrictions when the GPU mode is
disabled. The "Enable"/"Disable" button (circled in red in the image) can be used to
toggle GPU mode. NOTE: Toggling GPU support may switch the physical server
your session connects to, which can cause data loss UNLESS YOU CLICK THE
SAVE BUTTON BEFORE TOGGLING GPU SUPPORT.

ALWAYS SAVE YOUR CHANGES BEFORE TOGGLING GPU SUPPORT.

## Keeping Your Session Active


Workspaces automatically disconnect after 30 minutes of user inactivity—which
means that workspaces can disconnect during long-running tasks (like training neural
networks). We have provided a utility that can keep your workspace sessions active for
these tasks. However, keep the following guidelines in mind:

 Do not try to permanently hold the workspace session active when you do not have a
process running (e.g., do not try to hold the session open in the background)—the
limits are in place to preserve your GPU time allocation; there is no guarantee that
you'll receive additional time if you exceed the limit.
 Make sure that you save the results of the long running task to disk as soon as the
task ends (e.g., checkpoint your model parameters for deep learning networks);
otherwise the workspace will disconnect 30 minutes after the active process ends, and
the results will be lost.

The  workspace_utils.py  module (available here) includes an iterator wrapper


called  keep_awake  and a context manager called  active_session  that can be used to
maintain an active session during long-running processes. The two functions are
equivalent, so use whichever fits better in your code. NOTE: The file may be
incorrectly downloaded as  workspace-utils.py  (note the dash instead of an
underscore in the filename). Make sure to correct the filename before uploading to
your workspace; Python cannot import from file names including hyphens.

Example using  keep_awake :

from workspace_utils import keep_awake

for i in keep_awake(range(5)): #anything that happens inside this loop will keep
the workspace active
# do iteration with lots of work here
Example using  active_session :

from workspace_utils import active_session

with active_session():
# do long-running work here
## Submitting a Project
The Submit Project Button

Some workspaces are able to directly submit projects on your behalf (i.e., you
do not need to manually submit the project in the classroom). To submit your project,
simply click the "Submit Project" button (circled in red in the above image).

If you do not see the "Submit Project" button, then project submission is not enabled
for that workspace. You will need to manually download your project files and submit
them in the classroom.

NOTE: YOU MUST ENSURE THAT YOUR SUBMISSION INCLUDES ALL REQUIRED
FILES BEFORE SUBMITTING -- INCLUDING ANY FILE CONVERSIONS (e.g., from
ipynb to HTML)

## Opening a Terminal
The "New" menu button

Jupyter workspaces support several views, including the file browser and notebook
view already covered, as well as shell terminals. To open a terminal shell, click the
"New" menu button at the top right of the file browser view and select "Terminal".

## Terminals
Jupyter terminal shell interface

Terminals provide a full Bash shell that you can use to install or update software
packages, fetch updates from github repositories, or run any other terminal
commands. As with the notebook view, you can return to the file browser view by
clicking on the Jupyter logo at the top left corner of the window.

NOTE: Your data & changes are persistent across workspace sessions. Any
changes you make will need to be repeated if you later reset your workspace
data.

## Resetting Data
The Menu Button

The "Menu" button in the bottom left corner provides support for resetting your
Workspaces. The "Refresh Workspace" button will refresh your session, which has no
effect on the changes you've made in the workspace.

The "Reset Data" button discards all changes and restores a clean copy of the
workspace. Clicking the button will open a dialog that requires you to type "Reset
data" in a confirmation dialog. ALL OF YOUR DATA WILL BE LOST.

Resetting should only be required if Udacity makes changes to the project and you
can't get them via  git pull , or if you destroy the contents of the workspace. If you
do need to reset your data, you are strongly encouraged to download a copy of your
work from the file interface before clicking Reset Data.

02. Workspace Playground


Try it out!
There is an empty workspace in the next module that you can use to explore the workspaces
interface. The GPU time allocation in this notebook is shared with all others throughout the
term, but you can use this playground to experiment with the interface.

THE PLAYGROUND MAY NOT SUPPORT ALL PROJECTS. FOLLOW THE


INSTRUCTIONS FOR EACH PROJECT TO COMPLETE AND SUBMIT THEM. In
other words, if you're working on a project that doesn't have an associated workspace, then
there is no expectation for this playground to support that project.

Project: Predicting Bike Sharing Data

 Back to Home

 01. Introduction to the Project


 02. Project Workspace
 Project Description - Your first neural network
 Project Rubric - Your first neural network
02. Project Workspace
Workspace

This section contains either a workspace (it can be a Jupyter Notebook workspace or an online
code editor work space, etc.) and it cannot be automatically downloaded to be generated here.
Please access the classroom with your account and manually download the workspace to your
local machine. Note that for some courses, Udacity upload the workspace files
onto https://github.com/udacity, so you may be able to download them there.

Workspace Information:

 Default file path:


 Workspace type: jupyter
 Opened files (when workspace is loaded): n/a
Your first neural network
Your First Neural Network
Introduction

In this project, you'll get to build a neural network from scratch to carry out a prediction
problem on a real dataset! By building a neural network from the ground up, you'll have a
much better understanding of gradient descent, backpropagation, and other concepts that are
important to know before we move to higher level tools such as Tensorflow. You'll also get to
see how to apply these networks to solve real prediction problems!

The data comes from the UCI Machine Learning Database.

Instructions

1. Download the project materials from our GitHub repository. You can get download the
repository with  git clone https://github.com/udacity/deep-learning.git . Our
files in the GitHub repo are the most up to date, so it's the best place to get the project files.
2. cd into the  first-neural-network  directory.
3. Download anaconda or miniconda based on the instructions in the Anaconda lesson.
4. Create a new conda environment:

conda create --name dlnd python=3

5. Enter your new environment:


o Mac/Linux:  >> source activate dlnd
o Windows:  >> activate dlnd
6. Ensure you have  numpy ,  matplotlib ,  pandas , and  jupyter notebook  installed by
doing the following:

conda install numpy matplotlib pandas jupyter notebook

7. Run the following to open up the notebook server:

jupyter notebook

8. In your browser, open  Your_first_neural_network.ipynb


9. Follow the instructions in the notebook; they will lead you through the project. You'll
ultimately be editing the  my_answers.py  python file, whose components are imported into
the notebook at various places.
10. Ensure you've passed the unit tests in the notebook and have taken a look at the
rubric before you submit the project!

If you need help running the notebook file, check out the Jupyter notebook lesson.
Submission

Before submitting your solution to a reviewer, you are required to submit your project to
Udacity's Project Assistant, which will provide some initial feedback. It will give you
feedback within a minute or two on whether your project will meet all specifications.
It is possible to submit projects which do not pass all tests; you can expect to get feedback
from your Udacity reviewer on these within 3-4 days.

The setup for the project assistant is simple. If you have not installed the client tool from a
different Nanodegree program already, then you may do so with the command  pip install
udacity-pa .

To submit your code to the project assistant, run  udacity submit  from within the top-level
directory of the project. You will be prompted for a username and password. If you login using
google or facebook, visit this link for alternate login instructions.

This process will create a zipfile in your top-level directory named  first_neural_network-
result-.zip , where there will be a number between  result-  and  .zip . This is the file that
you should submit to the Udacity reviews system.

Upload that file into the system and hit Submit Project below!

If you run into any issues using the project assistant, please check this page to troubleshoot;
feel free to post your problem in Knowledge if it isn't covered by one of the displayed cases!

What to do afterwards

If you're waiting for new content or to get the review back, here's a great video from Frank
Chen about the history of deep learning. It's a 45 minute video, sort of a short documentary,
starting in the 1950s and bringing us to the current boom in deep learning and artificial
intelligence.
Your first neural network
Code Functionality
Criteria
All code works appropriately and passes all unit tests All the code in the notebook runs in Python
Sigmoid activation function The sigmoid activation function is implemen
Forward Pass
Criteria
Forward Pass - Training The forward pass is correctly implemented for the network's training.
Forward Pass - Run The run method correctly produces the desired regression output for the n
Backward Pass
Criteria Meet Specification
Batch Weight Change The network correctly implements the backward pass for each batch, correctly up
Updating the weights Updates to both the input-to-hidden and hidden-to-output weights are implemente
Hyperparameters
Criteria Meet Specification
Number of epochs The number of epochs is chosen such the network is trained well enough to accurately m
Number of hidden The number of hidden units is chosen such that the network is able to accurately predict
units overfitting.
Learning rate The learning rate is chosen such that the network successfully converges, but is still time
Output Nodes The number of output nodes is properly selected to solve the desired problem.
Final Results The training loss is below 0.09 and the validation loss is below 0.18.
Part 02-Module 01-Lesson 06_Sentiment Analysis
03. Materials
Materials
As you follow along this lesson, it's extremely important that you open the Jupyter notebook
and attempt the exercises. Much of the value in this experience will come from seeing how
your solution is different from Andrew's and playing around with the code in your own way.
Make this lesson count!

Workspace
The best way to open the notebook is to click here, which will open it in a new window. We
recommend you to work on the notebook in that window, and watch the videos in this one.
You can also get to the notebook by clicking the "Next" button in the classroom.

If you want to download the notebooks yourself, you can clone them from our GitHub
repository. You can either download the repository with  git clone
https://github.com/udacity/deep-learning.git , or download it as an archive file
from this link.

This lesson uses the following files:

 Sentiment_Classification_Projects.ipynb  - a notebook you will use to following


along and work on lesson mini projects.
 Sentiment_Classification_Solutions.ipynb  - a notebook that includes Andrew’s
solutions to the lesson projects, which you can use for reference
 A notebook for the solution for each mini project.
 reviews.txt  - a collection of 25 thousand movie reviews
 labels.txt  - positive/negative sentiment labels for the associated reviews in reviews.txt

Note: the notebooks for these lessons have been updated since the videos were recorded. In
most cases that just means your notebook will contain more hints and explanatory text than
what you see in the videos, but there may be some minor differences in the code as well. With
these changes, you still will be able to follow along with the lessons, and should have an easier
time understanding the project material.

Solutions
If you need help, feel free to look at the solutions in the same folder.
04. The Notebooks
Workspace

This section contains either a workspace (it can be a Jupyter Notebook workspace or an online
code editor work space, etc.) and it cannot be automatically downloaded to be generated here.
Please access the classroom with your account and manually download the workspace to your
local machine. Note that for some courses, Udacity upload the workspace files
onto https://github.com/udacity, so you may be able to download them there.

Workspace Information:

 Default file path:


 Workspace type: jupyter
 Opened files (when workspace is loaded): n/a
06. Mini Project 1
Instructions

In this project, you'll test your theory of what features of a review correlate with the label!
Here are your specific steps:

Mini Project 1

Task List:
 
Work in the  Project 1  section of  Sentiment_Classification_Projects.ipynb .

 
Follow the notebook’s instructions to test the correlation between review features and labels.

Task Feedback:
Nice work! In the next video, Andrew will explain his solution.
Important about project
09. Mini Project 2
Instructions

In the following mini project, you'll convert the inputs and outputs of the dataset into numbers.
Namely, you will convert each review string into a vector, and each label into a  0  or  1 .
You’ll need to make a few additions to the notebook, but the main work will be implementing
two functions, whose signatures are shown below:

def update_input_layer(review):
""" Modify the global layer_0 to represent the vector form of review.
The element at a given index of layer_0 should represent \
how many times the given word occurs in the review.
Args:
review(string) - the string of the review
Returns:
None
"""
global layer_0
# clear out previous state, reset the layer to be all 0s
layer_0 *= 0
## Your code here
pass
def get_target_for_label(label):
"""Convert a label to `0` or `1`.
Args:
label(string) - Either "POSITIVE" or "NEGATIVE".
Returns:
`0` or `1`.
"""
pass
Mini Project 2

Task List:
 
Work in the  Project 2  section of  Sentiment_Classification_Projects.ipynb .

 
Follow the notebook’s instructions to convert your inputs and outputs to numbers.

 
Create a global vocabulary set  vocab

 
Initialize a  global  layer_0 that is a vector of the size of the text vocabulary. All values should
be initialized to  0 .
 
Implement  update_input_layer .

 
Implement  get_target_label .

Task Feedback:
Nice work! In the next video, Andrew will share his solution.
Keras

 Back to Home

 01. Intro
 02. Keras
 03. Pre-Lab: Student Admissions in Keras
 04. Lab: Student Admissions in Keras
 05. Optimizers in Keras
 06. Mini Project Intro
 07. Pre-Lab: IMDB Data in Keras
 08. Lab: IMDB Data in Keras
Keras
Hi again! Now, we know all there is about training and optimizing neural networks,
and we've actually trained a few of them in NumPy, But this is not what we normally
do in real life. There are many packages which will make our life much easier. The two
main ones that we'll learn in this course are Keras and TensorFlow. In this lesson, we'll
learn to use Keras.

The way we'll learn is by writing lots of code and building lots of models. We'll start by
building a simple neural network that will solve the XOR problem. Then, we'll build a
bigger neural network that will analyze the student data that we have analyzed in a
previous section.

And finally, we'll have a lab in which you'll be able to build a neural network yourself,
which will process text, and make predictions on the sentiment of movie reviews in
IMDB.
02. Keras
Neural Networks in Keras
Luckily, every time we need to use a neural network, we won't need to code the activation
function, gradient descent, etc. There are lots of packages for this, which we recommend you
to check out, including the following:

 Keras
 TensorFlow
 Caffe
 Theano
 Scikit-learn
 And many others!

In this course, we will learn Keras. Keras makes coding deep neural networks simpler. To
demonstrate just how easy it is, you're going to build a simple fully-connected network in a
few dozen lines of code.

We’ll be connecting the concepts that you’ve learned in the previous lessons to the methods
that Keras provides.

The general idea for this example is that you'll first load the data, then define the network, and
then finally train the network.

Building a Neural Network in Keras


Here are some core concepts you need to know for working with Keras.

Sequential Model
from keras.models import Sequential

#Create the Sequential model


model = Sequential()
The keras.models.Sequential class is a wrapper for the neural network model that treats the
network as a sequence of layers. It implements the Keras model interface with common
methods like  compile() ,  fit() , and  evaluate()  that are used to train and run the model.
We'll cover these functions soon, but first let's start looking at the layers of the model.

Layers
The Keras Layer class provides a common interface for a variety of standard neural network
layers. There are fully connected layers, max pool layers, activation layers, and more. You can
add a layer to a model using the model's  add()  method. For example, a simple model with a
single hidden layer might look like this:

import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense, Activation
# X has shape (num_rows, num_cols), where the training data are stored
# as row vectors
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=np.float32)

# y must have an output vector for each input vector


y = np.array([[0], [0], [0], [1]], dtype=np.float32)

# Create the Sequential model


model = Sequential()

# 1st Layer - Add an input layer of 32 nodes with the same input shape as
# the training samples in X
model.add(Dense(32, input_dim=X.shape[1]))

# Add a softmax activation layer


model.add(Activation('softmax'))

# 2nd Layer - Add a fully connected output layer


model.add(Dense(1))

# Add a sigmoid activation layer


model.add(Activation('sigmoid'))
Keras requires the input shape to be specified in the first layer, but it will automatically infer
the shape of all other layers. This means you only have to explicitly set the input dimensions
for the first layer.

The first (hidden) layer from above,  model.add(Dense(32, input_dim=X.shape[1])) ,


creates 32 nodes which each expect to receive 2-element vectors as inputs. Each layer takes
the outputs from the previous layer as inputs and pipes through to the next layer. This chain of
passing output to the next layer continues until the last layer, which is the output of the model.
We can see that the output has dimension 1.

The activation "layers" in Keras are equivalent to specifying an activation function in the
Dense layers (e.g.,  model.add(Dense(128)); model.add(Activation('softmax'))  is
computationally equivalent to  model.add(Dense(128, activation="softmax"))) ), but it is
common to explicitly separate the activation layers because it allows direct access to the
outputs of each layer before the activation is applied (which is useful in some model
architectures).

Once we have our model built, we need to compile it before it can be run. Compiling the Keras
model calls the backend (tensorflow, theano, etc.) and binds the optimizer, loss function, and
other parameters required before the model can be run on any input data. We'll specify the loss
function to be  categorical_crossentropy  which can be used when there are only two
classes, and specify  adam  as the optimizer (which is a reasonable default when speed is a
priority). And finally, we can specify what metrics we want to evaluate the model with. Here
we'll use accuracy.

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics =


["accuracy"])
We can see the resulting model architecture with the following command:

model.summary()
The model is trained with the  fit()  method, through the following command that specifies
the number of training epochs and the message level (how much information we want
displayed on the screen during training).

model.fit(X, y, nb_epoch=1000, verbose=0)


Note: In Keras 1,  nb_epoch  sets the number of epochs, but in Keras 2 this changes to the
keyword  epochs .

Finally, we can use the following command to evaluate the model:

model.evaluate()
Pretty simple, right? Let's put it into practice.

Quiz
Let's start with the simplest example. In this quiz you will build a simple multi-layer
feedforward neural network to solve the XOR problem.

1. Set the first layer to a  Dense()  layer with an output width of 8 nodes and
the  input_dim  set to the size of the training samples (in this case 2).
2. Add a  tanh  activation function.
3. Set the output layer width to 1, since the output has only two classes. (We can use 0 for one
class and 1 for the other)
4. Use a  sigmoid  activation function after the output layer.
5. Run the model for 50 epochs.

This should give you an accuracy of 50%. That's ok, but certainly not great. Out of 4 input
points, we're correctly classifying only 2 of them. Let's try to change some parameters around
to improve. For example, you can increase the number of epochs. You'll pass this quiz if you
get 75% accuracy. Can you reach 100%?

To get started, review the Keras documentation about models and layers.
The Keras example of a Multi-Layer Perceptron network is similar to what you need to do
here. Use that as a guide, but keep in mind that there will be a number of differences.
tart Quiz:
network.py network_solution.py
import numpy as np
from keras.utils import np_utils
import tensorflow as tf
# Using TensorFlow 1.0.0; use tf.python_io in later versions
tf.python.control_flow_ops = tf

# Set random seed


np.random.seed(42)

# Our data
X = np.array([[0,0],[0,1],[1,0],[1,1]]).astype('float32')
y = np.array([[0],[1],[1],[0]]).astype('float32')

# Initial Setup for Keras


from keras.models import Sequential
from keras.layers.core import Dense, Activation
# One-hot encoding the output
y = np_utils.to_categorical(y)

# Building the model


xor = Sequential()

# Add required layers


# xor.add()

# Specify loss as "binary_crossentropy", optimizer as "adam",


# and add the accuracy metric
# xor.compile()

# Uncomment this line to print the model architecture


# xor.summary()

# Fitting the model


history = xor.fit(X, y, nb_epoch=50, verbose=0)

# Scoring the model


score = xor.evaluate(X, y)
print("\nAccuracy: ", score[-1])

# Checking the predictions


print("\nPredictions:")
print(xor.predict_proba(X))
Start Quiz:
network.py network_solution.py
import numpy as np
from keras.utils import np_utils
import tensorflow as tf
tf.python.control_flow_ops = tf

# Set random seed


np.random.seed(42)

# Our data
X = np.array([[0,0],[0,1],[1,0],[1,1]]).astype('float32')
y = np.array([[0],[1],[1],[0]]).astype('float32')

# Initial Setup for Keras


from keras.models import Sequential
from keras.layers.core import Dense, Activation, Flatten

# One-hot encoding the output


y = np_utils.to_categorical(y)

# Building the model


xor = Sequential()
xor.add(Dense(32, input_dim=2))
xor.add(Activation("tanh"))
xor.add(Dense(2))
xor.add(Activation("sigmoid"))

xor.compile(loss="categorical_crossentropy", optimizer="adam",
metrics = ['accuracy'])

# Uncomment this line to print the model architecture


# xor.summary()

# Fitting the model


history = xor.fit(X, y, nb_epoch=1000, verbose=0)

# Scoring the model


score = xor.evaluate(X, y)
print("\nAccuracy: ", score[-1])

# Checking the predictions


print("\nPredictions:")
print(xor.predict_proba(X))
03. Pre-Lab: Student Admissions in Keras
Mini Project: Student Admissions in Keras
So, now we're ready to use Keras with real data. We'll now build a neural network
which analyzes the dataset of student admissions at UCLA that we've previously
studied.

As you follow along with this lesson, you are encouraged to work in the referenced
Jupyter notebooks at the end of the page. We will present a solution to you, but
please try creating your own deep learning models! Much of the value in this
experience will come from playing around with the code in your own way.

Workspace
To open this notebook, you have two options:

 Go to the next page in the classroom (recommended)


 Clone the repo from Github and open the
notebook StudentAdmissionsKeras.ipynb in the student_admissions_keras folder.
You can either download the repository with  git clone
https://github.com/udacity/deep-learning.git , or download it as an archive file
from this link.

Instructions
This is more of a follow-along lab. We'll show you the steps to build the network.
However, at the end of the lab you'll be given the opportunity to improve the model,
and try to improve on its performance. Here are the main steps in this lab.

Studying the data

The dataset has the following columns:

 Student GPA (grades)


 Score on the GRE (test)
 Class rank (1-4)

First, let's start by looking at the data. For that, we'll use the read_csv function in
pandas.

import pandas as pd
data = pd.read_csv('student_data.csv')
print(data)
Here we can see that the first column is the label  y , which corresponds to
acceptance/rejection. Namely, a label of  1  means the student got accepted, and a
label of  0  means the student got rejected.

When we plot the data, we get the following graphs, which show that unfortunately,
the data is not as nicely separable as we'd hope:
So one thing we can do is make one graph for each of the 4 ranks. In that case, we get
this:
Pre-processing the data

Ok, there's a bit more hope here. It seems like the better grades and test scores the
student has, the more likely they are to be accepted. And the rank has something to
do with it. So what we'll do is, we'll one-hot encode the rank, and our 6 input variables
will be:

 Test (GPA)
 Grades (GRE)
 Rank 1
 Rank 2
 Rank 3
 Rank 4.

The last 4 inputs will be binary variables that have a value of 1 if the student has that
rank, or 0 otherwise.

So, first things first, let's notice that the test scores have a range of 800, while the
grades have a range of 4. This is a huge discrepancy, and it will affect our training.
Normally, the best thing to do is to normalize the scores so they are between 0 and 1.
We can do this as follows:

data["gre"] = data["gre"]/800
data["gpa"] = data["gpa"]/4.0
Now, we split our data input into X, and the labels y , and one-hot encode the output,
so it appears as two classes (accepted and not accepted).

X = np.array(data)[:,1:]
y = keras.utils.to_categorical(np.array(data["admit"]))
Building the model architecture

And finally, we define the model architecture. We can use different architectures, but
here's an example:

model = Sequential()
model.add(Dense(128, input_dim=6))
model.add(Activation('sigmoid'))
model.add(Dense(32))
model.add(Activation('sigmoid'))
model.add(Dense(2))
model.add(Activation('sigmoid'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',
metrics=['accuracy'])
model.summary()
The error function is given by  categorical_crossentropy , which is the one we've
been using, but there are other options. There are several optimizers which you can
choose from in order to improve your training. Here we use adam, but there are
others that are useful, such as rmsprop. These use a variety of techniques that we'll
outline in upcoming pages in this lesson.

The model summary will tell us the following:


Training the model
Now, we train the model, with 1000 epochs. Don't worry about the batch_size, we'll
learn about it soon.

model.fit(X_train, y_train, epochs=1000, batch_size=100,


verbose=0)
Evaluating the model
And finally, we can evaluate our model.

score = model.evaluate(X_train, y_train)


Results may vary, but you should get somewhere over 70% accuracy.

And there you go, you've trained your first neural network to analyze a dataset. Now,
in the following pages, you'll learn many techniques to improve the training process.
04. Lab: Student Admissions in Keras
Workspace

This section contains either a workspace (it can be a Jupyter Notebook workspace or an online
code editor work space, etc.) and it cannot be automatically downloaded to be generated here.
Please access the classroom with your account and manually download the workspace to your
local machine. Note that for some courses, Udacity upload the workspace files
onto https://github.com/udacity, so you may be able to download them there.

Workspace Information:

 Default file path:


 Workspace type: jupyter
 Opened files (when workspace is loaded): n/a

05. Optimizers in Keras


Keras Optimizers
There are many optimizers in Keras, that we encourage you to explore further, in this link, or
in this excellent blog post. These optimizers use a combination of the tricks above, plus a few
others. Some of the most common are:

SGD
This is Stochastic Gradient Descent. It uses the following parameters:

 Learning rate.
 Momentum (This takes the weighted average of the previous steps, in order to get a bit of
momentum and go over bumps, as a way to not get stuck in local minima).
 Nesterov Momentum (This slows down the gradient when it's close to the solution).

Adam
Adam (Adaptive Moment Estimation) uses a more complicated exponential decay that consists
of not just considering the average (first moment), but also the variance (second moment) of
the previous steps.

RMSProp
RMSProp (RMS stands for Root Mean Squared Error) decreases the learning rate by dividing
it by an exponentially decaying average of squared gradients.
07. Pre-Lab: IMDB Data in Keras
Mini Project: Using Keras to analyze IMDB Movie Data
Now, you're ready to shine! In this project, we will analyze a dataset from IMDB and use it to
predict the sentiment analysis of a review.

Workspace
To open this notebook, you have two options:

 Go to the next page in the classroom (recommended)


 Clone the repo from Github and open the notebook IMDB_in_Keras.ipynb in
the imdb_keras folder. You can either download the repository with  git clone
https://github.com/udacity/deep-learning.git , or download it as an archive file
from this link.

Instructions
In this lab, we will preprocess the data for you, and you'll be in charge of building and training
the model in Keras.

The dataset

This lab uses a dataset of 25,000 IMDB reviews. Each review comes with a label. A label of 0
is given to a negative review, and a label of 1 is given to a positive review. The goal of this lab
is to create a model that will predict the sentiment of a review, based on the words in the
review. You can see more information about this dataset in the Keras website.

Now, the input already comes preprocessed for us for convenience. Each review is encoded as
a sequence of indexes, corresponding to the words in the review. The words are ordered by
frequency, so the integer 1 corresponds to the most frequent word ("the"), the integer 2 to the
second most frequent word, etc. By convention, the integer 0 corresponds to unknown words.

Then, the sentence is turned into a vector by simply concatenating these integers. For instance,
if the sentence is "To be or not to be." and the indices of the words are as follows:

 "to": 5
 "be": 8
 "or": 21
 "not": 3

Then the sentence gets encoded as the vector  [5,8,21,3,5,8] .

Loading the data

The data comes preloaded in Keras, which means we don't need to open or read any files
manually. The command to load it is the following, which will actually split the words into
training and testing sets and labels!:

from keras.datasets import imdb


(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
num_words=None,
skip_top=0,
maxlen=None,
seed=113,
start_char=1,
oov_char=2,
index_from=3)
The meanings of all of these arguments are here. But in a nutshell, the most important ones
are:

 num_words: Top most frequent words to consider. This is useful if you don't want to consider
very obscure words such as "Ultracrepidarian."
 skip_top: Top words to ignore. This is useful if you don't want to consider the most common
words. For example, the word "the" would add no information to the review, so we can skip it
by setting  skip_top  to 2 or higher.

Pre-processing the data

We first prepare the data by one-hot encoding it into (0,1)-vectors as follows: If, for example,
we have 10 words in our vocabulary, and the vector is (4,1,8), we'll turn it into the vector
(1,0,0,1,0,0,0,1,0,0).

Building the model

Now it's your turn to use all you've learned! You can build a neural network using Keras, train
it, and evaluate it! Make sure you also use methods such as dropout or regularization, and
good Keras optimizers to do this. A good accuracy to aim for is 85%. Can your model achieve
this?

Help

This is a self-assessed lab. If you need any help or want to check your answers, feel free to
check out the solutions notebook in the same folder, or click here.
08. Lab: IMDB Data in Keras
Workspace

This section contains either a workspace (it can be a Jupyter Notebook workspace or an online
code editor work space, etc.) and it cannot be automatically downloaded to be generated here.
Please access the classroom with your account and manually download the workspace to your
local machine. Note that for some courses, Udacity upload the workspace files
onto https://github.com/udacity, so you may be able to download them there.

Workspace Information:

 Default file path:


 Workspace type: jupyter
 Opened files (when workspace is loaded): n/a
Part 02-Module 01-Lesson 08_TensorFlow
01. Intro

Hi! It's Luis again!

Intro to TensorFlow
Now that you are an expert in Neural Networks with Keras, you're more than ready to learn
TensorFlow. In the following sections of this Nanodegree Program, you will be using Keras
and TensorFlow alternately. Keras is great for building neural networks quickly, but it
abstracts a lot of the details. TensorFlow is great for understanding how neural networks
operate on a lower level. This lesson will teach you what you need to know of TensorFlow,
and give you some exercises to practice.

This lesson will build up on the knowledge from the Deep Neural Networks lesson. If you
need to refresh your memory on any of the topics, such as Linear Functions, Softmax, Cross
Entropy, Batching, Epochs, etc., feel free to go back and watch them again.

 Linear Functions
 Softmax
 Cross Entropy
 Batching and Epochs

Enjoy!
02. Installing TensorFlow

Throughout this lesson, you'll apply your knowledge of neural networks on real datasets
using TensorFlow (link for China), an open source Deep Learning library created by Google.

You’ll use TensorFlow to classify images from the notMNIST dataset - a dataset of images of
English letters from A to J. You can see a few example images below.
Your goal is to automatically detect the letter based on the image in the dataset. You’ll be
working on your own computer for this lab, so, first things first, install TensorFlow!

Install
As usual, we'll be using Conda to install TensorFlow. You might already have a TensorFlow
environment, but check to make sure you have all the necessary packages.

OS X or Linux
Run the following commands to setup your environment:

conda create -n tensorflow python=3.5


source activate tensorflow
conda install pandas matplotlib jupyter notebook scipy scikit-learn
pip install tensorflow
Windows
And installing on Windows. In your console or Anaconda shell,

conda create -n tensorflow python=3.5


activate tensorflow
conda install pandas matplotlib jupyter notebook scipy scikit-learn
pip install tensorflow
Hello, world!
Try running the following code in your Python console to make sure you have TensorFlow
properly installed. The console will print "Hello, world!" if TensorFlow is installed. Don’t
worry about understanding what it does. You’ll learn about it in the next section.

import tensorflow as tf

# Create TensorFlow object called tensor


hello_constant = tf.constant('Hello World!')

with tf.Session() as sess:


# Run the tf.constant operation in the session
output = sess.run(hello_constant)
print(output)
04. Quiz: TensorFlow Linear Function
Linear functions in TensorFlow
The most common operation in neural networks is calculating the linear combination of inputs,
weights, and biases. As a reminder, we can write the output of the linear operation as

Here, \mathbf{W}W is a matrix of the weights connecting two layers. The


output \mathbf{y}y, the input \mathbf{x}x, and the biases \mathbf{b}b are all vectors.
Weights and Bias in TensorFlow
The goal of training a neural network is to modify weights and biases to best predict the labels.
In order to use weights and bias, you'll need a Tensor that can be modified. This leaves
out  tf.placeholder()  and  tf.constant() , since those Tensors can't be modified. This is
where  tf.Variable  class comes in.

tf.Variable()
x = tf.Variable(5)
The  tf.Variable  class creates a tensor with an initial value that can be modified, much like a
normal Python variable. This tensor stores its state in the session, so you must initialize the
state of the tensor manually. You'll use the  tf.global_variables_initializer()  function
to initialize the state of all the Variable tensors.

Initialization
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
The  tf.global_variables_initializer()  call returns an operation that will initialize all
TensorFlow variables from the graph. You call the operation using a session to initialize all the
variables as shown above. Using the  tf.Variable  class allows us to change the weights and
bias, but an initial value needs to be chosen.

Initializing the weights with random numbers from a normal distribution is good practice.
Randomizing the weights helps the model from becoming stuck in the same place every time
you train it. You'll learn more about this in the next lesson, when you study gradient descent.

Similarly, choosing weights from a normal distribution prevents any one weight from
overwhelming other weights. You'll use the  tf.truncated_normal()  function to generate
random numbers from a normal distribution.

tf.truncated_normal()
n_features = 120
n_labels = 5
weights = tf.Variable(tf.truncated_normal((n_features, n_labels)))
The  tf.truncated_normal()  function returns a tensor with random values from a normal
distribution whose magnitude is no more than 2 standard deviations from the mean.

Since the weights are already helping prevent the model from getting stuck, you don't need to
randomize the bias. Let's use the simplest solution, setting the bias to 0.

tf.zeros()
n_labels = 5
bias = tf.Variable(tf.zeros(n_labels))
The  tf.zeros()  function returns a tensor with all zeros.

Linear Classifier Quiz

A subset of the MNIST dataset

You'll be classifying the handwritten numbers  0 ,  1 , and  2  from the MNIST dataset using
TensorFlow. The above is a small sample of the data you'll be training on. Notice how some of
the  1 s are written with a serif at the top and at different angles. The similarities and
differences will play a part in shaping the weights of the model.

Left: Weights for labeling 0. Middle: Weights for labeling 1. Right: Weights for labeling 2.
The images above are trained weights for each label ( 0 ,  1 , and  2 ). The weights display the
unique properties of each digit they have found. Complete this quiz to train your own weights
using the MNIST dataset.

Instructions

1. Open quiz.py.
1. Implement  get_weights  to return a  tf.Variable  of weights
2. Implement  get_biases  to return a  tf.Variable  of biases
3. Implement  xW + b  in the  linear  function
2. Open sandbox.py
1. Initialize all weights

Since  xW  in  xW + b  is matrix multiplication, you have to use the  tf.matmul()  function
instead of  tf.multiply() . Don't forget that order matters in matrix multiplication,
so  tf.matmul(a,b)  is not the same as  tf.matmul(b,a) .

Start Quiz:
sandbox.py quiz.pyquiz_solution.pysandbox_solution.py
# Solution is available in the other "sandbox_solution.py" tab
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
from quiz import get_weights, get_biases, linear

def mnist_features_labels(n_labels):
"""
Gets the first <n> labels from the MNIST dataset
:param n_labels: Number of labels to use
:return: Tuple of feature list and label list
"""
mnist_features = []
mnist_labels = []

mnist = input_data.read_data_sets('/datasets/ud730/mnist',
one_hot=True)

# In order to make quizzes run faster, we're only looking at


10000 images
for mnist_feature, mnist_label in
zip(*mnist.train.next_batch(10000)):

# Add features and labels if it's for the first <n>th labels
if mnist_label[:n_labels].any():
mnist_features.append(mnist_feature)
mnist_labels.append(mnist_label[:n_labels])

return mnist_features, mnist_labels

# Number of features (28*28 image is 784 features)


n_features = 784
# Number of labels
n_labels = 3

# Features and Labels


features = tf.placeholder(tf.float32)
labels = tf.placeholder(tf.float32)

# Weights and Biases


w = get_weights(n_features, n_labels)
b = get_biases(n_labels)

# Linear Function xW + b
logits = linear(features, w, b)

# Training data
train_features, train_labels = mnist_features_labels(n_labels)

with tf.Session() as session:


# TODO: Initialize session variables

# Softmax
prediction = tf.nn.softmax(logits)

# Cross entropy
# This quantifies how far off the predictions were.
# You'll learn more about this in future lessons.
cross_entropy = -tf.reduce_sum(labels * tf.log(prediction),
reduction_indices=1)

# Training loss
# You'll learn more about this in future lessons.
loss = tf.reduce_mean(cross_entropy)

# Rate at which the weights are changed


# You'll learn more about this in future lessons.
learning_rate = 0.08

# Gradient Descent
# This is the method used to train the model
# You'll learn more about this in future lessons.
optimizer =
tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

# Run optimizer and get loss


_, l = session.run(
[optimizer, loss],
feed_dict={features: train_features, labels: train_labels})

# Print loss
print('Loss: {}'.format(l))
Start Quiz:
sandbox.py quiz.py quiz_solution.pysandbox_solution.py
# Solution is available in the other "quiz_solution.py" tab
import tensorflow as tf

def get_weights(n_features, n_labels):


"""
Return TensorFlow weights
:param n_features: Number of features
:param n_labels: Number of labels
:return: TensorFlow weights
"""
# TODO: Return weights
pass

def get_biases(n_labels):
"""
Return TensorFlow bias
:param n_labels: Number of labels
:return: TensorFlow bias
"""
# TODO: Return biases
pass

def linear(input, w, b):


"""
Return linear function in TensorFlow
:param input: TensorFlow input
:param w: TensorFlow weights
:param b: TensorFlow biases
:return: TensorFlow linear function
"""
# TODO: Linear Function (xW + b)
pass
Start Quiz:
sandbox.pyquiz.py quiz_solution.py sandbox_solution.py
# Quiz Solution
# Note: You can't run code in this tab
import tensorflow as tf

def get_weights(n_features, n_labels):


"""
Return TensorFlow weights
:param n_features: Number of features
:param n_labels: Number of labels
:return: TensorFlow weights
"""
# TODO: Return weights
return tf.Variable(tf.truncated_normal((n_features, n_labels)))

def get_biases(n_labels):
"""
Return TensorFlow bias
:param n_labels: Number of labels
:return: TensorFlow bias
"""
# TODO: Return biases
return tf.Variable(tf.zeros(n_labels))

def linear(input, w, b):


"""
Return linear function in TensorFlow
:param input: TensorFlow input
:param w: TensorFlow weights
:param b: TensorFlow biases
:return: TensorFlow linear function
"""
# TODO: Linear Function (xW + b)
return tf.add(tf.matmul(input, w), b)
Start Quiz:
sandbox.pyquiz.pyquiz_solution.py sandbox_solution.py
import tensorflow as tf
# Sandbox Solution
# Note: You can't run code in this tab
from tensorflow.examples.tutorials.mnist import input_data
from quiz import get_weights, get_biases, linear

def mnist_features_labels(n_labels):
"""
Gets the first <n> labels from the MNIST dataset
:param n_labels: Number of labels to use
:return: Tuple of feature list and label list
"""
mnist_features = []
mnist_labels = []

mnist = input_data.read_data_sets('/datasets/ud730/mnist',
one_hot=True)

# In order to make quizzes run faster, we're only looking at


10000 images
for mnist_feature, mnist_label in
zip(*mnist.train.next_batch(10000)):

# Add features and labels if it's for the first <n>th labels
if mnist_label[:n_labels].any():
mnist_features.append(mnist_feature)
mnist_labels.append(mnist_label[:n_labels])

return mnist_features, mnist_labels

# Number of features (28*28 image is 784 features)


n_features = 784
# Number of labels
n_labels = 3

# Features and Labels


features = tf.placeholder(tf.float32)
labels = tf.placeholder(tf.float32)

# Weights and Biases


w = get_weights(n_features, n_labels)
b = get_biases(n_labels)

# Linear Function xW + b
logits = linear(features, w, b)

# Training data
train_features, train_labels = mnist_features_labels(n_labels)

with tf.Session() as session:


session.run(tf.global_variables_initializer())

# Softmax
prediction = tf.nn.softmax(logits)

# Cross entropy
# This quantifies how far off the predictions were.
# You'll learn more about this in future lessons.
cross_entropy = -tf.reduce_sum(labels * tf.log(prediction),
reduction_indices=1)

# Training loss
# You'll learn more about this in future lessons.
loss = tf.reduce_mean(cross_entropy)

# Rate at which the weights are changed


# You'll learn more about this in future lessons.
learning_rate = 0.08

# Gradient Descent
# This is the method used to train the model
# You'll learn more about this in future lessons.
optimizer =
tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

# Run optimizer and get loss


_, l = session.run(
[optimizer, loss],
feed_dict={features: train_features, labels: train_labels})

# Print loss
print('Loss: {}'.format(l))
05. Quiz: TensorFlow Softmax
TensorFlow Softmax
The softmax function squashes it's inputs, typically called logits or logit scores, to be between
0 and 1 and also normalizes the outputs such that they all sum to 1. This means the output of
the softmax function is equivalent to a categorical probability distribution. It's the perfect
function to use as the output activation for a network predicting multiple classes.

Example of the softmax function at work.

TensorFlow Softmax
We're using TensorFlow to build neural networks and, appropriately, there's a function for
calculating softmax.

x = tf.nn.softmax([2.0, 1.0, 0.2])


Easy as that!  tf.nn.softmax()  implements the softmax function for you. It takes in logits
and returns softmax activations.

Quiz
Use the softmax function in the quiz below to return the softmax of the logits.

Start Quiz:
quiz.py solution.py
# Solution is available in the other "solution.py" tab
import tensorflow as tf

def run():
output = None
logit_data = [2.0, 1.0, 0.1]
logits = tf.placeholder(tf.float32)

# TODO: Calculate the softmax of the logits


# softmax =

with tf.Session() as sess:


# TODO: Feed in the logit data
# output = sess.run(softmax, )

return output
Start Quiz:
quiz.py solution.py
# Quiz Solution
# Note: You can't run code in this tab
import tensorflow as tf

def run():
output = None
logit_data = [2.0, 1.0, 0.1]
logits = tf.placeholder(tf.float32)

softmax = tf.nn.softmax(logits)

with tf.Session() as sess:


output = sess.run(softmax, feed_dict={logits: logit_data})

return output
06. Quiz: TensorFlow Cross Entropy
Cross Entropy in TensorFlow
As with the softmax function, TensorFlow has a function to do the cross entropy calculations
for us.

Cross entropy loss function

Let's take what you learned from the video and create a cross entropy function in TensorFlow.
To create a cross entropy function in TensorFlow, you'll need to use two new functions:

 tf.reduce_sum()
 tf.log()

Reduce Sum
x = tf.reduce_sum([1, 2, 3, 4, 5]) # 15
The  tf.reduce_sum()  function takes an array of numbers and sums them together.

Natural Log
x = tf.log(100.0) # 4.60517
This function does exactly what you would expect it to do.  tf.log()  takes the natural log of
a number.

Quiz
Print the cross entropy using  softmax_data  and  one_hot_encod_label .

(Alternative link for users in China.)

Start Quiz:
quiz.py solution.py
# Solution is available in the other "solution.py" tab
import tensorflow as tf

softmax_data = [0.7, 0.2, 0.1]


one_hot_data = [1.0, 0.0, 0.0]

softmax = tf.placeholder(tf.float32)
one_hot = tf.placeholder(tf.float32)

# TODO: Print cross entropy from session


Start Quiz:
quiz.py solution.py
# Quiz Solution
# Note: You can't run code in this tab
import tensorflow as tf

softmax_data = [0.7, 0.2, 0.1]


one_hot_data = [1.0, 0.0, 0.0]

softmax = tf.placeholder(tf.float32)
one_hot = tf.placeholder(tf.float32)

# ToDo: Print cross entropy from session


cross_entropy = -tf.reduce_sum(tf.multiply(one_hot, tf.log(softmax)))

with tf.Session() as sess:


print(sess.run(cross_entropy, feed_dict={softmax: softmax_data,
one_hot: one_hot_data}))
07. Quiz: Mini-batch
Mini-batching
In this section, you'll go over what mini-batching is and how to apply it in TensorFlow.

Mini-batching is a technique for training on subsets of the dataset instead of all the data at one
time. This provides the ability to train a model, even if a computer lacks the memory to store
the entire dataset.

Mini-batching is computationally inefficient, since you can't calculate the loss simultaneously
across all samples. However, this is a small price to pay in order to be able to run the model at
all.

It's also quite useful combined with SGD. The idea is to randomly shuffle the data at the start
of each epoch, then create the mini-batches. For each mini-batch, you train the network
weights with gradient descent. Since these batches are random, you're performing SGD with
each batch.

Let's look at the MNIST dataset with weights and a bias to see if your machine can handle it.

from tensorflow.examples.tutorials.mnist import input_data


import tensorflow as tf

n_input = 784 # MNIST data input (img shape: 28*28)


n_classes = 10 # MNIST total classes (0-9 digits)

# Import MNIST data


mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled


train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Weights & bias


weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))
Question 1

Calculate the memory size of  train_features ,  train_labels ,  weights , and  bias  in
bytes. Ignore memory for overhead, just calculate the memory required for the stored data.

You may have to look up how much memory a float32 requires, using this link.

train_features Shape: (55000, 784) Type: float32


train_labels Shape: (55000, 10) Type: float32

weights Shape: (784, 10) Type: float32

bias Shape: (10,) Type: float32

QUESTION:

How many bytes of memory does  train_features  need?

ANSWER:
SOLUTION:
NOTE: The solutions are expressed in RegEx pattern. Udacity uses these patterns to
check the given answer

QUESTION:

How many bytes of memory does  train_labels  need?

ANSWER:
SOLUTION:
QUESTION:

How many bytes of memory does  weights  need?

ANSWER:
SOLUTION:
QUESTION:

How many bytes of memory does  bias  need?


ANSWER:
SOLUTION:
The total memory space required for the inputs, weights and bias is around 174
megabytes, which isn't that much memory. You could train this whole dataset on most
CPUs and GPUs.

But larger datasets that you'll use in the future measured in gigabytes or more. It's
possible to purchase more memory, but it's expensive. A Titan X GPU with 12 GB of
memory costs over $1,000.

Instead, in order to run large models on your machine, you'll learn how to use mini-
batching.

Let's look at how you implement mini-batching in TensorFlow.

TensorFlow Mini-batching
In order to use mini-batching, you must first divide your data into batches.

Unfortunately, it's sometimes impossible to divide the data into batches of exactly
equal size. For example, imagine you'd like to create batches of 128 samples each
from a dataset of 1000 samples. Since 128 does not evenly divide into 1000, you'd
wind up with 7 batches of 128 samples, and 1 batch of 104 samples. (7*128 + 1*104 =
1000)

In that case, the size of the batches would vary, so you need to take advantage of
TensorFlow's  tf.placeholder()  function to receive the varying batch sizes.

Continuing the example, if each sample had  n_input = 784  features and  n_classes
= 10  possible labels, the dimensions for  features  would be  [None,
n_input]  and  labels  would be  [None, n_classes] .

# Features and Labels


features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])
What does  None  do here?

The  None  dimension is a placeholder for the batch size. At runtime, TensorFlow will
accept any batch size greater than 0.
Going back to our earlier example, this setup allows you to
feed  features  and  labels  into the model as either the batches of 128 samples or
the single batch of 104 samples.

Question 2

Use the parameters below, how many batches are there, and what is the last batch
size?

features is (50000, 400)

labels is (50000, 10)

batch_size is 128

QUESTION:

How many batches are there?

ANSWER:
SOLUTION:
QUESTION:

What is the last batch size?

ANSWER:
SOLUTION:
Now that you know the basics, let's learn how to implement mini-batching.

Question 3

Implement the  batches  function to batch  features  and  labels . The function should
return each batch with a maximum size of  batch_size . To help you with the quiz, look
at the following example output of a working  batches  function.

# 4 Samples of features
example_features = [
['F11','F12','F13','F14'],
['F21','F22','F23','F24'],
['F31','F32','F33','F34'],
['F41','F42','F43','F44']]
# 4 Samples of labels
example_labels = [
['L11','L12'],
['L21','L22'],
['L31','L32'],
['L41','L42']]

example_batches = batches(3, example_features, example_labels)


The  example_batches  variable would be the following:

[
# 2 batches:
# First is a batch of size 3.
# Second is a batch of size 1
[
# First Batch is size 3
[
# 3 samples of features.
# There are 4 features per sample.
['F11', 'F12', 'F13', 'F14'],
['F21', 'F22', 'F23', 'F24'],
['F31', 'F32', 'F33', 'F34']
], [
# 3 samples of labels.
# There are 2 labels per sample.
['L11', 'L12'],
['L21', 'L22'],
['L31', 'L32']
]
], [
# Second Batch is size 1.
# Since batch size is 3, there is only one sample left from the 4 samples.
[
# 1 sample of features.
['F41', 'F42', 'F43', 'F44']
], [
# 1 sample of labels.
['L41', 'L42']
]
]
]
Implement the  batches  function in the "quiz.py" file below.
Start Quiz:
sandbox.py quiz.pyquiz_solution.py
from quiz import batches
from pprint import pprint

# 4 Samples of features
example_features = [
['F11','F12','F13','F14'],
['F21','F22','F23','F24'],
['F31','F32','F33','F34'],
['F41','F42','F43','F44']]
# 4 Samples of labels
example_labels = [
['L11','L12'],
['L21','L22'],
['L31','L32'],
['L41','L42']]

# PPrint prints data structures like 2d arrays, so they are easier to


read
pprint(batches(3, example_features, example_labels))
Start Quiz:
sandbox.py quiz.py quiz_solution.py
import math
def batches(batch_size, features, labels):
"""
Create batches of features and labels
:param batch_size: The batch size
:param features: List of features
:param labels: List of labels
:return: Batches of (Features, Labels)
"""
assert len(features) == len(labels)
# TODO: Implement batching
pass

Start Quiz:
sandbox.pyquiz.py quiz_solution.py
import math
def batches(batch_size, features, labels):
"""
Create batches of features and labels
:param batch_size: The batch size
:param features: List of features
:param labels: List of labels
:return: Batches of (Features, Labels)
"""
assert len(features) == len(labels)
# TODO: Implement batching
output_batches = []

sample_size = len(features)
for start_i in range(0, sample_size, batch_size):
end_i = start_i + batch_size
batch = [features[start_i:end_i], labels[start_i:end_i]]
output_batches.append(batch)

return output_batches

Let's use mini-batching to feed batches of MNIST features and labels into a linear
model.

Set the batch size and run the optimizer over all the batches with
the  batches  function. The recommended batch size is 128. If you have memory
restrictions, feel free to make it smaller.
Start Quiz:
quiz.py helper.pyquiz_solution.py
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
from helper import batches

learning_rate = 0.001
n_input = 784 # MNIST data input (img shape: 28*28)
n_classes = 10 # MNIST total classes (0-9 digits)

# Import MNIST data


mnist = input_data.read_data_sets('/datasets/ud730/mnist',
one_hot=True)

# The features are already scaled and the data is shuffled


train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels


features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias


weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer


cost =
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits,
labels=labels))
optimizer =
tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimi
ze(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels,
1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# TODO: Set batch size
batch_size = None
assert batch_size is not None, 'You must set the batch size'

init = tf.global_variables_initializer()

with tf.Session() as sess:


sess.run(init)

# TODO: Train optimizer on all batches


# for batch_features, batch_labels in ______
sess.run(optimizer, feed_dict={features: batch_features, labels:
batch_labels})

# Calculate accuracy for test dataset


test_accuracy = sess.run(
accuracy,
feed_dict={features: test_features, labels: test_labels})

print('Test Accuracy: {}'.format(test_accuracy))


Start Quiz:
quiz.py helper.py quiz_solution.py
import math
def batches(batch_size, features, labels):
"""
Create batches of features and labels
:param batch_size: The batch size
:param features: List of features
:param labels: List of labels
:return: Batches of (Features, Labels)
"""
assert len(features) == len(labels)
outout_batches = []

sample_size = len(features)
for start_i in range(0, sample_size, batch_size):
end_i = start_i + batch_size
batch = [features[start_i:end_i], labels[start_i:end_i]]
outout_batches.append(batch)

return outout_batches
Start Quiz:
quiz.pyhelper.py quiz_solution.py
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
from helper import batches

learning_rate = 0.001
n_input = 784 # MNIST data input (img shape: 28*28)
n_classes = 10 # MNIST total classes (0-9 digits)

# Import MNIST data


mnist = input_data.read_data_sets('/datasets/ud730/mnist',
one_hot=True)

# The features are already scaled and the data is shuffled


train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels


features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias


weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer


cost =
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits,
labels=labels))
optimizer =
tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimi
ze(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels,
1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# TODO: Set batch size
batch_size = 128
assert batch_size is not None, 'You must set the batch size'

init = tf.global_variables_initializer()

with tf.Session() as sess:


sess.run(init)

# TODO: Train optimizer on all batches


for batch_features, batch_labels in batches(batch_size,
train_features, train_labels):
sess.run(optimizer, feed_dict={features: batch_features,
labels: batch_labels})

# Calculate accuracy for test dataset


test_accuracy = sess.run(
accuracy,
feed_dict={features: test_features, labels: test_labels})

print('Test Accuracy: {}'.format(test_accuracy))

The accuracy is low, but you probably know that you could train on the dataset more
than once. You can train a model using the dataset multiple times. You'll go over this
subject in the next section where we talk about "epochs".
08. Epochs
Epochs
An epoch is a single forward and backward pass of the whole dataset. This is used to increase
the accuracy of the model without requiring more data. This section will cover epochs in
TensorFlow and how to choose the right number of epochs.

The following TensorFlow code trains a model using 10 epochs.

from tensorflow.examples.tutorials.mnist import input_data


import tensorflow as tf
import numpy as np
from helper import batches # Helper function created in Mini-batching section

def print_epoch_stats(epoch_i, sess, last_features, last_labels):


"""
Print cost and validation accuracy of an epoch
"""
current_cost = sess.run(
cost,
feed_dict={features: last_features, labels: last_labels})
valid_accuracy = sess.run(
accuracy,
feed_dict={features: valid_features, labels: valid_labels})
print('Epoch: {:<4} - Cost: {:<8.3} Valid Accuracy: {:<5.3}'.format(
epoch_i,
current_cost,
valid_accuracy))

n_input = 784 # MNIST data input (img shape: 28*28)


n_classes = 10 # MNIST total classes (0-9 digits)

# Import MNIST data


mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled


train_features = mnist.train.images
valid_features = mnist.validation.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
valid_labels = mnist.validation.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels


features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias


weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer


learning_rate = tf.placeholder(tf.float32)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits,
labels=labels))
optimizer =
tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

init = tf.global_variables_initializer()

batch_size = 128
epochs = 10
learn_rate = 0.001

train_batches = batches(batch_size, train_features, train_labels)

with tf.Session() as sess:


sess.run(init)

# Training cycle
for epoch_i in range(epochs):

# Loop over all batches


for batch_features, batch_labels in train_batches:
train_feed_dict = {
features: batch_features,
labels: batch_labels,
learning_rate: learn_rate}
sess.run(optimizer, feed_dict=train_feed_dict)

# Print cost and validation accuracy of an epoch


print_epoch_stats(epoch_i, sess, batch_features, batch_labels)

# Calculate accuracy for test dataset


test_accuracy = sess.run(
accuracy,
feed_dict={features: test_features, labels: test_labels})

print('Test Accuracy: {}'.format(test_accuracy))


Running the code will output the following:

Epoch: 0 - Cost: 11.0 Valid Accuracy: 0.204


Epoch: 1 - Cost: 9.95 Valid Accuracy: 0.229
Epoch: 2 - Cost: 9.18 Valid Accuracy: 0.246
Epoch: 3 - Cost: 8.59 Valid Accuracy: 0.264
Epoch: 4 - Cost: 8.13 Valid Accuracy: 0.283
Epoch: 5 - Cost: 7.77 Valid Accuracy: 0.301
Epoch: 6 - Cost: 7.47 Valid Accuracy: 0.316
Epoch: 7 - Cost: 7.2 Valid Accuracy: 0.328
Epoch: 8 - Cost: 6.96 Valid Accuracy: 0.342
Epoch: 9 - Cost: 6.73 Valid Accuracy: 0.36
Test Accuracy: 0.3801000118255615
Each epoch attempts to move to a lower cost, leading to better accuracy.

This model continues to improve accuracy up to Epoch 9. Let's increase the number of epochs
to 100.

...
Epoch: 79 - Cost: 0.111 Valid Accuracy: 0.86
Epoch: 80 - Cost: 0.11 Valid Accuracy: 0.869
Epoch: 81 - Cost: 0.109 Valid Accuracy: 0.869
....
Epoch: 85 - Cost: 0.107 Valid Accuracy: 0.869
Epoch: 86 - Cost: 0.107 Valid Accuracy: 0.869
Epoch: 87 - Cost: 0.106 Valid Accuracy: 0.869
Epoch: 88 - Cost: 0.106 Valid Accuracy: 0.869
Epoch: 89 - Cost: 0.105 Valid Accuracy: 0.869
Epoch: 90 - Cost: 0.105 Valid Accuracy: 0.869
Epoch: 91 - Cost: 0.104 Valid Accuracy: 0.869
Epoch: 92 - Cost: 0.103 Valid Accuracy: 0.869
Epoch: 93 - Cost: 0.103 Valid Accuracy: 0.869
Epoch: 94 - Cost: 0.102 Valid Accuracy: 0.869
Epoch: 95 - Cost: 0.102 Valid Accuracy: 0.869
Epoch: 96 - Cost: 0.101 Valid Accuracy: 0.869
Epoch: 97 - Cost: 0.101 Valid Accuracy: 0.869
Epoch: 98 - Cost: 0.1 Valid Accuracy: 0.869
Epoch: 99 - Cost: 0.1 Valid Accuracy: 0.869
Test Accuracy: 0.8696000006198883
From looking at the output above, you can see the model doesn't increase the validation
accuracy after epoch 80. Let's see what happens when we increase the learning rate.

learn_rate = 0.1

Epoch: 76 - Cost: 0.214 Valid Accuracy: 0.752


Epoch: 77 - Cost: 0.21 Valid Accuracy: 0.756
Epoch: 78 - Cost: 0.21 Valid Accuracy: 0.756
...
Epoch: 85 - Cost: 0.207 Valid Accuracy: 0.756
Epoch: 86 - Cost: 0.209 Valid Accuracy: 0.756
Epoch: 87 - Cost: 0.205 Valid Accuracy: 0.756
Epoch: 88 - Cost: 0.208 Valid Accuracy: 0.756
Epoch: 89 - Cost: 0.205 Valid Accuracy: 0.756
Epoch: 90 - Cost: 0.202 Valid Accuracy: 0.756
Epoch: 91 - Cost: 0.207 Valid Accuracy: 0.756
Epoch: 92 - Cost: 0.204 Valid Accuracy: 0.756
Epoch: 93 - Cost: 0.206 Valid Accuracy: 0.756
Epoch: 94 - Cost: 0.202 Valid Accuracy: 0.756
Epoch: 95 - Cost: 0.2974 Valid Accuracy: 0.756
Epoch: 96 - Cost: 0.202 Valid Accuracy: 0.756
Epoch: 97 - Cost: 0.2996 Valid Accuracy: 0.756
Epoch: 98 - Cost: 0.203 Valid Accuracy: 0.756
Epoch: 99 - Cost: 0.2987 Valid Accuracy: 0.756
Test Accuracy: 0.7556000053882599
Looks like the learning rate was increased too much. The final accuracy was lower, and it
stopped improving earlier. Let's stick with the previous learning rate, but change the number of
epochs to 80.

Epoch: 65 - Cost: 0.122 Valid Accuracy: 0.868


Epoch: 66 - Cost: 0.121 Valid Accuracy: 0.868
Epoch: 67 - Cost: 0.12 Valid Accuracy: 0.868
Epoch: 68 - Cost: 0.119 Valid Accuracy: 0.868
Epoch: 69 - Cost: 0.118 Valid Accuracy: 0.868
Epoch: 70 - Cost: 0.118 Valid Accuracy: 0.868
Epoch: 71 - Cost: 0.117 Valid Accuracy: 0.868
Epoch: 72 - Cost: 0.116 Valid Accuracy: 0.868
Epoch: 73 - Cost: 0.115 Valid Accuracy: 0.868
Epoch: 74 - Cost: 0.115 Valid Accuracy: 0.868
Epoch: 75 - Cost: 0.114 Valid Accuracy: 0.868
Epoch: 76 - Cost: 0.113 Valid Accuracy: 0.868
Epoch: 77 - Cost: 0.113 Valid Accuracy: 0.868
Epoch: 78 - Cost: 0.112 Valid Accuracy: 0.868
Epoch: 79 - Cost: 0.111 Valid Accuracy: 0.868
Epoch: 80 - Cost: 0.111 Valid Accuracy: 0.869
Test Accuracy: 0.86909999418258667
The accuracy only reached 0.86, but that could be because the learning rate was too high.
Lowering the learning rate would require more epochs, but could ultimately achieve better
accuracy.

In the upcoming TensorFLow Lab, you'll get the opportunity to choose your own learning rate,
epoch count, and batch size to improve the model's accuracy.
09. Pre-Lab: NotMNIST in TensorFlow
TensorFlow Neural Network Lab

TensorFlow Lab
We've prepared a Jupyter notebook that will guide you through the process of creating a single
layer neural network in TensorFlow. You'll implement data normalization, then build and train
the network with TensorFlow.

Getting the notebook


The notebook and all related files are available from our GitHub repository. Either clone the
repository or download it as a Zip file.

Use Git to clone the repository.

git clone https://github.com/udacity/deep-learning.git


If you're unfamiliar with Git and GitHub, I highly recommend checking out our course. If
you'd rather not use Git, you can download the repository as a Zip archive. You can find the
repo here.

View The Notebook


In the directory with the notebook file, start your Jupyter notebook server

jupyter notebook
This should open a browser window for you. If it doesn't, go to http://localhost:8888/tree.
Although, the port number might be different if you have other notebook servers running, so
try 8889 instead of 8888 if you can't find the right server.
You should see the notebook  intro_to_tensorflow.ipynb , this is the notebook you'll be
working on. The notebook has 3 problems for you to solve:

 Problem 1: Normalize the features


 Problem 2: Use TensorFlow operations to create features, labels, weight, and biases tensors
 Problem 3: Tune the learning rate, number of steps, and batch size for the best accuracy

This is a self-assessed lab. Compare your answers to the solutions here. If you have any
difficulty completing the lab, Udacity provides a few services to answer any questions you
might have.

Help
Remember that you can get assistance from your mentor, the Forums (click the link on the left
side of the classroom), or the Slack channel. You can also review the concepts from the
previous lessons.
10. Lab: NotMNIST in TensorFlow
Workspace

This section contains either a workspace (it can be a Jupyter Notebook workspace or an online
code editor work space, etc.) and it cannot be automatically downloaded to be generated here.
Please access the classroom with your account and manually download the workspace to your
local machine. Note that for some courses, Udacity upload the workspace files
onto https://github.com/udacity, so you may be able to download them there.

Workspace Information:

 Default file path:


 Workspace type: jupyter
 Opened files (when workspace is loaded): n/a
11. Two-layer Neural Network

Multilayer Neural Networks


In the previous lessons and the lab, you learned how to build a neural network of one layer.
Now, you'll learn how to build multilayer neural networks with TensorFlow. Adding a hidden
layer to a network allows it to model more complex functions. Also, using a non-linear
activation function on the hidden layer lets it model non-linear functions.

The first thing we'll learn to implement in TensorFlow is ReLU hidden layer. A ReLU is a
non-linear function, or rectified linear unit. The ReLU function is 0 for negative inputs
and xx for all inputs x &gt;0x>0.
As before, the following nodes will build up on the knowledge from the Deep Neural
Networks lesson. If you need to refresh your mind, you can go back and watch them again.

 ReLU
 Feedforward
 Dropout
12. Quiz: TensorFlow ReLUs
TensorFlow ReLUs
TensorFlow provides the ReLU function as  tf.nn.relu() , as shown below.

# Hidden Layer with ReLU activation function


hidden_layer = tf.add(tf.matmul(features, hidden_weights), hidden_biases)
hidden_layer = tf.nn.relu(hidden_layer)

output = tf.add(tf.matmul(hidden_layer, output_weights), output_biases)


The above code applies the  tf.nn.relu()  function to the  hidden_layer , effectively turning
off any negative weights and acting like an on/off switch. Adding additional layers, like
the  output  layer, after an activation function turns the model into a nonlinear function. This
nonlinearity allows the network to solve more complex problems.

Quiz
Below you'll use the ReLU function to turn a linear single layer network into a non-linear
multilayer network.

Start Quiz:
quiz.py solution.py
# Solution is available in the other "solution.py" tab
import tensorflow as tf

output = None
hidden_layer_weights = [
[0.1, 0.2, 0.4],
[0.4, 0.6, 0.6],
[0.5, 0.9, 0.1],
[0.8, 0.2, 0.8]]
out_weights = [
[0.1, 0.6],
[0.2, 0.1],
[0.7, 0.9]]

# Weights and biases


weights = [
tf.Variable(hidden_layer_weights),
tf.Variable(out_weights)]
biases = [
tf.Variable(tf.zeros(3)),
tf.Variable(tf.zeros(2))]

# Input
features = tf.Variable([[1.0, 2.0, 3.0, 4.0], [-1.0, -2.0, -3.0,
-4.0], [11.0, 12.0, 13.0, 14.0]])

# TODO: Create Model

# TODO: Print session results


Start Quiz:
quiz.py solution.py
# Quiz Solution
# Note: You can't run code in this tab
import tensorflow as tf

output = None
hidden_layer_weights = [
[0.1, 0.2, 0.4],
[0.4, 0.6, 0.6],
[0.5, 0.9, 0.1],
[0.8, 0.2, 0.8]]
out_weights = [
[0.1, 0.6],
[0.2, 0.1],
[0.7, 0.9]]

# Weights and biases


weights = [
tf.Variable(hidden_layer_weights),
tf.Variable(out_weights)]
biases = [
tf.Variable(tf.zeros(3)),
tf.Variable(tf.zeros(2))]

# Input
features = tf.Variable([[1.0, 2.0, 3.0, 4.0], [-1.0, -2.0, -3.0,
-4.0], [11.0, 12.0, 13.0, 14.0]])

# TODO: Create Model


hidden_layer = tf.add(tf.matmul(features, weights[0]), biases[0])
hidden_layer = tf.nn.relu(hidden_layer)
logits = tf.add(tf.matmul(hidden_layer, weights[1]), biases[1])

# TODO: Print session results


with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run(logits))
13. Deep Neural Network in TensorFlow
Deep Neural Network in TensorFlow
You've seen how to build a logistic classifier using TensorFlow. Now you're going to see how
to use the logistic classifier to build a deep neural network.

Step by Step
In the following walkthrough, we'll step through TensorFlow code written to classify the
letters in the MNIST database. If you would like to run the network on your computer, the file
is provided here. You can find this and many more examples of TensorFlow at Aymeric
Damien's GitHub repository.

Code
TensorFlow MNIST
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets(".", one_hot=True, reshape=False)
You'll use the MNIST dataset provided by TensorFlow, which batches and One-Hot encodes
the data for you.

Learning Parameters
import tensorflow as tf

# Parameters
learning_rate = 0.001
training_epochs = 20
batch_size = 128 # Decrease batch size if you don't have enough memory
display_step = 1

n_input = 784 # MNIST data input (img shape: 28*28)


n_classes = 10 # MNIST total classes (0-9 digits)
The focus here is on the architecture of multilayer neural networks, not parameter tuning, so
here we'll just give you the learning parameters.

Hidden Layer Parameters


n_hidden_layer = 256 # layer number of features
The variable  n_hidden_layer  determines the size of the hidden layer in the neural network.
This is also known as the width of a layer.

Weights and Biases


# Store layers weight & bias
weights = {
'hidden_layer': tf.Variable(tf.random_normal([n_input, n_hidden_layer])),
'out': tf.Variable(tf.random_normal([n_hidden_layer, n_classes]))
}
biases = {
'hidden_layer': tf.Variable(tf.random_normal([n_hidden_layer])),
'out': tf.Variable(tf.random_normal([n_classes]))
}
Deep neural networks use multiple layers with each layer requiring it's own weight and bias.
The  'hidden_layer'  weight and bias is for the hidden layer. The  'out'  weight and bias is
for the output layer. If the neural network were deeper, there would be weights and biases for
each additional layer.

Input
# tf Graph input
x = tf.placeholder("float", [None, 28, 28, 1])
y = tf.placeholder("float", [None, n_classes])

x_flat = tf.reshape(x, [-1, n_input])


The MNIST data is made up of 28px by 28px images with a single channel.
The  tf.reshape()  function above reshapes the 28px by 28px matrices in  x  into row vectors
of 784px.

Multilayer Perceptron

# Hidden layer with RELU activation


layer_1 = tf.add(tf.matmul(x_flat, weights['hidden_layer']),\
biases['hidden_layer'])
layer_1 = tf.nn.relu(layer_1)
# Output layer with linear activation
logits = tf.add(tf.matmul(layer_1, weights['out']), biases['out'])
You've seen the linear function  tf.add(tf.matmul(x_flat, weights['hidden_layer']),
biases['hidden_layer'])  before, also known as  xw + b . Combining linear functions
together using a ReLU will give you a two layer network.

Optimizer
# Define loss and optimizer
cost = tf.reduce_mean(\
tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\
.minimize(cost)
This is the same optimization technique used in the Intro to TensorFLow lab.

Session
# Initializing the variables
init = tf.global_variables_initializer()

# Launch the graph


with tf.Session() as sess:
sess.run(init)
# Training cycle
for epoch in range(training_epochs):
total_batch = int(mnist.train.num_examples/batch_size)
# Loop over all batches
for i in range(total_batch):
batch_x, batch_y = mnist.train.next_batch(batch_size)
# Run optimization op (backprop) and cost op (to get loss value)
sess.run(optimizer, feed_dict={x: batch_x, y: batch_y})
The MNIST library in TensorFlow provides the ability to receive the dataset in batches.
Calling the  mnist.train.next_batch()  function returns a subset of the training data.

Deeper Neural Network

That's it! Going from one layer to two is easy. Adding more layers to the network allows you
to solve more complicated problems.
14. Save and Restore TensorFlow Models
Save and Restore TensorFlow Models
Training a model can take hours. But once you close your TensorFlow session, you lose all the
trained weights and biases. If you were to reuse the model in the future, you would have to
train it all over again!

Fortunately, TensorFlow gives you the ability to save your progress using a class
called  tf.train.Saver . This class provides the functionality to save any  tf.Variable  to
your file system.

Saving Variables
Let's start with a simple example of saving  weights  and  bias  Tensors. For the first example
you'll just save two variables. Later examples will save all the weights in a practical model.

import tensorflow as tf

# The file path to save the data


save_file = './model.ckpt'

# Two Tensor Variables: weights and bias


weights = tf.Variable(tf.truncated_normal([2, 3]))
bias = tf.Variable(tf.truncated_normal([3]))

# Class used to save and/or restore Tensor Variables


saver = tf.train.Saver()

with tf.Session() as sess:


# Initialize all the Variables
sess.run(tf.global_variables_initializer())

# Show the values of weights and bias


print('Weights:')
print(sess.run(weights))
print('Bias:')
print(sess.run(bias))

# Save the model


saver.save(sess, save_file)
Weights:

[[-0.97990924 1.03016174 0.74119264]

[-0.82581609 -0.07361362 -0.86653847]]

Bias:

[ 1.62978125 -0.37812829 0.64723819]


The Tensors  weights  and  bias  are set to random values using
the  tf.truncated_normal()  function. The values are then saved to the  save_file  location,
"model.ckpt", using the  tf.train.Saver.save()  function. (The ".ckpt" extension stands for
"checkpoint".)

If you're using TensorFlow 0.11.0RC1 or newer, a file called "model.ckpt.meta" will also be
created. This file contains the TensorFlow graph.

Loading Variables
Now that the Tensor Variables are saved, let's load them back into a new model.

# Remove the previous weights and bias


tf.reset_default_graph()

# Two Variables: weights and bias


weights = tf.Variable(tf.truncated_normal([2, 3]))
bias = tf.Variable(tf.truncated_normal([3]))

# Class used to save and/or restore Tensor Variables


saver = tf.train.Saver()

with tf.Session() as sess:


# Load the weights and bias
saver.restore(sess, save_file)

# Show the values of weights and bias


print('Weight:')
print(sess.run(weights))
print('Bias:')
print(sess.run(bias))
Weights:

[[-0.97990924 1.03016174 0.74119264]

[-0.82581609 -0.07361362 -0.86653847]]

Bias:

[ 1.62978125 -0.37812829 0.64723819]

You'll notice you still need to create the  weights  and  bias  Tensors in Python.
The  tf.train.Saver.restore()  function loads the saved data into  weights  and  bias .

Since  tf.train.Saver.restore()  sets all the TensorFlow Variables, you don't need to
call  tf.global_variables_initializer() .

Save a Trained Model


Let's see how to train a model and save its weights.

First start with a model:

# Remove previous Tensors and Operations


tf.reset_default_graph()

from tensorflow.examples.tutorials.mnist import input_data


import numpy as np

learning_rate = 0.001
n_input = 784 # MNIST data input (img shape: 28*28)
n_classes = 10 # MNIST total classes (0-9 digits)

# Import MNIST data


mnist = input_data.read_data_sets('.', one_hot=True)

# Features and Labels


features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias


weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer


cost = tf.reduce_mean(\
tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\
.minimize(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
Let's train that model, then save the weights:

import math

save_file = './train_model.ckpt'
batch_size = 128
n_epochs = 100

saver = tf.train.Saver()

# Launch the graph


with tf.Session() as sess:
sess.run(tf.global_variables_initializer())

# Training cycle
for epoch in range(n_epochs):
total_batch = math.ceil(mnist.train.num_examples / batch_size)
# Loop over all batches
for i in range(total_batch):
batch_features, batch_labels = mnist.train.next_batch(batch_size)
sess.run(
optimizer,
feed_dict={features: batch_features, labels: batch_labels})

# Print status for every 10 epochs


if epoch % 10 == 0:
valid_accuracy = sess.run(
accuracy,
feed_dict={
features: mnist.validation.images,
labels: mnist.validation.labels})
print('Epoch {:<3} - Validation Accuracy: {}'.format(
epoch,
valid_accuracy))

# Save the model


saver.save(sess, save_file)
print('Trained Model Saved.')
Epoch 0 - Validation Accuracy: 0.06859999895095825

Epoch 10 - Validation Accuracy: 0.20239999890327454

Epoch 20 - Validation Accuracy: 0.36980000138282776

Epoch 30 - Validation Accuracy: 0.48820000886917114

Epoch 40 - Validation Accuracy: 0.5601999759674072

Epoch 50 - Validation Accuracy: 0.6097999811172485

Epoch 60 - Validation Accuracy: 0.6425999999046326

Epoch 70 - Validation Accuracy: 0.6733999848365784

Epoch 80 - Validation Accuracy: 0.6916000247001648

Epoch 90 - Validation Accuracy: 0.7113999724388123

Trained Model Saved.

Load a Trained Model


Let's load the weights and bias from memory, then check the test accuracy.

saver = tf.train.Saver()

# Launch the graph


with tf.Session() as sess:
saver.restore(sess, save_file)

test_accuracy = sess.run(
accuracy,
feed_dict={features: mnist.test.images, labels: mnist.test.labels})

print('Test Accuracy: {}'.format(test_accuracy))


Test Accuracy: 0.7229999899864197

That's it! You now know how to save and load a trained model in TensorFlow. Let's look at
loading weights and biases into modified models in the next section.
15. Finetuning
Loading the Weights and Biases into a New Model
Sometimes you might want to adjust, or "finetune" a model that you have already trained and
saved.

However, loading saved Variables directly into a modified model can generate errors. Let's go
over how to avoid these problems.

Naming Error
TensorFlow uses a string identifier for Tensors and Operations called  name . If a name is not
given, TensorFlow will create one automatically. TensorFlow will give the first node the
name  <Type> , and then give the name  <Type>_<number>  for the subsequent nodes. Let's see
how this can affect loading a model with a different order of  weights  and  bias :

import tensorflow as tf

# Remove the previous weights and bias


tf.reset_default_graph()

save_file = 'model.ckpt'

# Two Tensor Variables: weights and bias


weights = tf.Variable(tf.truncated_normal([2, 3]))
bias = tf.Variable(tf.truncated_normal([3]))

saver = tf.train.Saver()

# Print the name of Weights and Bias


print('Save Weights: {}'.format(weights.name))
print('Save Bias: {}'.format(bias.name))

with tf.Session() as sess:


sess.run(tf.global_variables_initializer())
saver.save(sess, save_file)

# Remove the previous weights and bias


tf.reset_default_graph()

# Two Variables: weights and bias


bias = tf.Variable(tf.truncated_normal([3]))
weights = tf.Variable(tf.truncated_normal([2, 3]))

saver = tf.train.Saver()

# Print the name of Weights and Bias


print('Load Weights: {}'.format(weights.name))
print('Load Bias: {}'.format(bias.name))

with tf.Session() as sess:


# Load the weights and bias - ERROR
saver.restore(sess, save_file)
The code above prints out the following:

Save Weights: Variable:0

Save Bias: Variable_1:0

Load Weights: Variable_1:0

Load Bias: Variable:0

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to
match.

You'll notice that the  name  properties for  weights  and  bias  are different than when you
saved the model. This is why the code produces the "Assign requires shapes of both tensors to
match" error. The code  saver.restore(sess, save_file)  is trying to load weight data
into  bias  and bias data into  weights .

Instead of letting TensorFlow set the  name  property, let's set it manually:

import tensorflow as tf

tf.reset_default_graph()

save_file = 'model.ckpt'

# Two Tensor Variables: weights and bias


weights = tf.Variable(tf.truncated_normal([2, 3]), name='weights_0')
bias = tf.Variable(tf.truncated_normal([3]), name='bias_0')

saver = tf.train.Saver()

# Print the name of Weights and Bias


print('Save Weights: {}'.format(weights.name))
print('Save Bias: {}'.format(bias.name))

with tf.Session() as sess:


sess.run(tf.global_variables_initializer())
saver.save(sess, save_file)

# Remove the previous weights and bias


tf.reset_default_graph()

# Two Variables: weights and bias


bias = tf.Variable(tf.truncated_normal([3]), name='bias_0')
weights = tf.Variable(tf.truncated_normal([2, 3]) ,name='weights_0')

saver = tf.train.Saver()

# Print the name of Weights and Bias


print('Load Weights: {}'.format(weights.name))
print('Load Bias: {}'.format(bias.name))

with tf.Session() as sess:


# Load the weights and bias - No Error
saver.restore(sess, save_file)

print('Loaded Weights and Bias successfully.')


Save Weights: weights_0:0

Save Bias: bias_0:0

Load Weights: weights_0:0

Load Bias: bias_0:0

Loaded Weights and Bias successfully.

That worked! The Tensor names match and the data loaded correctly.
16. Quiz: TensorFlow Dropout
TensorFlow Dropout

Figure 1: Taken from the paper "Dropout: A Simple Way to Prevent Neural Networks from
Overfitting" (https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf)

Dropout is a regularization technique for reducing overfitting. The technique temporarily


drops units (artificial neurons) from the network, along with all of those units' incoming and
outgoing connections. Figure 1 illustrates how dropout works.

TensorFlow provides the  tf.nn.dropout()  function, which you can use to implement
dropout.

Let's look at an example of how to use  tf.nn.dropout() .

keep_prob = tf.placeholder(tf.float32) # probability to keep units

hidden_layer = tf.add(tf.matmul(features, weights[0]), biases[0])


hidden_layer = tf.nn.relu(hidden_layer)
hidden_layer = tf.nn.dropout(hidden_layer, keep_prob)

logits = tf.add(tf.matmul(hidden_layer, weights[1]), biases[1])


The code above illustrates how to apply dropout to a neural network.

The  tf.nn.dropout()  function takes in two parameters:

1. hidden_layer : the tensor to which you would like to apply dropout


2. keep_prob : the probability of keeping (i.e. not dropping) any given unit

keep_prob  allows you to adjust the number of units to drop. In order to compensate for
dropped units,  tf.nn.dropout()  multiplies all units that are kept (i.e. not dropped)
by  1/keep_prob .

During training, a good starting value for  keep_prob  is  0.5 .

During testing, use a  keep_prob  value of  1.0  to keep all units and maximize the power of
the model.

Quiz 1
Take a look at the code snippet below. Do you see what's wrong?

There's nothing wrong with the syntax, however the test accuracy is extremely low.

...

keep_prob = tf.placeholder(tf.float32) # probability to keep units

hidden_layer = tf.add(tf.matmul(features, weights[0]), biases[0])


hidden_layer = tf.nn.relu(hidden_layer)
hidden_layer = tf.nn.dropout(hidden_layer, keep_prob)

logits = tf.add(tf.matmul(hidden_layer, weights[1]), biases[1])

...

with tf.Session() as sess:


sess.run(tf.global_variables_initializer())

for epoch_i in range(epochs):


for batch_i in range(batches):
....

sess.run(optimizer, feed_dict={
features: batch_features,
labels: batch_labels,
keep_prob: 0.5})

validation_accuracy = sess.run(accuracy, feed_dict={


features: test_features,
labels: test_labels,
keep_prob: 0.5})
What's wrong with the above code?

 
Dropout doesn't work with batching.

 
The keep_prob value of 0.5 is too low.

 
There shouldn't be a value passed to keep_prob when testing for accuracy.

 
keep_prob should be set to 1.0 when evaluating validation accuracy.

SOLUTION:keep_prob should be set to 1.0 when evaluating validation accuracy.


Quiz 2
This quiz will be starting with the code from the ReLU Quiz and applying a dropout layer.
Build a model with a ReLU layer and dropout layer using the  keep_prob  placeholder to pass
in a probability of  0.5 . Print the logits from the model.

Note: Output will be different every time the code is run. This is caused by dropout
randomizing the units it drops.

Start Quiz:
quiz.py solution.py
# Solution is available in the other "solution.py" tab
import tensorflow as tf

hidden_layer_weights = [
[0.1, 0.2, 0.4],
[0.4, 0.6, 0.6],
[0.5, 0.9, 0.1],
[0.8, 0.2, 0.8]]
out_weights = [
[0.1, 0.6],
[0.2, 0.1],
[0.7, 0.9]]

# Weights and biases


weights = [
tf.Variable(hidden_layer_weights),
tf.Variable(out_weights)]
biases = [
tf.Variable(tf.zeros(3)),
tf.Variable(tf.zeros(2))]

# Input
features = tf.Variable([[0.0, 2.0, 3.0, 4.0], [0.1, 0.2, 0.3, 0.4],
[11.0, 12.0, 13.0, 14.0]])

# TODO: Create Model with Dropout

# TODO: Print logits from a session


Start Quiz:
quiz.py solution.py
# Quiz Solution
# Note: You can't run code in this tab
import tensorflow as tf

hidden_layer_weights = [
[0.1, 0.2, 0.4],
[0.4, 0.6, 0.6],
[0.5, 0.9, 0.1],
[0.8, 0.2, 0.8]]
out_weights = [
[0.1, 0.6],
[0.2, 0.1],
[0.7, 0.9]]

# Weights and biases


weights = [
tf.Variable(hidden_layer_weights),
tf.Variable(out_weights)]
biases = [
tf.Variable(tf.zeros(3)),
tf.Variable(tf.zeros(2))]

# Input
features = tf.Variable([[0.0, 2.0, 3.0, 4.0], [0.1, 0.2, 0.3, 0.4],
[11.0, 12.0, 13.0, 14.0]])

# TODO: Create Model with Dropout


keep_prob = tf.placeholder(tf.float32)
hidden_layer = tf.add(tf.matmul(features, weights[0]), biases[0])
hidden_layer = tf.nn.relu(hidden_layer)
hidden_layer = tf.nn.dropout(hidden_layer, keep_prob)

logits = tf.add(tf.matmul(hidden_layer, weights[1]), biases[1])

# TODO: Print logits from a session


with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run(logits, feed_dict={keep_prob: 0.5}))

You might also like