Neural Networks

Arrangement:
I. Introduction
1. Define NN: NNs learn relationship between cause and effect or organize large
volumes of data into orderly and informative patterns.
Neural network: information processing paradigm inspired by biological nervous systems,

such as our brain. Structure: large number of highly interconnected processing elements
(neurons) working together. Like people, they learn from experience (by example).
“Data processing system consisting of a large number of simple, highly interconnected

processing elements (artificial neurons) in an architecture inspired by the structure of the
cerebral cortex of the brain” (Tsoukalas & Uhrig, 1997).
2. ANNs have been developed as generalizations of mathematical models of neural
biology, based on the assumptions that: (1) Information processing occurs at many
simple elements called neurons. (2) Signals are passed between neurons over
connection links. (3) Each connection link has an associated weight, which, in typical
neural net, multiplies the signal transmitted. (4) Each neuron applies an activation
function to its net input to determine its output signal.
3. Building Blocks: Neurons
First, we have to talk about neurons, the basic unit of a neural network. A neuron takes
inputs, does some math with them, and produces one output. Here’s what a 2-input
neuron looks like:
Definition:
A neural network looks at data and tries to figure out the function, or set of calculations, that
turn the input (variables) into the output.
That output could be a number, a probability, or even something a bit more complicated.
Neural networks are analogous to robots that can learn to make things, like a toy car, not by
following step by step instructions from humans, but by looking at a bunch of toy cars and
figuring out for itself how to turn inputs (like metal and plastic) into outputs (the toy cars).
If we want to work with data instead of toy cars we can use a neural network to predict future
salary based on a number of variables such as degree, field, age, years of experience, gender,
number of promotions, and university.
We feed these variables to the neural network. These circles are called Nodes, and they just
hold a value like degree or field.
Eventually we want the Neural Network to output its prediction for future salary. So we know
there will be one output node at the end of our network that tells us what it predicts the salary
will be.
It’s pretty clear what the input nodes are, since they’re values we understand. And the output
node is a salary, so we get that too. But it can be harder to grasp exactly what the middle
layers represent. You can think of all the calculations that happen between the input nodes
and output nodes as something called “feature generation”. “Feature” is just a fancy word for
a variable that can be made up of a combination of other variables. For example we could use
your grades, attendance, and test scores to create a “Feature” called Academic Performance.
Essentially the neural network is taking the variables we give it, and performing
combinations and calculations to create new values, or “features”. Then, it combines those
“features” to create an output. When we have large amounts of complex data, the neural
network saves us a LOT of time by combining variables and figuring out which ones are
important. Neural Networks allow us to make use of data that might seem too big and
overwhelming for us to try to use on its own. They can find patterns that humans might never
be able to see.
A layer can be one of the three types first is the input layer. Here is an example of input layer.
Now, if a layer is of input type, it will accept the data and pass it to the rest of the Network.
So, this input layer will accept the data and pass it to the rest of the network.
Second type of filler is called the hidden layer. Hidden layers are either one or more in
number. For a neural network here, in this case, the number is 1.
Hidden layers are the one actually responsible for the excellent performance and complexity
of neural networks. They perform multiple functions at the same time such as data
transformation, automatic feature creation etcetera.
The last type of layer is the output layer. Output layer holds the result or the output of the
problem. Raw images get passed to the input layer and we receive an output whether it is an
emergency or non-emergency vehicle in the output layer.
3 things are happening here. First, each input is multiplied by a
weight:
Next, all the weighted inputs are added together with a bias b:
Finally, the sum is passed through an activation function:
The activation function is used to turn an unbounded input into an

output that has a nice, predictable form. A commonly used
activation function is the sigmoid function:
The sigmoid function only outputs numbers in the range (0,1). You
can think of it as compressing (−∞,+∞) to (0,1) — big negative
numbers become ~0, and big positive numbers become ~1.
A Simple Example
Assume we have a 2-input neuron that uses the sigmoid activation

function and has the following parameters:
w=[0, 1] is just a way of writing w1=0, w2=1 in vector form. Now,

let’s give the neuron an input of x=[2, 3]. We’ll use the dot
product to write things more concisely:
The neuron outputs 0.999 given the inputs x=[2,3]. That’s it! This
process of passing inputs forward to get an output is known
as feedforward.
Coding a Neuron
Time to implement a neuron! We’ll use NumPy, a popular and

powerful computing library for Python, to help us do math:
Recognize those numbers? That’s the example we just did! We get
the same answer of 0.999.
2. Combining Neurons into a Neural Network
A neural network is nothing more than a bunch of neurons

connected together. Here’s what a simple neural network might look
like:
This network has 2 inputs, a hidden layer with 2 neurons (h1 and h2
), and an output layer with 1 neuron (o1). Notice that the inputs
for o1 are the outputs from h1 and h2 — that’s what makes this a
network.
A hidden layer is any layer between the input (first) layer and
output (last) layer. There can be multiple hidden layers!
An Example: Feedforward
Let’s use the network pictured above and assume all neurons have
the same weights w=[0,1], the same bias b=0, and the same sigmoid
activation function. Let h1, h2, o1 denote the outputs of the neurons
they represent.
What happens if we pass in the input x=[2, 3]?
The output of the neural network for input x=[2,3] is 0.7216. Pretty

simple, right?
A neural network can have any number of layers with any

number of neurons in those layers. The basic idea stays the same:
feed the input(s) forward through the neurons in the network to get
the output(s) at the end. For simplicity, we’ll keep using the network
pictured above for the rest of this post.
Coding a Neural Network: Feedforward
Let’s implement feedforward for our neural network. Here’s the

image of the network again for reference:
We got 0.7216 again! Looks like it works.
3. Training a Neural Network, Part 1
Say we have the following measurements:
Let’s train our network to predict someone’s gender given their

weight and height:
We’ll represent Male with a 0 and Female with a 1, and we’ll also
shift the data to make it easier to use:
Loss
Before we train our network, we first need a way to quantify how

“good” it’s doing so that it can try to do “better”. That’s what
the loss is.
We’ll use the mean squared error (MSE) loss:
Let’s break this down:

 n is the number of samples, which is 4 (Alice, Bob, Charlie,
Diana).
 y represents the variable being predicted, which is Gender.
 y_true is the true value of the variable (the “correct
answer”). For example, y_true for Alice would be 1
(Female).
 y_pred is the predicted value of the variable. It’s whatever
our network outputs.
(y_true−y_pred)² is known as the squared error. Our loss

function is simply taking the average over all squared errors (hence
the name mean squared error). The better our predictions are, the
lower our loss will be!
Better predictions = Lower loss.
Training a network = trying to minimize its loss.
An Example Loss Calculation
Let’s say our network always outputs 00 — in other words, it’s

confident all humans are Male 🤔. What would our loss be?
Code: MSE Loss
Here’s some code to calculate loss for us:

If you don’t understand why this code works, read the NumPy quickstart on array operations.
Nice. Onwards!
Liking this post so far? I write a lot of beginner-friendly ML

articles. Subscribe to my newsletter to get them in your inbox!
4. Training a Neural Network, Part 2
We now have a clear goal: minimize the loss of the neural

network. We know we can change the network’s weights and biases
to influence its predictions, but how do we do so in a way that
decreases loss?
This section uses a bit of multivariable calculus. If you’re not

comfortable with calculus, feel free to skip over the math parts.
For simplicity, let’s pretend we only have Alice in our dataset:
Then the mean squared error loss is just Alice’s squared error:
Another way to think about loss is as a function of weights and
biases. Let’s label each weight and bias in our network:
Then, we can write loss as a multivariable function:
Imagine we wanted to tweak w1. How would loss L change if we

changed w1? That’s a question the partial derivative can answer.
How do we calculate it?
Here’s where the math starts to get more complex. Don’t be

discouraged! I recommend getting a pen and paper to
follow along — it’ll help you understand.
If you have trouble reading this: the formatting for the math
below looks better in the original post on victorzhou.com.
To start, let’s rewrite the partial derivative in terms of ∂y_pred/∂w1

instead:
This works because of the Chain Rule.
We can calculate ∂L/∂y_pred because we computed L= (1−y_pred

)² above:
Now, let’s figure out what to do with ∂y_pred/∂w1. Just like before,
let h1, h2, o1 be the outputs of the neurons they represent. Then
f is the sigmoid activation function, remember?
Since w1 only affects h1 (not h2), we can write
More Chain Rule.
We do the same thing for ∂h1/∂w1:
You guessed it, Chain Rule.

x1 here is weight, and x2 is height. This is the second time we’ve
seen f′(x) (the derivate of the sigmoid function) now! Let’s derive it:
We’ll use this nice form for f′(x) later.
We’re done! We’ve managed to break down ∂L/∂w1 into several

parts we can calculate:
This system of calculating partial derivatives by working backwards

is known as backpropagation, or “backprop”.
Phew. That was a lot of symbols — it’s alright if you’re still a bit
confused. Let’s do an example to see this in action!
Example: Calculating the Partial Derivative
We’re going to continue pretending only Alice is in our dataset:

Let’s initialize all the weights to 1 and all the biases to 0. If we do a
feedforward pass through the network, we get:
The network outputs y_pred=0.524, which doesn’t strongly favor

Male (0) or Female (1). Let’s calculate ∂L/∂w1:
Reminder: we derived f′(x)=f(x)∗(1−f(x)) for our sigmoid
activation function earlier.
We did it! This tells us that if we were to increase w1, L would

increase a tiiiny bit as a result.
Training: Stochastic Gradient Descent
We have all the tools we need to train a neural network

now! We’ll use an optimization algorithm called stochastic gradient
descent (SGD) that tells us how to change our weights and biases to
minimize loss. It’s basically just this update equation:
η is a constant called the learning rate that controls how fast we

train. All we’re doing is subtracting η ∂w1/∂L from w1:
 If ∂L/∂w1 is positive, w1 will decrease, which

makes L decrease.
 If ∂L/∂w1 is negative, w1 will increase, which
makes L decrease.
If we do this for every weight and bias in the network, the loss will
slowly decrease and our network will improve.
Our training process will look like this:

1. Choose one sample from our dataset. This is what makes
it stochastic gradient descent — we only operate on one
sample at a time.
2. Calculate all the partial derivatives of loss with respect to
weights or biases (e.g. ∂L/∂w1, ∂L/∂w2, etc).
3. Use the update equation to update each weight and bias.
4. Go back to step 1.
Let’s see it in action!
Code: A Complete Neural Network
It’s finally time to implement a complete neural network:
You can run / play with this code yourself. It’s also available
on Github.
Our loss steadily decreases as the network learns:
We can now use the network to predict genders:
Now What?
You made it! A quick recap of what we did:
 Introduced neurons, the building blocks of neural

networks.
 Used the sigmoid activation function in our neurons.
 Saw that neural networks are just neurons connected
together.
 Created a dataset with Weight and Height as inputs
(or features) and Gender as the output (or label).
 Learned about loss functions and the mean squared
error (MSE) loss.
 Realized that training a network is just minimizing its loss.
 Used backpropagation to calculate partial derivatives.
 Used stochastic gradient descent (SGD) to train our
network.
There’s still much more to do:
 Experiment with bigger / better neural networks using

proper machine learning libraries like Tensorflow, Keras,
and PyTorch.
 Build your first neural network with Keras.
 Tinker with a neural network in your browser.
 Discover other activation functions besides sigmoid,
like Softmax.
 Discover other optimizers besides SGD.
 Read my introduction to Convolutional Neural
Networks (CNNs). CNNs revolutionized the field
of Computer Vision and can be extremely powerful.
 Read my introduction to Recurrent Neural
Networks (RNNs), which are often used for Natural
Language Processing (NLP).
4.
Neural networks are better than other methods for certain tasks like, image
recognition. The secret to their success is their hidden layers, and they’re
mathematically very elegant. Both of these reasons are why neural networks are one
of the most dominant machine learning technologies used today.
Not that long ago, a big challenge in AI was real-world image recognition, like
recognizing a dog from a cat, and a car from a plane from a boat.
Even though we do it every day, it’s really hard for computers.
That’s because computers are good at literal comparisons, like matching 0s and 1s,
one at a time. It’s easy for a computer to tell that these images are the same by
matching the pixels. But before AI, a computer couldn’t tell that these images are of
the same dog, and had no hope of telling that all of these different images are dogs.
So, a professor named Fei-Fei Li and a group of other machine learning and
computer vision researchers wanted to help the research community develop AI that
could recognize images. The first step was to create a huge public dataset of labeled
real-world photos. That way, computer scientists around the world could come up
with and test different algorithms. They called this dataset ImageNet.
It has 3.2 million labeled images, sorted into 5,247 nested categories of nouns.
Like for example, the “dog” label is nested under “domestic animal,” which is nested
under “animal.” Humans are the best at reliably labeling data.
But if one person did all this labeling, taking 10 seconds per label, without any sleep
or snack breaks, it would take them over a year!
So ImageNet used crowd-sourcing and leveraged the power of the Internet to
cheaply spread the work between thousands of people.
Once the data was in place, the researchers started an annual competition in 2010 to
get people to contribute their best solutions to image recognition.
Enter Alex Krizhevsky, who was a graduate student at the University of Toronto.
In 2012, he decided to apply a neural network to ImageNet, even though similar
solutions hadn’t been successful in the past.
His neural network, called AlexNet, had a couple of innovations that set it apart.
He used a lot of hidden layers, which we’ll get to in a minute.
He also used faster computation hardware to handle all the math that neural
networks do. AlexNet outperformed the next best approaches by over 10%.
It only got 3 out of every 20 images wrong.
In grade terms, it was getting a solid B while other techniques were scraping by with
a low C. Since 2012, neural network solutions have taken over the annual
competition, and the results keep getting better and better.
Plus, AlexNet sparked an explosion of research into neural networks, which we
started to apply to lots of things beyond image recognition.
To understand how neural networks can be used for these classification problems,
we have to understand their architecture first.
All neural networks are made up of an input layer, an output layer, and any number of
hidden layers in between. There are many different arrangements but we’ll use the
classic multi-layer perceptron as an example. The input layer is where the neural
network receives data represented as numbers. Each input neuron represents a
single feature, which is some characteristic of the data. Features are straightforward
if you’re talking about something that’s already a number, like grams of sugar in a
donut. But, really, just about anything can be converted to a number.
Sounds can be represented as the amplitudes of the sound wave. So each feature
would have a number that represents the amplitude at a moment in time.
Words in a paragraph can be represented by how many times each word appears.
So each feature would have the frequency of one word. Or, if we’re trying to label an
image of a dog, each feature would represent information about a pixel.
So for a grayscale image, each feature would have a number representing how bright
a pixel is. But for a color image, we can represent each pixel with three numbers: the
amount of red, green, and blue, which can be combined to make any color on your
computer screen. Once the features have data, each one sends its number to every
neuron in the next layer, called the hidden layer. Then, each hidden layer neuron
mathematically combines all the numbers it gets. The goal is to measure whether
the input data has certain components. For an image recognition problem, these
components may be a certain color in the center, a curve near the top, or even
whether the image contains eyes, ears, or fur. Instead of answering yes or no, like the
simple Perceptron from the previous episode, each neuron in the hidden layer does
some slightly more complicated math and outputs a number. And then, each neuron
sends its number to every neuron in the next layer, which could be another hidden
layer or the output layer. The output layer is where the final hidden layer outputs are
mathematically combined to answer the problem. So, let’s say we’re just trying to
label an image as a dog. We might have a single output neuron representing a single
answer - that the image is of a dog or not. But if there are many answers, like for
example if we’re labeling a bunch of images, we’ll need a lot of output neurons. Each
output neuron will correspond to the probability for each label -- like for example,
dog, car, spaghetti, and more. And then we can pick the answer with the highest
probability. The key to neural networks -- and really all of AI -- is math. And I get it. A
neural network kind of seems like a black box that does math and spits out an
answer. I mean, those middle layers are even called hidden layers! But we can
understand the gist of what’s happening by working through an example. Oh John
Green Bot? Let’s give John Green-bot a program with a neural network that’s been
trained to recognize a dog in a grayscale photo. When we show him this photo first,
every feature will contain a number between 0 and 1 corresponding to the brightness
of one pixel. And it’ll pass this information to the hidden layer. Now, let’s focus on
one hidden layer neuron. Since the neural network is already trained, this neuron has
a mathematical formula to look for a particular component in the image, like a
specific curve in the center. The curve at the top of the nose. If this neuron is
focused on this specific shape and spot, it may not really care what’s happening
everywhere else. So it would multiply or weigh the pixel values from most of those
features by 0 or close to 0. Because it’s looking for bright pixels here, it would
multiply these pixel values by a positive weight. But this curve is also defined by a
darker part below. So the neuron would multiply these pixel values by a negative
weight. This hidden neuron will add all the weighted pixel values from the input
neurons and squish the result so that it’s between 0 and 1. The final number
basically represents the guess of this neuron thinking that a specific curve, aka a dog
nose, appeared in the image. Other hidden neurons are looking for other
components, like for example, a different curve in another part of the image , or a
fuzzy texture. When all of these neurons pass their estimates onto the next hidden
layer, those neurons may be trained to look for more complex components. Like, one
hidden neuron may check whether there’s a shape that might be a dog nose. It
probably doesn’t care about data from previous layers that looked for furry textures,
so it weights those by 0 or close to 0. But it may really care about neurons that
looked for the “top of the nose” and “bottom of the nose” and “nostrils”. It weights
those by large positive numbers. Again, it would add up all the weighted values from
the previous layer neurons, squish the value to be between 0 and 1, and pass this to
the next layer. That’s the gist of the math, but we’re simplifying a bit.
It’s important to know that neural networks don’t actually understand ideas like
“nose” or “eyelid.” Each neuron is doing a calculation on the data it’s given and just
flagging specific patterns of light and dark. After a few more hidden layers, we reach
the output layer with one neuron! So after one more weighted addition of the
previous layer’s data, which happens in the output neuron, the network should have a
good estimate if this image is a dog. Which means, John Green-bot should have a
decision. John Green-bot: Output neuron value: 0.93. Probability that this is a dog:
93%! Hey John Green Bot nice job! Thinking about how a neural network would
process just one image makes it clearer why AI needs fast computers. Like I
mentioned before, each pixel in a color image will be represented by 3 numbers ---
how much red, green, and blue it has. So to process a 1000 by 1000 pixel image,
which in comparison is a small 3 by 3 inch photo, a neural network needs to look at 3
million features! AlexNet needed more than 60 million neurons to achieve this, which
is a ton of math and could take a lot of time to compute. Which is something we
should keep in mind when designing neural networks to solve problems. People are
really excited about using deeper neural networks, which are networks with more
hidden layers, to do deep learning. Deep networks can combine input data in more
complex ways to look for more complex components, and solve trickier problems.
But we can’t make all networks like a billion layers deep, because more hidden layers
means more math which again would mean that we need faster computers. Plus, as
a network get deeper, it gets harder for us to make sense of why it’s giving
the answers it does. Each neuron in the first hidden layer is looking for some specific
component of the input data. But in deeper layers, those components get more
abstract from how humans would describe the same data. Now, this may not seem
like a big deal, but if a neural network was used to deny our loan request for example,
we’d want to know why. Which features made the difference? How were they
weighed towards the final answer? In many countries, we have the legal right to
understand why these kinds of decisions were made. And neural networks are being
used to make more and more decisions about our lives. Most banks for example use
neural networks to detect and prevent fraud. Many cancer tests, like the Pap test for
cervical cancer, use a neural network to look at an image of cells under a
microscope, and decide whether there’s a risk of cancer. And neural networks are
how Alexa understands what song you’re asking her to play and how Facebook
suggests tags for our photos. Understanding how all this happens is really important
to being a human in the world right now, whether or not you want to build your own
neural network. So this was a lot of big-picture stuff, but the program we gave John
Green-bot had already been trained to recognize dogs. The neurons already had
algorithms that weighted inputs.

Neural Networks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Networks

Uploaded by

Copyright:

Available Formats

Arrangement:

Neural network: information processing paradigm inspired by biological nervous systems,

“Data processing system consisting of a large number of simple, highly interconnected

3. Building Blocks: Neurons

Finally, the sum is passed through an activation function:

The activation function is used to turn an unbounded input into an

Assume we have a 2-input neuron that uses the sigmoid activation

w=[0, 1] is just a way of writing w1=0, w2=1 in vector form. Now,

Time to implement a neuron! We’ll use NumPy, a popular and

2. Combining Neurons into a Neural Network

A neural network is nothing more than a bunch of neurons

What happens if we pass in the input x=[2, 3]?

The output of the neural network for input x=[2,3] is 0.7216. Pretty

A neural network can have any number of layers with any

Coding a Neural Network: Feedforward

Let’s implement feedforward for our neural network. Here’s the

3. Training a Neural Network, Part 1

Say we have the following measurements:

Let’s train our network to predict someone’s gender given their

Before we train our network, we first need a way to quantify how

We’ll use the mean squared error (MSE) loss:

Let’s break this down:

(y_true−y_pred)² is known as the squared error. Our loss

Better predictions = Lower loss.

Training a network = trying to minimize its loss.

An Example Loss Calculation

Let’s say our network always outputs 00 — in other words, it’s

Here’s some code to calculate loss for us:

Liking this post so far? I write a lot of beginner-friendly ML

4. Training a Neural Network, Part 2

We now have a clear goal: minimize the loss of the neural

This section uses a bit of multivariable calculus. If you’re not

For simplicity, let’s pretend we only have Alice in our dataset:

Then, we can write loss as a multivariable function:

Imagine we wanted to tweak w1. How would loss L change if we

Here’s where the math starts to get more complex. Don’t be

To start, let’s rewrite the partial derivative in terms of ∂y_pred/∂w1

We can calculate ∂L/∂y_pred because we computed L= (1−y_pred

f is the sigmoid activation function, remember?

Since w1 only affects h1 (not h2), we can write

More Chain Rule.

We do the same thing for ∂h1/∂w1:

You guessed it, Chain Rule.

We’ll use this nice form for f′(x) later.

We’re done! We’ve managed to break down ∂L/∂w1 into several

This system of calculating partial derivatives by working backwards

Example: Calculating the Partial Derivative

We’re going to continue pretending only Alice is in our dataset:

The network outputs y_pred=0.524, which doesn’t strongly favor

We did it! This tells us that if we were to increase w1, L would

Training: Stochastic Gradient Descent

We have all the tools we need to train a neural network

η is a constant called the learning rate that controls how fast we

 If ∂L/∂w1 is positive, w1 will decrease, which

Our training process will look like this:

Let’s see it in action!

Code: A Complete Neural Network

It’s finally time to implement a complete neural network:

We can now use the network to predict genders:

You made it! A quick recap of what we did:

 Introduced neurons, the building blocks of neural

There’s still much more to do:

 Experiment with bigger / better neural networks using

You might also like