You are on page 1of 32

AI: NN

FTMBA – Trim 6
03

Jan-21
ANN

 At the core of deep learning


 Versatile, adaptive, and scalable: appropriate for
– Tackling large datasets, highly complex Machine Learning tasks
• Image classification (Google Images)
• Speech recognition (Apple’s Siri)
• Video recommendation (YouTube)
• Analyzing sentiments among customers (Twitter Sentiment Analyzer)

2
Confidential |
ANN

 ANN: computational model


– Inspired by the way human brain processes information
• Using biological neural networks
– First introduced in 1943
• Neurophysiologist Warren McCulloch
• Mathematician Walter Pitts
• Simplified computational model
– How biological neurons might work together in animal brains
» To perform complex computations using propositional logic
– Frequently outperform other ML techniques
• Cases of very large and complex problems
– Huge quantity of data available to train neural networks
– Increase in computing power: train large neural networks
• In a reasonable amount of time: Moore’s Law, Gaming industry
– Improvement in training algorithms
– Possible theoretical limitations have been overcome
3
Confidential |
ANN

 Artificial neuron: simple model of the biological neuron


– Has one or more binary (on/off) inputs, one binary output.
• Activates output when more than a certain number of inputs are active
– It is possible to build a network of artificial neurons
» That computes any logical proposition is required (McCulloch and Pitts)

4
Confidential |
ANN

 Assumption: a neuron is activated when at least two of its inputs are active
– Identity function: if neuron A is activated, neuron C gets activated as well
• If neuron A is off, neuron C is off as well
– Logical AND: neuron C activated only when
• Both neurons A and B are activated
– A single input signal is not enough to activate neuron C
– Logical OR: neuron C activated if
• Either neuron A or neuron B is activated (or both)
– Neuron C activated only if neuron A is active & neuron B is off
• Neuron A active all the time  logical NOT
– Neuron C is active when neuron B is off, and vice versa

5
Confidential |
ANN

 Perceptron: one of the simplest ANN architectures


– Invented in 1957 (Frank Rosenblatt)
• Based on a slightly different AN: linear threshold unit (LTU)
– Numerical inputs, output
– Each input connection is associated with a weight
» LTU computes a weighted sum of its inputs
» z = w1 x1 + w2 x2 + ⋯ + wn xn = wT · x)
» Applies a step function to that sum, outputs the result
» hw(x) = step (z) = step (wT · x)

6
Confidential |
ANN

 most common step function used in Perceptrons:


– Heaviside function, sign function
 Single LTU: can be used for simple linear binary classification
– Computes a linear combination of the inputs
• If result exceeds a threshold: outputs the positive class
– Else outputs the negative class
• Classify iris flowers based on the petal length and width
– Can add an extra bias feature x0 = 1
• Training the LTU  finding the right values for w0, w1, w2

7
Confidential |
ANN

 Perceptron: composed of a single layer of LTUs


– each neuron connected to all the inputs
• Input connections: represented using special passthrough neurons
– Input neurons: output whatever input they are fed
• Bias feature: typically represented using a special type of neuron
– Bias neuron: outputs 1 all the time

8
Confidential |
ANN

 Perceptron training algorithm: inspired by Hebb’s rule


– When a biological neuron often triggers another neuron
• Connection between these two neurons grows stronger
– The Organization of Behavior (1949), Donald Hebb
– Cells that fire together, wire together (Siegrid Löwel)
• Later became known as Hebb’s rule (or Hebbian learning)
– Connection weight between two neurons is increased
» whenever they have the same output

 Perceptrons: trained using a variant of this rule


– Takes into account the error made by the network
• Does not reinforce connections that lead to the wrong output
– Perceptron is fed one training instance at a time
» for each instance predictions are made
– For every output neuron that produced a wrong prediction
» Reinforces the connection weights from the inputs
» That would have contributed to the correct prediction

9
Confidential |
ANN

𝑤𝑖 , 𝑗: connection weight between the 𝑖𝑡ℎ input neuron and the 𝑗𝑡ℎ
output neuron.
𝑥𝑖 : 𝑖𝑡ℎ input value of the current training instance
ŷ𝑗: output of the 𝑗𝑡ℎ output neuron for the current training
instance
𝑦𝑗: target output of the 𝑗𝑡ℎ output neuron for the current
training instance
𝜂: learning rate

 Perceptrons: incapable of learning complex patterns


– Decision boundary of each output neuron is linear
10
Confidential |
ANN

 Neuron: basic unit of computation in an ANN


– A.k.a. node/unit
– Receives input from some other nodes, or from external source
• Each input: has an associated weight (w)
– Assigned on the basis of its relative importance to other inputs
– Computes an output
• Node applies a function to the weighted sum of the inputs

11
Confidential |
Activation function

 Purpose: introduce non-linearity into the output of a neuron


– Most real world data is non-linear
• Neurons are required to learn non-linear representations
 Every activation function
– Takes a single number
– Performs a certain fixed mathematical operation on it
 Several Types:
– Sigmoid: takes an input, brings it to range between [0, 1]
• 𝜎(𝑥) = 1 / (1 + exp(−𝑥))
– tanh: takes an input, brings it to range [-1, 1]
• tanh(𝑥) = 2𝜎(2𝑥) − 1
– ReLU: Rectified Linear Unit: Takes an input, thresholds at zero
• Replaces negative values with zero
• 𝑓(𝑥) = max(0, 𝑥)
12
Confidential |
Activation Function

 ReLu: some gradients are fragile during training and can die
– Causes a weight update which will never activate a neuron on any
data point again: dead neurons
– Fix: Leaky ReLu: introduces a small slope to keep the updates alive
• Ranges from -∞ to +∞

13
Confidential |
ANN

 Feedforward Neural Network: simplest type of ANN


– Contains multiple neurons (nodes) arranged in layers
– Nodes from adjacent layers have connections between them
• Connections have weights associated with them

14
Confidential |
ANN

 Feedforward neural network: do not form cycles (as in recurrent NN)


– Information moves in only one direction – forward
• From the input nodes, through the hidden nodes (if any), to output nodes
 Consist of three types of nodes
– Input Nodes
• Provide information from the outside world to the network
– Together referred to as the "Input Layer"
• No computation is performed in any of the Input nodes
– Only passes on the information to the hidden nodes
• Single input layer
– Hidden Nodes: no direct connection with the outside world
• Perform computations, transfer information
– From the input nodes to the output nodes
• Hidden Layer : formed by a collection of hidden nodes
• Zero / multiple Hidden Layers possible
– Output Nodes: collectively referred to as the "Output Layer"
• Responsible for computations, transferring information
– From the network to the outside world
15
Confidential |
Back-Propagation

 Backward Propagation of Errors (BackProp)


– Process by which a Multi-Layer Perceptron learns
• One of the several ways in which an ANN can be trained
– Supervised training scheme: learns from labeled training data
 BackProp: "learning from mistakes"
– Goal of learning: assign correct weights for the connections
• Given an input vector, weights determine the output vector
 Supervised learning: labeled training set
 BackProp Algorithm:
– Initial random assignment of edge weights
• For every input in the training dataset, ANN is activated
– Output is observed: compared with the desired output that is known
– Error is "propagated" back to the previous layer
– Error is noted, weights are "adjusted" accordingly
• Process repeated until output error is within predetermined threshold
16
Confidential |
Confidential | 17
Confidential | 18
ANN Architecture

 Components of ANN
– Input Layer: input variables, bias term
– Hidden Layer
• Neurons where all mathematical calculations are done
• ANN can have more than one neuron in a hidden layer
– Multiple hidden layers also possible
– The Activation Function: mathematical equations
• Transforms output of a given layer
– Before passing on the information to consecutive layers
– Determine the output of an ANN
– Part of each neuron in the hidden layers
» Determines output relevant for prediction
– The Output Layer
• Final "output prediction" of the network

19
Confidential |
ANN Architecture

 Components of ANN (Contd.)


– Forward Propagation
• Calculation of the output of each iteration
– From the input layer to the output layer
– Backward Propagation (learning)
• Calculation of revised weights after each forward propagation
– Analysis of the derivative of the cost function
– Learning Rate:
• Percentage change attributed to each weight and bias term
– After every backward propagation
• Controls the speed at which the model learns about data

20
Confidential |
ANN

 Learning
– Cost Function: One half of the squared difference between
actual and output value
• For each layer of the network, cost function is analyzed
– Used to adjust the threshold and weights for the next input
• Aim: minimize the cost function
– Lower the cost function, closer the actual value is to predicted value
» Error keeps becoming marginally lesser in each run
» As the network learns how to analyze values
• Resulting data fed back through the entire neural network
– Weighted synapses connecting input variables to the neuron
» Only thing that can be accessed
• Adjustment of weights: till no disparity between the actual value
and the predicted value
– Tweak values, run the neural network again:
» New Cost function produced
– Repeat process: until cost function reduced to as small as possible
21
Confidential |
ANN

 Two basic mechanisms of back-propagation


– Brute-force method
• Best suited for the single-layer feed-forward network
– Take a number of possible weights
– Eliminate all the other weights
» Except the one at the bottom of the u-shaped curve
• Optimal weight found using simple elimination techniques
– Process of elimination works if there is one weight to optimize
– In case of complex NN: many numbers of weights
» This method fails: Curse of Dimensionality

22
Confidential |
ANN

 Batch-Gradient Descent
– Iterative optimization algorithm
• Responsibility: to find the minimum cost value (loss)
– In the process of training the model with different weights
– Rather than evaluating every possible weight value, evaluate slope
» Angle of the function line
– If slope → Negative, proceed along (down) the curve: lower cost
» If slope → Positive, Do nothing
• Gradient Descent works fine in case of a convex curve
23
Confidential |
ANN

 Backpropagation: involves gradient descent within the solution's vector space


– Towards a 'global minimum' along the steepest vector of the error surface
– Global Minimum: theoretical solution with the lowest possible error
• Error surface: hyperparaboloid, seldom 'smooth'
– In most problems, the solution space is quite irregular
» Numerous 'pits' and 'hills' which may cause the network to settle down in a 'local minimum'
» Not the best overall solution
24
Confidential |
ANN

 Stochastic Gradient Descent (SGD)


– 'Stochastic': system/process
• Linked with random probability

 SGD: few samples selected randomly for each iteration


– Instead of the whole data set
– Helps to avoid the problem of local minima
– Much faster than Gradient Descent
• Not required to load the whole data in memory during computations
– Generally noisier than typical Gradient Descent
• Usually takes a higher number of iterations to reach the minima
– Randomness in the descent
• Still computationally less expensive than Gradient Descent
– Preferred over Batch Gradient Descent for optimizing a learning algorithm
25
Confidential |
ANN

 Nature of the error space can not be known a priori


– NN analysis: requires a large number of runs to determine the best solution
 Most learning rules: built-in mathematical terms to assist in this process
– Control on 'speed' (Beta-coefficient) & 'momentum' of learning
• Speed of learning: rate of convergence between the current solution and the global
minimum
• Momentum: helps network to overcome obstacles (local minima) in the error surface
– Settle down at or near the global minimum
• Learning rate: how much the current situation affects the next step
– Momentum: how much past steps affect the next step

26
Confidential |
ANN

 Deep: use of multiple non-linear hidden layers


– Deep learning: not limited to neural networks
• Broader concept: constructing multiple levels of representation
– Learning a hierarchy of features
• name for hierarchical representation learning algorithms
– Deep models based on Hidden Markov Models, Conditional Random
Fields, Support Vector Machines etc.
» Feature engineering: identify set of features, best suited for solving a
specific classification problem
• Common aspect: work out their own representation from raw data
– Applied to image recognition (raw images) they produce multi level
representation:
» Pixels; lines; face features (if we are working with faces) like noses,
eyes, etc.; generalized faces
– Natural Language Processing
» Construction of language model
» Connects words into chunks, chunks into sentences etc.
27
Confidential |
ANN

 Addition of multiple hidden layers to an MLP: "deep"


– Problem: difficult to learn "good" weights for this network
• Start Training: assign random values as initial weights
– Can be off from the "optimal" solution
• During training: use backpropagation algorithm
– To propagate the "errors" from right to left
– Take a step into the opposite direction of the cost (or "error") gradient
» Problem of "vanishing/exploding gradient": more layers added, harder
to "update" weights:
» Signal becomes weaker/stronger: difficult to control
» Network's weights can be very much off in the beginning (random
initialization)
» Can become almost impossible to parameterize a "deep" neural
network with backpropagation

28
Confidential |
ANN

 Deep learning: algorithms that can help us with the training


of "deep" neural network structures
– Proposes a new initialization strategy: use a series of single
layer networks to find the initial parameters
– Called pre-training: initialization now generates values not
quite random: more suitable for the data
 Learning to read: recognize individual letters
– Combine letters into words; words into sentences
• Get better: easy to recognize words directly
– Without thinking about letters
– In fcat, possible to eaisly raed jmubled wrods
– Deep Neural Networks: designed to do similar stuff
• Logistic Regression: can look at the basic attributes fed into it
• Neural Network: can have several intermediary steps
– Combining the basic attributes into higher-level concepts
29
Confidential |
ANN

 Deep neural network: feedforward network with many hidden layers


– No. of layers required in order to qualify as deep: 2 or more
• No definite answer: shallow network – 1 hidden layer
 Benefits of having multiple hidden layers:
– Not known: still not quite sure why it works so well
– A shallow neural network can approximate any function
• Can in principle learn anything
– Deep networks work better
• Shallow networks need more neurons than the deep one
– No. of units in a shallow network grows exponentially with task complexity
• Shallow network: more difficult to train with current algorithms
– Difficult to get to global/local minima, convergence rate is slower, etc.
• Shallow architecture does not fit to the kind of problems required to solve
– Object recognition is a quintessential "deep", hierarchical process

30
Confidential |
ANN

 Concept of spatial/temporal invariance in recognition


– "dog"/"car" can appear anywhere in an image
– Learning independent weights at each spatial or temporal
location impractical
• Neurons receiving inputs from the one corner of the image
– Will have to learn to represent "dog" independently
» From neurons connected to other parts of the image
» Require enough images of dogs such that the network had
experienced several examples of dogs at each possible image location
separately
• Reduce neighboring features into single units
– By taking max / averaging
– Done over many rounds: eventually arrive at an almost scale invariant
representation of the image: "equivariant"
– Now possible to detect objects in an image
» No matter where they are located

31
Confidential |
Thank you

You might also like