You are on page 1of 118

+

Lectures 24 and 25 – Large-Scale


Learning and Neural Networks
IEOR 242 – Applications in Data Analysis
Fall 2019 – Paul Grigas IEOR 242, Fall 2019 - Lecture 24
+ 2

Today’s Agenda

n Regularized Loss Function Minimization and


(Stochastic) Gradient Descent

n Intro to Neural Networks

n Training a Neural Network model

n Deep and Convolutional Networks

n Computational Examples

n Some “Big Data” Issues

IEOR 242, Fall 2019 - Lecture 24


+
Regularized Loss Function
Minimization

IEOR 242, Fall 2019 - Lecture 24 3


+ 4

Test Set Performance Summary of


Last Week’s Models
R2 (w.r.t. MAE (w.r.t. to Number of
Log(Sale Price)) Sale Price) Non-zero
Coefficients
“Naïve” Linear 0.7484 $26,037 529
Regression Model
“Common Sense” 0.8324 $20,003 25
Linear Reg. Model
Random Forest 0.8304 $18,906 -
Principal Components 0.8604 $17,828 -
Regression (529, sort of)
Ridge Regression 0.8870 $14,931 529
LASSO 0.8598 $18,437 126

(Note: Random Forests did not use cross validation)


IEOR 242, Fall 2019 - Lecture 24
+ 5

Summary of Last Week

n PCR, Ridge Regression, and LASSO can improve


prediction accuracy over ordinary linear
regression, especially when p ≈ n

n LASSO perform variable selection and informs us


about which independent variables are the most
useful for predicting the dependent variable

n Linear models are interpretable and competitive


with nonlinear models (random forests, boosting,
etc.)

IEOR 242, Fall 2019 - Lecture 24


+ 6

What about classification?

n PCR, Ridge Regression, and LASSO can all easily


be modified for logistic regression and the glmnet
package in R easily handles this as well

n In fact, regularization is very general concept that


can be applied in many different contexts…

IEOR 242, Fall 2019 - Lecture 24


+ 7

Machine Learning as Loss Function


Minimization
nA common format for many machine learning
methods is to set up the “learning algorithm” as an
optimization problem where we want to minimize a
loss function of the training data plus a
regularization penalty term

n Let’s examine this more closely in the case where


the predictions generated by the method are
based on linear functions of the features (as in
linear regression and logistic regression)

IEOR 242, Fall 2019 - Lecture 24


+ 8

Machine Learning as Loss Function


Minimization, cont.
n The setup consists of:

n Training data:

n Predictions are constructed from a linear function of


the features
n (E.g. in linear regression , what about logistic regression?)

n Two ways to handle the intercept term : (i) construct a


feature column equal to all 1s or (ii) center the features

n are the model coefficients that we need to fit

n is a loss function that measures the error or loss


incurred when the truth is actually but our prediction was
based on
Reminder:
IEOR 242, Fall 2019 - Lecture 24
+ 9

Examples of Loss Functions

n is a loss function that measures the error


or loss incurred when the truth is actually but
our prediction was based on

n In regression, the standard loss function is the


least-squares (RSS) loss:

n In classification, there are a few options

IEOR 242, Fall 2019 - Lecture 24


+ 10

Loss Functions for Classifcation

n Suppose that we’ve set up the problem so that the


training data consists of features and
class labels that are either -1 or +1:

n This format is convenient because most loss


functions for classification are designed so that it
makes sense to predict according to the rule:

IEOR 242, Fall 2019 - Lecture 24


+ 11

Prediction Rule for Classification

n This prediction rule can be compactly summarized


as:

where is the function:

IEOR 242, Fall 2019 - Lecture 24


+ 12

Loss Functions for Classification,


cont.
n 0-1 loss function:

n Logistic loss function:

Continuous = better

IEOR 242, Fall 2019 - Lecture 24


+ 13

Loss Functions for Classification,


cont.
n Hinge loss function:

n Hinge loss arises in the context of support vector


machines (SVMs)

n In fact (after some geometric arguments), hinge


loss + ridge penalty is the SVM method
running this loss in his framework is equivalent to SVM

IEOR 242, Fall 2019 - Lecture 24


+ 14

Loss Functions for Classification,


cont.
n The 0-1 loss function is the “ideal” loss function for
achieving high accuracy, because minimizing the
average of the 0-1 loss on the training data is the
same as maximizing accuracy on the training data
n However, there are computational difficulties associated
with minimizing the 0-1 loss

n The logistic loss function seems weird, but it arises


from algebraic manipulations of the maximum
likelihood estimation formulation for logistic
regression

IEOR 242, Fall 2019 - Lecture 24


+ 15

Loss Functions for Classification,


cont.
4

3.5

2.5
0-1 Loss
2 Logistic Loss
Hinge Loss
1.5

0.5

0
-3 -2 -1 0 1 2 3

IEOR 242, Fall 2019 - Lecture 24


+ 16

Machine Learning as Loss Function


Minimization
n Learning algorithms are often formatted as:
minimize loss function on training data +
regularization penalty

n For example, is the ridge penalty,


is the LASSO penalty
n There is another penalty called “elastic net” which
is a combination of ridge and LASSO – see glmnet
docs
IEOR 242, Fall 2019 - Lecture 24
+ 17

Regularized Loss Function


Minimization

n The methods studied last week correspond to the least-


squares loss function with ridge and LASSO penalties

n If we use the logistic loss for a classification problem,


we get “ridge and LASSO logistic regression”
n Both of these extensions inherit many of the properties of their
respective linear regression versions

n We will now use the above problem format to study


large-scale learning methods

IEOR 242, Fall 2019 - Lecture 24


+ Regularized Loss Function
Minimization and (Stochastic)
Gradient Descent

IEOR 242, Fall 2019 - Lecture 24 18


+ 19

Machine Learning as Loss Function


Minimization
n Learning algorithms are often formatted as:
minimize loss function on training data +
regularization penalty

n For example:
n is the ridge penalty

n is the LASSO penalty

IEOR 242, Fall 2019 - Lecture 24


+ 20

Machine Learning as Loss Function


Minimization, cont.
n Let’s define the objective function as:

n …so that our learning problem is:

n (We assume that is differentiable globally)


n How should we solve this problem?

IEOR 242, Fall 2019 - Lecture 24


+ 21

Gradients
n Recall the partial derivative of with respect
to denoted as

nI can consider a p-dimensional vector of these


quantities called the gradient of at
n This is denoted as and its jth component is

n The vector points in the direction of


steepest increase of the function at

IEOR 242, Fall 2019 - Lecture 24


+ 22

Optimality Conditions
n Recall our optimization problem of interest:

n Under some regularity conditions (e.g., the


function is convex) then solves this
problem if and only if

n More generally, all local minimizers satisfy

n This result suggests that we should look for points


where the gradient is equal to 0, but there are a few
issues…
n Most notably, how can we find such points efficiently?
IEOR 242, Fall 2019 - Lecture 24
+ 23

Gradient Descent

n Basic idea: at any point in


the domain, take a slight
step in the direction of
steepest descent, i.e., the
direction where the function
has the maximum rate of
decrease, i.e., the direction
of minus the gradient

IEOR 242, Fall 2019 - Lecture 24


+ 24

Gradient Descent, cont.

n Problem of Interest:

n Step 1.) Initialize at some point and set step-


size parameter
Beta0 is another hyperparameter

n Step 2.) For t = 0, 1, …, T perform the update:

n Two issues: how do we select T and how do we


select the step-size parameter? Ideas?
Could choose cross-validation and with a certain threshold of the gradient and the change in the gradient

IEOR 242, Fall 2019 - Lecture 24


+ 25

Gradients in Machine Learning

n Recall that our objective function is

n How do I compute the gradient at ? Simple!

IEOR 242, Fall 2019 - Lecture 24


+ 26

Gradients in Machine Learning,


cont.

n It turns out that is typically quite simple to


compute, e.g., for the ridge penalty

n Recall that the loss function takes the value of the


prediction as input, hence its gradient
depends on how changes when changes
and how changes when changes.
By the chain rule, we have:

IEOR 242, Fall 2019 - Lecture 24


+ 27

Gradients in Machine Learning,


cont.
n This formula is extremely convenient – it says that
the gradient of the loss of the ith data point is
proportional to the ith feature vector and the
proportionality constant is equal to to the number:

n Then, recalling the formula:

n …what is the procedure for computing ?

IEOR 242, Fall 2019 - Lecture 24


+ 28

Gradient Descent, cont.

n For t = 0, 1, …, T perform the update:

n Given the procedure for computing the gradient of


the training set loss , each iteration of the
method requires a full pass over the training set

n Consider a situation where p = 10 and n =


1,000,000
n This seems to be quite an inefficient use of the training
data…

IEOR 242, Fall 2019 - Lecture 24


+ 29

Stochastic Gradient Descent

n Key idea: take the gradient descent update rule:

n …and replace the computation of with a


randomly selected single data point

n That is, sample an index i uniformly at random


from 1, 2, …, n and set

IEOR 242, Fall 2019 - Lecture 24


+ 30

Stochastic Gradient Descent, cont.

n This update rule has the property that the


expected value of the term in brackets is equal to
the deterministic quantity

n The path, however, will be dramatically different…

IEOR 242, Fall 2019 - Lecture 24


+ 31

Stochastic Gradient Descent, cont.


n Gradient descent and stochastic gradient descent provide the
basic foundation for many large-scale machine learning
methods
n Linear and logistic regression, as well as SVMs, neural networks, …

n SGD is most directly applicable when n is gigantic and p is


moderate
n When p is also large, adding randomization to coordinate updates also
helps

n Sampling a single data point i at each iteration can easily be


replaced with a small batch of randomly selected data points
(say 5, or 10)

n There are a whole host of other issues that have received recent
attention: accelerating the rate of convergence, persevering
“structure”, asynchronous and/or parallel updates, …

n These methods also form the basis of the machine learning


library (MLLib) in Spark
IEOR 242, Fall 2019 - Lecture 24
+
Neural Networks

IEOR 242, Fall 2019 - Lecture 24 32


+ 33

What is a Neural Network?

n “Mysterious” Question: What is a neural network?

n Straightforward Answer: A neural network is a


type of nonlinear statistical model

IEOR 242, Fall 2019 - Lecture 24


+ 34

Neural Network Applications

n Deep learning and neural networks have transformed


certain areas of data science in the last decade:
n Image recognition
n Speech recognition
n Machine translation
n Autonomous driving

n Neural network ideas have been around since the


1980s, but there has been a recent surge in popularity
n Availability of huge datasets
n Graphics processing units (GPUs) speeding up computation
(among other general improvements in computation)

IEOR 242, Fall 2019 - Lecture 24


+ 35

“Deep” Neural Networks


n Deep neural networks and other deep learning
architectures have been used to produce state-of-the-art
results in domains like computer vision, speech
recognition, natural language processing, and audio
recognition

n In these domains, the complicated functions fit by deep


networks are reasonable surrogates for reality

n Additionally, the deep networks are able to “automatically”


perform feature engineering, which used to be an onerous
task in these domains

n We will start with a basic single hidden layer network, and


then we will easily extend this to a deep feedforward
network
IEOR 242, Fall 2019 - Lecture 24
+ 36

Neural Networks: Pros and Cons

Pros: Cons:

n Delivers state-of-the-art n Neural Network modeling


predictions in multiple involves many parameters and
domains structural design decisions

n Fitting a (deep) neural network


n Scales to massive datasets
model can be an engineering
(both in the number of challenge
observations and the number
of variables) n NNs are often uninterpretable

n “Hot area” in machine n NNs often require huge datasets


learning in order to work really well

IEOR 242, Fall 2019 - Lecture 24


+ 37

Neural Network Basics


nA neural network is a type of nonlinear prediction
function that maps p-dimensional
feature vectors to K-dimensional outputs

n In a regression problem, the output is one-dimensional and


the prediction is the same as the output:
n In a binary classification problem, the prediction is given
by the rule:
n In a multiclass classification problem, the output is typically
interpreted as a probability vector:

where is interpreted as an estimate of the conditional


probability that the dependent variable is in class k

IEOR 242, Fall 2019 - Lecture 24


+ 38

Neural Network Basics, cont.

nA neural network is a nonlinear function


defined by a directed graph organized in layers

n The first layer is the input layer, which takes in

n The last layer is the output layer, which returns

n The inputs are “fed forward” through the network


in order to compute the output
IEOR 242, Fall 2019 - Lecture 24
+
Single Hidden Layer Neural
Nets

IEOR 242, Fall 2019 - Lecture 24 39


+ 40

A Single Hidden Layer Neural


Network
Input Hidden Output
Layer Layer Layer

IEOR 242, Fall 2019 - Lecture 24


+ 41

Key Ingredients

n Each node has an input and and output

n Each edge has a weight value associated with it

n Each node also has a “bias term” associated with it

n The nodes in the input layer take in the feature


values

IEOR 242, Fall 2019 - Lecture 24


+ 42

Key Ingredients, cont.

n Each node has an input and and output

n Each edge has a weight value associated with it

n Each node also has a “bias term” associated with it

n The function value is computed by feeding


values through the graph via weighted linear
combinations and nonlinear “activation” functions

n Let’s see how this works at each of the three layers…

IEOR 242, Fall 2019 - Lecture 24


+ 43

Input Layer

n The nodes in the input layer take in the feature


values

n If the input to a node in layer 1 is equal to x, then the


output is simply x

IEOR 242, Fall 2019 - Lecture 24


+ 44

Hidden Layer

n Let M be the total number of nodes in the hidden layer

n Notation:
n Let be the weight from the ith node in the input layer to the
jth node in the hidden layer
n Let be the input to the jth node in the hidden layer
n Let be the bias term of the jth node in the hidden layer

n The input to each node in the hidden layer is determined as


a weighted combination of the outputs of the input layer
plus the bias term

IEOR 242, Fall 2019 - Lecture 24


+ 45

Hidden Layer, cont.

n The output of each node in the hidden layer is


determined by applying a scalar nonlinear
“activation” function to its input

n Let be the output of the jth node in the hidden


layer, then:

IEOR 242, Fall 2019 - Lecture 24


+ 46

Example Activation Functions

n Sigmoid:

0.5

0
-5 -4 -3 -2 -1 0 1 2 3 4 5

IEOR 242, Fall 2019 - Lecture 24


+ 47

Example Activation Functions

n ReLU:

0
-5 -4 -3 -2 -1 0 1 2 3 4 5

IEOR 242, Fall 2019 - Lecture 24


+ 48

Example Activation Functions

n Threshold/Step:

1.5

0.5

0
-5 -4 -3 -2 -1 0 1 2 3 4 5

-0.5

IEOR 242, Fall 2019 - Lecture 24


+ 49

Activation Function

n The activation function is a crucial part of the


neural nets methodology because it introduces
nonlinearities

n Without the activation function, neural nets would


just reduce to an excessively complicated version
of linear regression

IEOR 242, Fall 2019 - Lecture 24


+ 50

Output Layer

n In general the output layer will have K nodes

n Notation:
n Let be the weight from the jth node in the hidden layer
to the kth node in the output layer
n Let be the input to the kth node in the output layer
n Let be the bias term of the kth node in the output layer

n Again, the input to each node in the output layer is


determined as a weighted combination of the outputs of
the hidden layer plus a bias term:

IEOR 242, Fall 2019 - Lecture 24


+ 51

Output Layer, cont.

n The output of each node in the output layer (i.e.,


the final output of the model) is determined again
by applying an activation function

n Let be the kth final output of the neural net


model, then:

IEOR 242, Fall 2019 - Lecture 24


+ 52

Activation Function in the Output


Layer
n In regression problems, usually we have K = 1 and
the output layer activation function is simply the
identity:

n This is also the case for binary classification if one


uses the prediction rule

IEOR 242, Fall 2019 - Lecture 24


+ 53

Activation Function in the Output


Layer, cont.
n In multiclass classification problems, we actually allow
the activation function in the last layer to depend on the
entire vector of inputs to the last layer

n The output of each node is determined by the softmax


activation function

n Softmax is well suited for multiclass classification since


the output is a valid probability
distribution
IEOR 242, Fall 2019 - Lecture 24
+
Training a Neural Network
Model

IEOR 242, Fall 2019 - Lecture 24 54


+ 55

Training a Neural Network Model

n We have now fully described how a single hidden layer


neural network model computes its function values
n The model relies on specifying:
n The number of nodes in the hidden layer
n The activation functions
n Edge weight values
n Node bias terms

n For now, let’s suppose that the number of nodes in the


hidden layer and the activation functions have already
been decided upon
n We will now determine the edge weights and bias
terms based on training data by solving an optimization
problem…
IEOR 242, Fall 2019 - Lecture 24
+ 56

Training a Neural Network Model


via Loss Function Minimization
n We will appeal to the same loss function minimization
framework that we previously considered for linear
models

n Training data:

n Predictions are constructed from a neural network


model of the features

n are the model parameters that we need to fit

n is a loss function that measures the error or


loss incurred when the truth is actually but our
prediction was based on

IEOR 242, Fall 2019 - Lecture 24


+ 57

Examples of Loss Functions

n In regression, as before the standard loss function


is the least-squares (RSS) loss:

n In binary classification, we might again consider


the logistic loss function:

IEOR 242, Fall 2019 - Lecture 24


+ 58

Examples of Loss Functions, cont.

n Consider multiclass classification with K classes:

n Presuming that we use the softmax activation in the


output layer (hence is a vector of
conditional probabilities), it makes sense to use the
cross-entropy loss function:

IEOR 242, Fall 2019 - Lecture 24


+ 59

Training a Neural Network Model


via Loss Function Minimization
n Let’s define the objective function as:

n Note that the structure of the function


implicitly depends on the parameters

n is a regularization (e.g., ridge) term like


before

IEOR 242, Fall 2019 - Lecture 24


+ 60

Training a Neural Network Model


via Loss Function Minimization
n Our learning problem is now the optimization
problem:

n How should we solve this problem?

n Stochastic gradient descent!

IEOR 242, Fall 2019 - Lecture 24


+ 61

Computing the Gradient

n We need to be able to compute the gradient of the


loss at any given data point with respect to the
parameters
n The gradient computation is not as straightforward
as with a linear model
n Luckily, there is a clever algorithm called back
propagation that can be used to efficiently
compute the gradient
n Key idea is a forward pass to compute the prediction
values, then a backwards pass to compute the rate of
change

IEOR 242, Fall 2019 - Lecture 24


+ 62

Back Propagation Illustrated

n Let’s consider the case of regression with the


squared error loss

IEOR 242, Fall 2019 - Lecture 24


+ 63

Back Propagation Illustrated

n More notation:

n Therefore:

IEOR 242, Fall 2019 - Lecture 24


+ 64

Back Propagation Illustrated

n Our Task:

Given values of , compute

IEOR 242, Fall 2019 - Lecture 24


+ 65

Forward Pass

n Given the specified values of the parameters and


the data observation, we can run a forward pass
through the network to compute:

IEOR 242, Fall 2019 - Lecture 24


+ 66

Partial Derivatives: Output Layer

n Forward pass yields:

n Chain rule yields:

IEOR 242, Fall 2019 - Lecture 24


+ 67

Partial Derivatives: Hidden Layer

n Forward pass yields:

n We can pass back to each node in the hidden layer


the value:

IEOR 242, Fall 2019 - Lecture 24


+ 68

Partial Derivatives: Hidden Layer

n From the backward pass:

n Chain rule (again) yields:

IEOR 242, Fall 2019 - Lecture 24


+ 69

Back Propagation

n The main ideas of back propagation easily extend


to networks with more layers as well

n Key idea is a forward pass to compute the


prediction values, then a backwards pass to
compute the derivative

IEOR 242, Fall 2019 - Lecture 24


+ 70

Other Optimization Issues

n 1.) Non-convexity: Unlike the case of linear


models, neural nets usually produce non-convex
optimization problems, which are more difficult to
solve in theory and in practice. The punchline is
that SGD may not always converge to the global
minimizer.

n 2.) Initialization: The initial values for the


parameters are usually selected as random values
near zero.

IEOR 242, Fall 2019 - Lecture 24


+ 71

Other Optimization Issues, cont.

n 3.) Regularization: A regularization penalty is not


always included with neural nets. The reason is
that the SGD algorithm does some sort of “implicit
regularization” on its own and the number of
iterations can also serve as a tuning parameter.

n 4.) Other Algorithms: There are other options for


algorithms besides SGD, and many bells and
whistles might be considered. (In the code, we use
“RMSProp”)

IEOR 242, Fall 2019 - Lecture 24


+
Deep Networks and Other
Extensions

IEOR 242, Fall 2019 - Lecture 24 72


+ 73

Deep Networks
n Deep neural networks are simply obtained by
adding more layers
Input Hidden Output
Layer Layers Layer

IEOR 242, Fall 2019 - Lecture 24


+ 74

Deep Networks, cont.

n The mathematical description for computing the


function value of a fully connected deep network is
exactly analogous to what we saw before in the
single hidden layer case

n Training the network also follows the same general


principles described previously

n It is also easy to extend the layered model to an


arbitrary directed acyclic graph

IEOR 242, Fall 2019 - Lecture 24


+ 75

Some Advanced Architectures

n Recurrent neural networks and LSTM (long short


term memory) models introduce time and memory
considerations

n Convolutional layers are useful for processing


image data and simplifying it down to something
that can then be fed into several fully connected
layers

n…

IEOR 242, Fall 2019 - Lecture 24


+ 76

Convolutional Networks

n Convolutional networks are especially useful for


dealing with image data
n Suppose that our input is an image of size h x w
pixels
n So the dimension of the feature vector is p = h x w
n Then, if there are M nodes in the hidden layer, there will be
h x w x M weight values that need to be trained
n This can quickly create computational and statistical issues

n Convolutional networks overcome this challenge


by exploiting sparsity, local structure, and
parameter sharing

IEOR 242, Fall 2019 - Lecture 24


+ 77

Convolutional Networks, cont.

n Simplest case: let’s transform an h x w pixel image


(our input) to another layer that also has h x w
nodes

IEOR 242, Fall 2019 - Lecture 24


+ 78

Convolutional Networks, cont.

n We will operate with a 3 x 3 window that scans over


the original image and computes a dot product of
the neighboring pixels with a fixed 3 x 3 matrix of
weights:

w11 w12 w13

* w21 w22 w23


= aij
w31 w32 w33

n This operation is called a convolution

IEOR 242, Fall 2019 - Lecture 24


+ 79

Convolutional Networks, cont.


n Each pixel value in the new image is computed by
taking a convolution with the weight matrix and
then applying an activation function (note that we
pad the borders with zeros)
w11 w12 w13
w21 w22 w23
w31 w32 w33

IEOR 242, Fall 2019 - Lecture 24


+ 80

Max Pooling

nA convolutional layer is usually followed by a


pooling step to reduce the size of the image

n The same window based approach is used but now


we take the maximum of all values in each window

The bigger the window, the more general info it gets but the less local structure it picks up
IEOR 242, Fall 2019 - Lecture 24
+ 81

Convolutional Networks: Key Ideas

n 1.) Each layer of the network is an image

n 2.) Convolutional “layers” link the images together by


the previously described process

n 3.) A convolutional layer is usually followed by a max


pooling layer that reduces the size of the image

n 4.) Sparsity and local structure through the


properties of convolution

n 5.) Parameter sharing (and invariance) since there is


only a single 3 x 3 weight matrix

IEOR 242, Fall 2019 - Lecture 24


+ 82

Convolutional Networks:
Extensions
n Of course the window size can be set however you
like
n Also different options for how to deal with the
border cases (here we used “same zero padding”)
n Most importantly, this can be extended to height x
width x channels
n The number of channels can vary throughout the network
and is also referred to as the number of filters

n These ideas can also be applied to 1-D data (e.g.,


time series)

IEOR 242, Fall 2019 - Lecture 24


+ 83

Convolutional Networks: Final


Steps
n After several convolutional/max pooling layers, it is
important to have a few (at least one) dense layer
n First you need to flatten the height x width x channels 3-
D object into a 1-D vector
n Then add dense layers as discussed before
n Finally, another popular idea is to use pre-trained
values for the weights in the earlier layers instead of
training them from scratch
n These pre-trained values are usually based on other datasets
and so this is a form of transfer learning
n See Chapter 5 of the “Deep Learning with R” book for more
details

IEOR 242, Fall 2019 - Lecture 24


+
Computational Examples

IEOR 242, Fall 2019 - Lecture 24 84


+ 85

Keras

n We will use the Keras package in R

n Keras is a “high-level neural networks API


developed with a focus on enabling fast
experimentation”

n Easy to implement feed forward, convolutional, or


recurrent networks (or any combination)

n Easy to run on CPU or GPU

n Uses a backend such as TensorFlow

IEOR 242, Fall 2019 - Lecture 24


+ 86

Classifying Poker Hands

n We will train a neural network model in Keras to


predict the best poker hand achievable based on
the cards that were dealt

n What kind of learning problem is this?

n How do you expect a neural network to perform?

IEOR 242, Fall 2019 - Lecture 24


+ 87

Poker Data Description

n Dataset retrieved from Kaggle

n Dependent Variable: one of 10 possible categories


coded between 0-9:
0: Nothing in hand; not a recognized poker hand
1: One pair; one pair of equal ranks within five cards
2: Two pairs; two pairs of equal ranks within five cards
3: Three of a kind; three equal ranks within five cards
4: Straight; five cards, sequentially ranked with no gaps
5: Flush; five cards with the same suit
6: Full house; pair + different rank three of a kind
7: Four of a kind; four equal ranks within five cards
8: Straight flush; straight + flush
9: Royal flush; {Ace, King, Queen, Jack, Ten} + flush

IEOR 242, Fall 2019 - Lecture 24


+ 88

Poker Data Description

n Independent Variables (10 nominally):


n S1 “Suit of card #1”
n Coded in {1, 2, 3, 4}, representing {Hearts, Spades, Diamonds,
Clubs}

n C1 “Rank of card #1”


n Coded in {1, 2, …, 12, 13}, representing {Ace, 2, 3, ... , Queen,
King}

n ...
S5 “Suit of card #5”
C5 “Rank of card #5”

IEOR 242, Fall 2019 - Lecture 24


+ 89

Training/Test Split and Validation


Approach
n 25,010 total observations
n (Note that there are 2,598,960 total possible poker hands)

n We did a random train/test split, with 70% in


training and 30% in test

n Instead of doing cross validation, I will allow Keras


to reserve 20% of the training data as a validation
set to track performance during the SGD (or SGD-
like) algorithm

IEOR 242, Fall 2019 - Lecture 24


+ 90

On Selecting the Network


Properties
n Generally speaking, it is not advisable to use cross-
validation, etc. to select the network properties such as
the number of hidden units and the number of layers
n This has some risk of overfitting and is also just generally a
waste of computation/time

n It’s okay to simply experiment with a few different


values for these parameters (often, number of hidden
units in each layer is on the order of the number of
features and number of hidden layers is between 1 and
10)
n In real life applications, a great deal of engineering
effort and domain expertise goes into determining the
network architecture

IEOR 242, Fall 2019 - Lecture 24


+ 91

Baseline Model

n Baseline model always predicts 0 – not a


recognizable poker hand (i.e., a high card hand)

n Test set accuracy of 0.495

Model Type Test Set Accuracy


Baseline 0.495

IEOR 242, Fall 2019 - Lecture 24


+ 92

LDA Model

n Multiclass classification is simple with LDA, so let’s


try it
n Computation time: a split second
n Test set accuracy: 0.487

Model Type Test Set Accuracy


Baseline 0.495
LDA 0.487

IEOR 242, Fall 2019 - Lecture 24


+ 93

Random Forests Model

n Default parameters (no cross validation)

n Computation time: 527.82 sec

n Test set accuracy: 0.655

Model Type Test Set Accuracy


Baseline 0.495
LDA 0.487
Random Forests 0.655

IEOR 242, Fall 2019 - Lecture 24


+ 94

Single Hidden Layer Neural Net

n Single hidden layer model with sigmoid activation


function and 100 hidden units

n Accuracy history during the course of the


algorithm:

IEOR 242, Fall 2019 - Lecture 24


+ 95

Single Hidden Layer Neural Net

n Computation time: 63.22 sec

n Test set accuracy: 0.909

Model Type Test Set Accuracy


Baseline 0.495
LDA 0.487
Random Forests 0.655
Single Hidden Layer Net 0.909

IEOR 242, Fall 2019 - Lecture 24


+ 96

Single Hidden Layer Neural Net


n Switching sigmoid to ReLU improves things slightly

n Computation time: 65.22 sec

n Test set accuracy: 0.919

Model Type Test Set Accuracy


Baseline 0.495
LDA 0.487
Random Forests 0.655
Single Hidden Layer Net 0.919

IEOR 242, Fall 2019 - Lecture 24


+ 97

Three Hidden Layer Neural Net

n Model with three hidden layers and ReLU


activation functions

n Accuracy history during the course of the


algorithm:

IEOR 242, Fall 2019 - Lecture 24


+ 98

Three Hidden Layer Neural Net


n Computation time: 67.82 sec

n Test set accuracy: 0.943

Model Type Test Set Accuracy


Baseline 0.495
LDA 0.487
Random Forests 0.655
Single Hidden Layer Net 0.919
Three Hidden Layer Net 0.943

IEOR 242, Fall 2019 - Lecture 24


+ 99

MNIST: Image Classification

n MNIST is a very famous dataset in image


classification and is popular in the neural network
literature

n The task is to classify images of handwritten digits

n There are 10 classes: 0, 1, 2, … , 9

n 60,000 training images, 10,000 test images

IEOR 242, Fall 2019 - Lecture 24


+ 100

MNIST: Dense Network

n Single hidden layer network with 512 hidden units


and ReLU activation function
n Number parameters = 28*28*512 + 512 + 512*10 + 10 =
407,050

n Accuracy history:

IEOR 242, Fall 2019 - Lecture 24


+ 101

MNIST: Dense Network

n Computation time: 24.77 sec

n Test set accuracy: 0.9797

Model Type Test Set Accuracy


Baseline 0.1135
Single Hidden Layer Network 0.9797

IEOR 242, Fall 2019 - Lecture 24


+ 102

MNIST: Convolutional Network

n3 convolutional layers, 2 dense layers (ReLU)


n Number parameters = 93,322

n Accuracy history:

IEOR 242, Fall 2019 - Lecture 24


+ 103

MNIST: Convolutional Network

n Computation time: 305.99 sec

n Test set accuracy: 0.9915

Model Type Test Set Accuracy


Baseline 0.1135
Single Hidden Layer Network 0.9797
Convolutional Network 0.9915

IEOR 242, Fall 2019 - Lecture 24


+ 104

CTR Prediction

n Let’s try a different example – the CTR prediction


dataset from Week 5 (regression problem)
n Recall that Random Forests and Boosting both had
OSR2 = 0.588
nI tried several different neural network
architectures/ideas and could not get an OSR2 value
better than 0.455 (single hidden layer model)
n It’s not worth the time or effort to continue trying to
get a good neural network model

IEOR 242, Fall 2019 - Lecture 24


+ 105

Neural Nets vs. Other Models


n What are some intuitive differences between the
poker/MNIST examples and the advertising example?

n Generally speaking, neural nets tend to work very well


on problem with high “signal-to-noise” ratios and
where a complicated relationship seems plausible
n This often is the case in vision, speech and language
modeling, etc.

n When there is a relatively larger degree of noise in the


problem and when feature engineering does not seem
to be a critical issue, boosting, random forests, and
other methods tend to dominate
n This often is the case in “business analytics” problems
n It may be possible to get a good neural network model, but it
will usually require an onerous amount of work
IEOR 242, Fall 2019 - Lecture 24
+ 106

Other Issues and Extensions

n There are many other issues that we haven’t


discussed, especially when dealing with
complicated data types like images

n There are also many different extensions to the


basic feedforward model that people consider

nA good practical reference is “Deep Learning with


R” by Francois Chollet
n Also the same author wrote “Deep Learning with Python”

IEOR 242, Fall 2019 - Lecture 24


+
Some “Big Data” Issues

IEOR 242, Fall 2019 - Lecture 24 107


+ 108

Data in Theory….

y x1 x2 x3 x4 …
y1 x11 x12 x13 x14 …
y2 x21 x22 x23 x24 …
y3 x31 x32 x33 x34 …
y4 x41 x42 x43 x44 …
y5 x51 x52 x53 x54 …


IEOR 242, Fall 2019 - Lecture 24
+ 109

Data in Practice…

IEOR 242, Fall 2019 - Lecture 24


+ 110

Data in Practice

n Data in practice is often unstructured, messy, and


distributed across many different machines and/or
locations

n How do we know if we have a “small data” problem


or a “big data” problem?

IEOR 242, Fall 2019 - Lecture 24


+ 111

The Data Science Lifecycle

n We might say that a “small data” problem is a


problem such that all aspects of the above work
flow may be completed on a single laptop
machine, in memory, and in a reasonable
amount of time

IEOR 242, Fall 2019 - Lecture 24


+ 112

The Data Science Lifecycle, cont.

n Question: What if we have a 50 Gb data file but


after tidying and transforming, we produce a 500
Mb data frame for training our model(s) on?

IEOR 242, Fall 2019 - Lecture 24


+ 113

“Medium Data” Problems

n We might say that a “medium data” problem is a


problem such that all aspects of the above work
flow may be completed on a single laptop machine
in a reasonable amount of time, but perhaps we
won’t be able to keep all of the data in RAM at the
same time

IEOR 242, Fall 2019 - Lecture 24


+ 114

Some “Medium Data” Tools in R


n data.table is a high-performance extension of base R’s
data.frame
n Effective for data around 10 Gb or so
n As compared to dplyr (see here), data.table provides similar functionality
but is less natural and less “linguistically friendly”

n ff package allows you to work with data on disk (almost) as if it


were in RAM

n biglm and bigglm support batch updating of linear and logistic


regression models based on new observations

n ranger for more efficient training of Random Forests models

n Perhaps a small subset, subsample, or summary of your medium


dataset is all you need to feed into the training of your ML
model…

n Other solutions: buy more RAM and/or processing power,


integrate closely with a database
IEOR 242, Fall 2019 - Lecture 24
+ 115

“Big Data” Problems

n We might say that a “big data” problem is a problem


such that all (or some) aspects of the above work flow
must involve working with data that is distributed
across multiple sources, machines, locations, etc.

n Some aspect of the problem is simply too big or too


complicated to fit on your laptop

IEOR 242, Fall 2019 - Lecture 24


+ 116

Some “Big Data” Tools in R

n Apache Spark™ is a fast and general engine for large-


scale data processing
n A high-level interface for programming clusters, using
MapReduce style ideas, and with implicit data
parallelism and fault tolerance
n Spark is typically run on the cloud
n Includes libraries for SQL, streaming data, machine
learning (stochastic gradient descent), and graph
processing
n You can also write your own applications on top of Spark

n Initially developed by the AMPLab at UC Berkeley

IEOR 242, Fall 2019 - Lecture 24


+ 117

Predicting Taxi Fares in Chicago

n We will demo Spark on a problem concerning


predicting taxi cab fares in the city of Chicago
n Dataset obtained from
https://www.kaggle.com/chicago/chicago-taxi-rides-2016

n This dataset is on the cusp of what can be done in R on


my 16gb RAM laptop
n Hence more of a “medium data” problem
n Parts of the analysis are too computationally intensive to be
done using the packages we have seen so far
n However using Spark, even in “local mode” (i.e., not on a
cluster), enables us to do more

n This may be covered in discussion section in two weeks

IEOR 242, Fall 2019 - Lecture 24


+ 118

Conclusions

n Real life big data problems are messy and


necessarily involve working in a distributed
computing environment
n Spark is a useful tool for machine learning and other tasks
in this setting

n Stochastic gradient descent and other efficient


optimization algorithms enable large-scale
learning
n Neural networks are a powerful modeling
framework that yield state-of-the-art results in
domains such as vision, natural language
processing, and others

IEOR 242, Fall 2019 - Lecture 24

You might also like