Lectures24and25 LargeScale NeuralNets PDF

+
Lectures 24 and 25 – Large-Scale

Learning and Neural Networks
IEOR 242 – Applications in Data Analysis
Fall 2019 – Paul Grigas IEOR 242, Fall 2019 - Lecture 24
+ 2
Today’s Agenda
n Regularized Loss Function Minimization and

(Stochastic) Gradient Descent
n Intro to Neural Networks
n Training a Neural Network model
n Deep and Convolutional Networks
n Computational Examples
n Some “Big Data” Issues
IEOR 242, Fall 2019 - Lecture 24

+
Regularized Loss Function
Minimization
IEOR 242, Fall 2019 - Lecture 24 3

+ 4
Test Set Performance Summary of

Last Week’s Models
R2 (w.r.t. MAE (w.r.t. to Number of
Log(Sale Price)) Sale Price) Non-zero
Coefficients
“Naïve” Linear 0.7484 $26,037 529
Regression Model
“Common Sense” 0.8324 $20,003 25
Linear Reg. Model
Random Forest 0.8304 $18,906 -
Principal Components 0.8604 $17,828 -
Regression (529, sort of)
Ridge Regression 0.8870 $14,931 529
LASSO 0.8598 $18,437 126
(Note: Random Forests did not use cross validation)

+ 5
Summary of Last Week
n PCR, Ridge Regression, and LASSO can improve

prediction accuracy over ordinary linear
regression, especially when p ≈ n
n LASSO perform variable selection and informs us

about which independent variables are the most
useful for predicting the dependent variable
n Linear models are interpretable and competitive

with nonlinear models (random forests, boosting,
etc.)

+ 6
What about classification?
n PCR, Ridge Regression, and LASSO can all easily

be modified for logistic regression and the glmnet
package in R easily handles this as well
n In fact, regularization is very general concept that

can be applied in many different contexts…

+ 7
Machine Learning as Loss Function

Minimization
nA common format for many machine learning
methods is to set up the “learning algorithm” as an
optimization problem where we want to minimize a
loss function of the training data plus a
regularization penalty term
n Let’s examine this more closely in the case where

the predictions generated by the method are
based on linear functions of the features (as in
linear regression and logistic regression)

+ 8

Minimization, cont.
n The setup consists of:
n Training data:
n Predictions are constructed from a linear function of

the features
n (E.g. in linear regression , what about logistic regression?)
n Two ways to handle the intercept term : (i) construct a

feature column equal to all 1s or (ii) center the features
n are the model coefficients that we need to fit
n is a loss function that measures the error or loss

incurred when the truth is actually but our prediction was
based on
Reminder:
+ 9
Examples of Loss Functions
n is a loss function that measures the error

or loss incurred when the truth is actually but
our prediction was based on
n In regression, the standard loss function is the

least-squares (RSS) loss:
n In classification, there are a few options

+ 10
Loss Functions for Classifcation
n Suppose that we’ve set up the problem so that the

training data consists of features and
class labels that are either -1 or +1:
n This format is convenient because most loss

functions for classification are designed so that it
makes sense to predict according to the rule:

+ 11
Prediction Rule for Classification
n This prediction rule can be compactly summarized

as:
where is the function:

+ 12
Loss Functions for Classification,

cont.
n 0-1 loss function:
n Logistic loss function:
Continuous = better

+ 13

cont.
n Hinge loss function:
n Hinge loss arises in the context of support vector

machines (SVMs)
n In fact (after some geometric arguments), hinge

loss + ridge penalty is the SVM method
running this loss in his framework is equivalent to SVM

+ 14

cont.
n The 0-1 loss function is the “ideal” loss function for
achieving high accuracy, because minimizing the
average of the 0-1 loss on the training data is the
same as maximizing accuracy on the training data
n However, there are computational difficulties associated
with minimizing the 0-1 loss
n The logistic loss function seems weird, but it arises

from algebraic manipulations of the maximum
likelihood estimation formulation for logistic
regression

+ 15

cont.
4
3.5
2.5
0-1 Loss
2 Logistic Loss
Hinge Loss
1.5
0.5
0
-3 -2 -1 0 1 2 3

+ 16

Minimization
n Learning algorithms are often formatted as:
minimize loss function on training data +
regularization penalty
n For example, is the ridge penalty,

is the LASSO penalty
n There is another penalty called “elastic net” which
is a combination of ridge and LASSO – see glmnet
docs
+ 17
Regularized Loss Function

Minimization
n The methods studied last week correspond to the least-

squares loss function with ridge and LASSO penalties
n If we use the logistic loss for a classification problem,

we get “ridge and LASSO logistic regression”
n Both of these extensions inherit many of the properties of their
respective linear regression versions
n We will now use the above problem format to study

large-scale learning methods

+ Regularized Loss Function
Minimization and (Stochastic)
Gradient Descent

+ 19

Minimization
n Learning algorithms are often formatted as:
minimize loss function on training data +
regularization penalty
n For example:
n is the ridge penalty
n is the LASSO penalty

+ 20

Minimization, cont.
n Let’s define the objective function as:
n …so that our learning problem is:
n (We assume that is differentiable globally)

n How should we solve this problem?

+ 21
Gradients
n Recall the partial derivative of with respect
to denoted as
nI can consider a p-dimensional vector of these

quantities called the gradient of at
n This is denoted as and its jth component is
n The vector points in the direction of

steepest increase of the function at

+ 22
Optimality Conditions
n Recall our optimization problem of interest:
n Under some regularity conditions (e.g., the

function is convex) then solves this
problem if and only if
n More generally, all local minimizers satisfy
n This result suggests that we should look for points

where the gradient is equal to 0, but there are a few
issues…
n Most notably, how can we find such points efficiently?
+ 23
Gradient Descent
n Basic idea: at any point in

the domain, take a slight
step in the direction of
steepest descent, i.e., the
direction where the function
has the maximum rate of
decrease, i.e., the direction
of minus the gradient

+ 24
Gradient Descent, cont.
n Problem of Interest:
n Step 1.) Initialize at some point and set step-

size parameter
Beta0 is another hyperparameter
n Step 2.) For t = 0, 1, …, T perform the update:
n Two issues: how do we select T and how do we

select the step-size parameter? Ideas?
Could choose cross-validation and with a certain threshold of the gradient and the change in the gradient

+ 25
Gradients in Machine Learning
n Recall that our objective function is
n How do I compute the gradient at ? Simple!

+ 26
Gradients in Machine Learning,

cont.
n It turns out that is typically quite simple to

compute, e.g., for the ridge penalty
n Recall that the loss function takes the value of the

prediction as input, hence its gradient
depends on how changes when changes
and how changes when changes.
By the chain rule, we have:

+ 27
Gradients in Machine Learning,

cont.
n This formula is extremely convenient – it says that
the gradient of the loss of the ith data point is
proportional to the ith feature vector and the
proportionality constant is equal to to the number:
n Then, recalling the formula:
n …what is the procedure for computing ?

+ 28
Gradient Descent, cont.
n For t = 0, 1, …, T perform the update:
n Given the procedure for computing the gradient of

the training set loss , each iteration of the
method requires a full pass over the training set
n Consider a situation where p = 10 and n =

1,000,000
n This seems to be quite an inefficient use of the training
data…

+ 29
Stochastic Gradient Descent
n Key idea: take the gradient descent update rule:
n …and replace the computation of with a

randomly selected single data point
n That is, sample an index i uniformly at random

from 1, 2, …, n and set

+ 30
Stochastic Gradient Descent, cont.
n This update rule has the property that the

expected value of the term in brackets is equal to
the deterministic quantity
n The path, however, will be dramatically different…

+ 31
Stochastic Gradient Descent, cont.

n Gradient descent and stochastic gradient descent provide the
basic foundation for many large-scale machine learning
methods
n Linear and logistic regression, as well as SVMs, neural networks, …
n SGD is most directly applicable when n is gigantic and p is

moderate
n When p is also large, adding randomization to coordinate updates also
helps
n Sampling a single data point i at each iteration can easily be

replaced with a small batch of randomly selected data points
(say 5, or 10)
n There are a whole host of other issues that have received recent
attention: accelerating the rate of convergence, persevering
“structure”, asynchronous and/or parallel updates, …
n These methods also form the basis of the machine learning

library (MLLib) in Spark
+
Neural Networks

+ 33
What is a Neural Network?
n “Mysterious” Question: What is a neural network?
n Straightforward Answer: A neural network is a

type of nonlinear statistical model

+ 34
Neural Network Applications
n Deep learning and neural networks have transformed

certain areas of data science in the last decade:
n Image recognition
n Speech recognition
n Machine translation
n Autonomous driving
n Neural network ideas have been around since the

1980s, but there has been a recent surge in popularity
n Availability of huge datasets
n Graphics processing units (GPUs) speeding up computation
(among other general improvements in computation)

+ 35
“Deep” Neural Networks

n Deep neural networks and other deep learning
architectures have been used to produce state-of-the-art
results in domains like computer vision, speech
recognition, natural language processing, and audio
recognition
n In these domains, the complicated functions fit by deep

networks are reasonable surrogates for reality
n Additionally, the deep networks are able to “automatically”

perform feature engineering, which used to be an onerous
task in these domains
n We will start with a basic single hidden layer network, and

then we will easily extend this to a deep feedforward
network
+ 36
Neural Networks: Pros and Cons
Pros: Cons:
n Delivers state-of-the-art n Neural Network modeling

predictions in multiple involves many parameters and
domains structural design decisions
n Fitting a (deep) neural network

n Scales to massive datasets
model can be an engineering
(both in the number of challenge
observations and the number
of variables) n NNs are often uninterpretable
n “Hot area” in machine n NNs often require huge datasets

learning in order to work really well

+ 37
Neural Network Basics

nA neural network is a type of nonlinear prediction
function that maps p-dimensional
feature vectors to K-dimensional outputs
n In a regression problem, the output is one-dimensional and

the prediction is the same as the output:
n In a binary classification problem, the prediction is given
by the rule:
n In a multiclass classification problem, the output is typically
interpreted as a probability vector:
where is interpreted as an estimate of the conditional

probability that the dependent variable is in class k

+ 38
Neural Network Basics, cont.
nA neural network is a nonlinear function

defined by a directed graph organized in layers
n The first layer is the input layer, which takes in
n The last layer is the output layer, which returns
n The inputs are “fed forward” through the network

in order to compute the output
+
Single Hidden Layer Neural
Nets

+ 40
A Single Hidden Layer Neural

Network
Input Hidden Output
Layer Layer Layer

+ 41
Key Ingredients
n Each node has an input and and output
n Each edge has a weight value associated with it
n Each node also has a “bias term” associated with it
n The nodes in the input layer take in the feature

values

+ 42
Key Ingredients, cont.
n Each node has an input and and output
n Each edge has a weight value associated with it
n Each node also has a “bias term” associated with it
n The function value is computed by feeding

values through the graph via weighted linear
combinations and nonlinear “activation” functions
n Let’s see how this works at each of the three layers…

+ 43
Input Layer
n The nodes in the input layer take in the feature

values
n If the input to a node in layer 1 is equal to x, then the

output is simply x

+ 44
Hidden Layer
n Let M be the total number of nodes in the hidden layer
n Notation:
n Let be the weight from the ith node in the input layer to the
jth node in the hidden layer
n Let be the input to the jth node in the hidden layer
n Let be the bias term of the jth node in the hidden layer
n The input to each node in the hidden layer is determined as

a weighted combination of the outputs of the input layer
plus the bias term

+ 45
Hidden Layer, cont.
n The output of each node in the hidden layer is

determined by applying a scalar nonlinear
“activation” function to its input
n Let be the output of the jth node in the hidden

layer, then:

+ 46
Example Activation Functions
n Sigmoid:
0.5
0
-5 -4 -3 -2 -1 0 1 2 3 4 5

+ 47
n ReLU:
0
-5 -4 -3 -2 -1 0 1 2 3 4 5

+ 48
n Threshold/Step:
1.5
0.5
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
-0.5

+ 49
Activation Function
n The activation function is a crucial part of the

neural nets methodology because it introduces
nonlinearities
n Without the activation function, neural nets would

just reduce to an excessively complicated version
of linear regression

+ 50
Output Layer
n In general the output layer will have K nodes
n Notation:
n Let be the weight from the jth node in the hidden layer
to the kth node in the output layer
n Let be the input to the kth node in the output layer
n Let be the bias term of the kth node in the output layer
n Again, the input to each node in the output layer is

determined as a weighted combination of the outputs of
the hidden layer plus a bias term:

+ 51
Output Layer, cont.
n The output of each node in the output layer (i.e.,

the final output of the model) is determined again
by applying an activation function
n Let be the kth final output of the neural net

model, then:

+ 52
Activation Function in the Output

Layer
n In regression problems, usually we have K = 1 and
the output layer activation function is simply the
identity:
n This is also the case for binary classification if one

uses the prediction rule

+ 53
Activation Function in the Output

Layer, cont.
n In multiclass classification problems, we actually allow
the activation function in the last layer to depend on the
entire vector of inputs to the last layer
n The output of each node is determined by the softmax

activation function
n Softmax is well suited for multiclass classification since

the output is a valid probability
distribution
+
Training a Neural Network
Model

+ 55
Training a Neural Network Model
n We have now fully described how a single hidden layer

neural network model computes its function values
n The model relies on specifying:
n The number of nodes in the hidden layer
n The activation functions
n Edge weight values
n Node bias terms
n For now, let’s suppose that the number of nodes in the

hidden layer and the activation functions have already
been decided upon
n We will now determine the edge weights and bias
terms based on training data by solving an optimization
problem…
+ 56

via Loss Function Minimization
n We will appeal to the same loss function minimization
framework that we previously considered for linear
models
n Training data:
n Predictions are constructed from a neural network

model of the features
n are the model parameters that we need to fit
n is a loss function that measures the error or

loss incurred when the truth is actually but our
prediction was based on

+ 57
Examples of Loss Functions
n In regression, as before the standard loss function

is the least-squares (RSS) loss:
n In binary classification, we might again consider

the logistic loss function:

+ 58
Examples of Loss Functions, cont.
n Consider multiclass classification with K classes:
n Presuming that we use the softmax activation in the

output layer (hence is a vector of
conditional probabilities), it makes sense to use the
cross-entropy loss function:

+ 59

n Let’s define the objective function as:
n Note that the structure of the function

implicitly depends on the parameters
n is a regularization (e.g., ridge) term like

before

+ 60

n Our learning problem is now the optimization
problem:
n How should we solve this problem?
n Stochastic gradient descent!

+ 61
Computing the Gradient
n We need to be able to compute the gradient of the

loss at any given data point with respect to the
parameters
n The gradient computation is not as straightforward
as with a linear model
n Luckily, there is a clever algorithm called back
propagation that can be used to efficiently
compute the gradient
n Key idea is a forward pass to compute the prediction
values, then a backwards pass to compute the rate of
change

+ 62
Back Propagation Illustrated
n Let’s consider the case of regression with the

squared error loss

+ 63
n More notation:
n Therefore:

+ 64
n Our Task:
Given values of , compute

+ 65
Forward Pass
n Given the specified values of the parameters and

the data observation, we can run a forward pass
through the network to compute:

+ 66
Partial Derivatives: Output Layer
n Forward pass yields:
n Chain rule yields:

+ 67
Partial Derivatives: Hidden Layer
n Forward pass yields:
n We can pass back to each node in the hidden layer

the value:

+ 68
Partial Derivatives: Hidden Layer
n From the backward pass:
n Chain rule (again) yields:

+ 69
Back Propagation
n The main ideas of back propagation easily extend

to networks with more layers as well
n Key idea is a forward pass to compute the

prediction values, then a backwards pass to
compute the derivative

+ 70
Other Optimization Issues
n 1.) Non-convexity: Unlike the case of linear

models, neural nets usually produce non-convex
optimization problems, which are more difficult to
solve in theory and in practice. The punchline is
that SGD may not always converge to the global
minimizer.
n 2.) Initialization: The initial values for the

parameters are usually selected as random values
near zero.

+ 71
Other Optimization Issues, cont.
n 3.) Regularization: A regularization penalty is not

always included with neural nets. The reason is
that the SGD algorithm does some sort of “implicit
regularization” on its own and the number of
iterations can also serve as a tuning parameter.
n 4.) Other Algorithms: There are other options for

algorithms besides SGD, and many bells and
whistles might be considered. (In the code, we use
“RMSProp”)

+
Deep Networks and Other
Extensions

+ 73
Deep Networks
n Deep neural networks are simply obtained by
adding more layers
Input Hidden Output
Layer Layers Layer

+ 74
Deep Networks, cont.
n The mathematical description for computing the

function value of a fully connected deep network is
exactly analogous to what we saw before in the
single hidden layer case
n Training the network also follows the same general

principles described previously
n It is also easy to extend the layered model to an

arbitrary directed acyclic graph

+ 75
Some Advanced Architectures
n Recurrent neural networks and LSTM (long short

term memory) models introduce time and memory
considerations
n Convolutional layers are useful for processing

image data and simplifying it down to something
that can then be fed into several fully connected
layers
n…

+ 76
Convolutional Networks
n Convolutional networks are especially useful for

dealing with image data
n Suppose that our input is an image of size h x w
pixels
n So the dimension of the feature vector is p = h x w
n Then, if there are M nodes in the hidden layer, there will be
h x w x M weight values that need to be trained
n This can quickly create computational and statistical issues
n Convolutional networks overcome this challenge

by exploiting sparsity, local structure, and
parameter sharing

+ 77
Convolutional Networks, cont.
n Simplest case: let’s transform an h x w pixel image

(our input) to another layer that also has h x w
nodes

+ 78
n We will operate with a 3 x 3 window that scans over

the original image and computes a dot product of
the neighboring pixels with a fixed 3 x 3 matrix of
weights:
w11 w12 w13
* w21 w22 w23

= aij
w31 w32 w33
n This operation is called a convolution

+ 79

n Each pixel value in the new image is computed by
taking a convolution with the weight matrix and
then applying an activation function (note that we
pad the borders with zeros)
w11 w12 w13
w21 w22 w23
w31 w32 w33

+ 80
Max Pooling
nA convolutional layer is usually followed by a

pooling step to reduce the size of the image
n The same window based approach is used but now

we take the maximum of all values in each window
The bigger the window, the more general info it gets but the less local structure it picks up
+ 81
Convolutional Networks: Key Ideas
n 1.) Each layer of the network is an image
n 2.) Convolutional “layers” link the images together by

the previously described process
n 3.) A convolutional layer is usually followed by a max

pooling layer that reduces the size of the image
n 4.) Sparsity and local structure through the

properties of convolution
n 5.) Parameter sharing (and invariance) since there is

only a single 3 x 3 weight matrix

+ 82
Convolutional Networks:
Extensions
n Of course the window size can be set however you
like
n Also different options for how to deal with the
border cases (here we used “same zero padding”)
n Most importantly, this can be extended to height x
width x channels
n The number of channels can vary throughout the network
and is also referred to as the number of filters
n These ideas can also be applied to 1-D data (e.g.,

time series)

+ 83
Convolutional Networks: Final

Steps
n After several convolutional/max pooling layers, it is
important to have a few (at least one) dense layer
n First you need to flatten the height x width x channels 3-
D object into a 1-D vector
n Then add dense layers as discussed before
n Finally, another popular idea is to use pre-trained
values for the weights in the earlier layers instead of
training them from scratch
n These pre-trained values are usually based on other datasets
and so this is a form of transfer learning
n See Chapter 5 of the “Deep Learning with R” book for more
details

+
Computational Examples

+ 85
Keras
n We will use the Keras package in R
n Keras is a “high-level neural networks API

developed with a focus on enabling fast
experimentation”
n Easy to implement feed forward, convolutional, or

recurrent networks (or any combination)
n Easy to run on CPU or GPU
n Uses a backend such as TensorFlow

+ 86
Classifying Poker Hands
n We will train a neural network model in Keras to

predict the best poker hand achievable based on
the cards that were dealt
n What kind of learning problem is this?
n How do you expect a neural network to perform?

+ 87
Poker Data Description
n Dataset retrieved from Kaggle
n Dependent Variable: one of 10 possible categories

coded between 0-9:
0: Nothing in hand; not a recognized poker hand
1: One pair; one pair of equal ranks within five cards
2: Two pairs; two pairs of equal ranks within five cards
3: Three of a kind; three equal ranks within five cards
4: Straight; five cards, sequentially ranked with no gaps
5: Flush; five cards with the same suit
6: Full house; pair + different rank three of a kind
7: Four of a kind; four equal ranks within five cards
8: Straight flush; straight + flush
9: Royal flush; {Ace, King, Queen, Jack, Ten} + flush

+ 88
Poker Data Description
n Independent Variables (10 nominally):

n S1 “Suit of card #1”
n Coded in {1, 2, 3, 4}, representing {Hearts, Spades, Diamonds,
Clubs}
n C1 “Rank of card #1”

n Coded in {1, 2, …, 12, 13}, representing {Ace, 2, 3, ... , Queen,
King}
n ...
S5 “Suit of card #5”
C5 “Rank of card #5”

+ 89
Training/Test Split and Validation

Approach
n 25,010 total observations
n (Note that there are 2,598,960 total possible poker hands)
n We did a random train/test split, with 70% in

training and 30% in test
n Instead of doing cross validation, I will allow Keras

to reserve 20% of the training data as a validation
set to track performance during the SGD (or SGD-
like) algorithm

+ 90
On Selecting the Network

Properties
n Generally speaking, it is not advisable to use cross-
validation, etc. to select the network properties such as
the number of hidden units and the number of layers
n This has some risk of overfitting and is also just generally a
waste of computation/time
n It’s okay to simply experiment with a few different

values for these parameters (often, number of hidden
units in each layer is on the order of the number of
features and number of hidden layers is between 1 and
10)
n In real life applications, a great deal of engineering
effort and domain expertise goes into determining the
network architecture

+ 91
Baseline Model
n Baseline model always predicts 0 – not a

recognizable poker hand (i.e., a high card hand)
n Test set accuracy of 0.495
Model Type Test Set Accuracy

Baseline 0.495

+ 92
LDA Model
n Multiclass classification is simple with LDA, so let’s

try it
n Computation time: a split second
n Test set accuracy: 0.487

Baseline 0.495
LDA 0.487

+ 93
Random Forests Model
n Default parameters (no cross validation)
n Computation time: 527.82 sec

Baseline 0.495
LDA 0.487
Random Forests 0.655

+ 94
Single Hidden Layer Neural Net
n Single hidden layer model with sigmoid activation

function and 100 hidden units
n Accuracy history during the course of the

algorithm:

+ 95

Baseline 0.495
LDA 0.487
Single Hidden Layer Net 0.909

+ 96

n Switching sigmoid to ReLU improves things slightly

Baseline 0.495
LDA 0.487

+ 97
Three Hidden Layer Neural Net
n Model with three hidden layers and ReLU

activation functions
n Accuracy history during the course of the

algorithm:

+ 98
Three Hidden Layer Neural Net


Baseline 0.495
LDA 0.487
Three Hidden Layer Net 0.943

+ 99
MNIST: Image Classification
n MNIST is a very famous dataset in image

classification and is popular in the neural network
literature
n The task is to classify images of handwritten digits
n There are 10 classes: 0, 1, 2, … , 9
n 60,000 training images, 10,000 test images

+ 100
MNIST: Dense Network
n Single hidden layer network with 512 hidden units

and ReLU activation function
n Number parameters = 28*28*512 + 512 + 512*10 + 10 =
407,050
n Accuracy history:

+ 101
MNIST: Dense Network

Baseline 0.1135
Single Hidden Layer Network 0.9797

+ 102
MNIST: Convolutional Network
n3 convolutional layers, 2 dense layers (ReLU)

n Number parameters = 93,322
n Accuracy history:

+ 103
MNIST: Convolutional Network

Baseline 0.1135
Single Hidden Layer Network 0.9797
Convolutional Network 0.9915

+ 104
CTR Prediction
n Let’s try a different example – the CTR prediction

dataset from Week 5 (regression problem)
n Recall that Random Forests and Boosting both had
OSR2 = 0.588
nI tried several different neural network
architectures/ideas and could not get an OSR2 value
better than 0.455 (single hidden layer model)
n It’s not worth the time or effort to continue trying to
get a good neural network model

+ 105
Neural Nets vs. Other Models

n What are some intuitive differences between the
poker/MNIST examples and the advertising example?
n Generally speaking, neural nets tend to work very well

on problem with high “signal-to-noise” ratios and
where a complicated relationship seems plausible
n This often is the case in vision, speech and language
modeling, etc.
n When there is a relatively larger degree of noise in the

problem and when feature engineering does not seem
to be a critical issue, boosting, random forests, and
other methods tend to dominate
n This often is the case in “business analytics” problems
n It may be possible to get a good neural network model, but it
will usually require an onerous amount of work
+ 106
Other Issues and Extensions
n There are many other issues that we haven’t

discussed, especially when dealing with
complicated data types like images
n There are also many different extensions to the

basic feedforward model that people consider
nA good practical reference is “Deep Learning with

R” by Francois Chollet
n Also the same author wrote “Deep Learning with Python”

+
Some “Big Data” Issues

+ 108
Data in Theory….
y x1 x2 x3 x4 …
y1 x11 x12 x13 x14 …
y2 x21 x22 x23 x24 …
y3 x31 x32 x33 x34 …
y4 x41 x42 x43 x44 …
y5 x51 x52 x53 x54 …
…
…
+ 109
Data in Practice…

+ 110
Data in Practice
n Data in practice is often unstructured, messy, and

distributed across many different machines and/or
locations
n How do we know if we have a “small data” problem

or a “big data” problem?

+ 111
The Data Science Lifecycle
n We might say that a “small data” problem is a

problem such that all aspects of the above work
flow may be completed on a single laptop
machine, in memory, and in a reasonable
amount of time

+ 112
The Data Science Lifecycle, cont.
n Question: What if we have a 50 Gb data file but

after tidying and transforming, we produce a 500
Mb data frame for training our model(s) on?

+ 113
“Medium Data” Problems
n We might say that a “medium data” problem is a

problem such that all aspects of the above work
flow may be completed on a single laptop machine
in a reasonable amount of time, but perhaps we
won’t be able to keep all of the data in RAM at the
same time

+ 114
Some “Medium Data” Tools in R

n data.table is a high-performance extension of base R’s
data.frame
n Effective for data around 10 Gb or so
n As compared to dplyr (see here), data.table provides similar functionality
but is less natural and less “linguistically friendly”
n ff package allows you to work with data on disk (almost) as if it

were in RAM
n biglm and bigglm support batch updating of linear and logistic

regression models based on new observations
n ranger for more efficient training of Random Forests models
n Perhaps a small subset, subsample, or summary of your medium

dataset is all you need to feed into the training of your ML
model…
n Other solutions: buy more RAM and/or processing power,

integrate closely with a database
+ 115
“Big Data” Problems
n We might say that a “big data” problem is a problem

such that all (or some) aspects of the above work flow
must involve working with data that is distributed
across multiple sources, machines, locations, etc.
n Some aspect of the problem is simply too big or too

complicated to fit on your laptop

+ 116
Some “Big Data” Tools in R
n Apache Spark™ is a fast and general engine for large-

scale data processing
n A high-level interface for programming clusters, using
MapReduce style ideas, and with implicit data
parallelism and fault tolerance
n Spark is typically run on the cloud
n Includes libraries for SQL, streaming data, machine
learning (stochastic gradient descent), and graph
processing
n You can also write your own applications on top of Spark
n Initially developed by the AMPLab at UC Berkeley

+ 117
Predicting Taxi Fares in Chicago
n We will demo Spark on a problem concerning

predicting taxi cab fares in the city of Chicago
n Dataset obtained from
https://www.kaggle.com/chicago/chicago-taxi-rides-2016
n This dataset is on the cusp of what can be done in R on

my 16gb RAM laptop
n Hence more of a “medium data” problem
n Parts of the analysis are too computationally intensive to be
done using the packages we have seen so far
n However using Spark, even in “local mode” (i.e., not on a
cluster), enables us to do more
n This may be covered in discussion section in two weeks

+ 118
Conclusions
n Real life big data problems are messy and

necessarily involve working in a distributed
computing environment
n Spark is a useful tool for machine learning and other tasks
in this setting
n Stochastic gradient descent and other efficient

optimization algorithms enable large-scale
learning
n Neural networks are a powerful modeling
framework that yield state-of-the-art results in
domains such as vision, natural language
processing, and others

Lectures24and25 LargeScale NeuralNets PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lectures24and25 LargeScale NeuralNets PDF

Uploaded by

Copyright:

Available Formats

+

Lectures 24 and 25 – Large-Scale

n Regularized Loss Function Minimization and

n Intro to Neural Networks

n Training a Neural Network model

n Deep and Convolutional Networks

n Some “Big Data” Issues

IEOR 242, Fall 2019 - Lecture 24

IEOR 242, Fall 2019 - Lecture 24 3

Test Set Performance Summary of

(Note: Random Forests did not use cross validation)

Summary of Last Week

n PCR, Ridge Regression, and LASSO can improve

n LASSO perform variable selection and informs us

n Linear models are interpretable and competitive

IEOR 242, Fall 2019 - Lecture 24

What about classification?

n PCR, Ridge Regression, and LASSO can all easily

n In fact, regularization is very general concept that

IEOR 242, Fall 2019 - Lecture 24

Machine Learning as Loss Function

n Let’s examine this more closely in the case where

IEOR 242, Fall 2019 - Lecture 24

Machine Learning as Loss Function

n Predictions are constructed from a linear function of

n Two ways to handle the intercept term : (i) construct a

n are the model coefficients that we need to fit

n is a loss function that measures the error or loss

Examples of Loss Functions

n is a loss function that measures the error

n In regression, the standard loss function is the

n In classification, there are a few options

IEOR 242, Fall 2019 - Lecture 24

Loss Functions for Classifcation

n Suppose that we’ve set up the problem so that the

n This format is convenient because most loss

IEOR 242, Fall 2019 - Lecture 24

Prediction Rule for Classification

n This prediction rule can be compactly summarized

where is the function:

IEOR 242, Fall 2019 - Lecture 24

Loss Functions for Classification,

n Logistic loss function:

IEOR 242, Fall 2019 - Lecture 24

Loss Functions for Classification,

n Hinge loss arises in the context of support vector

n In fact (after some geometric arguments), hinge

IEOR 242, Fall 2019 - Lecture 24

Loss Functions for Classification,

n The logistic loss function seems weird, but it arises

IEOR 242, Fall 2019 - Lecture 24

Loss Functions for Classification,

IEOR 242, Fall 2019 - Lecture 24

Machine Learning as Loss Function

n For example, is the ridge penalty,

Regularized Loss Function

n The methods studied last week correspond to the least-

n If we use the logistic loss for a classification problem,

n We will now use the above problem format to study

IEOR 242, Fall 2019 - Lecture 24

IEOR 242, Fall 2019 - Lecture 24 18

Machine Learning as Loss Function

n is the LASSO penalty