Aie231 NN Lab5

Alamein University Neural Networks
Faculty of Computer science & Course Code: AIE231

Engineering
Neural Network Lab 5

Stochastic Gradient Descent(SGD)
Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm that is
used for optimizing machine learning models. It addresses the
computational inefficiency of traditional Gradient Descent methods when
dealing with large datasets in machine learning projects.
In SGD, instead of using the entire dataset for each iteration, only a single
random training example (or a small batch) is selected to calculate the
gradient and update the model parameters. This random selection
introduces randomness into the optimization process, hence the term
“stochastic” in stochastic Gradient Descent.
The advantage of using SGD is its computational efficiency, especially when
dealing with large datasets. By using a single example or a small batch, the
computational cost per iteration is significantly reduced compared to
traditional Gradient Descent methods that require processing the entire
dataset.
Stochastic Gradient Descent Algorithm

Initialization: Randomly initialize the parameters of the model.
Set Parameters: Determine the number of iterations and the learning rate
(alpha) for updating the parameters.
Stochastic Gradient Descent Loop: Repeat the following steps until the
model converges or reaches the maximum number of iterations:
a. Shuffle the training dataset to introduce randomness.
b. Iterate over each training example (or a small batch) in the shuffled
order.
c. Compute the gradient of the cost function with respect to the model
parameters using the current training example (or batch).
d. Update the model parameters by taking a step in the direction of the
negative gradient, scaled by the learning rate.
e. Evaluate the convergence criteria, such as the difference in the cost
function between iterations of the gradient.
Page 1 of 7
Engineering
Return Optimized Parameters: Once the convergence criteria are met or

the maximum number of iterations is reached, return the optimized model
parameters.
In SGD, since only one sample from the dataset is chosen at random for
each iteration, the path taken by the algorithm to reach the minima is
usually noisier than your typical Gradient Descent algorithm. But that
doesn’t matter all that much because the path taken by the algorithm does
not matter, as long as we reach the minimum and with a significantly
shorter training time.
The path taken by Batch Gradient Descent is shown below:
Page 2 of 7
Engineering
A path taken by Stochastic Gradient Descent looks as follows :
One thing to be noted is that, as SGD is generally noisier than typical

Gradient Descent, it usually took a higher number of iterations to reach the
minima, because of the randomness in its descent. Even though it requires
a higher number of iterations to reach the minima than typical Gradient
Descent, it is still computationally much less expensive than typical
Gradient Descent. Hence, in most scenarios, SGD is preferred over Batch
Gradient Descent for optimizing a learning algorithm.
Page 3 of 7
Engineering
Python Code For Stochastic Gradient Descent

We will create an SGD class with methods that we will use while updating
the parameters, fitting the training data set, and predicting the new test
data. The methods we will be using are as :
Gradient – This method will be used in updating the parameters of the
model. For every iteration, it will calculate the error between the predicted
data point and the actual data point.
Fit – This method will be used to fit the training dataset into the machine
learning model. It will shuffle the data indices and will calculate the
gradient for each data point and update the parameter theta.
Predict – This method will be used to predict new data points. As the
prediction is just the dot product of parameter and dataset elements.
import numpy as np
class SGD:
def __init__(self, lr=0.01, max_iter=1000, batch_size=32,
tol=1e-3):
# learning rate of the SGD Optimizer
self.learning_rate = lr
# maximum number of iterations for SGD Optimizer
self.max_iteration = max_iter
# mini-batch size of the data
self.batch_size = batch_size
# tolerance for convergence for the theta
self.tolerence_convergence = tol
# Initialize model parameters to None
self.theta = None
def fit(self, X, y):

# store dimension of input vector
n, d = X.shape
# Intialize random Theta for every feature
self.theta = np.random.randn(d)
for i in range(self.max_iteration):
# Shuffle the data
indices = np.random.permutation(n)
X = X[indices]
y = y[indices]
Page 4 of 7
Engineering
# Iterate over mini-batches

for i in range(0, n, self.batch_size):
X_batch = X[i:i+self.batch_size]
y_batch = y[i:i+self.batch_size]
grad = self.gradient(X_batch, y_batch)
self.theta -= self.learning_rate * grad
# Check for convergence
if np.linalg.norm(grad) < self.tolerence_convergence:
break
# define a gradient functon for calculating gradient
# of the data
def gradient(self, X, y):
n = len(y)
# predict target value by taking taking
# taking dot product of dependent and theta value
y_pred = np.dot(X, self.theta)
# calculate error between predict and actual value

error = y_pred - y
grad = np.dot(X.T, error) / n
return grad
def predict(self, X):

# prdict y value using calculated theta value
y_pred = np.dot(X, self.theta)
return y_pred
SGD Implementation
We will create a random dataset with 100 rows and 5 columns and we fit
our Stochastic gradient descent Class on this data. Also, We will use predict
method from SGD
# Create random dataset with 100 rows and 5 columns
X = np.random.randn(100, 5)
# create corresponding target value by adding random
# noise in the dataset
y = np.dot(X, np.array([1, 2, 3, 4, 5]))\
+ np.random.randn(100) * 0.1
# Create an instance of the SGD class
model = SGD(lr=0.01, max_iter=1000,
batch_size=32, tol=1e-3)
model.fit(X, y)
Page 5 of 7
Engineering
# Predict using predict method from model

y_pred = model.predict(X)
This cycle of taking the values and adjusting them based on different
parameters in order to reduce the loss function is called back-
propagation.
Tuning your learning rate
This curve consists of three distinct parts: learning rates that don't learn fast enough,
and don't take the model anywhere; an area of steepest descent, eventually leading into
an optimal or near-optimal learning rate. Past the edge of that curve you get noise, and
eventually divergence (this is where the learning rate is too large).
Page 6 of 7
Engineering
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD
# Generate dummy data

import numpy as np
import pandas as pd
X_train = np.random.random((1000, 3))

y_train = pd.get_dummies(np.argmax(X_train[:, :3], axis=1)).values
X_test = np.random.random((100, 3))
y_test = pd.get_dummies(np.argmax(X_test[:, :3], axis=1)).values
import numpy as np
from keras.callbacks import LearningRateScheduler
lr_sched = LearningRateScheduler(lambda epoch: 1e-4 * (0.75 **

np.floor(epoch / 2)))
# Build the model.

clf = Sequential()
clf.add(Dense(9, activation='relu', input_dim=3))
clf.add(Dense(9, activation='relu'))
clf.add(Dense(3, activation='softmax'))
# Change the learning rate to show the loss
optimizer = keras.optimizers.SGD(lr=0.0001)
clf.compile(loss='categorical_crossentropy', optimizer=optimizer)
# Perform training.
clf.fit(X_train, y_train, epochs=10, batch_size=500)
clf.compile(loss='categorical_crossentropy', optimizer=SGD())
# Perform training.
clf.fit(X_train, y_train, epochs=10, batch_size=500, callbacks=[lr_sche
d])
Page 7 of 7

Aie231 NN Lab5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aie231 NN Lab5

Uploaded by

Copyright:

Available Formats

Alamein University Neural Networks

Faculty of Computer science & Course Code: AIE231

Neural Network Lab 5

Stochastic Gradient Descent Algorithm

Return Optimized Parameters: Once the convergence criteria are met or

The path taken by Batch Gradient Descent is shown below:

A path taken by Stochastic Gradient Descent looks as follows :

One thing to be noted is that, as SGD is generally noisier than typical

Python Code For Stochastic Gradient Descent

def fit(self, X, y):

# Iterate over mini-batches

# calculate error between predict and actual value

def predict(self, X):

# Predict using predict method from model

Tuning your learning rate

# Generate dummy data

X_train = np.random.random((1000, 3))

lr_sched = LearningRateScheduler(lambda epoch: 1e-4 * (0.75 **

# Build the model.

You might also like