Gradient Descent Algorithm

1. Define the principle of the gradient descent algorithm.
Accompany your explanation with a

diagram. Explain the use of all the terms and constants that you introduce and
comment on the range of values that they can take.
Solution:
Gradient Descent:
Gradient descent is an algorithm of optimization used to minimize any function by moving

number of iterations toward the steepest descent as defined by the negative of the gradient.
We use gradient descent in machine learning to change the model parameters. parameters
belong to linear regression coefficients and weights with respect to neural networks. This
implies that the first derivative is only considered when updating the parameters. Under
each iteration, we update parameters from the gradient of the objective function J(θ) in the
opposite direction with those parameters, where the gradient indicates the steepest
ascending direction. The size of the step we take for each iteration is determined by the α
learning rate to reach the local minimum. We follow the path down the slope direction until
we reach a local minimum.
Provided the cost function J(θ) gradient descent, the value of θ is determined which almost
minimises the cost function. This follows an iterative process, for any given iteration k
changes from θ(k) is updates to a new value θ (k + 1) given that J (θ (k + 1)) < J(θ(k)). The best
way to find θ (k + 1) that aims to decrease the cost is to determine the gradient of J at given
point θ(k) and then we can select θ (k + 1) in order to move in downward direction. When
the downhill is too large, the algorithm may take a long time to converge, If the move is too
big, the algorithm may go beyond the minimizing value of θ this leads to instability or
oscillation.
Pseudo-code:
for i=1 to n do
❑ ∂J
tempi=θi −α (θ )
∂ θi
end for
θ=temp
Until J(θ) decreases or minimum (slight change in the amount).
Given α is known as learning rate which controls downward movement of J(θ) for every
iteration.
2. Derive the gradient descent training rule assuming that the target function representation is:
od = w0 + w1x1 + … + wnxn. Define explicitly the cost/error function E, assuming that a set
of training examples D is provided, where each training example d ∈ D is associated with the
target output td.
Done on Paper
3. What is the difference between supervised and unsupervised learning. Explain with real life
examples.
4. Explain the feature engineering techniques covered in class (Textual data and Numerical
data)?
5. Explain the four techniques for features selections.
6. What are the six steps of Machine learning cycle? Explain the six step of Machine
learning
cycle with the help of example (minimum 200 words).
7. Write a short note on the following (minimum 100 words each)

a) Feature selection
b) Parameter Tuning
a. SVM
b. Decision Tree
c. Variance
d. Bias-Variance Trade off
e. RMSE
f. Types of Machine learning
13. Describe the two dimensionality reduction techniques.
i) Featureselection
In feature selection, we are interested in finding k of the total of n features that
give us the mostinformationanddiscardtheother(n−k)dimensions. Two popular
feature selection approaches.
a) Forward Selection
Inforwardselection,startwithnovariablesandaddthemonebyone,ateachstepaddin
gtheone thatdecreases the errorthemost, untilany further additiondoesnot
decrease the error (ordecreases it only slightly).
b) Backward Selection
a) Solution: 1- Neural network with a shared hidden layer can capture dependencies between
diseases. It can be shown that in some cases, when there is a dependency between the
output nodes, having a shared node in the hidden layer can improve the accuracy. 2- If there
is no dependency between diseases (output neurons), then we would prefer to have a
separate neural network for each disease.
b) We expect students to explain how each of these learning techniques can be used to output
a confidence value (any of these techniques can be modified to provide a confidence value).
In addition, Naive Bayes is preferable to other cases since we can still use it for classification
when the value of some of the features are unknown. We gave partial credits to those who
mentioned neural network because of its non-linear decision boundary, or decision tree
since it gives us an interpretable answer.
c) Yes, k-means assigns each data point to a unique cluster based on its distance to the cluster
center. Gaussian mixture clustering gives soft (probabilistic) assignment to each data point.
Therefore, even if cluster centers are identical in both methods, if Gaussian mixture components
have large variances (components are spread around their center), points on the edges between
clusters may be given different assignments in the Gaussian mixture solution.
Business Understanding: In this phase we have to clearly mention the business requirements,
problem statement. Here the business problem converted into a question of machine learning, for
example "What segment of customers we have to focus? Then problem is clearly stated like “Will
this target customer buy the product or not?”. in this phase we also mention the resources required,
the risk contingencies, cost-benefit analysis. Finally, a suitable machine learning project plan is
prepared.
Data Understanding: Data understanding consists of data collection, and assessing data quality. The
data collection step includes listing the data sources and what data to extract from them, analysing
the data for further requirements and determining whether additional data sources are required.
The data quality stage consists of deciding whether missing data elements are usable, whether they
can be extracted or substituted.
Data Preparation: The data preparation stage includes all the activities required for raw data
processing and for transforming it into representation features before they are fed into the machine
learning models. The task can be accomplished with feature engineering and feature selection.
Modeling: Modeling consists of three main phases, model creation, model selection, and tuning
parameters. If the data does not correspond well to the type of modelling algorithm then phase 3
needs to be revisited. The first step, of course, is to pick a modeling algorithm and the tools to do
this. The model testing is normally split into a data set for testing and training. Depending on the
data set and algorithm, the division can vary. Usually we divide the training and testing in the ratio
80% and 20%. An evaluation criterion should be selected at this time. The actual training will involve
tuning hyper parameters in order to check the accuracy.
Evaluation: This step checks to what degree the model meets business goals or objectives, and
attempts to decide if business objectives are adequate. In this stage, we validate all models by
selecting hyperparameters and model characteristics. The selected learning models are measured
and the accuracy compared. Ultimately, for implementation the optimal or best performing machine
learning model is selected.
Deployment: In this phase, we select the best model to create a Web application or a
recommenddatio system. The final application is deployed with an authorized cloud platform.

Gradient Descent Algorithm

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gradient Descent Algorithm

Uploaded by

Copyright:

Available Formats

1. Define the principle of the gradient descent algorithm.

Accompany your explanation with a

Gradient descent is an algorithm of optimization used to minimize any function by moving

5. Explain the four techniques for features selections.

7. Write a short note on the following (minimum 100 words each)

In feature selection, we are interested in finding k of the total of n features that

give us the mostinformationanddiscardtheother(n−k)dimensions. Two popular

feature selection approaches.

gtheone thatdecreases the errorthemost, untilany further additiondoesnot

decrease the error (ordecreases it only slightly).

You might also like