You are on page 1of 2

Programming Test: Learning Activations in Neural

Networks
Monk AI

Ml, AI, DNN


Bangalore, Goa, Singapore
Send your response to: bhavish.agarwal@gmail.com
hr@happymonk.co

Abstract—The choice of Activation Functions (AF) has proven where the coefficients k0 , k1 have to be learned during training
to be an important factor that affects the performance of an via back-propagation of error gradients, on a particular data
Artificial Neural Network (ANN). Use a 1-hidden layer neural set specified in the problem statement.
network model that adapts to the most suitable activation
function according to the data-set. The ANN model can learn for For the purpose of demonstration, consider a feed-forward
itself the best AF to use by exploiting a flexible functional form, neural network consisting of an input layer L0 consisting
k0 + k1 ∗ x with parameters k0 , k1 being learned from multiple of m nodes for m features, two hidden layers L1 and L2
runs. You can use this code-base for implementation guidelines consisting of n and p nodes respectively, and an output
and help. https://github.com/sahamath/MultiLayerPerceptron layer L3 consisting of k nodes for k classes. Let zi and
ai denote the inputs to and the activations of the nodes in
I. BACKGROUND
layer Li respectively. Let wi and bi denote the weights and
Selection of the best performing AF for classification task the biases applied to the nodes of layer Li−1 , and let the
is essentially a naive (or brute-force) procedure wherein, a activations of layer L0 be the input features of the training
popularly used AF is picked and used in the network for examples. Finally, let K denote
  the column matrix containing
k0
approximating the optimal function. If this function fails, the the equation coefficients: k1 and let t denote the number of
process is repeated with a different AF, till the network learns k2
to approximate the ideal function. It is interesting to inquire training examples being taken in one batch. Then the forward-
and inspect whether there exists a possibility of building a propagation equations will be:
framework which uses the inherent clues and insights from z1 = a0 × w1 + b1
data and bring about the most suitable AF. The possibilities
of such an approach could not only save significant time and a1 = g(z1 )
effort for tuning the model, but will also open up new ways z2 = a1 × w2 + b2
for discovering essential features of not-so-popular AFs.
a2 = g(z2 )
II. P ROBLEM S TATEMENT
z3 = a2 × w3 + b3
Given a specific activation function
a3 = Sof tmax(z3 )
g(x) = k0 + k1 x (1)
where × denotes the matrix multiplication operation and
and categorical cross-entropy loss, design a Neural Network on Sof tmax() denotes the Softmax activation function.
Banknote, MNIST or IRIS data where the activation function For back-propagation, let the loss function used in this
parameters k0 , k1 are learned from the data you choose from model be the Categorical Cross-Entropy Loss, and let dfi
one of the above-mentioned data sets. Your solution must denote the gradient matrix of the loss with respect to the matrix
include the learnable parameter values i.e. final k0 , k1 values fi , where f can be substituted with z, a, b, or w. and let there
at the end of training, a plot depicting changes in k0 , k1 at be matrices dK2 and dK1 of dimension 3 × 1. Then the back-
each epoch, training vs test loss, train vs. test accuracy and a propagation equations will be:
Loss function plot. dz3 = a3 − y
III. M ATHEMATICAL F RAMEWORK 1 T
dw3 = a × dz3
t 2
A. Compact Representation
db3 = avg col (dz3 )
Let the proposed Ada-Act activation function be mathemat-
ically defined as: da2 = dz3 × w3T
g(x) = k0 + k1 x (2) dz2 = g 0 (z2 ) ∗ da2
1 T
dw2 = a × dz2
t 1
db2 = avg col (dz2 )

da1 = dz2 × w2T


dz1 = g 0 (z1 ) ∗ da1
1 T
dw1 = a × dz1
t 0
db1 = avg col (dz1 )
 
avge (da1 )
dK1 = avge (da1 ∗ z1 ) (3)
 

avge (da1 ∗ z12 )


dK = dK1 (4)
where ∗ is the element-wise multiplication operation, T is the
matrix transpose operation, avg col (x) is the function which
returns the column-wise average of the elements present in
the matrix x, and avge (x) is the function which returns the
average of all the elements present in the matrix x.
Consider the learning rate of the model to be α. The update
equations for the model parameters will be:
w1 = w1 − α.dw1
b1 = b1 − α.db1
w2 = w2 − α.dw2
b2 = b2 − α.db2
w3 = w3 − α.dw3
b3 = b3 − α.db3
K = K − α.dK
IV. E XPECTED R ESULTS
• A technical report containing implementation details (al-
gorithm, initial settings such as sampling the parameters
k0 , k1 from some distribution, parameter updates on
epochs, final parameter values at the end of training, train
vs test loss, train and test accuracy, F1-Score, plot of
the loss function vs. epochs, Code base hosted on github
linked in the report)–Maximum 3 pages
• Grid search/brute force NOT allowed
• Data Sets for experiments and results: Bank-Note
(Internship Candidates)
• Data Sets for experiments and results: Bank-Note, Iris,
MNIST (Junior Data Scientist Positions)

You might also like