You are on page 1of 2

Programming Test: Learning Activations in Neural

Networks
Monk AI

Ml, AI, DNN


Bangalore, Goa, Singapore
hr@happymonk.co
Send your response to: bhavish.agarwal@gmail.com

Abstract—The choice of Activation Functions (AF) has proven training examples being taken in one batch. Then the forward-
to be an important factor that affects the performance of an propagation equations will be:
Artificial Neural Network (ANN). Use a 1-hidden layer neural
network model that adapts to the most suitable activation z1 = a0 × w1 + b1
function according to the data-set. The ANN model can learn for
itself the best AF to use by exploiting a flexible functional form, a1 = g(z1 )
k0 + k1 ∗ x with parameters k0 , k1 being learned from multiple
runs. You can use this code-base for implementation guidelines z2 = a1 × w2 + b2
and help. https://github.com/sahamath/MultiLayerPerceptron
a2 = g(z2 )
I. BACKGROUND z3 = a2 × w3 + b3
Selection of the best performing AF for classification task a3 = Sof tmax(z3 )
is essentially a naive (or brute-force) procedure wherein, a
popularly used AF is picked and used in the network for where × denotes the matrix multiplication operation and
approximating the optimal function. If this function fails, the Sof tmax() denotes the Softmax activation function.
process is repeated with a different AF, till the network learns For back-propagation, let the loss function used in this
to approximate the ideal function. It is interesting to inquire model be the Categorical Cross-Entropy Loss, and let dfi
and inspect whether there exists a possibility of building a denote the gradient matrix of the loss with respect to the matrix
framework which uses the inherent clues and insights from fi , where f can be substituted with z, a, b, or w. and let there
data and bring about the most suitable AF. The possibilities be matrices dK2 and dK1 of dimension 3 × 1. Then the back-
of such an approach could not only save significant time and propagation equations will be:
effort for tuning the model, but will also open up new ways dz3 = a3 − y
for discovering essential features of not-so-popular AFs.
1
dw3 = aT2 × dz3
II. M ATHEMATICAL F RAMEWORK t
A. Compact Representation db3 = avg col (dz3 )
Let the proposed Ada-Act activation function be mathemat- da2 = dz3 × w3T
ically defined as: dz2 = g 0 (z2 ) ∗ da2
g(x) = k0 + k1 x (1)
1 T
dw2 = a × dz2
where the coefficients k0 , k1 and k2 are learned during training t 1
via back-propagation of error gradients. db2 = avg col (dz2 )
For the purpose of demonstration, consider a feed-forward
neural network consisting of an input layer L0 consisting  
avge (da2 )
of m nodes for m features, two hidden layers L1 and L2
dK2 = avge (da2 ∗ z2 ) (2)
 
consisting of n and p nodes respectively, and an output
layer L3 consisting of k nodes for k classes. Let zi and avge (da2 ∗ z22 )
ai denote the inputs to and the activations of the nodes in
da1 = dz2 × w2T
layer Li respectively. Let wi and bi denote the weights and
the biases applied to the nodes of layer Li−1 , and let the dz1 = g 0 (z1 ) ∗ da1
activations of layer L0 be the input features of the training 1 T
examples. Finally, let K denote
  the column matrix containing dw1 = a × dz1
k0 t 0
the equation coefficients: k1 and let t denote the number of db1 = avg col (dz1 )
k2
 
avge (da1 )
dK1 = avge (da1 ∗ z1 ) (3)
 

avge (da1 ∗ z12 )


dK = dK2 + dK1 (4)
where ∗ is the element-wise multiplication operation, T is the
matrix transpose operation, avg col (x) is the function which
returns the column-wise average of the elements present in
the matrix x, and avge (x) is the function which returns the
average of all the elements present in the matrix x.
Consider the learning rate of the model to be α. The update
equations for the model parameters will be:
w1 = w1 − α.dw1
b1 = b1 − α.db1
w2 = w2 − α.dw2
b2 = b2 − α.db2
w3 = w3 − α.dw3
b3 = b3 − α.db3
K = K − α.dK
III. E XPECTED R ESULTS
• A technical report containing implementation details (al-
gorithm, initial settings such as sampling the parameters
k0 , k1 from some distribution, parameter updates on
epochs, final parameter values at the end of training, train
vs test loss, train and test accuracy, F1-Score, plot of
the loss function vs. epochs, Code base hosted on github
linked in the report)–Maximum 3 pages
• Grid search/brute force NOT allowed
• Data Sets for experiments and results: Bank-Note (Intern-
ship Candidates)
• Data Sets for experiments and results: Bank-Note, Iris,
Wisconsin Breast Cancer, MNIST (Junior Data Scientist
Positions)

You might also like