Professional Documents
Culture Documents
Happymonk Assignment
Happymonk Assignment
Networks
Monk AI
Abstract—The choice of Activation Functions (AF) has proven training examples being taken in one batch. Then the forward-
to be an important factor that affects the performance of an propagation equations will be:
Artificial Neural Network (ANN). Use a 1-hidden layer neural
network model that adapts to the most suitable activation z1 = a0 × w1 + b1
function according to the data-set. The ANN model can learn for
itself the best AF to use by exploiting a flexible functional form, a1 = g(z1 )
k0 + k1 ∗ x with parameters k0 , k1 being learned from multiple
runs. You can use this code-base for implementation guidelines z2 = a1 × w2 + b2
and help. https://github.com/sahamath/MultiLayerPerceptron
a2 = g(z2 )
I. BACKGROUND z3 = a2 × w3 + b3
Selection of the best performing AF for classification task a3 = Sof tmax(z3 )
is essentially a naive (or brute-force) procedure wherein, a
popularly used AF is picked and used in the network for where × denotes the matrix multiplication operation and
approximating the optimal function. If this function fails, the Sof tmax() denotes the Softmax activation function.
process is repeated with a different AF, till the network learns For back-propagation, let the loss function used in this
to approximate the ideal function. It is interesting to inquire model be the Categorical Cross-Entropy Loss, and let dfi
and inspect whether there exists a possibility of building a denote the gradient matrix of the loss with respect to the matrix
framework which uses the inherent clues and insights from fi , where f can be substituted with z, a, b, or w. and let there
data and bring about the most suitable AF. The possibilities be matrices dK2 and dK1 of dimension 3 × 1. Then the back-
of such an approach could not only save significant time and propagation equations will be:
effort for tuning the model, but will also open up new ways dz3 = a3 − y
for discovering essential features of not-so-popular AFs.
1
dw3 = aT2 × dz3
II. M ATHEMATICAL F RAMEWORK t
A. Compact Representation db3 = avg col (dz3 )
Let the proposed Ada-Act activation function be mathemat- da2 = dz3 × w3T
ically defined as: dz2 = g 0 (z2 ) ∗ da2
g(x) = k0 + k1 x (1)
1 T
dw2 = a × dz2
where the coefficients k0 , k1 and k2 are learned during training t 1
via back-propagation of error gradients. db2 = avg col (dz2 )
For the purpose of demonstration, consider a feed-forward
neural network consisting of an input layer L0 consisting
avge (da2 )
of m nodes for m features, two hidden layers L1 and L2
dK2 = avge (da2 ∗ z2 ) (2)
consisting of n and p nodes respectively, and an output
layer L3 consisting of k nodes for k classes. Let zi and avge (da2 ∗ z22 )
ai denote the inputs to and the activations of the nodes in
da1 = dz2 × w2T
layer Li respectively. Let wi and bi denote the weights and
the biases applied to the nodes of layer Li−1 , and let the dz1 = g 0 (z1 ) ∗ da1
activations of layer L0 be the input features of the training 1 T
examples. Finally, let K denote
the column matrix containing dw1 = a × dz1
k0 t 0
the equation coefficients: k1 and let t denote the number of db1 = avg col (dz1 )
k2
avge (da1 )
dK1 = avge (da1 ∗ z1 ) (3)