You are on page 1of 7

Image classification with Feed-Forward Neural Networks

Bartłomiej Mellera , Kamil Matulaa and Paweł Chłąda


a Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44-100 Gliwice

Abstract
Artificial neural networks, such as feed-forward networks (FFN), convolutional neural networks (CNN), recursive neural
network (RNN) are becoming powerful tools that are starting to replace many classical algorithms. It is known that, for
image recognition, CNNs are often the best choice in terms of accuracy. In this paper we show that feed forward networks are
capable of achieving comparable performance, with less complicated architecture in comparison to CNNs. After presentation
of underlying theory of Feed Forward networks, we present different methods, that allowed us to get past network local
minima, then we show experiments and conclusions that followed.

Keywords
Neural networks, Activation function, Images, Classification

1. Introduction be done using neural networks [7]. We can also find


applications of vision support in virtual reality enter-
Artificial neural networks such as feed-forward net- tainment systems [8], where neural networks are used
works (FFN), convolutional neural networks (CNN), to improve perception. There are also many examples
recursive neural network (RNN) are becoming pow- in which neural networks are used for understanding
erful tools that are starting to replace many classical context and emotions from movies. In [9] adaptive at-
algorithms. It is known that, for image recognition, tention model was used to recognize patterns, while
CNNs are often the best choice in terms of accuracy. emotions from clips were detected by neural networks
We went with FFN because they are simpler in their [10] or complex neuro-fuzzy systems [11].
structure and with sufficient amount of data and heuris- This paper addresses both of those issues. We show
tic approach they can achieve comparable performance, that combining heuristic and backpropagation algo-
while being easier to implement and manage. rithm allows, for efficient overcoming of ”minima traps”
One of the most effective learning method is back- and methods that help networks to generalize their
propagation (BP) algorithm. BP uses gradient calcula- knowledge. First we introduce basic theory of Feed-
tion to determine ”direction” in which network should Forward Networks, alongside with explanation of back-
”go”. The downside of strict mathematical approach is propagation algorithm. Then we describe our example
that gradient-based methods often ”get stuck” in local model and series of experiments that were performed
minima of a function. Another common problem in on the model. At the end of our paper, we present con-
many kinds of networks is knowledge generalization; clusion that we have gathered and give performance
we often train our models on large data sets, to avoid metrics of our model.
fixation of a network on small set of examples.
We can find many applications of neural networks
in image processing. In [1] CNN was adopted to rec- 2. Feed-Forward Networks
ognize archaeological sites. We can also find many
applications in medicine where such systems extract
Theory
bacteria [2] or detect respiratory malfunctions or other Artificial Neural Network (ANN) is a mathematical mo-
pathologies [3, 4, 5]. In movie and advertisements field del of biological neural networks that compose a hu-
there are also many applications of such ideas. Movie man brain; similarly to one, the ANN is built of neu-
scenes can be segmented by using background infor- rons which are parts of layers. Each neuron from each
mation [6] or even prediction about such content can layer is connected to all neurons from the previous
layer and all neurons of the next layer are connected
SYSTEM 2020: Symposium for Young Scientists in Technology,
Engineering and Mathematics, Online, May 20 2020 by synapses. These connections have randomly ini-
" bartmel655@student.polsl.pl (B. Meller); tialized weights, which are being modified in the learn-
kamimat133@student.polsl.pl (K. Matula); ing process.
pawechl893@student.polsl.pl (P. Chłąd)
The first layer, responsible for receiving input data,

© 2020 Copyright for this paper by its authors. Use permitted under Creative is called input layer. Similarly the last one, which re-
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) turns output data, is called output layer. There can be
by adjusting synaptic weights in strictly defined way.
There are many methods of learning, but the most pop-
ular training technique that is used in Feed-forward
Neural Network is back-propagation in which modify-
ing weights of the connections goes from output layer
to input layer sequentially. This direction is opposite
to the way inserted information moves in all FFNs. The
goal of training process is to minimize the value of loss
function, for all elements included in training set (T).
The training set consists of input vectors, which are
inserted to first layer by input synapses, and expected
output vectors, which are compared with gained out-
puts every time neural network is fed. The loss func-
tion shows the distance between predicted value and
the actual one. To calculate it we can use Mean Square
Error Loss or Cross-Entropy Loss (also known as Log
Figure 1: Bipolar linear activation function Loss). Using MSE, total error of training set can be de-
scribed with equation:
𝑛
zero, one or more hidden layers between them. The 𝐸 = ∑ ∑(𝑑𝑖 − 𝑦𝑖 )2 , (1)
goal of neural network architect is to find optimal sizes 𝑇 𝑖=1
of layers, to make learning process much more effi-
cient. Input neurons’ count depends on number of where 𝑛 is a dimension of output vector (and also count
features the analyzing object has and output neurons’ of neurons on output layer), 𝑑𝑖 is predicted value on
count on how many classes it can be classified to. 𝑖 𝑡ℎ position of output vector and 𝑦𝑖 is actual value on
Every neuron receives value on input, transforms it 𝑖 𝑡ℎ position of output vector. Correction of synaptic
using activation function and sends the output signal weights starts in last layer and goes backwards through
to the next layer. The input signal of 𝑖 𝑡ℎ neuron of 𝑘 𝑡ℎ all hidden layers until it reaches input layer. The weight
layer is: is changed according to equation:
𝑛
𝑤𝑖𝑗𝑘 = 𝑤𝑖𝑗𝑘 + 𝜂∇𝑤𝑖𝑗𝑘 , (2)
𝑠𝑖𝑘 = ∑ 𝑤𝑖𝑗𝑘 𝑦𝑗𝑘−1 ,
𝑗=1
where 𝜂 is correcting coefficient commonly called ’Learn-
where 𝑤𝑖𝑗𝑘 is a weight of connection between 𝑖 𝑡ℎ neu- ing Rate’ and ∇𝑤𝑖𝑗𝑘 is a value of gradient of synapse’s
ron of 𝑘 𝑡ℎ layer and 𝑗 𝑡ℎ neuron of previous layer and weight’s error:
𝑦𝑗𝑘−1 is 𝑗 𝑡ℎ neuron of previous layer’s output signal
value. The output signal of 𝑖 𝑡ℎ neuron is: 𝜕𝐸 1 𝜕𝐸 𝜕𝑠𝑖𝑘
∇𝑤𝑖𝑗𝑘 = = ⋅ ⋅ 2 ⋅ = 2𝛿𝑖𝑘 𝑦𝑗𝑘−1 , (3)
𝑛 𝜕𝑤𝑖𝑗𝑘 2 𝜕𝑠𝑖𝑘 𝜕𝑤𝑖𝑗𝑘
𝑦𝑖𝑘 = 𝑓 (𝑠𝑖𝑘 ) = 𝑓 (∑ 𝑤𝑖𝑗𝑘 𝑦𝑗𝑘−1 ).
𝑗=1
where 𝛿𝑖𝑘 is value of a change of error function, for 𝑘 𝑡ℎ
There are many activation functions (also known as layer’s 𝑖 𝑡ℎ neuron’s input signal and 𝑦 𝑘−1 is previous
𝑗
transfer functions or threshold functions). One of the
layer’s 𝑗 𝑡ℎ neuron’s output signal value. On last, 𝐾 𝑡ℎ
most commonly used is bipolar linear function. It’s
layer 𝛿 equals:
equation is:
𝑘
2 1 − 𝑒 −𝛼𝑠𝑖 1 𝜕𝐸 1 𝜕(𝑑 𝐾 − 𝑦 𝐾 )2
𝑓 (𝑠𝑖𝑘 ) = 𝑘
−1= 𝑘 𝛿𝑖𝐾 = ⋅ 𝑘 = ⋅ 𝑖 𝑘 𝑖 = 𝑓 ′ (𝑠𝑖𝐾 )⋅(𝑑𝑖𝐾 −𝑦𝑖𝐾 ), (4)
1+ 𝑒 −𝛼𝑠𝑖 1+ 𝑒 −𝛼𝑠𝑖 2 𝜕𝑠𝑖 2 𝜕𝑠𝑖
where 𝛼 is a coefficient that influences ”width” (con-
vergence rate) of activation function. where 𝑓 ′ (𝑠𝑖𝐾 ) is value of activation function’s differen-
After artificial neural network has been properly built, tial of 𝑖 𝑡ℎ output neuron’s input signal. On this layer
it is time to teach it object recognition. ANN learns

23
the value depends mainly on distance between pre- Data: 𝐸𝑝𝑜𝑐ℎ𝑠𝐶𝑜𝑢𝑛𝑡, 𝑇 𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝐼 𝑛𝑝𝑢𝑡𝑠,
dicted and actual values - on error. Other layers’ 𝛿s 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑂𝑢𝑡𝑝𝑢𝑡𝑠 (𝐸𝑂𝑈 𝑇 )
use numbers calculated in previous steps: Result: Higher precision of neural network
𝐿 ∶= count of network’s layers;
𝑁 𝑘+1 𝑁
𝐷 ∶= empty jagged array, for 𝛿 values;
1 𝜕𝐸 1 𝑘+1 𝜕𝐸 𝜕𝑠𝑗 𝑘+1
for 𝑖 = 0 to 𝐸𝑝𝑜𝑐ℎ𝑠𝐶𝑜𝑢𝑛𝑡 do
𝛿𝑖𝑘 = ⋅ 𝑘 = ⋅ ∑ 𝑘+1 = 𝑓 ′ 𝑘
(𝑠 𝑖 ) ∑ 𝛿𝑗𝑘+1 𝑤𝑖𝑗𝑘+1 ,
2 𝜕𝑠𝑖 2 𝑗=1 𝜕𝑠𝑗 𝜕𝑠𝑖𝑘 𝑗=1 for 𝑗 = 0 to Training Set’s Length do
(5) Insert 𝑗 𝑡ℎ vector from 𝐼 𝑛𝑝𝑢𝑡𝑠
to first layer’s input synapses;
where 𝑁𝑘+1 is count of neurons on (𝑘 + 1)𝑡ℎ layer.
for 𝑘 = 0 to L do
In Algorithm 1 you can see the full process of train- Calculate 𝑠 𝑘 on all neurons on 𝑘 𝑡ℎ
ing Feed-Forward Neural Network using back-propag- layer
ation algorithm. It uses array called 𝐷, which con- by summing products of synaptic
sists of 𝛿 values. This jagged array has 𝐿 rows and weights
as many columns, in a row, as many neurons the layer and output values of previous layer
has, it corresponds to. All elements of arrays are writ- neurons (or synapses if it is input
ten like 𝐴𝑖,𝑗 which is just alternative form, for writing layer);
𝐴[𝑖][𝑗]. What’s more, numerical intervals that are used Calculate 𝑦 𝑘 on all neurons on 𝑘 𝑡ℎ
in the for-loops are half-open (the numbers after to layer
don’t count to these intervals). There are also symbols by using activation function;
like 𝐿𝑅, 𝑠𝑖𝑘 , 𝑦𝑖𝑘 and 𝑤𝑖𝑗𝑘 . They mean respectively: Learn- end
ing Rate (𝜂), 𝑘 𝑡ℎ layer’s 𝑖 𝑡ℎ neuron’s input and output
𝑂𝑢𝑡𝑝𝑢𝑡 (𝑂𝑈 𝑇 ) ∶= vector made of
values and weight of synapse between 𝑖 𝑡ℎ neuron of
output values
𝑘 𝑡ℎ layer and 𝑗 𝑡ℎ neuron of previous layer. Moreover
values from output neurons;
there is 𝑓 ′ (⋅) symbol that signifies the differential of the
for 𝑛 = 0 to output neurons count do
activation function - for bipolar linear function
𝐷(𝐿−1),𝑛 =
(𝐸𝑂𝑈 𝑇𝑗,𝑛 − 𝑂𝑈 𝑇𝑗 ) ⋅ 𝑓 ′ (𝑠𝑛𝐿−1 );
1 − 𝑒 −𝛼𝑥
𝑓 (𝑥) = (6) end
1 + 𝑒 −𝛼𝑥
for 𝑘 = 𝐿 − 2 to 0 step −1 do
the differential is:
for 𝑛 = 0 to 𝑘 𝑡ℎ layer’s neurons
2𝛼𝑒 −𝛼𝑥 count do
𝑓 ′ (𝑥) = . (7) 𝐷𝑘,𝑛 = 0;
(1 + 𝑒 −𝛼𝑥 )2
for 𝑚 = 0 to (𝑘 + 1)𝑡ℎ layer’s
neurons count do
𝑘+1 ;
𝐷𝑘,𝑛 = 𝐷𝑘,𝑛 + 𝐷(𝑘+1),𝑚 ⋅ 𝑤𝑚𝑛
3. Example system
end
Our experimental FFN takes in an 50x25x3 image and 𝐷𝑘,𝑛 = 𝐷𝑘,𝑛 ⋅ 𝑓 ′ (𝑠𝑛𝑘 );
outputs six dimensional vector of decision values. Each end
decision value represents one movie that is associated end
with given frame.
for 𝑘 = 𝐿 − 2 to 0 step −1 do
for 𝑛 = 0 to 𝑘 𝑡ℎ layer’s neurons
3.1. Data preparation count do
3.1.1. Image preparation for 𝑚 = 0 to (𝑘 − 1)𝑡ℎ layer’s
neurons count do
After loading data set into the memory, we cut each
𝑘,𝑛 ⋅ 𝑦𝑚 ;
𝑘 = 2 ⋅ 𝐿𝑅 ⋅ 𝐷
𝑤𝑛𝑚 𝑘−1
image to meet 2:1 aspect ratio, this value was chosen end
because our input vector is an image that has an as-
end
pect ratio 2:1. If we took an image with aspect ratio
end
that does not meet input aspect ratio, we would need
to stretch or shrink an image; that would in turn add end
end
Algorithm 1: FFN training algorithm.

24
some pixels and thus make results more inaccurate. was not capacious enough to accommodate amount of
After cropping to target aspect ratio, we crop another knowledge that could recognize four or more movies.
12 of width and height, from each side, to eliminate Eventually, model that contained 3 hidden layers with
1
possible letterboxes. Finally we shrink an image to get the following count of neurons: 800, 200, 50 came out
it to 50x25 dimensions. Such dimensions have been to be an optimal solution.
chosen to cut down on training times. Obviously to The back-propagation algorithm has one significant
eliminate possible artifacts introduced by the prepro- defect. It does not use any heuristics that could help
cessing steps odescribed above the image will be fil- it to deal with local minima. Solution to this issue
tered [12]. is quite simple and easy to implement. In this case,
adding a random number from range of [−0.002; 0.002]
3.1.2. Data labeling to weights of all connections turned out to be extraor-
dinarily efficient. However it’s hard to say about ex-
Each image is given label, signifying movie it belongs act values because, the method was crossed with grad-
1 of image samples are redirected into test-
to. Then 10 ual increase of classes count.Surprisingly, increasing
ing set. amount of classes resulted in leap of quality in deci-
sions made by the network.
3.2. Training network This phenomenon was observed, after few epochs
of learning, after class addition. Sometimes network
After preparation of training and testing datasets, all
needed randomization mentioned above for the phe-
225 per movie images, converted to the vectors, are in-
nomenon to occur.
serted to the neural network. As it was said in ”Feed-
At the time of writing this article absolute accuracy
Forward Networks Theory” section, the information
of predictions made by the network, for six classes ex-
goes through all layers to the output layer with us-
ceeded maximal accuracy given on four classes. We
ing activation function in all encountered neurons and
suspect that this could be explained by growing gener-
then goes back in back-propagation process, modify-
ality of classifier contained in the model with increas-
ing synaptic weights. This sequence is repeated mul-
ing amount of known classes.
tiple times and results in improvement of network’s
Methodology of learning was following. Model was
ability to recognizing objects.
trained until it’s accuracy hadn’t increased, for few
1 hidden layer with 50 neurons epochs. In next stage networks’ weights was random-
Movies count Highest precision ized to get three different child networks. All four in-
4 76.0 % stances was learned in parallel. At the end, the best
5 71.2 % one of them was picked, reproduced by randomizing
6 52.7 % their weights and then all the process were repeated.
When the network had stalled with it’s advance-
3 hidden layers: 800, 200 and 50 neurons ment, one class was added to its possible outputs and
Movies count Highest precision learning was continued in the same way. One of the
most interesting issues that we have experienced in
4 82.0 %
the initial part of the research was a significant drop
5 88.8 %
in accuracy when class set contained ”Shrek 2” movie.
6 84.6 %
After preliminary learning on dataset that hadn’t con-
tained this movie, this problem disappeared.
4. Experiments We were also experimenting with different weight
initialization techniques, described in [4, 13]. We have
Main problem that was dealt with in order to provide tried the following methods:
optimal learning accuracy and time, was selecting ap-
propriate architecture that grants relatively good ac- • He weight initialization
√ - multiplies random value
curacy, but on the other hand, contains as few neurons [−1.0, 1.0] with 2 where 𝑠𝑖𝑧𝑒 is a size of pre-
𝑠𝑖𝑧𝑒
as required. Minimal neuron count results in minimal vious layer
computational time.
Initially, when tests were carried out on 4 classes • Xavier’s weight initialization
√ - similar to He ini-
of data, network that contained only one hidden layer tialization, but with 1
𝑠𝑖𝑧𝑒
which consisted of as few as 50 neurons seemed to be
the most efficient and accurate (76%). Nevertheless it • We also used the following technique:

25

2
𝑠𝑖𝑧𝑒𝑙−1 +𝑠𝑖𝑧𝑒𝑙 where 𝑠𝑖𝑧𝑒𝑛 is size of nth layer [2] D. Połap, M. Woźniak, Bacteria shape classifica-
tion by the use of region covariance and convo-
Unfortunately none of those methods worked cor- lutional neural network, in: 2019 International
rectly. In testing, we used networks with significant Joint Conference on Neural Networks (IJCNN),
number of neurons in nearly every layer (3750 neurons IEEE, 2019, pp. 1–7.
in input layer) that in turn set initial weight values to [3] M. Wózniak, D. Połap, R. K. Nowicki, C. Napoli,
minute values. G. Pappalardo, E. Tramontana, Novel approach
Results of our research are shown in the tables be- toward medical signals classifier, in: 2015 Inter-
low. national Joint Conference on Neural Networks
First table presents the most satisfactory accuracies (IJCNN), IEEE, 2015, pp. 1–7.
of the neural network (with 1 hidden layer consisting [4] G. Capizzi, G. L. Sciuto, C. Napoli, D. Polap,
of 50 neurons) before the experiment. The second one M. Woźniak, Small lung nodules detection based
shows the effects of randomization. Percents situated on fuzzy-logic and probabilistic neural network
in right columns are calculated by dividing amount of with bio-inspired reinforcement learning, IEEE
correctly recognized movies by size of testing dataset Transactions on Fuzzy Systems 28 (2019) 1178–
(which is 25 frames per movie). As it shows, the results 1189.
are surprisingly high. [5] F. Beritelli, G. Capizzi, G. Lo Sciuto, C. Napoli,
M. Woźniak, A novel training method to preserve
generalization of rbpnn classifiers applied to ecg
5. Example Predictions signals diagnosis, Neural Networks 108 (2018)
The tables below contain example predictions made by 331–338.
the network. Provided images are screenshots from [6] L.-H. Chen, Y.-C. Lai, H.-Y. M. Liao, Movie
the Netflix platform that were taken independently from scene segmentation using background informa-
the learning and testing datasets. tion, Pattern Recognition 41 (2008) 1056–1065.
As it can be seen, the accuracy is satisfactory high. [7] Y. Zhou, L. Zhang, Z. Yi, Predicting movie box-
It makes a few mistakes when it comes to dark and office revenues using deep neural networks, Neu-
distant frames. Fixing this issue will be our next goal. ral Computing and Applications 31 (2019) 1855–
1865.
[8] D. Połap, K. Kęsik, A. Winnicka, M. Woź-
6. Conclusions niak, Strengthening the perception of the vir-
tual worlds in a virtual reality environment, ISA
After all experiments done with this network, we can transactions 102 (2020) 397–406.
state the following conclusions: [9] J. Chen, J. Shao, C. He, Movie fill in the blank by
1. Combining back-propagation and heuristic ap- joint learning from video and text with adaptive
proach gave an unprecedented leap in network temporal attention, Pattern Recognition Letters
accuracy. - After a number of tests with weight 132 (2020) 62–68.
randomization, we suspect that by giving a ”nud- [10] S. Alghowinem, R. Goecke, M. Wagner, A. Alwa-
ge” to weights, we push it out of local minimum, bil, Evaluating and validating emotion elicitation
hence allowing it further learning. using english and arabic movie clips on a saudi
2. Changing model in-flight allows, for more gen- sample, Sensors 19 (2019) 2218.
erality. - By changing model topology (ex. adding [11] T.-L. Nguyen, S. Kavuri, M. Lee, A multimodal
additional output dimension), we have seen an convolutional neuro-fuzzy network for emotion
increase in generality of a model. understanding of movie clips, Neural Networks
118 (2019) 208–219.
3. Training on diversified samples first, results in
[12] G. Capizzi, S. Coco, G. Lo Sciuto, C. Napoli, A
increased generality.
new iterative fir filter design approach using a
gaussian approximation, IEEE Signal Processing
References Letters 25 (2018) 1615–1619.
[13] M. Matta, G. Cardarilli, L. Di Nunzio, R. Fazzo-
[1] M. Woźniak, D. Połap, Soft trees with neural lari, D. Giardino, A. Nannarelli, M. Re, S. Spanò,
components as image-processing technique for A reinforcement learning-based qam/psk sym-
archeological excavations, Personal and Ubiqui- bol synchronizer, IEEE Access 7 (2019) 124147–
tous Computing 24 (2020) 363–375. 124157.

26
Picture Origin Prediction Matches 2nd prediction

Indiana Jones Indiana Jones yes Shark Tale

Indiana Jones Indiana Jones yes Shrek 2

Indiana Jones Indiana Jones yes Shrek 2

Indiana Jones Shrek 2 no The Lego Movie

The Lego Movie The Lego Movie yes Shark Tale

The Lego Movie The Lego Movie yes Shark Tale

The Lego Movie Shark Tale no The Lego Movie

The Lego Movie The Lego Movie yes Shrek 2

Mada- gascar Mada- gascar yes Shark Tale

Mada- gascar Shrek 2 no The Lego Movie

Mada- gascar Mada- gascar yes Shrek 2

Mada- gascar Mada- gascar yes Shrek 2

27
Picture Origin Prediction Matches 2nd prediction

Shark Tale Shark Tale yes The Lego Movie

Shark Tale Shark Tale yes Indiana Jones

Shark Tale Shark Tale yes The Lego Movie

Shark Tale Shark Tale yes The Lego Movie

Shrek 2 Shrek 2 yes Shark Tale

Shrek 2 Shrek 2 yes Indiana Jones

Shrek 2 Shrek 2 yes Mada- gascar

Shrek 2 Shrek 2 yes Shark Tale

Loving Vincent Loving Vincent yes Shrek 2

Loving Vincent Loving Vincent yes The Lego Movie

Loving Vincent Loving Vincent yes Shark Tale

Loving Vincent Loving Vincent yes Shrek 2

28

You might also like