You are on page 1of 15

August 12, 2006 18:30 WSPC/157-IJCIA

00179

International Journal of Computational Intelligence and Applications


Vol. 6, No. 1 (2006) 4559
c Imperial College Press


APPLICATION OF SICoNNETS TO HANDWRITTEN


DIGIT RECOGNITION

FOK HING CHI TIVIVE and ABDESSELAM BOUZERDOUM


School of Electrical, Computer and Telecommunications Engineering
University of Wollongong, Northfields Avenue
Wollongong, NSW 2522, Australia
tivive@uow.edu.au
a.bouzerdoum@ieee.org
Received 6 September 2005
Revised 16 February 2006
In this paper, we apply a new neural network model, namely shunting inhibitory convolutional neural networks, or SICoNNets for short, to the problem of handwritten digit
recognition. This type of networks has a generic and exible architecture, where the
processing is based on the physiologically plausible mechanism of shunting inhibition.
A hybrid rst-order training method, called QRProp, is developed based on the three
training algorithms Rprop, Quickprop, and SuperSAB. The MNIST database is used to
train and evaluate the performance of SICoNNets in handwritten digit recognition. A
network with 24 feature maps and 2722 free parameters achieves a recognition accuracy
of 97.3%.
Keywords: Convolutional neural networks; shunting inhibitory neurons; handwritten
digit recognition; systematic connection schemes.

1. Introduction
Evolving from our understanding of neuro-biological systems, articial neural networks give computers an amazing capacity to learn complex tasks from examples.
They have become an alternative computational approach for problems that do
not have algorithmic solutions, or for which the algorithmic solutions are too difcult to express analytically. Their success can be attributed in part to their fault
tolerance, parallel processing, and generalization ability. The most popular neural
network architecture that is in use today, and discussed in almost every neural
network textbook, is the multilayer perceptron (MLP). MLPs have proven to be
a powerful computational tool for many problems in pattern recognition, function
approximation, and data analysis, to name a few. However, MLPs have some drawbacks when applied directly, without any processing, to high-dimensional data such
as in image analysis, image understanding and machine vision. The main problem
is that the size of the network grows with the size of the input image, which makes
the network training a much harder task. Moreover, over-tting may occur and
45

August 12, 2006 18:30 WSPC/157-IJCIA

46

00179

F. H. C. Tivive & A. Bouzerdoum

the generalization ability of the network suers when there is no sucient training
samples. The common approach to circumvent these problems is to use some preprocessing techniques to extract lower-dimensional features from the input data.
Feature extraction, however, is a computationally expensive process and requires
prior knowledge about the data to design the feature extractor.
In the past 20 years, researchers have focused not only on the development of
training algorithms for MLPs, but also on the identication of signicant network
structures and weight constraints that can reduce the number of trainable parameters. Inspired by the Hubel and Weisels hierarchical vision model of the cortex,
Fukushima et al.1 developed neocognitron, a two-dimensional (2D) neural network
architecture for visual pattern recognition. LeCun et al.,2 on the other hand, proposed a series of convolutional neural network (CoNN) architectures, based upon the
three structural concepts of local receptive elds, weight sharing and sub-sampling.
These networks can easily deal with variability in 2D shapes and possess a certain
degree of local invariance to distortion and translation. Consequently, they have
attracted considerable interest and gained popularity for solving visual pattern
recognition problems such as face detection,3 face recognition4 and facial expression analysis,5 and medical image pattern recognition.6
In Ref. 7, LeCun et al. reported their latest CoNN which is widely known as
the LeNet-5 for handwritten digit recognition. The network consists of seven processing layers, where the rst four layers are two successive pairs of convolutional
and sub-sampling layers with a total of 44 feature maps for feature extraction. The
fth and the sixth layers are the respective convolutional and fully-connected layers
with 120 and 84 neurons, and the output layer has ten neurons to represent the
ten digit classes. Overall, the network has 60,000 trainable parameters, and was
on
trained and tested on the MNIST database8 with an error rate of 0.8%. Calder
et al.9 developed a CoNN structure similar to the LeNet-5 which uses Gabor lters as receptive elds for the rst convolutional layer, and at the output layer, 84
perceptrons is used to represent the output as a grayscale image of size 12 7.
To improve the performance of their handwritten digit recognition system, they
applied a boosting method to boost their networks so as to achieve an error rate of
0.68%. Simard and his colleagues,10 on the other hand, used a much simple CoNN
structure for handwritten digit recognition with four processing layers and a network retina of size 29 29. The rst layer has ve feature maps of size 13 13,
and the second layer has 50 feature maps. In each layer, the size of the feature
map is reduced from n to (n 3)/2, where n is the original size, and the receptive
elds size used throughout the network is 5 5. The last two layers are equivalent to two layer fully-connected MLP with 100 hidden neurons and ten neurons as
outputs. Expanding the training set through elastic distortions and using a crossentropy function as an error function, they have achieved an error rate of 0.4%.
Gorgevik and Cakmakov11 proposed another approach by combining two neural
networks and a support vector machine to implement a three-stage classier for
handwritten digit recognition. First, the digit images are preprocessed for slant

August 12, 2006 18:30 WSPC/157-IJCIA

00179

Application of SICoNNets to Handwritten Digit Recognition

47

correction. Then 292 features are extracted from the image as inputs for the cascade of classiers. Based on the MNIST database, their three-stage classier has an
error rate of 0.83%. The experimental results of these neural-based approaches show
that neural networks can yield state-of-the-art performances. Nevertheless, these
networks are still plugged with the problem of huge number of trainable parameters.
Recently, we have proposed a new class of convolutional neural networks, known
as shunting inhibitory convolutional neural networks (SICoNNets), which can be
easily tailored to the users specications.12 The key characteristics of these networks are the processing element used for feature extraction and the systematic
interconnection schemes between the dierent hidden layers. The processing elements in the hidden layers are based on the shunting inhibition mechanism, which
plays an important role in visual information processing in the cortex.13 15 The
reason for using this type of processing elements is that shunting inhibitory neurons have been shown to be more computationally powerful than the traditional
sigmoid type neurons. Contrary to a sigmoid neuron, a single shunting inhibitory
neuron can solve linearly nonseparable classication problems by forming nonlinear
decision boundaries.16,17 In Ref. 18, the shunting inhibitory convolutional neural
network was applied to a two-class pattern classication task for discriminating segmented images between a face and a non-face, and subsequently developed as a face
detection system that can detect and localize faces in complex background scenes.
In this paper, we apply SICoNNets to handwritten digit recognition. The next
section gives a detailed description of the shunting inhibitory convolutional neural
network architecture. Section 3 describes the training algorithms that have been
developed for these networks, followed by the description of the handwritten digit
recognition system in Sec. 4. The experimental results and performance analysis
are presented in Sec. 5, and nal concluding remarks are given in Sec. 6.
2. Description of SICoNNet Architecture
The proposed convolutional neural networks, SICoNNets, have a exible do-ityourself network architecture in which the following network parameters can be
specied: the input size, the receptive eld size, number of layers and/or number of
feature maps, number of outputs, and connection scheme between layers. The input
layer is a 2D array used by the network to receive images from the environment.
The input layer is succeeded by several processing layers, or hidden layers, and
each hidden layer is made up of planes of shunting inhibitory neurons, known as
feature maps. Each neuron in the feature map receives inputs from a small local
neighborhood in the previous layer, its receptive field. However, all the neurons in a
feature map share the same set of connection weights [Fig. 1(a)], and each hidden
layer has a xed receptive eld size. Since all neurons in a feature map share the
same set of weights, the same operation is performed on dierent parts of the
input plane. Hence, the same elementary visual feature is extracted from dierent
positions in the input image. Other feature maps of the same layer operate with

August 12, 2006 18:30 WSPC/157-IJCIA

48

00179

F. H. C. Tivive & A. Bouzerdoum


Input image (or previous feature map)

Feature map
Receptive field
Shifted horizontally by
two positions
ReceptiveField

Shifted
vertically by
two positions

Same set of weights

(a)

(b)

Fig. 1. Schematic diagrams illustrate: (a) the application of local receptive elds and (b) the
movement of a receptive eld in the input image.

dierent sets of weights to extract dierent types of local features. In higher layers,
the feature maps extract higher-order features by taking their inputs from one or
more feature maps in the preceding layer. In each hidden layer, another structural
process, namely sub-sampling, is performed to reduce the spatial resolution of the
2D input by shifting the centers of receptive elds of adjacent neurons by two
positions in both directions [see Fig. 1(b)]; as a result, the size of the feature maps
is reduced by one quarter in each hidden layer. This introduces a certain degree
of invariance to translation and input distortion as the absolute location of the
extracted feature becomes less important in higher layer so long as its approximate
position relative to other features is preserved.
The computation performed by the shunting inhibitory neuron at location (i, j)
in the kth feature map of the Lth layer is given by
ZL,k (i, j) =

XL,k (i, j)
,
aL,k (i, j) + YL,k (i, j)

where
XL,k (i, j) = gL

SL1


i, j = 1, . . . , FL

(1)


[CL,k ZL1,m ](2i)(2j) + bL,k (i, j) ,

m=1

and
YL,k (i, j) = fL

SL1



[DL,k ZL1,m ](2i)(2j) + dL,k (i, j) .

m=1

The parameters CL,k and DL,k are the set of excitatory and inhibitory weights,
respectively, bL,k and dL,k are scalar parameters called the biases, aL,k is the passive

August 12, 2006 18:30 WSPC/157-IJCIA

00179

Application of SICoNNets to Handwritten Digit Recognition

49

decay rate of the neuron, gL and fL are the activation functions, SL1 is the number
of feature maps at the (L 1)th layer, and FL is the size of the feature map at the
Lth layer. In a feature map, all the neurons share the same set of weights, CL,k and
DL,k as well as the biases and the passive decay rate parameter. In order to avoid
division by zero in (1), aL,k is constrained to be positive:
aL,k (i, j) + YL,k (i, j) ,

(2)

where is a small positive constant.


In contrast to some existing CoNNs3,4,19 in which the connection strategy is
non-trivial and manually chosen, the proposed CoNNs were developed with three
systematic connection schemes: full-, binary-, and toeplitz-connection. In the fullconnection scheme, each hidden layer contains an arbitrary number of feature maps,
which are fully connected to the feature maps in the succeeding layer. This scheme
is similar to the MLP, where the number of hidden layers and hidden neurons
(equivalent to feature maps) can be changed arbitrarily. The binary-connection and
toeplitz-connection are partial-connection schemes where the rst hidden layer can
have an arbitrary number of feature maps so long as the subsequent layer has twice
the number of feature maps. In the binary connection, each feature map branches
out to two feature maps in the succeeding layer, as shown in Fig. 2(a), whereas in
the toeplitz-connection each feature map may have one-to-one or one-to-many links
with feature maps of the preceding layer. As an example, Table 1 illustrates the
connections between rst (L1) and second (L2) hidden layers. Suppose that L1 has
four feature maps, labeled AD, and L2 has eight feature maps, labeled 1 to 8 (rst
column). Feature maps 1 and 8 have one-to-one connections with feature maps A
and D, respectively. Feature map 2 makes connections with feature maps A and B.
Feature map 3 is connected to feature maps AC. The rest of the connections form
a Toeplitz matrix, hence the name [see Fig. 2(b)]. In other words, each feature map
of L1 connects to the same number of feature maps of L2 (in this case ve), and
its connections appear along a diagonal of the connection matrix. There are two
advantages for partial-connection schemes:
rst, it reduces the number of connections within the network, which may increase
the generalization ability;
second, it diversies the extraction of high-order features by taking inputs from
dierent set of feature maps rather than from all feature maps in the previous
layer.
At the output layer, sigmoid neurons are used as processing elements to classify
the features extracted at the last hidden layer. To reduce the number of weights,
a local averaging operation is applied on all feature maps in the last hidden layer;
that is a 2 2 non-overlapping receptive eld is used across each feature map to
average four outputs into a single signal which is fed to the neurons at the output

August 12, 2006 18:30 WSPC/157-IJCIA

50

00179

F. H. C. Tivive & A. Bouzerdoum


Layer 2

Layer 2

1
Layer 1

Layer 1

A
3

(a)
Fig. 2.

(b)

The partial-connection schemes: (a) binary-connection and (b) toeplitz-connection.

Table 1.

Toeplitz connections between feature maps L1 and L2.

L2 Feature Map
1
2
3
4
5
6
7
8

Connections from L1 to L2
A
B
C
D

A
B
C
D

A
B
C
D

layer. The activity of the output neuron is given by


S

N

y=h
wi zi + b ,

A
B
C
D

A
B
C
D

(3)

i=1

where y is the neural response of the sigmoid neuron, h is the output activation
function, wi s are the connection weights, zi s are the input signals, SN is the
number of input signals, and b is the bias term.

August 12, 2006 18:30 WSPC/157-IJCIA

00179

Application of SICoNNets to Handwritten Digit Recognition

51

3. Training Algorithm
To train the SICoNNets, a batch training algorithm based on the combination of
Rprop,20 Quickprop,21 and SuperSAB22 has been developed and named QRProp.
It is a local adaptation technique, in which the temporal behavior of the partial
derivative of the weight is used in the computation of the weight update. For comparison, the LevenbergMarquardt algorithm (LM) is also implemented, where the
Jacobian matrix is computed using a modied error-backpropagation rule similar
to the one developed by Hagan23 (see Ref. 12 for more details).
The weight update rule of the QRprop method is given by
 (k + 1) = W
 (k) + W
 (k) +
 (k 1),
W
 (k) W

(4)

where is the element-by-element product of two column vectors, and the


 (k) is the weight vector which is obtained by reshaping all the weights
parameter W
in the receptive elds and biases of the neurons in the feature maps, where elements are taken column-wise from the rst hidden layer to the last layer of the
 (k) is computed
network forming a large column vector. The weight update W
using the same principle as the Rprop algorithm, i.e., each local weight wi (k) in
 (k) has its own step size, i (k), which is adjusted according
the weight vector W
to the observation of the behavior of the local gradient gi (k) during two successive
iterations

min(1.2i (k 1), max ), if gi (k)gi (k 1) > 0

(5)
i (k) = max(0.5i (k 1), min), if gi (k)gi (k 1) < 0, i = 1, . . . , n

otherwise
i (k 1),
where n is the number of trainable weights, max and min are the upper and lower
limits of the step size, respectively; the initial value i (0) is set to 0.001 and the
respective limits for max and min are 10 and 1010 . The local weight update of
the ith weight is then determined by
wi (k) = sgn(gi (k))i (k),

(6)

where sgn denotes the signum function. When the current local gradient has a
change of sign with respect to the previous local gradient of the same weight, the
stored local gradient is set to zero so as to avoid an update in that weight in the
next iteration. Furthermore, when the product of the current and previous local
gradients is less than zero and there is an increase in the network error E, the ith
weight update is reverted back to the previous weight update and multiplied by an
adaptive momentum rate:
if gi (k)gi (k 1) < 0 and E(k) > E(k 1),
then wi (k) = i (k)wi (k 1).

(7)

August 12, 2006 18:30 WSPC/157-IJCIA

52

00179

F. H. C. Tivive & A. Bouzerdoum

The adaptive momentum rate i (k) of the ith weight used in (4) and (7) is computed
as the magnitude of the Quickprop-step, bounded within the range [0.5, 1.5]:




gi (k)
,
(8)

i (k) =
gi (k 1) gi (k)
i (k) = min(
i (k), 1.5),

max(
i (k), 0.5), if gi (k)gi (k 1) < 0
i (k) =
.
0,
if gi (k)gi (k 1) = 0

(9)
(10)

Moreover, when there is a decrease in the current network error with respect to
the previous error, a small percentage of the negative gradient is added to the
weight
 (k + 1) = W
 (k + 1)
W
 (k) g (k),

(11)

where
 (k) is a vector of learning rates, which are adapted using similar principle
as the SuperSAB method and bounded above by (13).

1.2i (k 1), if gi (k)gi (k 1) > 0

i (k) = 0.5i (k 1), if gi (k)gi (k 1) < 0 .


(12)

otherwise
i (k 1),
i (k), 0.9).
i (k) = min(

(13)

To summarize, a pseudo-code of the QRProp training algorithm is given below.


Input: Initialize i 0.001, i 0.01, i 0.1. Calculate the local gradient.
1: while stopping criterion is not met do
2:
Calculate the adaptive momentum rate
ei (k), according to (8) and bound it above by (9).
3:
if gi (k)gi (k 1) > 0 then
4:
i (k) min(1.2i (k 1), max ),
5:

ei (k) 1.2i (k 1).


6:
else if gi (k)gi (k 1) < 0 then
7:
i (k) max(0.5i (k 1), min ),
8:

ei (k) 0.5i (k 1),


9:
i (k) max(e
i (k), 0.5),
10:
gi (k) 0.
11:
else if gi (k)gi (k 1) = 0 then
12:
i (k) 0,
13:
i (k) i (k 1),
14:

ei (k) i (k 1).
15:
end if
16:
i (k) min(e
i (k), 0.9).
17:
wi (k) sgn(gi (k))i (k).
18:
if gi (k)gi (k 1) < 0 and E(k) > E(k 1) then
19:
wi (k) i (k)wi (k 1).
20:
end if
21:
wi (k + 1) wi (k) + wi (k) + i (k)wi (k 1).
22:
if E(k) < E(k 1) then
23:
wi (k + 1) wi (k + 1) i (k)gi (k).
24:
end if
25: end while

August 12, 2006 18:30 WSPC/157-IJCIA

00179

Application of SICoNNets to Handwritten Digit Recognition

53

4. Handwritten Digit Recognition System


The SICoNNet used for digit recognition is a three layer network which has eight
feature maps in the rst hidden layer and 16 feature maps in the second hidden
layer. At the output layer, there are 10 sigmoid neurons, one for each digit. The
receptive eld size used throughout the network is 5 5 pixels, and the input layer
is 24 24 pixels. The reason for using this input size is to ensure that the feature
maps in the rst and second hidden layers have even size after the sub-sampling
operation. The activation functions, gL and fL , chosen for the rst hidden layer are
the hyperbolic tangent, f (x) = (ex ex )/(ex + ex ), and exponential function,
f (x) = ex , respectively; whereas in the second layer, gL is the logarithmic sigmoid
function, f (x) = 1/(1 + ex ). At the output layer, the activation function, h,
applied to the sigmoid neurons is the hyperbolic tangent function. The desired
outputs are ten-element column vectors whose elements are set to 1, except the
element corresponding to the input digit, which is set to 1. Overall, the network has
2722 free parameters that need to be adapted during the training process. Before
the training commences, the network parameters are initialized with random values
using a uniform distribution. The weights of the receptive eld are initialized in the
range [1/w, 1/w], where w is the width of the receptive eld. The bias parameters
are initialized between 1 and 1, whereas the passive decay rate term is initialized
in the range (0, 1], subjected to the condition in (2).
To train the SICoNNet for digit recognition, some sample digit patterns are
needed for training and testing the network; we used the MNIST database for training and testing. This database contains real-world samples of handwritten digits;
it is publicly available for evaluating machine learning and pattern recognition systems on handwritten digit recognition. It contains two disjoint sets of handwritten
digit patterns of size 28 28 pixels: one set is used for training and contains 60,000
samples, and the other set has 10,000 samples for testing. As the input size of the
network is 24 24 pixels, all the patterns in the database are resized using a nearest
neighbor interpolation technique. At the output layer, each neuron gives an output, and the neuron with the maximum response is considered the winning neuron,
which determines the class of the input pattern.
5. Experimental Results
In this section, we present the experimental results. First, two preliminary experiments are conducted to analyze the training algorithms and determine the most
suitable connection scheme for the network. Then, the chosen network structure,
with the selected connection scheme, is trained and evaluated on the MNIST
database,8 where the digit patterns have been converted into binary images.
5.1. Analysis of the training methods
To analyze the QRProp training method and the LM algorithm, a small network
with four feature maps in the rst hidden layer and eight feature maps in the second

August 12, 2006 18:30 WSPC/157-IJCIA

54

00179

F. H. C. Tivive & A. Bouzerdoum

hidden layer was trained on a set of 5000 handwritten digit patterns, where 500
samples were taken from each digit class of the MNIST training set, based on a
ve-fold cross validation procedure. In each fold, 4000 patterns were gathered for
training and 1000 patterns for testing. For analysis purposes, the training mean
square error (MSE), training time and number of training epochs were recorded
in each fold and averaged across the ve folds. As the training time is relatively
dependent on the machine used, we compute the training time in terms of the gradient descent epoch time unit or gdeu. One gdeu is dened as the average time
taken by the network to perform one gradient descent training epoch on a xed
training set and a xed-size network, and it remains constant throughout the gradient descent training process. On a PC with 3 GHz CPU and 2 GB RAM, using
MATLAB software as programming language, one gdeu time unit is approximately
42.5 seconds, based on a network with 1366 trainable parameters and a training set
of 4000 samples.
Figure 3 shows that both training methods converge with dierent speeds. In
terms of the mean square error (MSE), as a function of the number of epochs,
Fig. 3(a) shows that the LM algorithm has better convergence speed than QRProp;
however, based on the training time, Fig. 3(b) shows that the MSE of the LM
algorithm decreases slower than that of QRProp. Moreover, after a certain number of gdeus, the MSE of the LM algorithm remains constant, indicating that the
training algorithm has reached a local minimum. On the other hand, the MSE of
the QRProp method gradually decreases and becomes smaller than that of the LM
algorithm. Another test was conducted to analyze the classication performance of
the training algorithms. The results, based on ve-fold cross-validation, are shown
in Fig. 4. Since the LM shows better convergence, the trained network yields higher
classication accuracy after a few epochs; for instance, at 20 iterations, the trained
network achieves a classication accuracy of 96.8% on the 4000 training patterns
and 94.9% on the 1000 test patterns. However, the LM algorithm is known to have

0.5
LM
QRProp

LM
QRProp

0.5

log (Training MSE)

log (Training MSE)

1
1.5
2
2.5

1
2
3

3
3.5

20

40

60

Number of Training Epochs

(a)

80

100

100

200

300

400

500

Training Time (gdeu)

(b)

Fig. 3. The convergence speed of the training algorithms as a function of (a) number of training
epochs and (b) number of gdeus.

August 12, 2006 18:30 WSPC/157-IJCIA

00179

0.9

0.9

Classification Accuracy (%)

Classification Accuracy (%)

Application of SICoNNets to Handwritten Digit Recognition

0.8
0.7
0.6
0.5
0.4
0.3
LM
QRProp

0.2
0.1

20

40

60

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1

100

80

55

50

Number of Training Epochs

100

150

200

250

300

250

300

Training Time (gdeu)

0.9

0.9

Classification Accuracy (%)

Classification Accuracy (%)

(a)

0.8
0.7
0.6
0.5
0.4
0.3
LM
QRProp

0.2
0.1

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1

20

40

60

80

100

50

100

150

200

Training Time (gdeu)

Number of Training Epochs

(b)
Fig. 4. The classication accuracy of the training algorithms versus the number of training epochs
and the training time based on (a) the training set and (b) the test set.

some shortcomings such as the computation of the Hessian matrix and its storage.
On a large training set of 60,000 samples, it is not possible to train a network
with 2722 trainable parameters using the LM algorithm due to the huge amount of
memory required to store the Jacobian and Hessian matrices. On the contrary, the
QRProp method requires only few gradient and function evaluations to update
the weights. Furthermore, when training for a longer period of time, say 250 gdeus,
the classication performance achieved by QRProp on the test set is similar to
that of the LM algorithm. Therefore, QRProp is chosen to train the SICoNNets for
handwritten digit recognition.
5.2. Classification performance of three SICoNNet architectures
In this experiment, we train and evaluate the classication performance of three
dierent SICoNNet architectures: fully-connected, binary-connected and toeplitzconnected. Each SICoNNet was trained on a set of 10,000 patterns and tested
on the entire test set of the MNIST database. The classication rates of the different network architectures are presented in Table 2. Clearly, all three networks
achieve classication rates higher than 90%. The best classication rate is 94.1%
achieved with the binary-connected network, followed by the toeplitz-connected

August 12, 2006 18:30 WSPC/157-IJCIA

56

00179

F. H. C. Tivive & A. Bouzerdoum


Table 2.

Classication performances of the three proposed CoNNs.


Classication Rate for Each Digit Class (%)

SICoNNets
Binary
Toeplitz
Full

Accuracy

(%)

98.3
97.1
95.9

97.9
96.7
96.5

95.1
96.6
92.4

92.2
95.0
86.6

93.0
92.8
89.1

91.3
92.2
85.2

95.9
95.5
93.7

93.4
91.0
89.0

93.0
89.3
84.7

91.8
90.1
88.2

94.1
93.6
90.2

network with 93.6% accuracy; the fully-connected network achieves an accuracy of


90.2%. Among the three networks, the partially-connected networks perform better
than the fully-connected network. This may be due to the fact that the partiallyconnected networks have fewer connections within the network structure.
5.3. Performance of the handwritten digit recognition system
After analyzing the experiment results of the previous two sections, a binaryconnection scheme is chosen to build the handwritten digit recognition system.
The binary-connected network has 24 feature maps (or 2722 trainable weights) and
is trained on the entire training set (containing 60,000 samples) of the MNIST
database, using the QRProp. The trained network is then evaluated on the entire
test set. The classication performance of this network is presented as a confusion
matrix, Table 3. For each digit, the network has a recognition accuracy greater than
97%, except for the digits 5 and 9. The digit 9 has the worst recognition accuracy
of 95.0%. Many of the nine digits are classied as a four, and vice versa. This is due
to the fact that some patterns in the test set are written with heavy strokes, which
caused the digit four to appear as a nine and nine as a four (see samples in Fig. 5).
Nevertheless, the overall classication accuracy of the system is over 97.3%.
Table 4 shows the recognition error rates of dierent neural-based classiers
tested on the MNIST database together with their network sizes. To the best of
Table 3.

Classication performance of the handwritten digit recognition system.


Network Predicted Class

Actual class
0
1
2
3
4
5
6
7
8
9

970
0
7
0
1
3
8
1
4
4

0
1120
1
0
0
0
3
2
2
4

1
2
1001
5
3
2
4
8
1
0

0
2
6
982
0
11
0
6
6
10

0
0
1
0
954
1
2
2
4
11

0
0
0
7
0
862
2
0
5
7

6
2
0
0
4
5
935
0
1
1

2
1
3
6
1
1
2
999
3
5

1
8
5
10
0
4
2
0
945
8

0
0
0
0
19
3
0
6
3
959

Classication accuracy

Classication
Rate (%)

99.0
98.7
97.8
97.2
97.1
96.6
97.6
97.6
97.0
95.0
97.3

August 12, 2006 18:30 WSPC/157-IJCIA

00179

Application of SICoNNets to Handwritten Digit Recognition

57

(a) Digit 4

(b) Digit 9
Fig. 5. Examples of digit patterns in the test set that were misclassied (a) digit four predicted
as nine, and (b) digit nine predicted as four.

Table 4. Classication results of dierent neural-based classiers tested on the MNIST


database. The second and third column present the number of feature maps (F. maps) or
neurons and the number of trainable weights (T. weights), respectively, in the networks.
NN Classier
MLP7

3-layer
LeNet-57
Boosted GCNN9
CoNN with cross entropy10
CoNN24
SICoNNet

No. of F. Maps/Neurons

No. of T. Weights

Error Rate (%)

1160
164
176
55

24

936,660
60,000
63,156
127,540
18,370
2,722

2.95
0.80
0.68
0.40
1.20
2.70

our knowledge and from the list,8 the most successful classier reported to date
was developed by Simard et al.10 with an error rate of 0.4%. However, their CoNN
has the most trainable network parameters, apart from the three layer MLP, with
127,540 trainable weights. This amount of weights is computed from their given network structure and assumed that the 100 hidden neurons in the third hidden layer
of the network is fully connected to the 50 feature maps of size 5 5 in the second
hidden layer, and each feature map has a single receptive eld. Most of the classiers based on convolutional neural networks have error rates of less than 1%, at the
expense of having more than 10,000 trainable weights. Even though the same test set
from the MNIST database is used to evaluate the performances of these networks,
the size of the training set and the preprocessing applied to the training patterns
are dierent; for example, LeNet-5 and the network implemented by Simard et al.
were both trained on an augmented training set with articially distorted versions
of the original digit patterns so as to accommodate all form of ane transformations. The proposed CoNN, on the other hand, was trained and tested on binary
images, and its recognition error rate is lower than that of the MLP, but higher
than those of the existing CoNNs. However, it has the least number of trainable
weights with 24 feature maps in the hidden layers behaving as feature detectors.
To improve the performance of the proposed CoNN for this pattern recognition
task, a large training set with distorted digit patterns can be used, and the network
structure can be modied so that another classication layer is added between the
last feature extraction layer and the output layer, as with two classication layers

August 12, 2006 18:30 WSPC/157-IJCIA

58

00179

F. H. C. Tivive & A. Bouzerdoum

better recognition rate can be achieved. Another approach is to apply a two-layer


MLP as a classier to the extracted features generated at the last hidden layer of
the proposed CoNN.

6. Conclusion
In this paper, we proposed to use a new class of convolutional neural networks
for handwritten digit recognition. These networks, known as shunting inhibitory
convolutional neural networks, have a exible network structure with three connection schemes: fully-connected, binary-connected and toeplitz-connected. A hybrid
training method (QRProp), derived from existing rst-order training algorithms,
was used to train the networks for handwritten digit recognition. The performance
of QRProp was compared to that of the LevenbergMarquardt algorithm. Experimental results show that the QRProp method has better convergence speed than
the LM algorithm, in terms of the training time, and achieves similar classication
accuracy. Among the three dierent SICoNNet architectures (binary-, toeplitz-, and
fully-connected networks), the binary-connected network has the best recognition
rate. Evaluated on the MNIST database, a binary-connected network, with 2722
trainable weights, achieves a correct classication rate of 97.3%.

References
1. K. Fukushima, S. Miyake and T. Ito, Neocognitron: A neural network model for a
mechanism of visual pattern recognition, IEEE Trans. Syst. Man Cybernet. SMC13(5) (1983) 826834.
2. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and
L. D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural
Comput. 1(4) (1989) 541551.
3. C. Garcia and M. Delakis, A neural architecture for fast and robust face detection, in
Proc. Sixteenth Int. Conf. Pattern Recogn., Quebec Canada 2 (2002) 4447.
4. S. Lawrence, C. L. Giles, A. C. Tsoi and A. D. Back, Face recognition: a convolutional
neural network approach, IEEE Trans. Neural Networks 8(1) (1997) 98113.
5. B. Fasel, Multiscale facial expression recognition using convolutional neural networks,
in Proc. Third Indian Conf. Comput. Vision, Graphics Image Process, Ahmedabad,
India (2002).
6. S.-C. B. Lo, J.-S. J. Lin, M. T. Freedman and S. K. Mun, Application of articial
neural networks to medical image pattern recognition: Detection of clustered microcalcications on mammograms and lung cancer on chest radiographs, J VLSI Signal
Process. Syst. 18(3) (1996) 263274.
7. Y. LeCun, L. Bottou, Y. Bengio and P. Haner, Gradient-based learning applied to
document recognition, Proc. IEEE 86(11) (1998) 22782324.
8. Y. LeCun, The MNIST database of handwritten digits, http://yann.lecun.
com/exdb/mnist.
9. A. Calder
on, S. Roa and J. Victorino, Handwritten digit recognition using convolutional neural networks and Gabor lter, in Proc. Int. Congr. Comput. Intell., Medellin,
Colombia (2003).

August 12, 2006 18:30 WSPC/157-IJCIA

00179

Application of SICoNNets to Handwritten Digit Recognition

59

10. P. Y. Simard, D. Steinkraus and J. C. Platt, Best practices for convolutional neural
networks applied to visual documents analysis, Proc. Seventh Int. Conf. Document
Anal. Recogn. 2 (2003) 958962.
11. D. Gorgevik and D. Cakmakov, An ecient three-stage classier for handwritten digit
recognition, Proc. 17th Int. Conf. Pattern Recogn. 4 (2004) 507510.
12. F. H. C. Tivive and A. Bouzerdoum, Ecient training algorithms for a class of shunting inhibitory convolutional neural networks, IEEE Trans. Neural Networks 16(3)
(2005) 541556.
13. L. J. Borg-Graham, C. Monier and Y. Fregnac, Visual input evokes transient and
strong shunting inhibition in visual cortical neurons, Nature 393(6683) (1998) 369
373.
14. J. S. Anderson, M. Carandini and D. Ferster, Orientation tuning of input conductance,
excitation, and inhibition in cat primary visual cortex, J. Neurophysiol. 84 (2000)
909926.
15. Y. Fregnac, C. Monier, F. Chavane, P. Baudot and L. Graham, Shunting inhibition,
a silent step in visual computation, J. Physiol. 97 (2003) 441451.
16. A. Bouzerdoum, A new class of high-order neural networks with nonlinear decision
boundaries, in Proc. Sixth Int. Conf. Neural Inf. Process., Perth 3 (1999) 10041009.
17. A. Bouzerdoum, Classication and function approximation using feed-forward shunting inhibitory articial neural networks, in Proc. IEEE-INNS-ENNS Int. Joint Conf.
Neural Networks (2000) 613618.
18. F. H. C. Tivive and A. Bouzerdoum, A face detection system using shunting inhibitory
convolutional neural networks, in Proc. Int. Joint Conf. Neural Networks 4 (2004)
25712575.
19. B. Fasel, Robust face analysis using convolutional neural networks, in Proc. Sixteenth
Int. Conf. Pattern Recogn., Quebec, Canada 2 (2002) 4043.
20. M. Riedmiller and H. Braun, A direct adaptive method for faster backpropagation
learning: The RPROP algorithm, Proc. IEEE Int. Conf. Neural Networks (1993)
586591.
21. S. Fahlman, An empirical study of learning speed in back-propagation networks,
Carnegie Mellon University, Technical Report CMU-CS 88-162 (1988).
22. T. Tollenaere, SuperSAB: Fast adaptive BP with good scaling properties, Neural
Networks 3 (1990) 561573.
23. M. T. Hagan and M. Menhaj, Training feedforward networks with the marquardt
algorithm, IEEE Trans. Neural Networks 5 (1994) 989993.
24. E. Poisson, C. V. Gaudin and P.-M. Lallican, Multi-modular architecture based on
convolutional neural networks for online handwritten character recognition, Proc. 9th
Int. Conf. Neural Inf. Process. 5 (2002) 24442448.