Professional Documents
Culture Documents
K. STEINHÖFEL
GMD–National Research Center for Information Technology,
Kekuléstr. 7, 12489 Berlin, Germany
M. TAUPITZ
Faculty of Medicine, Institute of Radiology, Humboldt University of Berlin,
Schumannstraße 20/21, 10117 Berlin, Germany
We present a pattern classification method that combines the classical Perceptron algo-
rithm with simulated annealing. For a sample set S of n-dimensional patterns labeled as
positive and negative, our algorithm computes threshold circuits of small depth where
the linear threshold functions of the first layer are calculated by simulated annealing with
the logarithmic cooling schedule c(k) = Γ(k)/ ln (k + 2). The parameter Γ depends on
the sample set and changes in time, and the neighborhood relation is determined by the
Perceptron algorithm. We apply the approach to the recognition of focal liver tumours.
From 400 positive (focal liver tumour) and 400 negative (normal liver tissue) examples a
depth-six threshold circuit is calculated. The examples are of size n = 14 161 = 119×119
and they are presented in the DICOM format. On test sets of 100+100 examples (disjoint
from the learning set) we obtain a correct classification of more than 98%.
1. Introduction
The paper describes a method of computing small depth threshold circuits for
pattern recognition purposes. The approach is applied to focal liver tumour recog-
nition, where the CT images are classified by the threshold circuit without any
preprocessing. From a general point of view, the threshold circuits are designed for
binary classifications of points from an n-dimensional space. This problem has been
studied for a long time and is closely related to algorithms solving systems of linear
inequalities.
573
August 12, 2002 17:26 WSPC/115-IJPRAI 00184
(by adding one extra dimension to the space, the threshold can be made equal to
zero). Then the Perceptron algorithm converges in at most 1/σ 2 iterations, where
σ := min[~x,η]∈S |w ~ ∗ · ~x|, η ∈ {+, −}.
For our problem of CT-image classification, one can hardly assume that positive
and negative examples are separable by a single linear threshold function. In order to
reduce the classification error, we try to compute a bounded-depth circuit consisting
of linear threshold functions. The threshold functions, in particular the gates of
the first level, are determined by a learning procedure from positive and negative
examples of the classification problem. The aim of the learning procedure is to
minimize the error on the set of examples.
To our knowledge, the first paper on learning-based methods applied to X-
ray diagnosis was published by Asada et al.7 Since then, the research has been
concentrating on using commercially available neural network tools or hardware
systems, respectively, for medical image classification.11,13,19,20,30,32,33
Usually, feature extraction is used in learning-based classification methods. For
example, Ref. 27 introduces the assignment of fractal dimensions to tumour struc-
tures. The fractal dimensions are assigned to contours which have been extracted
by commonly used filtering operations. In fact, these contours represent polygonal
structures within a binary image. For example, the fractal dimensions D1 = 1.13
and D2 = 1.40 are assigned to the boundary and the interior, respectively, of a
glioblastoma.
A high classification rate of nearly 98% is reported in Ref. 23, where the
Wisconsin breast cancer diagnosis (WBCD) database of 683 cases is taken for learn-
ing and testing. The approach is based on feature extraction from FNA data and
uses nine visually assessed characteristics for learning and testing. Among the char-
acteristics are the uniformity of cell size, the uniformity of cell shape, and the clump
thickness.
In our approach, the only input to the algorithm are the image data, but the
set of training examples is partitioned into classes according to the average value of
the gray scale of pixels. Since focal liver tumour detection is not part of screening
procedures like the detection of microcalcifications,14,16,20,23,25 a certain effort is
August 12, 2002 17:26 WSPC/115-IJPRAI 00184
required to collect the image material. To our knowledge, results on neural net-
work applications to focal liver tumour detection are not available in the literature.
Therefore, we could not include comparisons to related, previous work in our paper.
First results of our method are presented in Ref. 4 for a small number of 220+220
examples and without the partition into gray scale classes. In the present paper, we
describe the computation of threshold circuits from 400 + 400 positive and negative
examples. Moreover, the parameter setting changes for the underlying simulated
annealing procedure as the average value of misclassified examples becomes smaller,
i.e. we perform an adaptive parameter control. The adaptive approach as well as the
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com
8 bit gray scale in DICOM standard format.17 Therefore, the input size is n = 14 161
and the input values range from 0 to 255. For the learning procedure, we used 400
positive (focal liver tumours) and 400 negative (normal liver tissue) examples. The
threshold circuits of depth six were tested on 100 + 100 examples (different from
the learning set), and we obtained a correct classification of about 98%. Further
improvements can be expected when a larger depth of circuits is considered.
2. Basic Definitions
We take into account that computers have a limited register length d to represent
real numbers. This limits the range of threshold functions one can consider and
therefore we take the following set of functions:
[
F := Fn ,
n≥1
where
( )
X
n
Fn = f (~x) : f (~x) = wi · xi ≥ ϑf , wi ∈ (±1) · {0, 1} × {0, 1}
d d
.
i=1
Pn
output 1 or 0 depending on whether or not i=1 wi · xi ≥ ϑf . In the same way, the
gates at higher levels have Boolean outputs only. Therefore, when all paths from
input nodes to vout are of the same length, the gates at level 2, 3, . . . do compute
Boolean threshold functions. Thus, we have F (C) : {0, 1}n·d → {0, 1}. The width of
C is the maximum number of gates that have the same distance (maximum number
of edges) to input nodes.
In the present paper, the maximum depth of C is six; see Fig. 1. There are two
types A and B of gates at the first level: In type A gates, the sum of all 14 161
gray scale values from the input is calculated and compared to the minimum and
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com
used for the selection of intermediate values at larger depths with respect to the
class Cli to which a particular input belongs.
The first level gates of type B are part of subcircuits that are calculated from a
random choice of training examples of the particular classes Cli by the combination
of logarithmic simulated annealing and the Perceptron algorithm. For each Cli
(subcircuit), the k output values from the corresponding type B gates are taken
CT image
Depth 3
OR gate Depth 5
Depth 6
number of training examples in Cl1 and Clm is relatively small compared to Cli
when i 6= 1, m.
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.
At depth four, each output from subcircuits assigned to a particular Cli is taken
together m times with two outputs from the type A gates from the first level by a
three-input AND gate, i.e. we have m2 gates of the type u1 ∧ u2 ∧ u3 at depth four.
The m2 outputs are collected together by an OR gate at depth five.
Finally, at depth six we have a simple majority functions of the type v1 +v2 +v3 ≥
2. In Fig. 1, the comparison to three copies of circuits means that the circuits do
have the same structure. Since the threshold functions from the first level (type B
gates) are calculated by a stochastic procedure, it is unlikely that the “copies of
circuits” represent the same function.
Thus, the most time-consuming part is the calculation of first level gates of type
B and we pay particular attention to the time complexity analysis of computing
the weights and threshold values of these linear threshold functions.
n 2
ϑf 0 is equal to ϑf + yj / i=1 wi .
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.
The recursive application of (6) defines a Markov chain of probabilities af (k), where
f ∈ F and k = 1, 2, . . .. If the parameter c is a constant, the chain is said to be
August 12, 2002 17:26 WSPC/115-IJPRAI 00184
These difficulties can be avoided when inhomogeneous Markov chains are ap-
plied. The general framework of inhomogeneous Markov chains has been studied,
e.g. by Hajek15 and Catoni.9,10 Let af (k) denote the probability to obtain the linear
threshold function f ∈ F (i.e. one threshold gate from the first level of the thresh-
old circuit) after k steps of an inhomogeneous Markov chain. We have to minimize
|{~x : [~x, η] ∈ S & f (~x) 6= η}|, and the problem is to find a lower bound for k such
P
that f ∈Fmin af (k) > 1 − δ for f ∈ Fmin .
In the present paper we are focusing on a special type of inhomogeneous Markov
chains where the value c(k) changes in accordance with
Γ
c(k) = , k = 0, 1, . . . (7)
ln(k + 2)
The value Γ is a parameter that depends both on the configuration space F and
the neighborhood relation Nf .
The conditions that determine the choice of Γ are derived from Hajek’s
Theorem15 on logarithmic cooling schedules for inhomogeneous Markov chains:
3. Computational Experiments
As already mentioned, simulated annealing-based heuristics are designed in most
applications for homogeneous Markov chains where the convergence to the
Boltzmann distribution at fixed temperatures is important for the performance
of the algorithm. In our approach, we utilize the general framework of logarithmic
simulated annealing described in Sec. 2 for the design of a pattern classification
heuristic. We paid particular attention to the choice of the parameter Γ which is
crucial to the quality of solutions as well as to the run-time of our heuristic.
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com
Since the cooling schedule is defined by (7) and the objective function is given by
(1), it remains to choose a suitable neighborhood relation only. To speed up the
local search for minimum error solutions, we employ the neighborhood Nf from (2)
together with a nonuniform generation probability where the transitions are forced
into the direction of the maximum deviation. The approach has been described in
Ref. 5 and is motivated by the computational experiments on equilibrium compu-
tations from Ref. 3. For the implementation of simulated annealing-based heuris-
tics a significant speed-up was obtained when transitions were performed into the
direction of maximum local forces.
The nonuniform generation probability is derived from the Perceptron
algorithm: When f is the current hypothesis, we set
−f (~x), if f (~x) < ϑf and η(~x) = + ,
U (~x) := f (~x), if f (~x) ≥ ϑf and η(~x) = − , (8)
0, otherwise .
The f 0 in (2) are related to the ~x ∈ S∆f and therefore it is justified to define
U (~x)
G[f, f 0 ] := P . (9)
x∈S∆f U (~
~ x)
Thus, preference is given to the neighbors that maximize the deviation. Now, our
heuristic can be summarized in the following way:
1. The initial hypothesis is defined by wi = 1, i = 1, 2, . . . , n and ϑ = 0.
2. For the current hypothesis, the probabilities U (~x) are calculated; see (8).
3. To determine the next hypothesis fk , a random choice is made among the ele-
ments of Nfk−1 according to (9).
4. When Z(fk ) ≤ Z(fk−1 ), we proceed with the new hypothesis fk .
5. In case of Z(fk ) > Z(fk−1 ), a random number ρ ∈ [0, 1] is drawn uniformly.
6. If e−(Z(fk )−Z(fk−1 )/c(k)) ≥ ρ, the function fk is the new hypothesis. Otherwise,
we return to 3.1 with fk−1 .
Here, Γ from c(k) changes in the following adaptive way: For a constant number
of steps L, the maximum value of |S∆fk | is evaluated after k = L · l steps for
August 12, 2002 17:26 WSPC/115-IJPRAI 00184
The heuristic has been applied to the recognition of focal liver tumours from
fragments of CT images. The input to the heuristic was derived from the DICOM
standard representation of CT images.17 In Figs. 2 and 3, examples of input in-
stances are shown in the DICOM format. From these 128×128 images we calculated
8-bit gray scale representations of size 119×119 in order to avoid interferences from
the border of images.
From a total number of 500 positive (abnormal findings) and 500 negative ex-
amples (normal liver tissue) we separated 100 positive and 100 negative examples
for test purposes (images no. 0, . . . , 99 for both types, respectively). The sets of
training examples are denoted by POS and NEG, the test sets by T POS and
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.
T NEG. All four sets were analyzed with respect to their average gray scale value
P14161
( j=1 xj )/14 161, xj ∈ {0, . . . , 255} and then subdivided into m = 3, 5 classes
where each class covers an interval of a unit length (max av gray−min av gray)/m.
In the following, T POS and T NEG are classified with respect to max av gray
and min av gray from POS and NEG, respectively. Table 1 shows the result of
the classification for both values of m, i.e. the entries are the number of elements
of the corresponding class. The minimum and maximum values were calculated as
min av gray = 132.1 and max av gray = 210.4 over POS∪NEG.
We computed threshold circuits for m = 3, l = 7, 13 and m = 5, l = 7. Let
N ump (i) and N umn (i) denote the number of positive and negative examples in
Pn
Clml as shown in Table 1. Each particular function f = j=1 wj · xj ≥ ϑ was
trained on a random choice of N ump (i)/2+N umn (i)/2 examples out of N ump (i)+
N umn (i) examples. When N ump (i) and N umn (i) differ by a larger margin as, e.g.
in Cl31 and Cl54 , all examples from the smaller set were taken for the training
procedure.
The computations were performed for two cases:
The procedure is repeated three times, i.e. three subcircuits of depth five
(according to Fig. 1) are calculated independently. Each of the three subcircuits
consists of m subcircuits for classes Clmj , j = 1, . . . , m. With an additional gate
x1 + x2 + x3 ≥ 2, the subcircuits result in a depth-six threshold circuit. For m = 3
and l = 7, the run-time to compute a depth-six type α circuit is about 22 hours,
and for type β circuits the run-time is about 28 hours.
In Tables 2–4, we summarize the results of our computational experiments,
including intermediate results from outputs at depth five. Since the learning phase
is performed for each class separately, the computation of threshold functions from
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com
level three (see Fig. 1) is much shorter compared to the run-times presented in
Ref. 4 and therefore it was possible to complete a large number of runs, at least
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.
three for each pair [l, m] and depths five and six. The results in Tables 2–4 are
representatives of an average outcome for the given [l, m]; the results from different
runs for m = 3 and the same settings are stable and differ only marginally. The test
on a single image from the 200 test examples is performed within a few seconds.
Table 2 shows the results for l = 7, m = 3, and the two types of experiments α
and β. There is only a marginal improvement for the adaptive annealing (type β)
and the run-time becomes longer in this case because it is more difficult to escape
from local minima when Γ is small. On the other side, the increase in computation
time usually leads to better results. The best results we obtained for type α and
type β calculations were both the same and equal to 3% ([4,2] and [3,3] errors,
respectively).
In Table 3, the run-time is much shorter since the number of learning examples
in each of the classes Cl5j , j = 1, . . . , 5 is relatively small (see Table 1). Due to
the small number of samples, the classification rate becomes worse and therefore
we consider for l = 13 only the case m = 3. For the settings from Table 3, the
results for type β circuits were always better than type α results, with a particular
improvement for depth-six circuits.
Table 2. Results on test samples for l = 7 and three gray scale classes.
l = 7, m = 3 Type α Type β
Circuit Time Errors on Total Time Errors on Total
Depth (Min.) T POS T NEG Errors (Min.) T POS T NEG Errors
5 428 6 4 5.0% 529 3 7 5.0%
6 1315 5 3 4.0% 1665 4 3 3.5%
Table 3. Results on test samples for l = 7 and five gray scale classes.
l = 7, m = 5 Type α Type β
Table 4. Results on test samples for l = 13 and three gray scale classes.
The results in Table 4 remain almost the same for values of l larger than 13.
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com
classification with 2 errors on positive and negative test examples, respectively. For
type β circuits, some runs (depth six) finished with only a single misclassification
on the set of 200 test examples.
To our knowledge, results on neural network applications to focal liver tumour
detection are not reported in the literature. Therefore, we cannot include a direct
comparison to related, previous work in our paper. In Ref. 23, a database of 683
cases is taken for learning and testing. The approach is based on feature extraction
from image data and uses nine visually assessed characteristics for learning and
testing. Consequently, the classification rate is much higher and lies between 97.1%
and 97.8%.
Usually, in neural network research only a small depth of networks is considered
(one or two “hidden layers” in back-propagation networks) and the input informa-
tion is down-sized by preprocessing steps (feature extraction). In principle, neural
networks can be used to map the pixel images directly onto the target output values,
i.e. the entire pixel information is directly passed to the neural network. However,
such an approach will typically generate poor results as discussed in Ref. 22. With
respect to our specific problem setting, the computational effort of back-propagation
applied to n = 14 161 input values and a large number of “hidden units” seems to
be difficult to manage. For a discussion of the power of small-depth neural networks,
see Ref. 31.
Recently, support vector machines are widely studied as a new learning tool,
see Ref. 8. In this approach, higher dimensional separations are modeled by intro-
ducing auxiliary variables representing components of the separation functions, e.g.
y1 = x21 , y2 = 2 · x1 · x2 , etc. When we try to keep the entire image information
as an input to the learning procedure, the introduction of a multiple number of
variables would significantly increase the time complexity as well as the memory
space complexity of the learning process.
In our experiments, we obtain a correct classification on 200 test examples of
more than 98%. Thus, we think that the approach presented in the paper is par-
ticularly suited to large scale learning problems where the original information is
retained during the training phase. The main features are (i) the combination of log-
arithmic simulated annealing with the Perceptron algorithm, (ii) the random choice
August 12, 2002 17:26 WSPC/115-IJPRAI 00184
of samples out of the entire set of training samples, and (iii) the specific structure of
the bounded-depth classification circuit, including the partition of training exam-
ples by maintaining the entire image information. We expect further improvements
of the classification rate when circuits of depth seven are considered.
4. Concluding Remarks
We applied a combination of simulated annealing and the Perceptron algorithm
to the recognition of focal liver tumours. From 400 positive (focal liver tumour)
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com
and 400 negative (normal liver tissue) examples we calculated threshold circuits of
depth six. The functions of the first layer are determined by a simulated annealing
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.
Acknowledgment
The authors would like to thank the referees for their careful reading of the
manuscript and helpful suggestions that resulted in an improved presentation.
The authors are grateful to thank Eike Hein and Daniela Melzer (HUB, Institute
of Radiology) for preparing the image material.
References
1. E.H.L. Aarts, Local Search in Combinatorial Optimization, Wiley, NY, 1998.
2. S. Agmon, “The relaxation method for linear inequalities,” Canadian J. Math. 6, 3
(1954) 382–392.
3. A. Albrecht, S. K. Cheung, K. S. Leung and C. K. Wong. Stochastic simulations of
two-dimensional composite packings,” J. Comput. Phys. 136, 2 (1997) 559–579.
4. A. Albrecht, M. J. Loomes, K. Steinhöfel and M. Taupitz, “A modified perceptron
algorithm for computer-assisted diagnosis,” Research and Development in Intelligent
Systems XVII, eds. M. Bramer, A. Preece and F. Coenen, BCS Series, Springer-Verlag,
2000, pp. 199–211.
5. A. Albrecht and C. K. Wong, “On logarithmic simulated annealing,” Theoretical
Computer Science: Exploring New Frontiers of Theoretical Informatics, eds. J. van
Leeuwen, O. Watanabe, M. Hagiya, P. D. Mosses and T. Ito, Lecture Notes in Com-
puter Science, Vol. 1872, 2000, pp. 301–314.
6. A. Albrecht and C. K. Wong, “Combining the perceptron algorithm with logarithmic
simulated annealing,” Neural Process. Lett. 14, 1 (2001) 75–83.
August 12, 2002 17:26 WSPC/115-IJPRAI 00184
neural networks for image compression, enhancement and detection,” Proc. Meeting
of the Radiological Society of North America, 1994, p. 225.
12. E. Cohen, “Learning noisy perceptrons by a perceptron in polynomial time,”
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.
Proc. 38th IEEE Symp. Foundations of Computer Science, 1997, pp. 514–523.
13. K. Doi, M. L. Giger, R. M. Nishikawa, H. McMahon and R. A. Schmidt, “Artificial
intelligence and neural networks in radiology: application to computer-aided diagnos-
tic schemes,” Digital Imaging, eds. W. Hendee and J. H. Trueblood, 1993, pp. 301–322.
14. D. B. Fogel, E. C. Wasson III, E. M. Boughton and V. W. Porto, “Evolving artificial
neural networks for screening features from mammograms,” Artif. Intell. Med. 14, 3
(1998) 317.
15. B. Hajek, “Cooling schedules for optimal annealing,” Math. Operat. Res. 13 (1988)
311–329.
16. H. Handels, Th. Roß, J. Kreusch, H. H. Wolff and S. J. Pöppl, “Feature selection for
optimized skin tumour recognition using genetic algorithms,” Artif. Intell. Med. 16,
3 (1999) 283–297.
17. R. Hindel, Implementation of the DICOM 3.0 Standard, RSNA Handbook, 1994.
18. S. Kirkpatrick, C. D. Gelatt, Jr. and M. P. Vecchi, “Optimization by simulated
annealing,” Science 220 (1983) 671–680.
19. X. Li, S. Bhide and M. R. Kabuka, “Labeling of MR brain images using Boolean
neural network,” IEEE Trans. Med. Imag. 15, 5 (1997) 628–638.
20. S. B. Lo, Y. C. Wu, M. T. Freedman, S. K. Mun and A. Hasegawa, “Detection
of microcalcifications by using adaptive-sized neural networks, Proc. Meeting of the
Radiological Society of North America, 1994, p. 171.
21. M. L. Minsky and S. A. Papert, Perceptrons, MIT Press, Cambridge, MA, 1969.
22. Z. Pan, A. G. Rust and H. Bolouri, “Image redundancy reduction for neural net-
work classification using discrete cosine transforms,” Proc. Int. Joint Conf. Neural
Networks, Vol. 3, Como, 2000, pp. 149–154.
23. C. A. Pea-Reyes and M. Sipper, “A fuzzy-genetic approach to breast cancer
diagnosis,” Artif. Intell. Med. 17, 2 (1999) 131–155.
24. F. Romeo and A. Sangiovanni-Vincentelli, “A theoretical framework for simulated
annealing,” Algorithmica 6, 3 (1991) 302–345.
25. A. L. Ronco, “Use of artificial neural networks in modeling associations of discriminant
factors: towards an intelligent selective breast cancer screening,” Artif. Intell. Med.
16, 3 (1999) 299–309.
26. F. Rosenblatt, Principles of Neurodynamics, Spartan Books, NY, 1962.
27. C. Roßmanith, H. Handels, S. J. Pöppel, E. Rinast and H. D. Weiss, “Computer-
assisted diagnosis of brain tumors using fractals, texture and morphological
image analysis,” Proc. Computer-Assisted Radiology, ed. H. U. Lemke, 1995,
pp. 375–380.
28. E. Seneta, Non-Negative Matrices and Markov Chains, Springer-Verlag, NY, 1981.
August 12, 2002 17:26 WSPC/115-IJPRAI 00184
29. K. Steinhöfel, A. Albrecht and C. K. Wong, “On various cooling schedules for sim-
ulated annealing applied to the job shop problem,” Randomization and Approxi-
mation Techniques in Computer Science, eds. M. Luby, J. Rolim and M. Serna,
Lecture Notes in Computer Science, Vol. 1518, Springer-Verlag, Barcelona, 1998,
pp. 260–279.
30. R. Tawel, T. Dong, B. Zheng, W. Qian and L. P. Clarke, “Neuroprocessor hardware
card for real-time microcalcification detection at digital mammography,” Proc. Meet-
ing of the Radiological Society of North America, 1994, p. 172.
31. P. E. Utgoff, D. J. Stracuzzi and R. P. Cochran, “Many-layered versus few-layered
learning,” TR-01-14, Department of Computer Science, University of Massachusetts,
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com
January 9, 2001.
32. Y. C. Wu, K. Doi and M. L. Giger, “Detection of lung nodules in digital chest
radiographs using artificial neural networks: a pilot study,” J. Digit. Imag. 8 (1995)
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.
88–94.
33. Y. Zhu and H. Yan, “Computerized tumour boundary detection using a Hopfield
neural network,” IEEE Trans. Med. Imag. 16, 1 (1997) 55–67.
August 12, 2002 17:26 WSPC/115-IJPRAI 00184
and the Habilitation (Dr.sc.nat.) degrees in degree (summa cum laude) from the Tech-
mathematics from Humboldt University at nical University, Berlin.
Berlin. Her research interests are in stochastic al-
His research interests include complexity gorithms, and learning theory.
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.