You are on page 1of 16

August 12, 2002 17:26 WSPC/115-IJPRAI 00184

International Journal of Pattern Recognition and Artificial Intelligence


Vol. 16, No. 5 (2002) 573–588
c World Scientific Publishing Company

ADAPTIVE SIMULATED ANNEALING FOR


CT IMAGE CLASSIFICATION
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com

A. A. ALBRECHT∗ and M. LOOMES


Department of Computer Science, University of Hertfordshire,
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.

Hatfield, Herts AL10 9AB, UK


∗ A.Albrecht@herts.ac.uk

K. STEINHÖFEL
GMD–National Research Center for Information Technology,
Kekuléstr. 7, 12489 Berlin, Germany

M. TAUPITZ
Faculty of Medicine, Institute of Radiology, Humboldt University of Berlin,
Schumannstraße 20/21, 10117 Berlin, Germany

We present a pattern classification method that combines the classical Perceptron algo-
rithm with simulated annealing. For a sample set S of n-dimensional patterns labeled as
positive and negative, our algorithm computes threshold circuits of small depth where
the linear threshold functions of the first layer are calculated by simulated annealing with
the logarithmic cooling schedule c(k) = Γ(k)/ ln (k + 2). The parameter Γ depends on
the sample set and changes in time, and the neighborhood relation is determined by the
Perceptron algorithm. We apply the approach to the recognition of focal liver tumours.
From 400 positive (focal liver tumour) and 400 negative (normal liver tissue) examples a
depth-six threshold circuit is calculated. The examples are of size n = 14 161 = 119×119
and they are presented in the DICOM format. On test sets of 100+100 examples (disjoint
from the learning set) we obtain a correct classification of more than 98%.

Keywords: Markov chains; simulated annealing; threshold circuits; CT images.

1. Introduction
The paper describes a method of computing small depth threshold circuits for
pattern recognition purposes. The approach is applied to focal liver tumour recog-
nition, where the CT images are classified by the threshold circuit without any
preprocessing. From a general point of view, the threshold circuits are designed for
binary classifications of points from an n-dimensional space. This problem has been
studied for a long time and is closely related to algorithms solving systems of linear
inequalities.

∗ Author for correspondence.

573
August 12, 2002 17:26 WSPC/115-IJPRAI 00184

574 A. A. Albrecht et al.

In 1954, Agmon2 proposed a simple iteration procedure to find solutions of


linear inequalities lj (~z) = ~aj · ~z + bj ≥ 0, j = 1, . . . , m. The procedure starts
with an arbitrary initial vector ~z0 . When ~zi does not represent a solution of the
system, then ~zi+1 is taken as the orthogonal projection of the farthest hyperplane
which corresponds to a violated linear inequality: ~zi+1 := ~zi + t · ~aj0 , where t =
−lj0 (~zi )/|~aj0 |2 and ~aj0 maximizes −lj (~zi )/|~aj |2 among the violated lj (~zi ).
In pattern recognition, Agmon’s method became popular as the classical
Perceptron algorithm.26 For sets of points S that can be separated by a linear
function into “positive” and “negative” examples, Minsky and Papert 21 proved the
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com

following convergence property: Let w ~ ∗ be a unit vector solution to the separa-


tion problem, i.e. w ~ ∗ · ~x > 0 for all [~x, +] ∈ S and w~ ∗ · ~x < 0 for all [~x, −] ∈ S
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.

(by adding one extra dimension to the space, the threshold can be made equal to
zero). Then the Perceptron algorithm converges in at most 1/σ 2 iterations, where
σ := min[~x,η]∈S |w ~ ∗ · ~x|, η ∈ {+, −}.
For our problem of CT-image classification, one can hardly assume that positive
and negative examples are separable by a single linear threshold function. In order to
reduce the classification error, we try to compute a bounded-depth circuit consisting
of linear threshold functions. The threshold functions, in particular the gates of
the first level, are determined by a learning procedure from positive and negative
examples of the classification problem. The aim of the learning procedure is to
minimize the error on the set of examples.
To our knowledge, the first paper on learning-based methods applied to X-
ray diagnosis was published by Asada et al.7 Since then, the research has been
concentrating on using commercially available neural network tools or hardware
systems, respectively, for medical image classification.11,13,19,20,30,32,33
Usually, feature extraction is used in learning-based classification methods. For
example, Ref. 27 introduces the assignment of fractal dimensions to tumour struc-
tures. The fractal dimensions are assigned to contours which have been extracted
by commonly used filtering operations. In fact, these contours represent polygonal
structures within a binary image. For example, the fractal dimensions D1 = 1.13
and D2 = 1.40 are assigned to the boundary and the interior, respectively, of a
glioblastoma.
A high classification rate of nearly 98% is reported in Ref. 23, where the
Wisconsin breast cancer diagnosis (WBCD) database of 683 cases is taken for learn-
ing and testing. The approach is based on feature extraction from FNA data and
uses nine visually assessed characteristics for learning and testing. Among the char-
acteristics are the uniformity of cell size, the uniformity of cell shape, and the clump
thickness.
In our approach, the only input to the algorithm are the image data, but the
set of training examples is partitioned into classes according to the average value of
the gray scale of pixels. Since focal liver tumour detection is not part of screening
procedures like the detection of microcalcifications,14,16,20,23,25 a certain effort is
August 12, 2002 17:26 WSPC/115-IJPRAI 00184

Adaptive Simulated Annealing for CT Image Classification 575

required to collect the image material. To our knowledge, results on neural net-
work applications to focal liver tumour detection are not available in the literature.
Therefore, we could not include comparisons to related, previous work in our paper.
First results of our method are presented in Ref. 4 for a small number of 220+220
examples and without the partition into gray scale classes. In the present paper, we
describe the computation of threshold circuits from 400 + 400 positive and negative
examples. Moreover, the parameter setting changes for the underlying simulated
annealing procedure as the average value of misclassified examples becomes smaller,
i.e. we perform an adaptive parameter control. The adaptive approach as well as the
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com

partition of training examples improves the results presented in Ref. 4 significantly.


The input to the algorithm are fragments of CT images of size 119×119 with an
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.

8 bit gray scale in DICOM standard format.17 Therefore, the input size is n = 14 161
and the input values range from 0 to 255. For the learning procedure, we used 400
positive (focal liver tumours) and 400 negative (normal liver tissue) examples. The
threshold circuits of depth six were tested on 100 + 100 examples (different from
the learning set), and we obtained a correct classification of about 98%. Further
improvements can be expected when a larger depth of circuits is considered.

2. Basic Definitions
We take into account that computers have a limited register length d to represent
real numbers. This limits the range of threshold functions one can consider and
therefore we take the following set of functions:
[
F := Fn ,
n≥1

where
( )
X
n
Fn = f (~x) : f (~x) = wi · xi ≥ ϑf , wi ∈ (±1) · {0, 1} × {0, 1}
d d
.
i=1

The variables are represented by xi = (pi , qi ), pi , qi ∈ {0, 1}d.

2.1. Threshold circuits


Besides F, we consider single-output circuits C of threshold functions: A circuit C
is defined by the underlying acyclic directed graph G = [E, V ], E ⊂ V × V . The
graph G has n input nodes labeled by variables x1 , . . . , xn , and |V | − n nodes vf
labeled by threshold functions f ∈ F, where the number of incoming edges of vf
has to be consistent with the number of variables of f . Finally, one vf is chosen as
the output vout of C.
The depth of C is the maximum number of edges on a path from an input node
xi to the output node vout . The nodes that are not input nodes are called gates.
The function F (C) computed by C is defined as follows: The gates of the first level
August 12, 2002 17:26 WSPC/115-IJPRAI 00184

576 A. A. Albrecht et al.

Pn
output 1 or 0 depending on whether or not i=1 wi · xi ≥ ϑf . In the same way, the
gates at higher levels have Boolean outputs only. Therefore, when all paths from
input nodes to vout are of the same length, the gates at level 2, 3, . . . do compute
Boolean threshold functions. Thus, we have F (C) : {0, 1}n·d → {0, 1}. The width of
C is the maximum number of gates that have the same distance (maximum number
of edges) to input nodes.
In the present paper, the maximum depth of C is six; see Fig. 1. There are two
types A and B of gates at the first level: In type A gates, the sum of all 14 161
gray scale values from the input is calculated and compared to the minimum and
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com

maximum values defining a class Cli , where 1 ≤ i ≤ m ≤ 5. The number of type


A gates is equal to 2 · m, and the outputs of the gates (two outputs for each i) are
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.

used for the selection of intermediate values at larger depths with respect to the
class Cli to which a particular input belongs.
The first level gates of type B are part of subcircuits that are calculated from a
random choice of training examples of the particular classes Cli by the combination
of logarithmic simulated annealing and the Perceptron algorithm. For each Cli
(subcircuit), the k output values from the corresponding type B gates are taken

CT image

X1 X 14161 Input nodes

Grey scale value Grey scale value


comparison: Class 1 Type B Type B Type B Type B Depth 1
comparison: Class m

Voting Voting Depth 2

Weighted voting Weighted voting


 

Depth 3


Selection gates Selection gates Depth 4




OR gate Depth 5


Voting on three copies 

Depth 6

Fig. 1. Structure of the classification circuit.


August 12, 2002 17:26 WSPC/115-IJPRAI 00184

Adaptive Simulated Annealing for CT Image Classification 577

together as inputs to a voting function y1 + y2 + · · · + yl ≥ dl/2e, where in our


computational experiments l = 7, 13. Thus, there are m such subcircuits and k × m
gates of type B at the first level, and the width at the first level is (l + 2) · m.
At depth three, for each of the m classes Cli three “neighboring” outputs are
taken together by weighted voting functions which depend on Cli . We use zi−1 +
2 · zi + zi+1 ≥ 3 when i 6= 1, m; for the border elements we take 2 · z1 + z2 + z3 ≥ 2
and zm−2 + zm−1 + 2 · zm ≥ 2, respectively, i.e. the sum as well as the thresholds
are different from the case i 6= 1, m. The choice of these weighted voting functions
has been determined by computational experiments. It takes into account that the
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com

number of training examples in Cl1 and Clm is relatively small compared to Cli
when i 6= 1, m.
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.

At depth four, each output from subcircuits assigned to a particular Cli is taken
together m times with two outputs from the type A gates from the first level by a
three-input AND gate, i.e. we have m2 gates of the type u1 ∧ u2 ∧ u3 at depth four.
The m2 outputs are collected together by an OR gate at depth five.
Finally, at depth six we have a simple majority functions of the type v1 +v2 +v3 ≥
2. In Fig. 1, the comparison to three copies of circuits means that the circuits do
have the same structure. Since the threshold functions from the first level (type B
gates) are calculated by a stochastic procedure, it is unlikely that the “copies of
circuits” represent the same function.
Thus, the most time-consuming part is the calculation of first level gates of type
B and we pay particular attention to the time complexity analysis of computing
the weights and threshold values of these linear threshold functions.

2.2. Simulated annealing


Simulated annealing algorithms (see Refs. 1 and 18) are acting within a configura-
tion space in accordance with a certain neighborhood relation, where the particular
transitions between adjacent elements of the configuration space are governed by
an objective function.

2.2.1. Configuration space and neighborhood relation


Given a sample set S, we set m := |S|, and we assume that the coordinates of all
elements are represented as rational numbers by pairs of binary tuples as in the
case of Fn , see Sec. 2.1:
S = {[~x, η] : ~x = (x1 , . . . , xn ), xi = (pi , qi ), pi , qi ∈ {0, 1}d, and η ∈ {+, −}} .
Furthermore, we consider a particular number n of variables only and we take the
set F := Fn as the configuration space.
The objective of our optimization procedure is to minimize the number |S∆f |
of misclassified examples, S∆f := {[~x, η] : f (~x) < 0 & η = + or f (~x) > 0 & η = −}.
The objective function is defined by
Z(f ) := |S∆f | . (1)
August 12, 2002 17:26 WSPC/115-IJPRAI 00184

578 A. A. Albrecht et al.

The set of minimal solutions is given by


Fmin (S) := {f (~x) : ∀f 0 (f 0 ∈ F → Z(f ) ≤ Z(f 0 ))} .
Pn
Given f = i=1 wi · xi ≥ ϑf , the neighborhood relation is suggested by the
Perceptron algorithm26 and defined by
yj Xn
wi (f 0 ) := wi − pPn 2
· xij , j ∈ {1, 2, . . . , m} and y j = wi · xij , (2)
i=1 wi i=1

for all i simultaneously and


pPfor a specified j that maximizes |yj −ϑf |. The threshold
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com

n 2
ϑf 0 is equal to ϑf + yj / i=1 wi .
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.

2.2.2. Transition probabilities


Given a pair [f, f 0 ], f 0 ∈ Nf , we denote by G[f, f 0 ] the probability of generating f 0
from f and by A[f, f 0 ] the probability of accepting f 0 once it has been generated
from f . As in most applications of simulated annealing, we take a uniform proba-
bility G[f, f 0 ] which is given by
1
G[f, f 0 ] := , f 0 ∈ Nf . (3)
|Nf |
We recall that during a single transition only one particular weight is changed.
The acceptance probabilities A[f, f 0 ], f 0 ∈ F, are derived from the underlying
analogy to thermodynamic systems:
(
0
1, if Z(f 0 ) − Z(f ) ≤ 0 ,
A[f, f ] := 0
(4)
e−(Z(f )−Z(f ))/c , otherwise ,
where c is a control parameter having the interpretation of a temperature in
annealing procedures.
The probability of performing the transition between f and f 0 is defined by

0 0
 G[f, f ] · A[f, f ],
 if f 0 6= f ,
Pr{f → f 0 } = 1 − X G[f, g] · A[f, g], (5)

 otherwise .
g6=f
0
By definition, Pr{f → f } depends on the control parameter c.

2.2.3. Markov chains


Let af (k) denote the probability of being in the configuration f after k steps
performed for the same value of c. The probability af (k) can be calculated in
accordance with
X
af (k) := ah (k − 1) · Pr{h → f } . (6)
h

The recursive application of (6) defines a Markov chain of probabilities af (k), where
f ∈ F and k = 1, 2, . . .. If the parameter c is a constant, the chain is said to be
August 12, 2002 17:26 WSPC/115-IJPRAI 00184

Adaptive Simulated Annealing for CT Image Classification 579

a homogeneous Markov chain; otherwise, if c = c(k) is lowered at any step, the


sequence of probability vectors ~a(k) is an inhomogeneous Markov chain.
In the literature, the application of homogeneous Markov chains to optimization
problems is dominating, see Refs. 1 and 18. The convergence at fixed temperatures
is well-studied and depends on certain matrix properties, cf. Ref. 28. Under natural
assumptions, the Markov chain tends to the Boltzmann distribution with its explicit
dependency on Z(f ). The convergence to optimum solutions, however, requires an
infinite number of transitions at any temperature and additionally a decreasing
temperature that tends to zero. Therefore, the convergence to optimum solutions
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com

cannot be guaranteed in practical applications where only a finite number of steps


can be performed for any fixed c(k).
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.

These difficulties can be avoided when inhomogeneous Markov chains are ap-
plied. The general framework of inhomogeneous Markov chains has been studied,
e.g. by Hajek15 and Catoni.9,10 Let af (k) denote the probability to obtain the linear
threshold function f ∈ F (i.e. one threshold gate from the first level of the thresh-
old circuit) after k steps of an inhomogeneous Markov chain. We have to minimize
|{~x : [~x, η] ∈ S & f (~x) 6= η}|, and the problem is to find a lower bound for k such
P
that f ∈Fmin af (k) > 1 − δ for f ∈ Fmin .
In the present paper we are focusing on a special type of inhomogeneous Markov
chains where the value c(k) changes in accordance with
Γ
c(k) = , k = 0, 1, . . . (7)
ln(k + 2)
The value Γ is a parameter that depends both on the configuration space F and
the neighborhood relation Nf .
The conditions that determine the choice of Γ are derived from Hajek’s
Theorem15 on logarithmic cooling schedules for inhomogeneous Markov chains:

Theorem . Under some (natural ) assumption about the reversibility of the


configuration space, the simulated annealing algorithm which is based on (7) implies
P
the convergence f ∈Fmin af (k) −→ 1 if and only if Γ is larger than or equal to the
k→∞
maximum of the minimum escape height from local minima.

For a more generalized neighborhood relation, where transitions by the Per-


ceptron rule are replaced by reversible elementary steps, one obtains from Hajek’s
Theorem:

Corollary. The inhomogeneous Markov chain which is generated in accordance


P
with (3) until (7) tends to the probability distribution limk→∞ f ∈Fmin af (k).

The convergence analysis from Ref. 5 indicates a time complexity of roughly


nΓ+O(1) , i.e. after nΓ +logO(1) (1/δ) transition steps of the Markov chain (each tran-
sition is performed in nO(1) time), the confidence that a minimum-error threshold
function has been computed is larger than 1 − δ.
August 12, 2002 17:26 WSPC/115-IJPRAI 00184

580 A. A. Albrecht et al.

3. Computational Experiments
As already mentioned, simulated annealing-based heuristics are designed in most
applications for homogeneous Markov chains where the convergence to the
Boltzmann distribution at fixed temperatures is important for the performance
of the algorithm. In our approach, we utilize the general framework of logarithmic
simulated annealing described in Sec. 2 for the design of a pattern classification
heuristic. We paid particular attention to the choice of the parameter Γ which is
crucial to the quality of solutions as well as to the run-time of our heuristic.
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com

3.1. A simulated annealing-based heuristic


by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.

Since the cooling schedule is defined by (7) and the objective function is given by
(1), it remains to choose a suitable neighborhood relation only. To speed up the
local search for minimum error solutions, we employ the neighborhood Nf from (2)
together with a nonuniform generation probability where the transitions are forced
into the direction of the maximum deviation. The approach has been described in
Ref. 5 and is motivated by the computational experiments on equilibrium compu-
tations from Ref. 3. For the implementation of simulated annealing-based heuris-
tics a significant speed-up was obtained when transitions were performed into the
direction of maximum local forces.
The nonuniform generation probability is derived from the Perceptron
algorithm: When f is the current hypothesis, we set


 −f (~x), if f (~x) < ϑf and η(~x) = + ,

U (~x) := f (~x), if f (~x) ≥ ϑf and η(~x) = − , (8)



0, otherwise .
The f 0 in (2) are related to the ~x ∈ S∆f and therefore it is justified to define
U (~x)
G[f, f 0 ] := P . (9)
x∈S∆f U (~
~ x)
Thus, preference is given to the neighbors that maximize the deviation. Now, our
heuristic can be summarized in the following way:
1. The initial hypothesis is defined by wi = 1, i = 1, 2, . . . , n and ϑ = 0.
2. For the current hypothesis, the probabilities U (~x) are calculated; see (8).
3. To determine the next hypothesis fk , a random choice is made among the ele-
ments of Nfk−1 according to (9).
4. When Z(fk ) ≤ Z(fk−1 ), we proceed with the new hypothesis fk .
5. In case of Z(fk ) > Z(fk−1 ), a random number ρ ∈ [0, 1] is drawn uniformly.
6. If e−(Z(fk )−Z(fk−1 )/c(k)) ≥ ρ, the function fk is the new hypothesis. Otherwise,
we return to 3.1 with fk−1 .
Here, Γ from c(k) changes in the following adaptive way: For a constant number
of steps L, the maximum value of |S∆fk | is evaluated after k = L · l steps for
August 12, 2002 17:26 WSPC/115-IJPRAI 00184

Adaptive Simulated Annealing for CT Image Classification 581

l = 1, 2, . . .. The value maxL·(l−1)≤k≤L·l |S∆fk | is taken as the new Γ for the


cooling schedule when larger than zero (otherwise Γ := 1).
7. The computation is terminated after a predefined number of steps K.
Hence, instead of following unrestricted increases of the objective function, our
heuristic tries to find another “initial” hypothesis when the difference of the number
of misclassified examples is too large.
The heuristic was implemented in C ++ and we performed computational
experiments on SUN Ultra 5/360 workstation with 256 MB RAM under Solaris 2.6.
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com

3.2. Classification of CT images


by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.

The heuristic has been applied to the recognition of focal liver tumours from
fragments of CT images. The input to the heuristic was derived from the DICOM
standard representation of CT images.17 In Figs. 2 and 3, examples of input in-
stances are shown in the DICOM format. From these 128×128 images we calculated
8-bit gray scale representations of size 119×119 in order to avoid interferences from
the border of images.
From a total number of 500 positive (abnormal findings) and 500 negative ex-
amples (normal liver tissue) we separated 100 positive and 100 negative examples

Fig. 2. An example of normal liver tissue (negative example).

Fig. 3. An example of tumour tissue (positive example).


August 12, 2002 17:26 WSPC/115-IJPRAI 00184

582 A. A. Albrecht et al.

Table 1. Distribution of samples according to the average gray scale


value.

n = 14 161 |POS| = |NEG| = 400 |T POS = |T NEG| = 100


Cl31 Cl32 Cl33 Cl51 Cl52 Cl53 Cl54 Cl55
POS 177 190 33 90 136 115 44 15
NEG 28 274 98 8 80 175 123 14
T POS 45 47 8 25 26 31 15 3
T NEG 3 71 26 0 9 51 35 5
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com

for test purposes (images no. 0, . . . , 99 for both types, respectively). The sets of
training examples are denoted by POS and NEG, the test sets by T POS and
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.

T NEG. All four sets were analyzed with respect to their average gray scale value
P14161
( j=1 xj )/14 161, xj ∈ {0, . . . , 255} and then subdivided into m = 3, 5 classes
where each class covers an interval of a unit length (max av gray−min av gray)/m.
In the following, T POS and T NEG are classified with respect to max av gray
and min av gray from POS and NEG, respectively. Table 1 shows the result of
the classification for both values of m, i.e. the entries are the number of elements
of the corresponding class. The minimum and maximum values were calculated as
min av gray = 132.1 and max av gray = 210.4 over POS∪NEG.
We computed threshold circuits for m = 3, l = 7, 13 and m = 5, l = 7. Let
N ump (i) and N umn (i) denote the number of positive and negative examples in
Pn
Clml as shown in Table 1. Each particular function f = j=1 wj · xj ≥ ϑ was
trained on a random choice of N ump (i)/2+N umn (i)/2 examples out of N ump (i)+
N umn (i) examples. When N ump (i) and N umn (i) differ by a larger margin as, e.g.
in Cl31 and Cl54 , all examples from the smaller set were taken for the training
procedure.
The computations were performed for two cases:

(α) For a fixed Γ equal to (N ump (i) + N umn (i))/3.


(β) For an adaptive value of Γ according to step 6 (see Sec. 3.1) with L = 250 and
an initial value Γ as in the first case.

The setting of the initial value of Γ was determined by computational experiments.


We used a similar adaptive method before in Ref. 29 where logarithmic simulated
annealing was applied to job shop scheduling.
We used a combined termination criterion: The learning process terminates for
a single function when either the number of hypotheses is larger than Kmax or
the percentage of correctly classified examples is larger than or equal to 99%. It is
interesting to note that in all cases the examples were learned with zero error.
The result (output) of the learning procedure are l independently calculated
linear threshold functions. The l functions are taken together by a simple voting
function to form a depth-two threshold circuit (cf. Fig. 1) for each Clmj , j =
1, . . . , m. Then, the weighted sum over three “neighboring” subcircuits is taken as
described in Sec. 2.1.
August 12, 2002 17:26 WSPC/115-IJPRAI 00184

Adaptive Simulated Annealing for CT Image Classification 583

The procedure is repeated three times, i.e. three subcircuits of depth five
(according to Fig. 1) are calculated independently. Each of the three subcircuits
consists of m subcircuits for classes Clmj , j = 1, . . . , m. With an additional gate
x1 + x2 + x3 ≥ 2, the subcircuits result in a depth-six threshold circuit. For m = 3
and l = 7, the run-time to compute a depth-six type α circuit is about 22 hours,
and for type β circuits the run-time is about 28 hours.
In Tables 2–4, we summarize the results of our computational experiments,
including intermediate results from outputs at depth five. Since the learning phase
is performed for each class separately, the computation of threshold functions from
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com

level three (see Fig. 1) is much shorter compared to the run-times presented in
Ref. 4 and therefore it was possible to complete a large number of runs, at least
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.

three for each pair [l, m] and depths five and six. The results in Tables 2–4 are
representatives of an average outcome for the given [l, m]; the results from different
runs for m = 3 and the same settings are stable and differ only marginally. The test
on a single image from the 200 test examples is performed within a few seconds.
Table 2 shows the results for l = 7, m = 3, and the two types of experiments α
and β. There is only a marginal improvement for the adaptive annealing (type β)
and the run-time becomes longer in this case because it is more difficult to escape
from local minima when Γ is small. On the other side, the increase in computation
time usually leads to better results. The best results we obtained for type α and
type β calculations were both the same and equal to 3% ([4,2] and [3,3] errors,
respectively).
In Table 3, the run-time is much shorter since the number of learning examples
in each of the classes Cl5j , j = 1, . . . , 5 is relatively small (see Table 1). Due to
the small number of samples, the classification rate becomes worse and therefore
we consider for l = 13 only the case m = 3. For the settings from Table 3, the
results for type β circuits were always better than type α results, with a particular
improvement for depth-six circuits.

Table 2. Results on test samples for l = 7 and three gray scale classes.

l = 7, m = 3 Type α Type β
Circuit Time Errors on Total Time Errors on Total
Depth (Min.) T POS T NEG Errors (Min.) T POS T NEG Errors
5 428 6 4 5.0% 529 3 7 5.0%
6 1315 5 3 4.0% 1665 4 3 3.5%

Table 3. Results on test samples for l = 7 and five gray scale classes.

l = 7, m = 5 Type α Type β

Circuit Time Errors on Total Time Errors on Total


Depth (Min.) T POS T NEG Errors (Min.) T POS T NEG Errors
5 93 13 6 9.5% 177 9 9 9.0%
6 265 13 5 9.0% 549 8 4 6.0%
August 12, 2002 17:26 WSPC/115-IJPRAI 00184

584 A. A. Albrecht et al.

Table 4. Results on test samples for l = 13 and three gray scale classes.

l = 13, m = 3 Type α Type β

Circuit Time Errors on Total Time Errors on Total


Depth (Min.) T POS T NEG Errors (Min.) T POS T NEG Errors
5 1077 4 4 4.0% 1268 3 3 3.0%
6 3165 2 3 2.5% 4117 1 2 1.5%

The results in Table 4 remain almost the same for values of l larger than 13.
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com

The adaptive logarithmic cooling schedule provides a further improvement of the


results obtained for type α circuits. The best results for type α were 2% correct
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.

classification with 2 errors on positive and negative test examples, respectively. For
type β circuits, some runs (depth six) finished with only a single misclassification
on the set of 200 test examples.
To our knowledge, results on neural network applications to focal liver tumour
detection are not reported in the literature. Therefore, we cannot include a direct
comparison to related, previous work in our paper. In Ref. 23, a database of 683
cases is taken for learning and testing. The approach is based on feature extraction
from image data and uses nine visually assessed characteristics for learning and
testing. Consequently, the classification rate is much higher and lies between 97.1%
and 97.8%.
Usually, in neural network research only a small depth of networks is considered
(one or two “hidden layers” in back-propagation networks) and the input informa-
tion is down-sized by preprocessing steps (feature extraction). In principle, neural
networks can be used to map the pixel images directly onto the target output values,
i.e. the entire pixel information is directly passed to the neural network. However,
such an approach will typically generate poor results as discussed in Ref. 22. With
respect to our specific problem setting, the computational effort of back-propagation
applied to n = 14 161 input values and a large number of “hidden units” seems to
be difficult to manage. For a discussion of the power of small-depth neural networks,
see Ref. 31.
Recently, support vector machines are widely studied as a new learning tool,
see Ref. 8. In this approach, higher dimensional separations are modeled by intro-
ducing auxiliary variables representing components of the separation functions, e.g.
y1 = x21 , y2 = 2 · x1 · x2 , etc. When we try to keep the entire image information
as an input to the learning procedure, the introduction of a multiple number of
variables would significantly increase the time complexity as well as the memory
space complexity of the learning process.
In our experiments, we obtain a correct classification on 200 test examples of
more than 98%. Thus, we think that the approach presented in the paper is par-
ticularly suited to large scale learning problems where the original information is
retained during the training phase. The main features are (i) the combination of log-
arithmic simulated annealing with the Perceptron algorithm, (ii) the random choice
August 12, 2002 17:26 WSPC/115-IJPRAI 00184

Adaptive Simulated Annealing for CT Image Classification 585

of samples out of the entire set of training samples, and (iii) the specific structure of
the bounded-depth classification circuit, including the partition of training exam-
ples by maintaining the entire image information. We expect further improvements
of the classification rate when circuits of depth seven are considered.

4. Concluding Remarks
We applied a combination of simulated annealing and the Perceptron algorithm
to the recognition of focal liver tumours. From 400 positive (focal liver tumour)
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com

and 400 negative (normal liver tissue) examples we calculated threshold circuits of
depth six. The functions of the first layer are determined by a simulated annealing
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.

procedure with a logarithmic cooling schedule, where the neighborhood function is


specified by the Perceptron algorithm. The main parameter Γ of the logarithmic
cooling schedule Γ/ ln (k + 2) changes for increasing k according to the (in general,
decreasing) number of misclassified examples. The examples are presented in the
DICOM format and have the size n = 14 161 = 119 × 119. On test sets of 100 + 100
examples (disjoint from the learning set) we obtained a correct classification of more
than 98%. The test on a single image is performed within a few seconds, whereas
the run-time to compute the depth-six circuit with the best classification rate is
about 70 hours on a SUN Ultra 5/360 workstation.

Acknowledgment
The authors would like to thank the referees for their careful reading of the
manuscript and helpful suggestions that resulted in an improved presentation.
The authors are grateful to thank Eike Hein and Daniela Melzer (HUB, Institute
of Radiology) for preparing the image material.

References
1. E.H.L. Aarts, Local Search in Combinatorial Optimization, Wiley, NY, 1998.
2. S. Agmon, “The relaxation method for linear inequalities,” Canadian J. Math. 6, 3
(1954) 382–392.
3. A. Albrecht, S. K. Cheung, K. S. Leung and C. K. Wong. Stochastic simulations of
two-dimensional composite packings,” J. Comput. Phys. 136, 2 (1997) 559–579.
4. A. Albrecht, M. J. Loomes, K. Steinhöfel and M. Taupitz, “A modified perceptron
algorithm for computer-assisted diagnosis,” Research and Development in Intelligent
Systems XVII, eds. M. Bramer, A. Preece and F. Coenen, BCS Series, Springer-Verlag,
2000, pp. 199–211.
5. A. Albrecht and C. K. Wong, “On logarithmic simulated annealing,” Theoretical
Computer Science: Exploring New Frontiers of Theoretical Informatics, eds. J. van
Leeuwen, O. Watanabe, M. Hagiya, P. D. Mosses and T. Ito, Lecture Notes in Com-
puter Science, Vol. 1872, 2000, pp. 301–314.
6. A. Albrecht and C. K. Wong, “Combining the perceptron algorithm with logarithmic
simulated annealing,” Neural Process. Lett. 14, 1 (2001) 75–83.
August 12, 2002 17:26 WSPC/115-IJPRAI 00184

586 A. A. Albrecht et al.

7. N. Asada, K. Doi, H. McMahon, S. Montner, M. L. Giger, C. Abe and Y. C. Wu,


“Neural network approach for differential diagnosis of interstitial lung diseases: a pilot
study,” Radiology 177 (1990) 857–860.
8. E. J. Bredensteiner and K. P. Bennett, “Multicategory classification by support vector
machines,” Comp. Optim. Appl. 12 (1999) 53–79.
9. O. Catoni, “Rough large deviation estimates for simulated annealing: applications to
exponential schedules,” Ann. Probab. 20, 3 (1992) 1109–1146.
10. O. Catoni, “Metropolis, simulated annealing, and iterated energy transformation al-
gorithms: theory and experiments,” J. Complex. 12, 4 (1996) 595–623.
11. L. P. Clarke, “Computer assisted-diagnosis: advanced adaptive filters, wavelets and
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com

neural networks for image compression, enhancement and detection,” Proc. Meeting
of the Radiological Society of North America, 1994, p. 225.
12. E. Cohen, “Learning noisy perceptrons by a perceptron in polynomial time,”
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.

Proc. 38th IEEE Symp. Foundations of Computer Science, 1997, pp. 514–523.
13. K. Doi, M. L. Giger, R. M. Nishikawa, H. McMahon and R. A. Schmidt, “Artificial
intelligence and neural networks in radiology: application to computer-aided diagnos-
tic schemes,” Digital Imaging, eds. W. Hendee and J. H. Trueblood, 1993, pp. 301–322.
14. D. B. Fogel, E. C. Wasson III, E. M. Boughton and V. W. Porto, “Evolving artificial
neural networks for screening features from mammograms,” Artif. Intell. Med. 14, 3
(1998) 317.
15. B. Hajek, “Cooling schedules for optimal annealing,” Math. Operat. Res. 13 (1988)
311–329.
16. H. Handels, Th. Roß, J. Kreusch, H. H. Wolff and S. J. Pöppl, “Feature selection for
optimized skin tumour recognition using genetic algorithms,” Artif. Intell. Med. 16,
3 (1999) 283–297.
17. R. Hindel, Implementation of the DICOM 3.0 Standard, RSNA Handbook, 1994.
18. S. Kirkpatrick, C. D. Gelatt, Jr. and M. P. Vecchi, “Optimization by simulated
annealing,” Science 220 (1983) 671–680.
19. X. Li, S. Bhide and M. R. Kabuka, “Labeling of MR brain images using Boolean
neural network,” IEEE Trans. Med. Imag. 15, 5 (1997) 628–638.
20. S. B. Lo, Y. C. Wu, M. T. Freedman, S. K. Mun and A. Hasegawa, “Detection
of microcalcifications by using adaptive-sized neural networks, Proc. Meeting of the
Radiological Society of North America, 1994, p. 171.
21. M. L. Minsky and S. A. Papert, Perceptrons, MIT Press, Cambridge, MA, 1969.
22. Z. Pan, A. G. Rust and H. Bolouri, “Image redundancy reduction for neural net-
work classification using discrete cosine transforms,” Proc. Int. Joint Conf. Neural
Networks, Vol. 3, Como, 2000, pp. 149–154.
23. C. A. Pea-Reyes and M. Sipper, “A fuzzy-genetic approach to breast cancer
diagnosis,” Artif. Intell. Med. 17, 2 (1999) 131–155.
24. F. Romeo and A. Sangiovanni-Vincentelli, “A theoretical framework for simulated
annealing,” Algorithmica 6, 3 (1991) 302–345.
25. A. L. Ronco, “Use of artificial neural networks in modeling associations of discriminant
factors: towards an intelligent selective breast cancer screening,” Artif. Intell. Med.
16, 3 (1999) 299–309.
26. F. Rosenblatt, Principles of Neurodynamics, Spartan Books, NY, 1962.
27. C. Roßmanith, H. Handels, S. J. Pöppel, E. Rinast and H. D. Weiss, “Computer-
assisted diagnosis of brain tumors using fractals, texture and morphological
image analysis,” Proc. Computer-Assisted Radiology, ed. H. U. Lemke, 1995,
pp. 375–380.
28. E. Seneta, Non-Negative Matrices and Markov Chains, Springer-Verlag, NY, 1981.
August 12, 2002 17:26 WSPC/115-IJPRAI 00184

Adaptive Simulated Annealing for CT Image Classification 587

29. K. Steinhöfel, A. Albrecht and C. K. Wong, “On various cooling schedules for sim-
ulated annealing applied to the job shop problem,” Randomization and Approxi-
mation Techniques in Computer Science, eds. M. Luby, J. Rolim and M. Serna,
Lecture Notes in Computer Science, Vol. 1518, Springer-Verlag, Barcelona, 1998,
pp. 260–279.
30. R. Tawel, T. Dong, B. Zheng, W. Qian and L. P. Clarke, “Neuroprocessor hardware
card for real-time microcalcification detection at digital mammography,” Proc. Meet-
ing of the Radiological Society of North America, 1994, p. 172.
31. P. E. Utgoff, D. J. Stracuzzi and R. P. Cochran, “Many-layered versus few-layered
learning,” TR-01-14, Department of Computer Science, University of Massachusetts,
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com

January 9, 2001.
32. Y. C. Wu, K. Doi and M. L. Giger, “Detection of lung nodules in digital chest
radiographs using artificial neural networks: a pilot study,” J. Digit. Imag. 8 (1995)
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.

88–94.
33. Y. Zhu and H. Yan, “Computerized tumour boundary detection using a Hopfield
neural network,” IEEE Trans. Med. Imag. 16, 1 (1997) 55–67.
August 12, 2002 17:26 WSPC/115-IJPRAI 00184

588 A. A. Albrecht et al.

Andreas Albrecht is Kathleen Steinhöfel


a Lecturer in the Com- is a researcher in com-
puter Science Depart- puter architecture and
ment of Hertfordshire software technology at
University. He received the Frauenhofer Insti-
a Diploma in math- tute, Berlin. She re-
ematics (summa cum ceived her Diploma in
laude) from Moscow Informatics from the
State University, and Technical University,
the Ph.D. (Dr.rer.nat.) Leipzig and the Ph.D.
Int. J. Patt. Recogn. Artif. Intell. 2002.16:573-588. Downloaded from www.worldscientific.com

and the Habilitation (Dr.sc.nat.) degrees in degree (summa cum laude) from the Tech-
mathematics from Humboldt University at nical University, Berlin.
Berlin. Her research interests are in stochastic al-
His research interests include complexity gorithms, and learning theory.
by NORTHEASTERN UNIVERSITY on 02/09/15. For personal use only.

problems of Boolean functions, combinatorial


optimization and algorithmic learning theory.
Matthias Taupitz has
specialised in diagnostic
Martin Loomes is a radiology and is an at-
Professor at the De- tending physician in the
partment of Computer Department of Radiol-
Science of Hertfordshire ogy, University Hospital
University. He received Charité, Humboldt Uni-
a mathematics degree versity at Berlin. He re-
from Bath University, ceived his Diploma in
his Ph.D. from Surrey physics and his doco-
University and is a char- toral degree (Dr. med.) in medicine from the
tered mathematician. Freie Universität Berlin.
His research interests include formal meth- His research interests include cross-
ods, adaptive systems and interactive sys- sectional imaging of the abdomen in general
tems, with particular reference to learning and, in particular, magnetic resonance imag-
systems. ing and organ specific contrast materials.

You might also like