Professional Documents
Culture Documents
• h = dim(~si ), (i = t-k, . . . , t+m) is the dimensionality The crucial problem of any classification is to sharply
of the state layer. h is independent from d1 and d2 discriminate between the existing classes in the underlying
feature space and to determine the discriminating and thus code – the so-called spreading code2 , which imposes well-
significant features. Given an input object ~xt , the associ- defined redundancy on the basic and non-redundant code
ated class or type µ(~xt ) ∈ C has to be resolved, given by vector ~b. We either used Barker Codes [16] of different
the typing function µ : O → C. Analogously to the so- lengths and OVSF codes (cp. section 3.2) as spreading se-
called spread spectrum technique [4] from mobile commu- quences of the form ~c = c1 c2 . . . cλ , ci ∈ {0, 1}, L =
nication, where data is spread over a wide bandwidth for λ · n, λ ∈ N, where λ is the spreading factor and L is the
transmission via the air interface, we exploit this mecha- overall length of the resulting code.
nism to enhance the discrimination between classes. The
spread spectrum mechanism is characterized by a wideband
transmission of signals1 , which is very robust against ex- Spreading Process The spreading process is defined
ternal interferences and noise. In Direct Sequence Spread by the function spr, which convolutes an arbitrary bit vector
~b – that represents the object class for example – with a well-
Spectrum (DSSS) technology [4] all transmissions share a
common carrier channel, which is furthermore exposed to defined spreading code ~c.
the environmental noise and various interferences. Analo-
gously, for machine learning the carrier medium is repre- spr(~b, ~c) = xor(b1 , c1 ) xor(b1 , c2 ) . . . xor(b1 , cλ )
sented by the h-dimensional internal state layer of the RNN xor(b2 , c1 ) xor(b2 , c2 ) . . . xor(b2 , cλ )
that serves as state transition space S = Rh . .. (4)
In contrast to wireless signal transmission, the signal to .
be transmitted is intentionally changed by the forward prop- xor(bn , c1 ) ... xor(bn , cλ )
agation of the recurrent neural network (cp. formula 1) in
order to match the desired target class µ(~xt ) ∈ C repre- A demonstrative example is given in section 4.1. The spread
sented by ~yt+1 . In terms of mobile communication, the spectrum technique is imposed as additional encoding here,
sent signal carries the attribute values of the respective ob- which significantly improves the type classification for the
ject ~xt ∈ Rd1 that should be mapped to its known class computed output signal ~ot+1 . Each class label Cj is as-
µ(~xt ), which has to be learned during the training phase signed to an own spreading sequence ~cj such that all in-
by minimizing the Euclidean distance k~ot+1 − ~yt+1 k2 . All stances ~xk ∈ Cj of the same class are encoded by ~cj .
input sequences ~xt−k , . . . , ~xt are propagated through the
recurrent state layer ~st−k , . . . , ~st , . . . , ~st+m in forward di-
rection. Subsequently, the deviations from the targets 3.2 Classification by Despreading
~yt+1 , . . . , ~yt+m to be learned by the RNN are sent back-
wards. In case of object classification evaluated in section The data spreading in form of an additional encoding
4, the input-target sequences degenerate to input-target pairs causes redundancy depending on the fixed spreading fac-
(~xt 7→ ~yt+1 ) ∈ T S, where T S is the training set. tor λ ∈ N and thus also a drawback in computational effi-
In the operative classification phase, the received signal ciency. On the other side, the obtained process gain justifies
has to be decoded to the correct class µ(~xt ) ∈ C. This in- the insertion of redundancy. The process gain, which is vi-
formation is drawn from the spread output vector ~ot+1 (ob- sualized in figure 2, is originally defined as
served output), which has dimensionality dim(~ot+1 ) ≤ d2 .
After having used the targets ~yt+1 = f (C~st ) (cp. formula carrier bandwidth
P G := 10 log10 ( )[db] (5)
1) for network training, d2 is only an upper bound on the inf ormation bandwidth
output dimensionality, since we allow variably dimensional
vectors as encoding of the class labels. So the question is and is measured in decibel [19]. When employed in terms of
how to recover the class information from the output signal. neural processing, the bandwidth is measured as the num-
A solution to that issue will be given by the despreading ber of bits used to encode a class Ci , that is the dimen-
mechanism in section 3.2. sionality of the spread target vector ~yt+1 . The insertion of
redundancy is similar to adding parity information for error
3.1 Encoding of Class Labels Using recognition in binary sequences like done by Cyclic Redun-
Spread Spectrum dancy Check (CRC) or Hamming codes. The strong benefit
of our adaption of this technique is the achievable degree of
The spread spectrum encoding of a target class label discrimination between all existing classes Cj , Ck , j 6= k.
Ci ∈ C, r(Ci ) = ~b = b1 b2 . . . bn , bi ∈ {0, 1} is per- As a consequence, the correct class µ(~xt ) = Cj can be de-
formed by applying an XOR-operation to the basic encod- termined with higher probability after the despreading pro-
ing (unary) of Ci . Thereby ~b is XORed with a fixed binary cess.
1 Utilized by Code Division Multiple Access (CDMA) in UMTS. 2 Also called chipping or spreading sequence.
Table 1. Lookup-table containing the basic encoding and
the assigned spreading codes that are unique for each class.
k·λ
Despreading Example When the network predicts
X
bitSum[k] := τi (8)
i=((k−1)·λ)+1
the output vector ~o = (0.0024, 0.9998, 0.00035, 0.9999,
0.9997, 0.0023, 0.9998, 0.0022), this vector is digitized by an
bˆk := θminV (bitSum[k]), (9) appropriate heaviside function at first: ~odig =(0,1,0,1,1,
λ 0,1,0). Then ~odig is despread with each of the existing
minV := , k ∈ {1, . . . , n}
2 spreading codes that represent the object classes.
Figure 3. Schematic processing steps and required gates for despreading of the predicted numerical output signal ~ot+1 .
despr((0, 1, 0, 1, 1, 0, 1, 0), (1, 0, 0, 1)) A spreading code ~c holds a good autocorrelation if its inner
xor
= (1, 1, 0, 0, 0, 0, 1, 1) product Φ~c,~c (0) with itself is high and Φ~c,~c (n) is low for all
| {z } | {z } shifts n = 1 . . . N -1.
? ?
Cross-correlation: Correlation of two sequences ~c and
despr((0, 1, 0, 1, 1, 0, 1, 0), (1, 0, 1, 0)) ~ while d~ is shifted N times.
d,
xor
= (1, 1, 1, 1, 0, 0, 0, 0) N
| {z } | {z } 1 X
1 0 Φ~c,d~(n) = c[m] · d[m + n],
N m=1
Here, a 100% certainty for class C2 is reached by despread-
ing with the second code ~c2 , while the first code does not
n = 0..N -1, c[i], d[i] ∈ {−1, 1}. For good separation
allow an unique decoding at all (0% certainty). Thus the
properties between different classes Ci , Cj , the respective
classification is unique and a final table-lookup by table 1
spreading codes ~ci and ~cj must have a low cross-correlation
reveals the predicted class label. The removal of the be-
value Φ~ci ,~cj (n) for all shifts n = 1 . . . N -1. If their cross-
forehand imposed redundancy results in the main advan-
correlation is zero, then these codes are said to be fully or-
tage of the spread spectrum technique, which is robustness
thogonal.
against external interferences like noise blurring the input
Barker codes hold an autocorrelation of Φ~c,~c (0) = 1 in
data. Those effects are spread out (minimized) by the de-
the unshifted case and Φ~c,~c (n) = − N1 , N = dim(~c), n =
spreading process, which is also shown in the evaluation in
1, . . . , N -1 for the N -1 shifts of themselves, which is in-
the final section. For automatically generating a sufficient
tended for a good recognition as well as distinction of phase
number of spreading codes, the concept of Orthogonal Vari-
shifts. In comparison to OVSF codes, the used Barker codes
able Spreading Factor (OVSF) can be used alternatively to
are also distinguished by their different lengths, which is a
Barker codes. For separation of a higher number of differ-
further separating property that compensates for not being
ent object classes, these are assigned to unique OVSF codes,
fully orthogonal.
which hold appropriate correlation properties and can be re-
In signal transmission, good autocorrelation is essen-
cursively generated via a tree schema [19].
tial to achieve synchronization between sender and receiver.
Autocorrelation: Correlation of a bipolar3 sequence ~c
Here, it is useful for recognizing the length of the employed
of N elements with all phase shifts n of itself.
spreading code in the decoding phase, since codes of dif-
N ferent lengths are allowed for different classes. Due to its
1 X
Φ~c,~c (n) = c[m] · c[m + n] modular design, the RNN has the capability of processing
N m=1
variably dimensional vectors to learn the class labels spread
3 −1, 1 instead of 0, 1. by codes of different lengths.
3.3 Complexity Discussion We show the robustness of the proposed classifier by ar-
tificially imposing noise onto the untrained test instances
The spread spectrum enabled classification is efficient, within the respective test set. Thereby the performance of
since the despreading step itself can be performed in the neural classifier is compared for spread spectrum en-
O(|C|) = O(r) in principle, where coded and basic encoded (non-spread) class labels. Both
Pr r is the number of approaches are evaluated under the influence of stepwise
classes – in comparison to sr := i=0 |Ci | as the number
of instances in all classes of the given dataset (r << sr ). increased noise levels. Instead of applying the complex and
Furthermore the number Ncomp of weighted sum non-intuitive spread spectrum process for imposing type in-
Pd formation, one could also follow a straightforward approach
( i=1 xi wij ), wij ∈ M ∈ {A, B, C}, d ∈ {d1 , d2 }
by using the basic encoding of the r existing classes, while
computations in the (trained) neural network to compute
skipping the spreading step.
the classification increases linearly in d1 . The growth is
dependent on the fixed dimension h of the hidden state
4.1 Binary Classification of Molecule
layer, which is constant: Ncomp = h · d1 + h · d2 =
Data
h · (d1 + (λmax · r)). The factor λmax is the length of
the longest assigned spreading code. As described in sec-
The simple approach unary encodes the d2 := r con-
tion 3.1, d2 = λmax · r is the upper bound on the number of
sidered classes by d2 -dimensional target vectors ~y , where
required bits to represent all target classes. The despread-
(yi = 1 ∧ yj = 0, ∀j 6= i) ⇔ µ(~x) = Ci . In this case
ing process of figure 3, which itself requires d2 = λmax · r
the classification of the input object ~x via the output vec-
computations, has to be performed r times, once for each
tor ~o := ~ot+1 is done by picking the maximum component
class. So the entire complexity to classify an input object
omax : µ(~x) = Cmax ⇔ omax = max{o1 , . . . , od2 }.
is O((Ncomp + d2 ) · r) = O((h(d1 + d2 ) + d2 ) · r) =
const. terms This solution is less powerful than the spread spectrum
O([hd1 + (h + 1)d2 ] · r) = O((d1 + d2 ) · r) classification variant. To actually show this hypothesis, we
= O(d1 r + λmax r2 ), since the constant h determines the compared both techniques with regard to their robustness
network resources and is widely independent from the ob- against noise.
ject representation (with dimensionality d1 ). Compared to
a nearest neighbor approach, the term sr for calculating all Training Data Representation The molecules of the
neighbor distances is omitted, while the factor r is added to MUSK2 dataset to be classified by our connectionist classi-
the complexity and d2 becomes λmax -times bigger. fier appear in different conformations, which are described
by 166 features. This information is used to build a straight-
4 Evaluation forward feature vector representation.
~xt = (x1 , . . . , x162 , x163 , . . . , x166 ), xi ∈ R (10)
The proposed spread spectrum classification based on a | {z
f 1 to f 162
} | {z
f 163 to f 166
}
connectionist model is a novel approach. Therefore the ac-
curacy of the classifier was evaluated by three standard clas- The real-valued attributes f1 to f162 are distances that are
sification benchmarks. The first one is the publicly avail- measured in hundredths of Angstroms. The distances may
able MUSK2 dataset [11], which is a binary distribution of be negative or positive.
molecules. In order to test the discrimination capability, we Depending on the chosen spreading factor λ, the spread-
also applied our technique to a multi-class problem that re- ing process computes a (2 · λ)-bit target vector in case of
quires r > 2 different spreading codes. the 2 classes Musk and Non-Musk. The following exam-
The first dataset describes molecules by their phenotypic ple spreads each class by the Barker codes (1, 1, 0, 1) and
appearance, which is called conformation. The dataset con- (1, 1, 1, 0) of the same length λ = 4 using the spreading
tains 6,598 different conformations of 102 molecules, 39 of function spr defined in section 3.1.
these molecules are judged by human experts to be musks xor
spr((1, 0), (1, 1, 0, 1)) = (0, 0, 1, 0, 1, 1, 0, 1)
and the remaining 63 are judged to be non-musks. A sin- xor
gle molecule is processed as a multi-represented object that spr((0, 1), (1, 1, 1, 0)) = (1, 1, 1, 0, 0, 0, 0, 1)
contains all of its feature values. “Musk odor is a specific The training and test patterns both hold the form input 7→
and clearly identifiable sensation, although the mechanisms target, while the test patterns were excluded from training.
underlying it are poorly understood. Musk odor is deter- For the presented evaluation a spreading factor of 16 was
mined almost entirely by steric (i.e., “molecular shape”) chosen.
effects (Ohloff, 1986).” [13]. These characteristics are cap-
tured as molecule conformations represented as feature vec- r(o) 7→ spr(~ei , ~ci )
tors and typed by our typing mechanism. ⇔ (x1 , . . . , x166 ) 7→ spr(~ei , (ci1 , . . . , ci16 ))
Table 2. Average classification accuracy for the MUSK2 dataset for different noise levels after 10-fold cross-validation. The
measure variation indicates the range of the obtained accuracy values over the 10 individual test sets. For all predictions the network
was trained till a residual error level of ≤ 1.5%. Best accuracy is in bold.
Dataset Measurement Accuracy [%] of classification by Spread Spectrum for uniform noise n
n=0% n≤5% n≤10% n≤15% n≤20.0%
MUSK2 Mean 97.26 96.42 93.47 77.74 88.03
Variation [95.15 - 98.03] [94.24 - 97.73] [90.15 - 95.75] [58.79 - 95.14] [84.55 - 90.61]
n≤22.5% n≤25.0% n≤27.5% n≤35.0% n≤50.0%
Mean 82.95 75.89 37.37 67.02 61.05
Variation [61.67 - 95.15] [61.21 - 88.03] [23.82 - 73.48] [30.15 - 82.42] [21.21 - 81.52]
Accuracy [%] of classification by Basic Encoding for uniform noise n
n=0% n≤5% n≤10% n≤15% n≤20.0%
MUSK2 Mean 97.73 96.56 93.56 74.08 80.04
Variation [93.94 - 99.85] [94.24 - 97.88] [90.15 - 96.52] [54.24 - 90.91] [28.94 - 92.42]
n≤22.5% n≤25.0% n≤27.5% n≤35.0% n≤50.0%
Mean 80.04 66.79 30.54 59.34 52.35
Variation [28.94 - 92.42] [25.91 - 92.12] [16.69 - 62.88] [29.09 - 86.06] [21.09 - 87.88]
The function r creates the [0, 1]-scaled feature vector repre- 30.54% for basic encoding and 37.37% for the spread spec-
sentation of any multi-represented object o that consists of trum variant. This is reasonable, since with a certain (low)
categorical or metric features. When the spreading code is probability noise of lower amplitudes but with the “right”
of length λ = 16 and there are 2 classes C = {C1 , C2 }, signs can do more harm than a higher noise level that does
the spreading process computes a 32-bit (16 · 2) target vec- not displace the feature values as much, due to the sign of
tor. Ten-fold cross-validation was used to obtain signifi- each noise component. Summing up the accuracy for all
cant accuracy measurements, so the 6,598 objects were di- noise levels, the spread spectrum technique holds an over-
vided into 10 disjoint test sets. Additionally from each of all accuracy of 77.72% and thus outperforms the simple
the 10 corresponding training sets 5,641 patterns a frac- approach that achieves 72.24%, in several cases reaching
tion of 297 patterns (1/20) was split off and used for check- even more than 10% of advance. Furthermore the lower
ing the connectionist model quality and to determine when classification variance of the spread spectrum technique is
to stop the training. Network training was stopped when a sign of its higher confidence and discrimination between
the training error was below 1.5% and the model qual- the classes.
ity was just declining after continuously rising. The av- As a reference, the algorithm Iterated Discrimination,
erage model quality on the 10 auxiliary sets that were ex- which belongs to the class of Axis Parallel Rectangle (APR)
cluded from network training was 97.51% for the basic methods, was used by Dietterich et al. [15] to classify the
encoding approach and 99.85% for the spread spectrum MUSK1 pharmaceutical dataset. The MUSK1 is related
variant. Table 2 shows the results of the evaluation pro- to the MUSK2 benchmark and the APR method achieved
cess for different degrees of imposed noise and different an accuracy of 92.4% thereupon. Dietterich et al. [13]
spreading factors. Uniformly distributed noise n interferes achieved their best result when using a domain-specific (dy-
the [0, 1]-scaled numeric input pattern by adding the noise namic reposing) neural network approach for classifying
realizations to each of its components: xi ± ni , di ∈ the molecules of the MUSK2 dataset. Without imposing
[0, 1], ni ∈ [0.0, (n · 1.0)], n ∈ {5%, 10%, 15%, . . .}, for noise on the objects, 10-fold cross-validation led to a maxi-
example, (0.65, 0.34, 0.86, 0.07, 0.95, . . .) is distracted to mal accuracy of 91%, which is almost 7% below the result
(1.34, 0.10, 1.05, −0.23, 1.49, . . .). at hand.
The result clearly shows the high robustness of the 4.2 Multi-Class Classification of Forest
spread spectrum supported classification, which is less af- Data
fected by the increasing noise level than the unspread vari-
ant. Noise of n≤27.5% leads to outlying accuracies for both The second dataset stems from geography and deals with
techniques, since the classification accuracy drops down to forest cover types [7]. By virtue of 12 metric and categor-
ical attributes, the 7 cover types Spruce/Fir, Krummholz,
Lodgepole Pine, Ponderosa Pine, Cottonwood/Willow, As- Table 4. Comparison of the classification accuracy of the
pen and Douglas-fir should be classified. The dataset con- Spread Spectrum and the Support Vector Machine (SVM)
sisting of 581,012 instances was originally obtained from classification of the forest cover type dataset. The perfor-
the US Geological Survey (USGS). We have chosen this mance advantage ∆ of our novel classification technique is
dataset, since it provides a multi-class problem for evalu- significantly positive for all degrees of uniform noise except
ating the discriminative capability of the spread spectrum one. All values are given in percent [%].
technique in case of more than two classes. Furthermore
a neural network approach was already conducted upon Noise Level Spread Spectrum SVM ∆
this dataset, which achieved a classification accuracy of 0 85.91 66.28 + 19.63
5 58.94 33.96 + 24.98
about 70.0% [7]. Blackard’s connectionist classification
10 42.00 35.85 + 6.15
(70.58%), which was generated by model averaging using
15 31.33 18.88 + 12.45
thirty networks with randomly selected initial weights, sig-
20 26.29 32.08 - 5.79
nificantly outperformed the alternative linear discriminant
analysis model (58.38%).
The forest data was elicited from cartographic material 1
7 for a test object o and a classifier K (equal distribution
only and describes the respective cover type by characteris- assumed). The output encoding by spreading codes with
tical attributes like elevation, aspect, slope, horizontal and well-defined correlation properties enhances the classifica-
vertical distance to nearest hydrology (water surface) and tion ability by the sharper class discrimination that prevents
others measured in meters, azimuth or degrees. Each in- misclassification more effectively. The highest accuracy
stance represents a 30 x 30 meter cell, which was labeled of 85.91% was realized by the spread spectrum technique
with one of seven forest cover types (classes). A signif- based on a test set consisting of 1,050 randomly chosen in-
icance test showed that the 44 binary variables indicating stances.
the wilderness area and the soil type can be disregarded, In order to compare the results with an advanced clas-
so we excluded these variables from training – additionally sification technique, we trained a support vector machine
improving training efficiency by the reduced input dimen- (SVM)4 classifier on exactly the same training set. With-
sionality. The underlying dataset together with a detailed out noise, 10-fold cross-validation resulted in a classifica-
description can be obtained from Blackard [7]. tion accuracy of 66.28%, which means that the SVM ap-
Again we compared the classification performance of proach falls short by almost 20% compared to the per-
our robust classification method with a support vector clas- formance of our connectionist technique. Under the in-
sifier under the influence of noise based on the same training fluence of the same uniformly distributed noise, the per-
and test instances. We used the same network topology than formance (percentage of correctly classified test instances)
before with a hidden layer dimension of h = 30. Seven dis- of the SVM classifier dropped dramatically to about 34%
tinct Barker codes served for encoding of the existing cover with 5% noise, 35.85% with 10% noise, 18.88% with 15%
type classes: ~c1 := (1, 0, 0, 1, 0, 1, 1, 0), C1 := “Douglas- noise and 32.08% with 20% noise. This underpins the diffi-
fir”; ~c2 := (1, 0, 0, 1, 1, 0, 0, 1), C2 := “Aspen”; ~c3 := (1, 0, culty of reliable classification of feature vectors when these
1, 0, 0, 1, 0, 1), C3 := “Cottonwood/Willow”; ~c4 := (1, 0, 1, are exposed to considerable noise portions, which is bet-
0, 1, 0, 1, 0), C4 := “Ponderosa Pine”; ~c5 := (1, 1, 0, 0, 0, ter achieved by the connectionist classifier based on spread
0, 1, 1), C5 := “Lodgepole Pine”; ~c6 := (1, 1, 0, 0, 1, 1, 0, spectrum.
0), C6 := “Spruce/Fir”; ~c7 := (1, 1, 1, 1, 0, 0, 0, 0), C7 :=
“Krummholz”. 4.3 Classification of Diabetes Patients
The cross-correlation among these OVSF codes is ac-
tually zero, Φ~ci ,~cj (n) = 0, i 6= j, i, j = 1, . . . , 7, n = Finally, we employed our method for a medical classifi-
0, . . . , N -1, which makes them fully orthogonal. The out- cation task based on a binary distribution of Pima Indian di-
put space, which is 7-dimensional for unary encoding, be- abetes patients. The dataset consisted of 768 instances with
comes (7 · 8 = 56)-dimensional in case of spread spec- 8 attributes, each instance belonging to the classes healthful
trum output coding, since the required OVSF code length is (65.1%) and diseased (34.9%) [9, 5]. Again we evaluated
dim(~ci ) = 8, i = 1, . . . , 7. The classification accuracies the classification performance for different degrees of noise
are shown by table 3 and a direct comparison with an SVM by ten-fold cross-validation. The results are presented in
classifier is given in table 4. table 5. We also used gaussian instead of uniformly dis-
As expected, the results of the spread spectrum approach tributed noise to distort the input vectors, which was gener-
are even more convincing for the multi-class problem, sim- 4 The SVM classifier (SMO) from the Weka data mining package was
ply due to the lower success probability P(K(o) = C(o)) = used [24].
Table 3. Average classification accuracy for the forest cover type dataset for different noise levels after 10-fold cross-validation.
The cumulated advance of the spread spectrum technique compared to the basic encoding amounts to 18.33% for all considered
noise levels. For all predictions the network was trained till a residual error level of ≤ 7.9%. Best accuracy is in bold.
Dataset Measurement Accuracy [%] of classification by Spread Spectrum for uniform noise n
n=0% n≤5% n≤10% n≤15% n≤20.0% n≤55.0%
Forest Mean 85.91 58.94 42.00 31.33 26.29 27.91
CoverType Variation [80.19 - 91.51] [51.43 - 66.98] [23.81 - 49.52] [23.58 - 39.05] [13.33 - 38.68] [19.81 - 44.76]
Accuracy [%] of classification by Basic Encoding for uniform noise n
n=0% n≤5% n≤10% n≤15% n≤20.0% n≤55.0%
Forest Mean 83.94 54.37 45.53 29.13 20.60 21.05
CoverType Variation [75.47 - 89.62] [40.57 - 67.92] [33.96 - 57.55] [19.81 - 42.45] [3.81 - 32.08] [12.26 - 33.96]