You are on page 1of 10

Noise Robust Classification Based On Spread Spectrum

Submitted for Blind Review

Abstract spectrum), which is adapted to a neural classifier. Multi-


represented objects [3] should be reliably classified even
In this technical paper we develop a robust classification when affected by high extents of noise.
mechanism based on a connectionist model in order to learn The area of knowledge discovery in databases already
and classify objects from arbitrary feature spaces. Thereby provides a variety of techniques for noise robust cluster-
a joint approach of recurrent neural networks and spread ing in high-dimensional spaces and in arbitrary subspaces
spectrum symbol encoding is implemented in order to clas- of those, which are either based on similarity, density or
sify any kind of objects that can be represented by feature subspace hyperplanes [1, 2, 26, 12, 17]. By contrast, noise
vectors. robust classification is mainly found in speech recognition
Our main contribution is to adapt the spread spectrum [21, 25], but not as a general purpose application.
method from signal transmission technology to classifica- Meta-algorithms for classification or regression prob-
tion of feature vectors, which are encoded for neural pro- lems such as bootstrap aggregating (Bagging) [22] have
cessing by means of unique spreading sequences. The idea been employed to increase the robustness of classification
behind this data spreading approach is related to the field techniques by a selection of more representative training
of error-correcting output coding, but is furthermore char- sets from the basic population. These are relatively sim-
acterized by a despreading mechanism that results in high ple approaches to reliable classification that average several
classification accuracy and robustness against noisy or in- classifiers or predictors and thus are not useful for improv-
complete data. ing linear models. Additionally one has to accept the over-
We applied our technique to three publicly available head of constructing and evaluating all participating indi-
classification benchmarks, which stem from three different vidual classifiers.
domains namely biology, geography and medicine. In case As opposed to meta algorithms, our technique trans-
of the MUSK2 molecule dataset (biology), ten-fold cross- forms the output space into a higher dimensional space that
validation of our technique revealed a classification accu- eventually serves for the object classification. This idea
racy of 97.7% at maximum, which is about 7% better than was already employed by Error-Correcting Output Coding
any published algorithm. In presence of a noise level of up (ECOC) in a similar way [18, 20]. Error-correcting codes
to 25.0%, still an accuracy of 75.9% was achieved. have been used with decision trees and neural networks for
classification tasks, for example by Diettrich et al. [14].
Berger [6] improved the classification of unstructured text
using ECOC. The voting that is performed among the multi-
1 Introduction ple classifiers in case of ECOC corresponds to the despread-
ing step of our technique, which also determines the class
Many knowledge-driven and domain-specific problems that matches best with the computed output signal. Com-
like speech and handwriting recognition, biometric iden- pared to these error-correction approaches, we do not solve
tification, credit scoring or document classification can k-class supervised learning problems by training multiple
be turned into statistical classification problems O := 2-class classifiers. Instead, only one instance of our recur-
{o1 , . . . , om } → {C1 , . . ., Cr } =: C by partitioning the rent neural network is trained upon the whole training set.
domain objects oi ∈ O into appropriate classes Cj ∈ C. According to Diettrich et al., ECOC reduces both bias
A classification mechanism must be robust and capable and variance of the used classification model. In contrast
of handling fuzzy, incomplete and partially incorrect data, to the bias, the related concepts noise and variance repre-
which results from incomplete and inaccurate sensor mea- sent unsystematic errors. For example, the residuals result-
surements. The central idea of this paper is to boost the ing from a least square optimization may still contain infor-
classification accuracy and robustness by an approved and mation, but noise as an unsystematic error does not. The
noise-resistant method from signal transmission (spread variance of a classifier can be measured when classifying
unseen instances from the test set with a certain misclassi-
fication rate. Similar to error-correcting output codes, the
spread spectrum technique also reduces bias and variance
of the classification model, which is validated by the higher
generalization performance on benchmark datasets evalu-
ated in section 4.
Nevertheless, there are no extensive studies how the
performance of the mentioned error-correcting techniques
changes when the objects attribute values are interfered by
noise, which actually occurs due to deviations of the used
measuring instruments or measuring errors. The objects af- Figure 1. Schematic topology of the proposed modular
fected by these and further unwanted effects may represent recurrent network RNN. The block arrows indicate the in-
systematic (e.g. periodic) or non-systematic (e.g. white ternal state transition ~st → ~st+1 . A, B and C are weight
noise) deviations from the expected distribution. In general matrices. ~xt is the external input vector at time t, ~
yt+m is
it is difficult to tell which characteristics determine noise: the correspondingly predicted output vector. As opposed to
“One person’s noise could be another person’s signal.” sequence prediction, for classification only one output unit
~
yt+1 is used. The depicted block arrow direction shows the
2 The Classification Engine forward propagation phase.

Recurrent neural networks are a subclass of artificial


neural networks, which are characterized by recurrent con- and was set to h = 15 (experimentally determined) to
nections between their units. These typically form a di- provide sufficient network resources.
rected cycle, while common feed-forward networks do not
allow any cycles [10]. A main reason for choosing a neural ~st = f (B~st−1 + A~xt ), s0 := ~0 (1)
network to solve the classification problems described in ~ot+1 = f (C~st ) (2)
section 4 were the promising results reported by Blackard training
[7]. ~ot+i −→ ~yt+i , i = 1, . . . , m (3)
We have designed a modular recurrent neural network
The vector ~st stands for the internal state at the discrete time
(RNN) that is employed as enabling technology in this pa-
step t. The therefrom composed state layer is the backbone
per and was already presented in [?]. The fundamental data
for learning the input-target sequences and for classifying
structure that is processed by the RNN are sequences of ar-
or predicting symbol sequences. Each neuron record (the
bitrarily dimensional feature vectors that stand for multi-
block arrows depicted in figure 1) serves both as hidden unit
represented objects. Multi-representation is a concept to
and as context unit, because ~st−1 provides a context for the
address the manifold contents carried by complex domain
recursive computation of the subsequent hidden state ~st .
objects, that is multi-represented objects capture several as-
The crucial recurrent equation 1 combines an external
pects of the same object. A recent example is the encapsula-
input ~xt with the previous state ~st−1 to the subsequent
tion of all biometric features of a person like voice pattern,
state ~st , which indirectly depends on all foregoing external
image and finger print by a single multi-represented object.
inputs ~xt−k , . . . , ~xt−1 and internal states ~st−k , . . . , ~st−1 .
In case of supervised network training, the target symbols
~yt+1 , . . . , ~yt+m are known, while in case of actual struc-
Recurrent Network Model The basic design of the ture classification the output sequence ~ot+1 , . . . , ~ot+m is
recurrent neural network is defined by the following propa- computed solely based on the respective inputs. Here, the
gation model and is visualized by the schema in figure 1. activation function is chosen as sigmoid function f (x) =
1
• A is a h Rd1 , B is a h Rh and C is a d2 Rh matrix. 1+exp(−x) . The RNN is trained with a modified Backprop-
agation Through Time (BPTT) algorithm [10, 23] and is
• d1 is the dimensionality of the input- and d2 is the di- able to process variably dimensional vectors ~xi ∈ Rm and
mensionality of the output feature space. Regarding ~yj ∈ Rn , m 6= n.
the output space, the network operates like a bit pro-
cessor throughout the whole spreading and despread-
ing process described in the next sections.
3 Spread Spectrum Based Classification

• h = dim(~si ), (i = t-k, . . . , t+m) is the dimensionality The crucial problem of any classification is to sharply
of the state layer. h is independent from d1 and d2 discriminate between the existing classes in the underlying
feature space and to determine the discriminating and thus code – the so-called spreading code2 , which imposes well-
significant features. Given an input object ~xt , the associ- defined redundancy on the basic and non-redundant code
ated class or type µ(~xt ) ∈ C has to be resolved, given by vector ~b. We either used Barker Codes [16] of different
the typing function µ : O → C. Analogously to the so- lengths and OVSF codes (cp. section 3.2) as spreading se-
called spread spectrum technique [4] from mobile commu- quences of the form ~c = c1 c2 . . . cλ , ci ∈ {0, 1}, L =
nication, where data is spread over a wide bandwidth for λ · n, λ ∈ N, where λ is the spreading factor and L is the
transmission via the air interface, we exploit this mecha- overall length of the resulting code.
nism to enhance the discrimination between classes. The
spread spectrum mechanism is characterized by a wideband
transmission of signals1 , which is very robust against ex- Spreading Process The spreading process is defined
ternal interferences and noise. In Direct Sequence Spread by the function spr, which convolutes an arbitrary bit vector
~b – that represents the object class for example – with a well-
Spectrum (DSSS) technology [4] all transmissions share a
common carrier channel, which is furthermore exposed to defined spreading code ~c.
the environmental noise and various interferences. Analo-
gously, for machine learning the carrier medium is repre- spr(~b, ~c) = xor(b1 , c1 ) xor(b1 , c2 ) . . . xor(b1 , cλ )
sented by the h-dimensional internal state layer of the RNN xor(b2 , c1 ) xor(b2 , c2 ) . . . xor(b2 , cλ )
that serves as state transition space S = Rh . .. (4)
In contrast to wireless signal transmission, the signal to .
be transmitted is intentionally changed by the forward prop- xor(bn , c1 ) ... xor(bn , cλ )
agation of the recurrent neural network (cp. formula 1) in
order to match the desired target class µ(~xt ) ∈ C repre- A demonstrative example is given in section 4.1. The spread
sented by ~yt+1 . In terms of mobile communication, the spectrum technique is imposed as additional encoding here,
sent signal carries the attribute values of the respective ob- which significantly improves the type classification for the
ject ~xt ∈ Rd1 that should be mapped to its known class computed output signal ~ot+1 . Each class label Cj is as-
µ(~xt ), which has to be learned during the training phase signed to an own spreading sequence ~cj such that all in-
by minimizing the Euclidean distance k~ot+1 − ~yt+1 k2 . All stances ~xk ∈ Cj of the same class are encoded by ~cj .
input sequences ~xt−k , . . . , ~xt are propagated through the
recurrent state layer ~st−k , . . . , ~st , . . . , ~st+m in forward di-
rection. Subsequently, the deviations from the targets 3.2 Classification by Despreading
~yt+1 , . . . , ~yt+m to be learned by the RNN are sent back-
wards. In case of object classification evaluated in section The data spreading in form of an additional encoding
4, the input-target sequences degenerate to input-target pairs causes redundancy depending on the fixed spreading fac-
(~xt 7→ ~yt+1 ) ∈ T S, where T S is the training set. tor λ ∈ N and thus also a drawback in computational effi-
In the operative classification phase, the received signal ciency. On the other side, the obtained process gain justifies
has to be decoded to the correct class µ(~xt ) ∈ C. This in- the insertion of redundancy. The process gain, which is vi-
formation is drawn from the spread output vector ~ot+1 (ob- sualized in figure 2, is originally defined as
served output), which has dimensionality dim(~ot+1 ) ≤ d2 .
After having used the targets ~yt+1 = f (C~st ) (cp. formula carrier bandwidth
P G := 10 log10 ( )[db] (5)
1) for network training, d2 is only an upper bound on the inf ormation bandwidth
output dimensionality, since we allow variably dimensional
vectors as encoding of the class labels. So the question is and is measured in decibel [19]. When employed in terms of
how to recover the class information from the output signal. neural processing, the bandwidth is measured as the num-
A solution to that issue will be given by the despreading ber of bits used to encode a class Ci , that is the dimen-
mechanism in section 3.2. sionality of the spread target vector ~yt+1 . The insertion of
redundancy is similar to adding parity information for error
3.1 Encoding of Class Labels Using recognition in binary sequences like done by Cyclic Redun-
Spread Spectrum dancy Check (CRC) or Hamming codes. The strong benefit
of our adaption of this technique is the achievable degree of
The spread spectrum encoding of a target class label discrimination between all existing classes Cj , Ck , j 6= k.
Ci ∈ C, r(Ci ) = ~b = b1 b2 . . . bn , bi ∈ {0, 1} is per- As a consequence, the correct class µ(~xt ) = Cj can be de-
formed by applying an XOR-operation to the basic encod- termined with higher probability after the despreading pro-
ing (unary) of Ci . Thereby ~b is XORed with a fixed binary cess.
1 Utilized by Code Division Multiple Access (CDMA) in UMTS. 2 Also called chipping or spreading sequence.
Table 1. Lookup-table containing the basic encoding and
the assigned spreading codes that are unique for each class.

Class Basic encoding Spreading code


C1 ~b1 =(0,1) ~c1 =(1,0,1,0)
C2 ~b2 =(1,0) ~c2 =(1,0,0,1)

Figure 2. Visualization of the result of spreading and de-


spreading and the obtained process gain in the analogy of Classification Certainty The relative certainty cert
signal transmission. The process gain becomes manifest in for the k-th decoded bit is given by the distance from
the amplitude of the despread signal in the right chart. PSD the mean value minV (minimum number of votes required
is the Power Spectral Density that specifies the power of a for a “1”). The farther the result is separated from this
signal in an infinitesimal frequency band. The integral over
1
Pn the more unique the decoding is: cert :=
mean value,
all frequency portions gives the complete signal power. n·minV k=1 |bitSum[k] − minV |.
Thus the despreading certainty for a voting consensus
00 . . . 0 or 11 . . . 1 of all τi in a λ-block corresponds to a
certainty of 100% for the bit bk to decode as bˆk = 0 ∨ bˆk =
Despreading Process Let ~o := ~ot+1 = (o1 o2 . . . od2 )
1. Since different spreading codes ~ci , ~cj may have different
be the predicted output vector, L ≤ d2
lengths λi 6= λj , the number of minimum votes varies but
( still does not influence the relative certainty cert. Different
1, if x > t
θt (x) := (6) code lengths are allowed, because the minimal number n =
0, else dlog2 |C|e of bits to represent all classes in the respective
dataset is fixed and known a priori.
The modified heaviside function θt (x) serves for digitaliza-
tion of the numeric output signal, for the following equa- Figure 3 illustrates the complete despreading process
tions t is set to 0.5 and the shortform θ(x) := θ0.5 (x) is starting with the unclassified feature vector as input for
used. the RNN, the network prediction and the subsequent dig-
ital despreading process. The downstreamed despreading
despr(~o, ~c) = through the various xor-gates is repeated for all existing
xor(θ(o1 ), c1 ) xor(θ(o2 ), c2 ) . . . xor(θ(oλ ), cλ ) object classes. Thus the output signal is despread with
xor(θ(oλ+1 ), c1 ) ... xor(θ(o2λ ), cλ ) all spreading codes and each result is compared with the
lookup-table that holds the basic encodings (unary) of all
.. (7) object classes C. If there is exactly one match Chit ∈ C,
.
then this class label is returned. Otherwise there may occur
xor(θ(o((n−1)λ)+1 ), c1 ) xor(θ(o((n−1)λ)+2 ), c2 )
the case that none or several of the despread sequences each
... xor(θ(oL ), cλ ) match to one symbol from the lookup-table. Then the most
= τ1 τ2 . . . τL , τi ∈ {0, 1} probable class label is predicted, which is determined by the
classification certainty given by cert. This class is usually
The despreading is done λ-blockwise, because each leading with a high certainty overhang with regard to alter-
block τ(k·λ)+1 . . . τ(k+1)·λ of the spread output vector cor- natively matching classes. The certainty measure usually
responds to a single bit of the original unspread representa- enables a clear decision between multiple decodings ~b0dec ,
tion. The uniqueness of the decoding bˆk with respect to the ~b00 , ~b0 0 0 0 0 ~ 00
dec dec = b1 b2 . . . bn , bk ∈ {0, 1}, bdec = .. . The
actual bit bk is considered as the distance of the prediction following despreading example will illustrate the used for-
from the maximal entropy, where no clear decision can be mulas.
made neither for 0 nor for 1.

k·λ
Despreading Example When the network predicts
X
bitSum[k] := τi (8)
i=((k−1)·λ)+1
the output vector ~o = (0.0024, 0.9998, 0.00035, 0.9999,
0.9997, 0.0023, 0.9998, 0.0022), this vector is digitized by an
bˆk := θminV (bitSum[k]), (9) appropriate heaviside function at first: ~odig =(0,1,0,1,1,
λ 0,1,0). Then ~odig is despread with each of the existing
minV := , k ∈ {1, . . . , n}
2 spreading codes that represent the object classes.
Figure 3. Schematic processing steps and required gates for despreading of the predicted numerical output signal ~ot+1 .

despr((0, 1, 0, 1, 1, 0, 1, 0), (1, 0, 0, 1)) A spreading code ~c holds a good autocorrelation if its inner
xor
= (1, 1, 0, 0, 0, 0, 1, 1) product Φ~c,~c (0) with itself is high and Φ~c,~c (n) is low for all
| {z } | {z } shifts n = 1 . . . N -1.
? ?
Cross-correlation: Correlation of two sequences ~c and
despr((0, 1, 0, 1, 1, 0, 1, 0), (1, 0, 1, 0)) ~ while d~ is shifted N times.
d,
xor
= (1, 1, 1, 1, 0, 0, 0, 0) N
| {z } | {z } 1 X
1 0 Φ~c,d~(n) = c[m] · d[m + n],
N m=1
Here, a 100% certainty for class C2 is reached by despread-
ing with the second code ~c2 , while the first code does not
n = 0..N -1, c[i], d[i] ∈ {−1, 1}. For good separation
allow an unique decoding at all (0% certainty). Thus the
properties between different classes Ci , Cj , the respective
classification is unique and a final table-lookup by table 1
spreading codes ~ci and ~cj must have a low cross-correlation
reveals the predicted class label. The removal of the be-
value Φ~ci ,~cj (n) for all shifts n = 1 . . . N -1. If their cross-
forehand imposed redundancy results in the main advan-
correlation is zero, then these codes are said to be fully or-
tage of the spread spectrum technique, which is robustness
thogonal.
against external interferences like noise blurring the input
Barker codes hold an autocorrelation of Φ~c,~c (0) = 1 in
data. Those effects are spread out (minimized) by the de-
the unshifted case and Φ~c,~c (n) = − N1 , N = dim(~c), n =
spreading process, which is also shown in the evaluation in
1, . . . , N -1 for the N -1 shifts of themselves, which is in-
the final section. For automatically generating a sufficient
tended for a good recognition as well as distinction of phase
number of spreading codes, the concept of Orthogonal Vari-
shifts. In comparison to OVSF codes, the used Barker codes
able Spreading Factor (OVSF) can be used alternatively to
are also distinguished by their different lengths, which is a
Barker codes. For separation of a higher number of differ-
further separating property that compensates for not being
ent object classes, these are assigned to unique OVSF codes,
fully orthogonal.
which hold appropriate correlation properties and can be re-
In signal transmission, good autocorrelation is essen-
cursively generated via a tree schema [19].
tial to achieve synchronization between sender and receiver.
Autocorrelation: Correlation of a bipolar3 sequence ~c
Here, it is useful for recognizing the length of the employed
of N elements with all phase shifts n of itself.
spreading code in the decoding phase, since codes of dif-
N ferent lengths are allowed for different classes. Due to its
1 X
Φ~c,~c (n) = c[m] · c[m + n] modular design, the RNN has the capability of processing
N m=1
variably dimensional vectors to learn the class labels spread
3 −1, 1 instead of 0, 1. by codes of different lengths.
3.3 Complexity Discussion We show the robustness of the proposed classifier by ar-
tificially imposing noise onto the untrained test instances
The spread spectrum enabled classification is efficient, within the respective test set. Thereby the performance of
since the despreading step itself can be performed in the neural classifier is compared for spread spectrum en-
O(|C|) = O(r) in principle, where coded and basic encoded (non-spread) class labels. Both
Pr r is the number of approaches are evaluated under the influence of stepwise
classes – in comparison to sr := i=0 |Ci | as the number
of instances in all classes of the given dataset (r << sr ). increased noise levels. Instead of applying the complex and
Furthermore the number Ncomp of weighted sum non-intuitive spread spectrum process for imposing type in-
Pd formation, one could also follow a straightforward approach
( i=1 xi wij ), wij ∈ M ∈ {A, B, C}, d ∈ {d1 , d2 }
by using the basic encoding of the r existing classes, while
computations in the (trained) neural network to compute
skipping the spreading step.
the classification increases linearly in d1 . The growth is
dependent on the fixed dimension h of the hidden state
4.1 Binary Classification of Molecule
layer, which is constant: Ncomp = h · d1 + h · d2 =
Data
h · (d1 + (λmax · r)). The factor λmax is the length of
the longest assigned spreading code. As described in sec-
The simple approach unary encodes the d2 := r con-
tion 3.1, d2 = λmax · r is the upper bound on the number of
sidered classes by d2 -dimensional target vectors ~y , where
required bits to represent all target classes. The despread-
(yi = 1 ∧ yj = 0, ∀j 6= i) ⇔ µ(~x) = Ci . In this case
ing process of figure 3, which itself requires d2 = λmax · r
the classification of the input object ~x via the output vec-
computations, has to be performed r times, once for each
tor ~o := ~ot+1 is done by picking the maximum component
class. So the entire complexity to classify an input object
omax : µ(~x) = Cmax ⇔ omax = max{o1 , . . . , od2 }.
is O((Ncomp + d2 ) · r) = O((h(d1 + d2 ) + d2 ) · r) =
const. terms This solution is less powerful than the spread spectrum
O([hd1 + (h + 1)d2 ] · r) = O((d1 + d2 ) · r) classification variant. To actually show this hypothesis, we
= O(d1 r + λmax r2 ), since the constant h determines the compared both techniques with regard to their robustness
network resources and is widely independent from the ob- against noise.
ject representation (with dimensionality d1 ). Compared to
a nearest neighbor approach, the term sr for calculating all Training Data Representation The molecules of the
neighbor distances is omitted, while the factor r is added to MUSK2 dataset to be classified by our connectionist classi-
the complexity and d2 becomes λmax -times bigger. fier appear in different conformations, which are described
by 166 features. This information is used to build a straight-
4 Evaluation forward feature vector representation.
~xt = (x1 , . . . , x162 , x163 , . . . , x166 ), xi ∈ R (10)
The proposed spread spectrum classification based on a | {z
f 1 to f 162
} | {z
f 163 to f 166
}
connectionist model is a novel approach. Therefore the ac-
curacy of the classifier was evaluated by three standard clas- The real-valued attributes f1 to f162 are distances that are
sification benchmarks. The first one is the publicly avail- measured in hundredths of Angstroms. The distances may
able MUSK2 dataset [11], which is a binary distribution of be negative or positive.
molecules. In order to test the discrimination capability, we Depending on the chosen spreading factor λ, the spread-
also applied our technique to a multi-class problem that re- ing process computes a (2 · λ)-bit target vector in case of
quires r > 2 different spreading codes. the 2 classes Musk and Non-Musk. The following exam-
The first dataset describes molecules by their phenotypic ple spreads each class by the Barker codes (1, 1, 0, 1) and
appearance, which is called conformation. The dataset con- (1, 1, 1, 0) of the same length λ = 4 using the spreading
tains 6,598 different conformations of 102 molecules, 39 of function spr defined in section 3.1.
these molecules are judged by human experts to be musks xor
spr((1, 0), (1, 1, 0, 1)) = (0, 0, 1, 0, 1, 1, 0, 1)
and the remaining 63 are judged to be non-musks. A sin- xor
gle molecule is processed as a multi-represented object that spr((0, 1), (1, 1, 1, 0)) = (1, 1, 1, 0, 0, 0, 0, 1)
contains all of its feature values. “Musk odor is a specific The training and test patterns both hold the form input 7→
and clearly identifiable sensation, although the mechanisms target, while the test patterns were excluded from training.
underlying it are poorly understood. Musk odor is deter- For the presented evaluation a spreading factor of 16 was
mined almost entirely by steric (i.e., “molecular shape”) chosen.
effects (Ohloff, 1986).” [13]. These characteristics are cap-
tured as molecule conformations represented as feature vec- r(o) 7→ spr(~ei , ~ci )
tors and typed by our typing mechanism. ⇔ (x1 , . . . , x166 ) 7→ spr(~ei , (ci1 , . . . , ci16 ))
Table 2. Average classification accuracy for the MUSK2 dataset for different noise levels after 10-fold cross-validation. The
measure variation indicates the range of the obtained accuracy values over the 10 individual test sets. For all predictions the network
was trained till a residual error level of ≤ 1.5%. Best accuracy is in bold.

Dataset Measurement Accuracy [%] of classification by Spread Spectrum for uniform noise n
n=0% n≤5% n≤10% n≤15% n≤20.0%
MUSK2 Mean 97.26 96.42 93.47 77.74 88.03
Variation [95.15 - 98.03] [94.24 - 97.73] [90.15 - 95.75] [58.79 - 95.14] [84.55 - 90.61]
n≤22.5% n≤25.0% n≤27.5% n≤35.0% n≤50.0%
Mean 82.95 75.89 37.37 67.02 61.05
Variation [61.67 - 95.15] [61.21 - 88.03] [23.82 - 73.48] [30.15 - 82.42] [21.21 - 81.52]
Accuracy [%] of classification by Basic Encoding for uniform noise n
n=0% n≤5% n≤10% n≤15% n≤20.0%
MUSK2 Mean 97.73 96.56 93.56 74.08 80.04
Variation [93.94 - 99.85] [94.24 - 97.88] [90.15 - 96.52] [54.24 - 90.91] [28.94 - 92.42]
n≤22.5% n≤25.0% n≤27.5% n≤35.0% n≤50.0%
Mean 80.04 66.79 30.54 59.34 52.35
Variation [28.94 - 92.42] [25.91 - 92.12] [16.69 - 62.88] [29.09 - 86.06] [21.09 - 87.88]

The function r creates the [0, 1]-scaled feature vector repre- 30.54% for basic encoding and 37.37% for the spread spec-
sentation of any multi-represented object o that consists of trum variant. This is reasonable, since with a certain (low)
categorical or metric features. When the spreading code is probability noise of lower amplitudes but with the “right”
of length λ = 16 and there are 2 classes C = {C1 , C2 }, signs can do more harm than a higher noise level that does
the spreading process computes a 32-bit (16 · 2) target vec- not displace the feature values as much, due to the sign of
tor. Ten-fold cross-validation was used to obtain signifi- each noise component. Summing up the accuracy for all
cant accuracy measurements, so the 6,598 objects were di- noise levels, the spread spectrum technique holds an over-
vided into 10 disjoint test sets. Additionally from each of all accuracy of 77.72% and thus outperforms the simple
the 10 corresponding training sets 5,641 patterns a frac- approach that achieves 72.24%, in several cases reaching
tion of 297 patterns (1/20) was split off and used for check- even more than 10% of advance. Furthermore the lower
ing the connectionist model quality and to determine when classification variance of the spread spectrum technique is
to stop the training. Network training was stopped when a sign of its higher confidence and discrimination between
the training error was below 1.5% and the model qual- the classes.
ity was just declining after continuously rising. The av- As a reference, the algorithm Iterated Discrimination,
erage model quality on the 10 auxiliary sets that were ex- which belongs to the class of Axis Parallel Rectangle (APR)
cluded from network training was 97.51% for the basic methods, was used by Dietterich et al. [15] to classify the
encoding approach and 99.85% for the spread spectrum MUSK1 pharmaceutical dataset. The MUSK1 is related
variant. Table 2 shows the results of the evaluation pro- to the MUSK2 benchmark and the APR method achieved
cess for different degrees of imposed noise and different an accuracy of 92.4% thereupon. Dietterich et al. [13]
spreading factors. Uniformly distributed noise n interferes achieved their best result when using a domain-specific (dy-
the [0, 1]-scaled numeric input pattern by adding the noise namic reposing) neural network approach for classifying
realizations to each of its components: xi ± ni , di ∈ the molecules of the MUSK2 dataset. Without imposing
[0, 1], ni ∈ [0.0, (n · 1.0)], n ∈ {5%, 10%, 15%, . . .}, for noise on the objects, 10-fold cross-validation led to a maxi-
example, (0.65, 0.34, 0.86, 0.07, 0.95, . . .) is distracted to mal accuracy of 91%, which is almost 7% below the result
(1.34, 0.10, 1.05, −0.23, 1.49, . . .). at hand.

The result clearly shows the high robustness of the 4.2 Multi-Class Classification of Forest
spread spectrum supported classification, which is less af- Data
fected by the increasing noise level than the unspread vari-
ant. Noise of n≤27.5% leads to outlying accuracies for both The second dataset stems from geography and deals with
techniques, since the classification accuracy drops down to forest cover types [7]. By virtue of 12 metric and categor-
ical attributes, the 7 cover types Spruce/Fir, Krummholz,
Lodgepole Pine, Ponderosa Pine, Cottonwood/Willow, As- Table 4. Comparison of the classification accuracy of the
pen and Douglas-fir should be classified. The dataset con- Spread Spectrum and the Support Vector Machine (SVM)
sisting of 581,012 instances was originally obtained from classification of the forest cover type dataset. The perfor-
the US Geological Survey (USGS). We have chosen this mance advantage ∆ of our novel classification technique is
dataset, since it provides a multi-class problem for evalu- significantly positive for all degrees of uniform noise except
ating the discriminative capability of the spread spectrum one. All values are given in percent [%].
technique in case of more than two classes. Furthermore
a neural network approach was already conducted upon Noise Level Spread Spectrum SVM ∆
this dataset, which achieved a classification accuracy of 0 85.91 66.28 + 19.63
5 58.94 33.96 + 24.98
about 70.0% [7]. Blackard’s connectionist classification
10 42.00 35.85 + 6.15
(70.58%), which was generated by model averaging using
15 31.33 18.88 + 12.45
thirty networks with randomly selected initial weights, sig-
20 26.29 32.08 - 5.79
nificantly outperformed the alternative linear discriminant
analysis model (58.38%).
The forest data was elicited from cartographic material 1
7 for a test object o and a classifier K (equal distribution
only and describes the respective cover type by characteris- assumed). The output encoding by spreading codes with
tical attributes like elevation, aspect, slope, horizontal and well-defined correlation properties enhances the classifica-
vertical distance to nearest hydrology (water surface) and tion ability by the sharper class discrimination that prevents
others measured in meters, azimuth or degrees. Each in- misclassification more effectively. The highest accuracy
stance represents a 30 x 30 meter cell, which was labeled of 85.91% was realized by the spread spectrum technique
with one of seven forest cover types (classes). A signif- based on a test set consisting of 1,050 randomly chosen in-
icance test showed that the 44 binary variables indicating stances.
the wilderness area and the soil type can be disregarded, In order to compare the results with an advanced clas-
so we excluded these variables from training – additionally sification technique, we trained a support vector machine
improving training efficiency by the reduced input dimen- (SVM)4 classifier on exactly the same training set. With-
sionality. The underlying dataset together with a detailed out noise, 10-fold cross-validation resulted in a classifica-
description can be obtained from Blackard [7]. tion accuracy of 66.28%, which means that the SVM ap-
Again we compared the classification performance of proach falls short by almost 20% compared to the per-
our robust classification method with a support vector clas- formance of our connectionist technique. Under the in-
sifier under the influence of noise based on the same training fluence of the same uniformly distributed noise, the per-
and test instances. We used the same network topology than formance (percentage of correctly classified test instances)
before with a hidden layer dimension of h = 30. Seven dis- of the SVM classifier dropped dramatically to about 34%
tinct Barker codes served for encoding of the existing cover with 5% noise, 35.85% with 10% noise, 18.88% with 15%
type classes: ~c1 := (1, 0, 0, 1, 0, 1, 1, 0), C1 := “Douglas- noise and 32.08% with 20% noise. This underpins the diffi-
fir”; ~c2 := (1, 0, 0, 1, 1, 0, 0, 1), C2 := “Aspen”; ~c3 := (1, 0, culty of reliable classification of feature vectors when these
1, 0, 0, 1, 0, 1), C3 := “Cottonwood/Willow”; ~c4 := (1, 0, 1, are exposed to considerable noise portions, which is bet-
0, 1, 0, 1, 0), C4 := “Ponderosa Pine”; ~c5 := (1, 1, 0, 0, 0, ter achieved by the connectionist classifier based on spread
0, 1, 1), C5 := “Lodgepole Pine”; ~c6 := (1, 1, 0, 0, 1, 1, 0, spectrum.
0), C6 := “Spruce/Fir”; ~c7 := (1, 1, 1, 1, 0, 0, 0, 0), C7 :=
“Krummholz”. 4.3 Classification of Diabetes Patients
The cross-correlation among these OVSF codes is ac-
tually zero, Φ~ci ,~cj (n) = 0, i 6= j, i, j = 1, . . . , 7, n = Finally, we employed our method for a medical classifi-
0, . . . , N -1, which makes them fully orthogonal. The out- cation task based on a binary distribution of Pima Indian di-
put space, which is 7-dimensional for unary encoding, be- abetes patients. The dataset consisted of 768 instances with
comes (7 · 8 = 56)-dimensional in case of spread spec- 8 attributes, each instance belonging to the classes healthful
trum output coding, since the required OVSF code length is (65.1%) and diseased (34.9%) [9, 5]. Again we evaluated
dim(~ci ) = 8, i = 1, . . . , 7. The classification accuracies the classification performance for different degrees of noise
are shown by table 3 and a direct comparison with an SVM by ten-fold cross-validation. The results are presented in
classifier is given in table 4. table 5. We also used gaussian instead of uniformly dis-
As expected, the results of the spread spectrum approach tributed noise to distort the input vectors, which was gener-
are even more convincing for the multi-class problem, sim- 4 The SVM classifier (SMO) from the Weka data mining package was

ply due to the lower success probability P(K(o) = C(o)) = used [24].
Table 3. Average classification accuracy for the forest cover type dataset for different noise levels after 10-fold cross-validation.
The cumulated advance of the spread spectrum technique compared to the basic encoding amounts to 18.33% for all considered
noise levels. For all predictions the network was trained till a residual error level of ≤ 7.9%. Best accuracy is in bold.

Dataset Measurement Accuracy [%] of classification by Spread Spectrum for uniform noise n
n=0% n≤5% n≤10% n≤15% n≤20.0% n≤55.0%
Forest Mean 85.91 58.94 42.00 31.33 26.29 27.91
CoverType Variation [80.19 - 91.51] [51.43 - 66.98] [23.81 - 49.52] [23.58 - 39.05] [13.33 - 38.68] [19.81 - 44.76]
Accuracy [%] of classification by Basic Encoding for uniform noise n
n=0% n≤5% n≤10% n≤15% n≤20.0% n≤55.0%
Forest Mean 83.94 54.37 45.53 29.13 20.60 21.05
CoverType Variation [75.47 - 89.62] [40.57 - 67.92] [33.96 - 57.55] [19.81 - 42.45] [3.81 - 32.08] [12.26 - 33.96]

as classifier for multi-represented objects. In general se-


Table 5. Comparison of the classification accuracy in [%] quences consisting of objects from different classes can be
based on the diabetes patient dataset with and without the learned by the network. In conclusion we developed two
Spread Spectrum technique. The performance advantage ∆ main functionalities:
of the Spread variant compared to the Unary approach is Learning / Classifying Multi-Represented Objects:
significantly positive for all noise levels except one. Both We demonstrated that objects containing heterogeneous
Uniform (σ 2 ≤ 0.082) and Gaussian noise (σ 2 ≤ 0.25) are features can be learned and classified by the connection-
evaluated. ist system. The spread spectrum mechanism can be used
to represent the class-membership of domain objects in a
Uniform N. Spread Spectrum Unary ∆
0 74.48 68.75 + 5.73
robust way.
5 75.52 67.71 + 7.81 Classification Despite Noise or Incomplete Data: The
10 75.0 67.71 + 7.29 RNN with spread spectrum technique is capable of reliably
25 74.48 67.71 + 6.77 classifying incomplete or distracted domain objects, which
40 60.94 53.13 + 7.81 are exposed to significant noise portions.
50 68.75 57.29 + 11.46 The high classification robustness is achieved by the inte-
Gaussian N. Spread Spectrum Unary ∆ grated spreading mechanism, which is borrowed from state-
5 71.86 71.86 + 0.00 of-art code multiplexing in mobile communication and is
10 72.40 69.27 + 3.13 adapted to neural information processing. We evaluated
15 70.83 67.71 + 3.12 the connectionist classifier on the basis of three benchmark
25 57.29 55.21 + 2.08 datasets. The proposed spread spectrum mechanism was
40 60.94 64.06 - 3.12 superior to the unspread variant regarding its robustness
50 63.54 48.44 + 15.10 against noisy and sparse data. For the MUSK2 dataset, it
is remarkable that both variants outperform the neural net-
work approach from Dietterich et al. [15] in case of absent
ated by the Box-Muller method [8]. Gaussian white noise
noise, which achieved 91.0% accuracy instead of 97.26%.
with a higher variance led to a lower performance advan-
Furthermore the new technique was successfully applied
tage compared to unary encoding on average. Anyhow, the
to a multi-class dataset on forest cover types, whose in-
spread spectrum method outperforms the alternative classi-
stances should be classified based on ten topological at-
fication technique for the third time, especially in terms of
tributes. The spread spectrum variant both achieved higher
noise robustness.
accuracy and robustness against uniformly distributed noise
of different levels. We also compared our new classification
5 Conclusion model with an advanced SVM classifier, whose classifica-
tion accuracy collapsed by about 32%, when affected by
We have developed a robust connectionist classification uniform noise and thus provides evidence for the need for
aimed at arbitrary domain objects based on a recurrent neu- robust classification methods. We conclude that our tech-
ral network (RNN). Our spread spectrum typing and clas- nique can be used to improve the robustness of any connec-
sification mechanism improves the robustness of the RNN tionist or statistical classifier.
6 Acknowledgments [13] T. G. Dietterich, A. Jain, R. H. Lathrop, and T. Lozano-Perez.
A comparison of dynamic reposing and tangent distance for
drug activity prediction. In Advances in Neural Information
This paper would not have been possible without the sup-
Processing Systems, San Mateo, CA, pages 216–223. Mor-
port of B. B. from XX, who contributed to the problem def- gan Kaufmann, 1994.
inition and the structure of this paper.
[14] T. G. Dietterich and E. B. Kong. Error-correcting output cod-
ing corrects bias and variance. In International Conference
References on Machine Learning, pages 313–321, 1995.
[15] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Perez. Solv-
[1] E. Achtert, C. Böhm, J. David, P. Kröger, and A. Zimek. ing the multiple-instance problem with axis-parallel rectan-
Noise robust clustering in arbitrarily oriented subspaces. In gles. In Artificial Intelligence, volume 89, pages 31–71.
Proceedings of SIAM Conference on Data Mining, SDM 08, Addision-Wesley, Bonn, 1997.
Society for Industrial and Applied Mathematics. Institute [16] J. Fakatselis. Processing gain for direct sequence
for Informatics, Ludwig-Maximilians-Universität München, spread spectrum communication systems and prism,
Germany, 2008. http://www.qsl.net/n9zia/pdf/an9633.pdf. Technical report,
[2] E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, and A. Zimek. Intersil Corporation, Melbourne, 1996.
Robust, complete, and efficient correlation clustering. In [17] H. Frigui and R. Krishnapuram. A robust clustering algo-
Proceedings of the 7th SIAM International Conference on rithm based on competitive agglomeration and soft rejection
Data Mining (SDM), Minneapolis, MN. Institute for In- of outliers. In Proceedings of the 1996 Conference on Com-
formatics, Ludwig-Maximilians-Universität München, Ger- puter Vision and Pattern Recognition (CVPR ’96), page 550.
many, 2007. IEEE Computer Society, Washington, DC, USA, 1996.
[3] E. Achtert, H.-P. Kriegel, A. Pryakhin, and M. Schubert. Hi- [18] R. Ghani. Using error-correcting codes for text classification.
erarchical density-based clustering for multi-represented ob- In Proceedings of ICML-00, 17th International Conference
jects. In Workshop on Mining Complex Data (MCD’05), on Machine Learning, 2000.
ICDM, Houston, TX. Institute for Computer Science, Uni- [19] A. Küpper. Mobile communications 1, multiplexing and
versity of Munich, 2005. modulation, http://www.mobile.ifi.lmu.de/vorlesungen/ss06
[4] C. Andren. Short pn sequences for direct sequence spread /mk/chapter4.pdf. Technical report, Mobile and Distributed
spectrum radios. Harris Semiconductor Palm Bay, Florida, Systems Group, University of Munich, Germany, 2004.
http://www.sss-mag.com/pdf/shortpn.pdf, 1997. [20] Y. Liu. Using svm and error-correcting codes for multi-
[5] K. Bennett and J. Blue. A support vector machine approach class dialog act classification in meeting corpus. In INTER-
to decision trees. In R.P.I Math Report, number 97-100. SPEECH 2006 - ICSLP, 2006.
Rensselaer Polytechnic Institute, Troy, NY, 1997. [21] R. Rifkin, K. Schutte, M. Saad, J. Bouvrie, and J. Glass.
Noise robust phonetic classification with linear regularized
[6] A. Berger. Error-correcting output coding for text classifica-
least squares and second-order features. In IEEE Interna-
tion. In IJCAI’99: Workshop on machine learning for infor-
tional Conference, ICASSP 2007 Proceedings on Acoustics,
mation filtering, 1999.
Speech and Signal Processing, volume 4, 2007.
[7] J. A. Blackard. Comparison of neural networks [22] M. Skurichina, L. Kuncheva, and R. P. W. Duin. Bagging
and discriminant analysis in predicting forest cover and boosting for the nearest mean classifier: Effects of sam-
types. Department of Forest Sciences. Colorado ple size on diversity and accuracy. In Multiple Classifier Sys-
State University. Fort Collins, Colorado, 1998. tems, 2002.
http://www.cormactech.com/neunet/sampdata/forests.html.
[23] P. Werbos. Backpropagation through time: what it does and
[8] G. E. P. Box and M. E. Muller. A note on the generation how to do it. In Proceedings of the IEEE, volume 78, pages
of random normal deviates. In The Annals of Mathematical 1550–1560, 1990.
Statistics, volume 29, pages 610–611, 1958.
[24] I. H. Witten and E. Frank. Data Mining: Practical machine
[9] A. e. a. Bulsari. Neural networks in medical diagnosis: Com- learning tools and techniques. 2nd Edition, Morgan Kauf-
parison with other methods. In Proceedings of the Interna- mann, San Francisco, 2005.
tional Conference EANN ’96, pages 427–430, 1996. [25] H. Xu, Z.-H. Tan, P. Dalsgaard, and B. Lindberg. Robust
[10] R. Callan. Neuronale Netze im Klartext. Pearson Studium, speech recognition from noise-type based feature compensa-
2003. tion and model interpolation in a multiple model framework.
In IEEE International Conference, ICASSP 2006 Proceed-
[11] D. Chapman and A. Jain. Musk “clean2” database. Technical
ings on Acoustics, Speech and Signal Processing, volume 1,
report, AI Group at Arris Pharmaceutical Corporation, 1994.
2006.
[12] J.-L. Chen and J.-H. Wang. A new robust clustering algo- [26] M.-S. Yang and K.-L. Wu. A similarity-based robust cluster-
rithm - density-weighted fuzzy c-means. volume 3, pages ing method. volume 26, pages 434–448, 2004.
90–94, 1999.

You might also like