Professional Documents
Culture Documents
3, MAY 2007 1
Abstract— In this paper we discuss in detail a recently allowing the algorithm to operate online in time-varying
proposed kernel-based version of the recursive least-squares environments.
(RLS) algorithm for fast adaptive nonlinear filtering. Unlike After describing the proposed sliding-window kernel
other previous approaches, the studied method combines
a sliding-window approach (to fix the dimensions of the RLS algorithm, we discuss its application to different non-
kernel matrix) with conventional ridge regression (to im- linear system identification problems. First, it is applied
prove generalization). The resulting kernel RLS algorithm is to identify a Wiener system, which is a simple nonlinear
applied to several nonlinear system identification problems. model that typically appears in satellite communications
Experiments show that the proposed algorithm is able to [7] or digital magnetic recording systems [8]. Then we
operate in a time-varying environment and to adjust to
abrupt changes in either the linear filter or the nonlinearity. apply the kernel RLS algorithm to a nonlinear regression
problem. Apart from its applications in system identifica-
tion, the proposed sliding-window kernel RLS algorithm
Index Terms— kernel methods, kernel RLS, sliding-window,
system identification can also be used as a building block for other, more
complex adaptive algorithms. In particular, it has been
applied to develop a sliding-window kernel canonical
I. I NTRODUCTION correlation analysis (CCA) algorithm [9].
An important number of kernel methods have been pro- The rest of this paper organized as follows: in Section
posed in recent years, including support vector machines II the least-squares criterion is explained briefly. Kernel
[1], kernel principal component analysis [2], kernel Fisher methods are introduced in Section III and a detailed
discriminant analysis [3] and kernel canonical correlation description of the sliding-window algorithm is given in
analysis [4], [5], with applications in classification and Section IV. In Section V it is applied to two nonlinear
nonlinear regression problems. In their original forms, system identification problems and Section VI summa-
most of these algorithms cannot be used to operate online rizes the main conclusions of this work.
since a number of difficulties are introduced by the kernel
methods, such as the time and memory complexities II. R EGULARIZED L EAST-S QUARES
(because of the growing kernel matrix) and the need to A. Least-Squares Problem
avoid overfitting. The least-squares (LS) criterion [10] is a widely used
Recently a kernel RLS algorithm was proposed that method in signal processing. Given a vector y ∈ RN ×1
dealt with both difficulties [6]: by applying a sparsification and a data matrix X ∈ RN ×M of N observations, it
procedure the kernel matrix size was limited and the order consists in seeking the optimal vector h ∈ RM×1 that
of the problem was reduced. This kernel RLS algorithm solves
allows for online training but cannot handle time-varying J = min ky − Xhk2 . (1)
data. Other approaches to reduce the order of the kernel h
matrix have been also been proposed, including a recent When the number of observations N is at least equal to
kernel principal component analysis (PCA) algorithm [6] the number of unknowns M , the least-squares problem
which is able to track time-varying data. is overdetermined and has either one unique solution or
In this paper we discuss a different approach, applying an infinite number of solutions. If XT X is a non-singular
conventional ridge regression instead of reducing the matrix, the unique solution of the least-squares problem
order of the feature space. A sliding-window technique is given by
−1 T
is applied to fix the size of the kernel matrix in advance, ĥ = XT X X y. (2)
This paper is based on “A Sliding-Window Kernel RLS Algorithm and If X is rank deficient, XT X is a singular matrix corre-
its Application to Nonlinear Channel Identification,” by S. Van Vaeren- sponding to a LS problem with infinitely many solutions.
bergh, J. Vı́a, and I. Santamarı́a, which appeared in the Proceedings
of the 2006 IEEE International Conference on Acoustics, Speech and For full rank matrices X with N ≥ M , the solution
Signal Processing, ICASSP 2006, Toulouse, France, May 2006.
c 2006 ĥ can be represented in the basis defined by the rows of
IEEE. X. Hence it can also be written as h = XT a, making it a
This work was supported by MEC (Ministerio de Educación y
Ciencia) under grant TEC2004-06451-C05-02 and FPU grant AP2005- linear combination of the input patterns (this is sometimes
5366. denoted as the “dual representation”).
To improve generalization and increase the smoothness which implies an infinite-dimensional feature space, or
of the solution, a regularization term is often included in the polynomial kernel of order p
the standard least-squares formulation (1): p
κ(xi , yj ) = 1 + xTi xj .
J = min ky − Xhk2 + chT h ,
(3)
h
where c is a regularization constant. This problem is A. Kernel LS
known as regularized least-squares, and its solution is A nonlinear “kernel” version of the linear LS algorithm
given by can be obtained by transforming the data into feature
−1 T ′
ĥ = XT X + cI space. Using the transformed vector h̃ ∈ RM ×1 and the
X y, (4)
N ×M ′
transformed data matrix X̃ ∈ R , the LS problem (1)
where I is the identity matrix.
can be written in feature space as
where xTj denotes the j-th row of X, for a growing number where xTi and xTj are the i-th and j-th rows of X. As a
of observations N . For a complete formulation of this consequence, the computational complexity of operating
algorithm, we refer to [10]. in this high-dimensional space is not necessarily larger
than that of working in the original low-dimensional
III. K ERNEL M ETHODS space.
In recent years, the framework of reproducing kernel
Hilbert spaces (RKHS) has led to the family of algorithms B. Measures Against Overfitting
known as kernel methods [12]. These algorithms use For most useful kernel functions, the dimension of the
Mercer kernels in order to produce nonlinear versions of feature space, M ′ , will be much higher than the number of
conventional linear algorithms. available data points N (for instance, in case the Gaussian
Kernel methods are based on the implicit nonlinear kernel is used, the feature space will have dimension
transformation of the data xi from the input space to M ′ = ∞). As mentioned in Section II, Eq. (8) could
a high-dimensional feature space x̃i = Φ(xi ). The key have an infinite number of solutions in these cases, due
property of kernel methods is that scalar products in the to the rank deficiency of X̃. A number of techniques to
feature space can be seen as nonlinear (kernel) functions handle this overfitting have been presented.
of the data in the input space. Since this property avoids 1) Kernel RLS with sequential sparsification: In [6]
the explicit mapping to the feature space, it allows to a sparsification process was proposed in which an input
perform any conventional scalar product based algorithm sample is only admitted into the kernel if its image in
in the feature space by solely replacing the scalar products feature space cannot be sufficiently well approximated by
with the Mercer kernel function in the input space. combining the previously admitted samples. In this way
In this way, any positive definite kernel function satis- the resulting kernel RLS algorithm reduces the order of
fying Mercer’s condition [1]: κ(xi , xj ) = hΦ(xi ), Φ(xj )i the feature space [4], [5]. A related kernel LS algorithm
has an implicit mapping to some higher-dimensional was recently presented in [13].
feature space. This simple and elegant idea is known as 2) Kernel RLS by kernel PCA: Another approach to
the “kernel trick”, and it is commonly applied by using a reduce the order of the feature space is by applying prin-
nonlinear kernel such as the Gaussian kernel cipal component analysis (PCA) [14]. PCA is a technique
kxi − xj k2
based on singular value decomposition (SVD) to extract
κ(xi , xj ) = exp − ,
2σ 2 the dominant eigenvectors of a matrix.
...
ize the solution of the kernel RLS problem is by applying x2 K 2
...
K 5 Kn+1
computational complexity. In (kernel) ridge regression, x6 K 6
xn-1
xn
...
the norm of the solution h̃ is penalized as in (3) to obtain
xn+1
...
the following problem:
...
...
h T
i
J ′′ = min ky − X̃h̃k2 + ch̃ h̃ . (10)
h̃
1.5
Algorithm 1 Summary of the proposed adaptive algo-
rithm. 1
Initialize K0 and calculate K−1
0 . 0.5
for n = 1, 2, . . . do
n
0
v
Downsizing: Obtain K̂n−1 out of Kn−1 .
−1 −0.5
Calculate K̂n−1 according to Eq. (15).
Upsizing: Obtain Kn according to Eq. (13). −1
Calculate K−1n according to Eq. (16). −1.5
−4 −2 0 2 4
Obtain the updated solution αn = K−1n yn . u
n
end for
Figure 3. Signal in the middle of the Wiener system vs. output signal
nn for binary input symbols and different indices n.
un vn
xn H(z) f(.) yn 10
MSE (dB)
MLP
−10
K−RLS
an N × N matrix filled by ones and I is the unit matrix.
The complete algorithm is summarized in Alg. (1). −20
V. E XPERIMENTS −30
In this section we investigate the performance of the
sliding-window kernel RLS algorithm to identify two −40
0 500 1000 1500
nonlinear systems. In both cases the results are compared iterations
to online identification by a multi-layer perceptron (MLP),
which is a standard approach for identifying nonlinear Figure 4. MSE of the identification of the nonlinear Wiener system of
systems. Since kernel methods provide a natural nonlinear Fig. 2, for the standard method using an MLP and for the window-based
kernel RLS algorithm with window length N = 150. A change in filter
extension of linear regression methods [12], the proposed coefficients of the nonlinear Wiener system was introduced after sending
system is supposed to perform well compared to the MLP. 500 data points.
A. Identification of a Wiener System with an abrupt input and output are known. To fill the initial sliding-
channel change windows, zero-padding of the signals was applied. The
The Wiener system is a well-known and simple non- MLP was trained in an online manner, i.e. in every
linear system which consists of a series connection of iteration one new data point was used for updating the
a linear filter and a memoryless nonlinearity (see Fig. net weights.
2). Such a nonlinear channel can be encountered in 1) Comparison to MLP: System identification was first
digital satellite communications and in digital magnetic performed by an MLP with 8 neurons in its hidden layer
recording. Traditionally, the problem of blind nonlinear and learning rate 0.1, and then by using the sliding-
equalization or identification has been tackled by consid- window kernel RLS with window size N = 150 and using
ering nonlinear structures such as MLPs [16], recurrent a polynomial kernel of order p = 3. For both methods we
neural networks [17], or piecewise linear networks [18]. applied time-embedding techniques in which the length L
We consider a supervised identification problem, in of the linear channel was known. More specifically, the
which moreover at a given time instant the linear channel used MLP was a time-delay MLP with L inputs, and the
coefficients are changed abruptly to compare the tracking input vectors for the kernel RLS algorithm were time-
capabilities of both algorithms: During the first part of the delayed vectors of length L, xn = [xn−L+1 , . . . , xn ]T .
simulation, the linear channel is H1 (z) = 1−0.3668z −1− The length of the used linear channels H1 and H2 was
0.4764−2 + 0.8070−3 and after receiving 500 symbols it L = 4.
is changed into H2 (z) = 1 − 0.8326z −1 + 0.6656z −2 + In iteration n the input-output pair (xn , yn ) is fed into
0.7153−3. A binary signal (xn ∈ {−1, +1}) is sent the identification algorithm. The performance is evaluated
through this channel after which the signal is transformed by estimating the next output sample v̂n+1 , given the next
nonlinearly according to the nonlinear function v = input vector xn+1 , and comparing it to the actual output
tanh(u), where v is the linear channel output. A scatter vn+1 . For both methods, the mean square error (MSE)
plot of un versus vn can be found in Fig. 3. Finally, 20dB was averaged out over 250 Monte-Carlo simulations.
of additive white Gaussian noise (AWGN) is added. The The MSE for both approaches is shown in Fig. 4. In
Wiener system is treated as a black box of which only iteration 500 to 650 it can be seen that the algorithm needs
10 10
L =3
est
0 0
N=300
MSE (dB)
MSE (dB)
N=150
−10 −10 L =7
N=75 est
L =6
est
−20 −20
−30 −30
L =5
est
L =4
est
−40 −40
0 500 1000 1500 0 500 1000 1500
iterations iterations
Figure 5. Comparison of the identification results of the kernel RLS Figure 7. Identification results when the correct channel length L =
algorithm for different window lengths for Wiener system identification. 4 is not known. The curve Lest = 3 shows that underestimation of
Larger windows obtain a better MSE. In general, the number of iterations the channel leads to a worse performance than overestimation, which
needed for convergence is of the order of the window length. corresponds to Lest > 4.
10 10
0 0
L=6
MSE (dB)
MSE (dB)
−10
−10
SNR = 10dB
−20
−20 L=4
SNR = 20dB
L=2 −30
−30 SNR = 30dB
−40
−40
0 500 1000 1500 −50
iterations 0 500 1000 1500
iterations
Figure 6. Comparison of the identification results of the kernel RLS
algorithm for different Wiener system channel lengths. Since the same Figure 8. Identification results of the kernel RLS algorithm for different
sliding-window length was used in the three situations, best results are system SNR values in Wiener system identification.
obtained for the shortest channel. Note that the channel length was
known while performing identification.
10 10
0 0
constant channel
MSE (dB)
linearly changing channel
MSE (dB)
−10 −10
c = 50
c = 20 −20
−20
−30 −30
c = 0.1
−40 −40
0 50 100 150 200 250 300 0 500 1000 1500 2000
iterations iterations
Figure 9. Influence of the regularization constant on the algorithm’s Figure 10. MSE for time-tracking of a Wiener system in which the
performance. For c ≤ 20 the obtained performance is very similar. For channel varies linearly.
larger values, such as c = 50, regularization has a dominating effect on
the performance. 0.5
20 R EFERENCES
10
[1] V. N. Vapnik, The Nature of Statistical Learning Theory.
MSE (dB)
Ignacio Santamarı́a received his Telecommunication Engi- B. The inverse of an upsized matrix
neer Degree and his Ph.D. in Electrical Engineering from the
Polytechnic University of Madrid, Spain in 1991 and 1995,
By upsizing a matrix we refer to adding one row and
respectively. In 1992 he joined the Department of Communica- one column at the end. The matrix A shown below is
tions Engineering, University of Cantabria, Spain, where he is upsized to the matrix K. Given the inverse matrix A−1
currently Professor. In 2000 and 2004, he spent visiting periods and upsized matrix K, the inverse matrix K−1 can be
at the Computational NeuroEngineering Laboratory (CNEL), obtained as follows:
University of Florida.
Dr. Santamarı́a has more than 90 publications in refereed jour- A b −1 E f
nals and international conference papers. His current research in-
K= T , K = T
b d f g
terests include nonlinear modeling techniques, machine learning
theory and their application to digital communication systems, AE + bfT = I
in particular to inverse problems (deconvolution, equalization
and identification problems).
⇒ Af + bg = 0
T
b f + dg = 1