You are on page 1of 3

Kernel Principal Components Analysis

Max Welling Department of Computer Science University of Toronto 10 King’s College Road Toronto, M5S 3G5 Canada

This is a note to explain kPCA.

Let’s fist see what PCA is when we do not worry about kernels and feature spaces. We will always assume that we have centered data, i.e. i xi = 0. This can always be achieved by a simple translation of the axis. Our aim is to find meaningful projections of the data. However, we are facing an unsupervised problem where we don’t have access to any labels. If we had, we should be doing Linear Discriminant Analysis. Due to this lack of labels, our aim will be to find the subspace of largest variance, where we choose the number of retained dimensions beforehand. This is clearly a strong assumption, because it may happen that there is interesting signal in the directions of small variance, in which case PCA in not a suitable technique (and we should perhaps use a technique called independent component analysis). However, usually it is true that the directions of smallest variance represent uninteresting noise. To make progress, we start by writing down the sample-covariance matrix C , 1 xi xT C= i N i


The eigenvalues of this matrix represent the variance in the eigen-directions of data-space. The eigen-vector corresponding to the largest eigenvalue is the direction in which the data is most stretched out. The second direction is orthogonal to it and picks the direction of largest variance in that orthogonal subspace etc. Thus, to reduce the dimensionality of the data, we project the data onto the retained eigen-directions of largest variance: U ΛU T = C and the projection is given by, ⇒ C=
a T yi = Uk xi ∀i (3) where Uk means the d × k sub-matrix containing the first k eigenvectors as columns. As a side effect, we can now show that the projected data are de-correlated in this new basis: 1 1 T T T T T yi yi = U T xi xT (4) i Uk = Uk CUk = Uk U ΛU Uk = Λk N i N i k

λa ua uT a


2 Centering Data in Feature Space It is in fact very difficult to explicitly center the data in feature space. (5) Now imagine that there are more dimensions than data-cases. it is central to most kernel methods.k a αj xj = = = λa xT i ua λa xT i j ⇒ a αj xj ⇒ (7) λa j a T [xi xj ] αj We now rename the matrix [xT i xj ] = Kij to arrive at. is minimal. This implies that we are now ready to kernelize the procedure by replacing xi → Φ(xi ) and defining Kij = Φ(xi )Φ(xj )T . i. so if we can center the kernel matrix .e. the whole exposition was setup so that in the end we only needed the matrix K to do our calculations. 1 1 (xT i ua ) a xi λa ua = C ua = xi = αi xi xT (xT i ua = i ua )xi ⇒ ua = N i N i N λ a i i (6) where ua is some arbitrary eigen-vector of C . losslessly) as some linear combination of the data-vectors. when we receive a new data-case t and we like to compute its projections onto the new reduced space. In this case it is not hard to show that the eigen-vectors that span the projection space must lie in the subspace spanned by the data-cases.where Λk is the (diagonal) k × k sub-matrix corresponding to the largest eigenvalues. uT at = i a T αi xi t = i a αi K (xi . This also implies that instead of the eigenvalue equations C u = λu we may consider the N projected equations xT i Cu = a λxT i u ∀i. some dimensions remain unoccupied by the data.e. i.j a a T T αi αj [xi xj ] = αT a K αa = N λa αa αa = 1 ⇒ ||αa || = 1/ N λa (9) Finally. But. The last expression can be interpreted as: “every eigen-vector can be exactly written (i. By requiring that u is normalized we find. Another convenient property of this procedure is that the reconstruction error in L2 -norm between from y to x is minimal. ||xi − Pk xi ||2 i T where Pk = Uk Uk is the projection onto the subspace spanned by the columns of Uk . we know that the final algorithm only depends on the kernel matrix. we have derived an eigenvalue equation for α which in turn completely determines the eigenvectors u. xT i C ua 1 xT i N 1 N xk xT k k j a T [xi xk ][xT αj k xj ] j. t) (10) This equation should look familiar. Obviously. where Φ(xi ) = Φia . we compute. From this equation the coefficients αi can be computed efficiently a space of dimension N (and not d) as follows. 2 a ˜ a )αa with λ ˜ a = N λa K α = N λa K αa ⇒ K αa = (λ (8) So. uT a ua = 1 ⇒ i.e. This can be seen as follows. and hence it must lie in its span”.

1 Φi = Φi − Φk (11) N k Hence the kernel in terms of the new features is given by. xj ) − κ(ti )1T j − 1i κ(xj ) + k 1i 1j (18) . c Kij = = = with and (Φi − 1 N Φk )(Φj − k 1 N Φl )T l (12) ΦT l ]+[ 1 N Φk ][ k Φi ΦT j −[ Kij − 1 N Φk ]ΦT j − Φi [ k 1 N l 1 N ΦT l ](13) l κi 1 T j − 1i κT j Kik + k 1i 1T j (14) (15) (16) 1 κi = N k= 1 N2 k Kij ij Hence. xj ) = [Φ(ti ) − 1 N Φ(xk )][Φ(xj ) − k 1 N Φ(xl )]T l (17) Using a similar calculation (left for the reader) you can find that this can be expressed easily in terms of K (ti . xj ) and K (xi . A kernel matrix is given by Kij = Φi ΦT j . T T Kc (ti . We now center the features using. xj ) as follows.we are done as well. xj ) = K (ti . At test-time we need to compute. Kc (ti . we can compute the centered kernel in terms of the non-centered kernel alone and no features need to be accessed.