You are on page 1of 10

70 IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, VOL. 28, NO.

1, FEBRUARY 2015

Fault Detection Using Random Projections and


k-Nearest Neighbor Rule for Semiconductor
Manufacturing Processes
Zhe Zhou, Chenglin Wen, Member, IEEE, and Chunjie Yang

Abstract—Fault detection technique is essential for improv- realize reducing wafer scrap, increasing equipment uptime
ing overall equipment efficiency of semiconductor manufacturing and reducing the usage of test wafers [1]–[4]. Traditional
industry. It has been recognized that fault detection based on multivariate statistical process monitoring (MSPM) methods,
k nearest neighbor rule (kNN) can effectively deal with some char-
acteristics of semiconductor processes, such as multimode batch such as principal component analysis (PCA) and partial least
trajectories and nonlinearity. However, the computation complex- squares (PLS) have been extensively used in semiconduc-
ity and storage space involved in neighbors searching of kNN tor manufacturing process monitoring [5]–[8]. However, as
prevent it from online monitoring, especially for high dimensional a typical batch process semiconductor processes have some
cases. To deal with this difficulty, principal component-based characteristics, such as multimode batch trajectories, nonlin-
kNN has also been presented in literature, in which dimension
reduction is done by principal component analysis (PCA) before earity and non-Gaussian distributed data (see Fig. 11 in [1] and
kNN rule implemented to fault detection. However, dimension Fig. 3 in [9]), which have posed challenges to these MSPM
reduction by PCA may distort the distances between pairs of methods. Several modified PCA-based methods for nonlin-
samples (trajectories). Thus the false alarm and missing detection earity [10], [11], multimode [12], [13], and non-Gaussian
of kNN for fault detection may increase in principal component data [14], also encounter difficulties when these problems
subspace because PCA fails to preserve pairwise distances in
subspace. To overcome this drawback, we propose a new fault coexisted.
detection method based on random projection and kNN rule, To overcome these limitations, He and Wang [1] pro-
which combines the advantages of random projection in dis- posed a fault detection method based on k nearest neighbor
tance preservation (in the expectation) and kNN rule in dealing rule (FD-kNN). Unlike the well known k nearest neighbor
with the problems of multimodality and nonlinearity that often rule for multi-classes classification, it is used as an anomaly
coexist in semiconductor manufacturing processes. An industrial
example illustrates the performance of the proposed method. detection algorithm in which only the normal data is available
for model building. It performs fault detection based on the
Index Terms—Fault detection, k-nearest neighbor rule (kNN), following criterion: the trajectory of a normal test sample is
random projection (RP), distance preservation.
similar to the trajectories of normal training samples (obtained
under normal operation conditions); while the trajectory of a
I. I NTRODUCTION faulty sample should deviate significantly from the trajectories
AULT detection and classification techniques continu- of normal samples in training set. The deviation is measured
F ously play an important role in the sustained growth of
the semiconductor manufacturing industry. Combining with
by kNN distance which is the average square distance between
the incoming test sample and its k nearest neighbors from the
advanced process control techniques, it can characterize and normal training set. The reason why FD-kNN is superior to
control variability in critical manufacturing processes and thus PCA for fault detection of semiconductor process is that PCA
fails to capture those characteristics of semiconductor manu-
Manuscript received June 23, 2014; revised October 11, 2014; accepted facturing processes; it implies the assumption of multivariate
November 19, 2014. Date of publication November 26, 2014; date of current
version January 30, 2015. This work was supported in part by the National gaussian distribution on measurement variables and tries to
Natural Science Foundation of China under Grant 61034006, Grant 61104028, characterize the global variation of underling data. In contrast,
Grant 61273170, Grant 61203094, Grant 61290324, and Grant 61333005, FD-kNN overcomes these problems by utilizing the relation of
in part by the National High-tech R&D Program of China (863 Program)
under Grant 2012AA041709. A brief version of the paper was presented distances among local samples to perform fault detection. This
at 33rd Chinese Control Conference, July 28, 2014. (Corresponding author: has been evidently illustrated through simulation and industrial
C. Wen.) examples in [1].
Z. Zhou and C. Yang are with the State Key Laboratory of
Industrial Control Technology, Institute of Industrial Process Control, In general, the measurement data of batch process are col-
Zhejiang University, Hangzhou 310027, China (e-mail: zhouzhe@zju.edu.cn; lected in a 3-D array, denoted as X ∈ RI×J×K , where I, J and
cjyang@iipc.zju.edu.cn). K represent the number of batches, variables and sampling
C. Wen is with the Institute of Systems Science and Control Engineering,
School of Automation, Hangzhou Dianzi University, Hangzhou 310018, times, respectively. Before applying PCA, several processing
China, and also with the College of Electrical Engineering, Henan University steps are needed including unfolding the 3-D array into 2-D
of Technology, Zhengzhou 450007, China (e-mail: wencl@hdu.edu.cn). matrix, X. The way adopted in FD-kNN is batch unfolding
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. (X ∈ RI×JK ), each batch trajectory characterizes the varia-
Digital Object Identifier 10.1109/TSM.2014.2374339 tion in the whole batch duration is represented by a higher
0894-6507 c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on November 14,2020 at 18:38:51 UTC from IEEE Xplore. Restrictions apply.
ZHOU et al.: FAULT DETECTION USING RANDOM PROJECTIONS AND kNN RULE FOR SEMICONDUCTOR MANUFACTURING PROCESSES 71

dimensional vector (x ∈ RJK×1 ) after unfolding. It is obvious A. PCA-Based Fault Detection


that a high computational load and storage space are required As a fundamental method for multivariate statistical pro-
(neighbor searching in X) when FD-kNN is applied for online cess monitoring, PCA decomposes normalized data (scaled to
monitoring, especially when many of such fault detection zero mean and unit variance for each variable) X ∈ Rn×m as
models are implemented simultaneously. It is also difficult to follows:
realize this fault detection algorithm on an Electronic Control
Unit (ECU) with limited capacity which often embedded in a X = X̂ + X̃ = XPPT + XP̃P̃T = TPT + T̃P̃T (1)
control system [15].
To reduce the computation complexity and requirement where, n and m represent the number of samples and vari-
of storage space, He and Wang proposed a method called ables, respectively. T∈Rn×l and P∈Rm×l represent score and
principal component based kNN (PC-kNN) to deal with loading matrices, respectively. [T T̃] is orthogonal, [P P̃]
this drawback of FD-kNN [2], [3]. PC-kNN firstly projects is orthonormal and its columns are eigenvectors of matrix
high-dimensional data matrix X onto principal component sub- XT X. P is composed of l eigenvectors corresponding to
space (PCS), then the kNN rule is applied to the scores matrix the l largest eigenvalues of XT X, and P̃ contains other
in PCS to build a model. In this way, the computation load eigenvectors.
and storage space can be reduced significantly if few principle Then the statistics of Hotelling T 2 and SPE (squared predic-
components are preserved. tion error) can be constructed in principal component subspace
Although the computation complexity and storage space can spanned by P and residual subspace spanned by P̃, respec-
be reduced by PC-kNN, the performance of fault detection is tively. The SPE statistic, which measures the deviation of
more important and should be the primary consideration, espe- incoming sample x from principle component subspace, is
cially for some critical processes. Unfortunately, the pairwise defined as
distances in original space cannot be preserved in princi-
 2  
2 
  2
ple component subspace. Because the objective of dimension SPE(x) = x̃ = P̃P̃T x =  I − PPT x (2)
reduction by PCA is to reconstruct as much as possible the
data in original space using few principle components, but not where I is the identity matrix. If SPE(x) ≤ δα , then
to preserve distances between samples. That is to say a sample x is recognized as a normal sample. Otherwise, x is a
detected as a faulty sample (significantly deviate from normal faulty sample. δα represents the control limit for SPE under
samples) by means of kNN in original space may be recog- significance level α, the calculation of δα was given by
nized as normal sample in PCS due to the distance distortion Jackson and Mudholkar [23] assuming that x follows mul-
caused by PCA-based projection, and vice versa. Thus the tivariate normal distribution.
false alarm and missing detection of PC-kNN for fault detec- The Hotelling T 2 statistic characterized the variation of
tion may increase compared to FD-kNN. Therefore, the dimen- sample in PCS is defined as
sion reduction by PCA cannot guarantee the performance of
kNN in the principle component subspace. T 2 (x) = xT P−1 PT x (3)
In this paper, a new fault detection method, which inte-
grates the advantages of random projection in preserving of where,  is a diagonal matrix with elements along diagonal
pairwise distances and kNN in dealing with the problems of are l largest eigenvalues of XXT . The sample x is considered
multimodality and nonlinearity, will be presented to overcome normal if
the drawbacks of those methods mentioned above. Random
projection can approximately preserve the distances between T 2 (x) ≤ Tα2 (4)
pairs of samples in random subspace. It has been extensively
applied in several areas, such as machine learning [16]–[18], where Tα2 is the control limit with a significance level α and
image processing [19], [20] and compressed sensing [21], [22]. can be estimated in several ways [24].
The rest of this paper is organized as follows: in Section II,
we review the relevant methods including PCA, FD-kNN and
PC-kNN. The comparisons between PCA and random pro- B. FD-kNN
jection for dimension reduction are given in Section III. Our The well known kNN rule is a simple and practical method
main results are stated in Section IV, where we present a for classification and has been extensively used in several
new method for fault detection based on kNN rule in com- areas [25]–[28]. In the field of industrial process control,
pressive measurement space. In Section V, we illustrate the kNN has also been used to deal with the problem of fault
proposed method using industrial data from a semiconductor classification [29]. However, only few faulty samples and
etch process. Finally, Section VI gives conclusions and some a large amount of normal samples are available for real
discussions. industrial processes, so it is insufficient to learn a reli-
able kNN classification model or to characterize the entire
distribution of faulty samples with limited labeled faulty
II. B RIEF R EVIEW OF R ELEVANT M ETHODS samples.
In this section, some relevant methods will be briefly The FD-kNN given by He and Wang [1] only uses the nor-
reviewed including PCA, FD-kNN and PC-kNN. mal samples obtained under normal operation conditions, and

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on November 14,2020 at 18:38:51 UTC from IEEE Xplore. Restrictions apply.
72 IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, VOL. 28, NO. 1, FEBRUARY 2015

is designed as an anomaly detection algorithm. The details of • Dimension reduction by PCA


this algorithm are as follows: – Calculate projection matrix P = [p1 , . . . , pl ], where
• Model Building pi , i = 1, . . . , l are the eigenvectors corresponding
– Find k nearest neighbors using Euclidean distance to the l largest eigenvalues of XT X
for each sample xi in the training set1. – Project X onto PCS
 
di,j = xi − xj  , j = 1, . . . , n, j = i (5) T = XP (8)
– Calculate the average square distance between each • Model Building
sample and its k neighbors2 . – Find k nearest neighbors using Euclidean distance
for each sample in reduced training set T.
1 2
k
Di2 = di,j (6)
k d̂i,j = ti − tj , j = 1, . . . , n, j = i (9)
j=1
2 – Calculate the average square distance between each
where di,j denotes squared Euclidean distance
sample and its k neighbors.
between ith sample to its jth nearest neighbor in the
1 2
training set. k
– Determine the threshold Dα2 (control limit) used in D̂i2 = d̂i,j (10)
k
the next stage of fault detection. j=1
The distribution of Di2 can be approximated by a 2
where d̂i,j denotes squared Euclidean distance
noncentral chi-square distribution under the assump-
between sample i to its jth nearest neighbor in PCS.
tion that di,j are normally distributed with a nonzero
– Determine the threshold D̂α2 (control limit).
mean. Thus, the threshold Dα2 can be estimated.
The threshold D̂α2 can be approximated by a noncen-
Another way is to choose it as the (1 − α)-empirical
tral chi-square distribution under the assumption that
quartile of Di2 as
d̂i,j are normally distributed with a nonzero mean.
Dα2 = D(n(1−α))
2
(7) Another way is to choose it as the (1 − α)-empirical
quartile of D̂i2 as
where D(i)
2 , i = 1, . . . , n is the rearrangement of

Di , i = 1, . . . , n in descending order. n(1 − α) is


2 D̂α2 = D̂(n(1−α))
2
(11)
the integer of n(1 − α).
where D̂(i)2 , i = 1, . . . , n is the rearrangement of
• Fault detection
For a new sample x, D̂i2 , i = 1, . . . , n in descending order, and n(1−α)
is the integer of n(1 − α).
– Find x’s k nearest neighbors in training set using (5).
• Fault detection
– Calculate Dx2 using (6).
For a new sample x,
– Compare Dx2 with the threshold Dα2 . If Dx2 > Dα2 ,
then x is considered as a faulty sample. Otherwise, – Project x onto PCS: t = Px
x is a normal sample. – Find the k nearest neighbors of t in reduced training
It worth noting that the reason why FD-kNN can deal with set T using (9) and calculate D̂t2 using (10).
the problems in semiconductor processes is that the distances – Compare D̂t2 with the threshold D̂α2 . If D̂x2 >
between local samples are utilized to perform fault detection. D̂α2 , then x will be classified as a faulty sample.
However, the computation involved in neighbors searching Otherwise, x is a normal sample.
(calculating distance between high dimensional vectors and Although PC-kNN can realize the goal of reducing com-
sorting) prevents it from online process monitoring. putation complexity and storage space, the advantages of
FD-kNN in dealing with the problems of multimodality and
C. PC-kNN nonlinearity depend on whether the pairwise distances can
be preserved in the projection subspace which is necessary
To overcome the drawbacks of FD-kNN for online imple-
for designing a reliable and robust fault detection algorithm.
mentation, He and Wang [3] proposed principal component
Unfortunately, the pairwise distances in original space cannot
based kNN. The unfolding array X is firstly projected onto
be preserved in principal components subspace. This will be
principal component subspace, then the kNN rule is applied
discussed in detail in next section.
to the score matrix in PCS to perform fault detection. In this
way, the neighbors searching can be carried out in Rl subspace
if few principal components (l  JK) could capture most of III. R ANDOM P ROJECTION FOR D IMENSIONALITY
the variation in X; thus the computation complexity can be R EDUCTION
reduced significantly. The details of PC-kNN are as follows: In this section, another dimension reduction method, ran-
1 k can be determined by means of cross validation [1]. How to select an
dom projection (RP), will be introduced. The capability of
appropriate k is not the main topic in this paper.
PCA and random projection in distance preservation, which
2 This is different from [1], where no average arithmetic in defining (6). decides the performance of the kNN rule in subspace, will be
Here the expression used in [3] (latest version) is adopted in this paper. compared.

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on November 14,2020 at 18:38:51 UTC from IEEE Xplore. Restrictions apply.
ZHOU et al.: FAULT DETECTION USING RANDOM PROJECTIONS AND kNN RULE FOR SEMICONDUCTOR MANUFACTURING PROCESSES 73

Fig. 1. Fault detection using FD-kNN with k = 3. Fig. 2. Fault detection using PC-kNN with PCs = 1.

A. Distance Preservation by PCA If the fault occurs in the direction dominated by the first
few principal components, it may also be detected by kNN in
The conclusion is that the projection by PCA cannot
PCS. However, it can never be detected by kNN in PCS if
preserve the capability of the kNN rule.
the fault occurs in the residual subspace. Actually, the fault
Here, a two-dimensional example is given to show that
is more likely detected in residual subspace which has been
dimension reduction by PCA fails to preserve the distances
indicated in several MSPM related literatures [6], [23], [24].
among samples. Fig. 1 is an illustration of fault detection by
As the two-dimensional example given in this subsection, if
FD-kNN with k = 3. It is a nonlinear case with two variables
the information in residual subspace can be utilized, such as
and ten normal samples are represented by circles. It can be
implementing kNN rule in residual subspace as well; those
seen from the distribution of normal samples that the relation
three faults can be detected correctly. However, the advantage
of variables is approximately quadratic. The faulty samples
of low computation consumption and storage requirement will
are represented by three red squares. Obviously, all the three
also disappear.
faulty samples significantly deviate from its neighbors in nor-
mal samples, so it is easy to correctly detect these three faulty
samples by FD-kNN. B. Distance Preservation by Random Projection
When PC-kNN is applied to this example, PCA is firstly
It is evident that projection based on PCA captures a
applied to the normal samples to learn the basis of PCS. The
global property and cannot provide any local guarantees.
variance along direction of axis x1 is biggest, so an orthonor-
Random projection, which will be introduced in this subsec-
mal basis along the direction x1 is selected as the basis of PCS.
tion, can provide the preservation of pairwise distances. It has
Then normal samples will be projected onto PCS and thresh-
been extensively used in machine learning [16]–[18], image
old can be calculated in PCS. For online monitoring, the new
processing [19], [20] and compressed sensing [21], [22].
incoming samples will firstly be projected onto PCS, then its
A result of an important lemma—Johnson-
kNN distance is calculated and compared with a threshold to
Lindenstrauss (JL) lemma [30]—says that any set of n
perform fault detection. The results are shown in Fig. 2. From
points in (high) d-dimensional Euclidean space can be
Fig. 2, we can see that the projections of the three faulty sam-
mapped into an O(ε−2 log n)-dimensional Euclidean space
ples from the original space have mixed with the projections
such that the distance between any two points is distorted
of normal samples in PCS, and their neighbors in PCS are not
by only a factor of 1 ± ε (0 < ε < 1) [31]. Later, several
identical to that of in original space. Therefore, kNN fails to
researchers provided a simpler proof of the original JL lemma
detect the three faulty samples in PCS, though the threshold in
and specified a bound of reduced dimension L [32]–[34]. In
PCS is different from the original one. This means the dimen-
order to improve further efficiency of projection operation,
sion reduction by PCA changes pairwise distances in PCS
Achlioptas [34] designs two very simple projection matrices
and is thus unable to retain the advantages of kNN in PCS,
where elements are either 0 or ±1, and provides a new bound
which may results in the increase of false alarms or missing
of L which is similar to the result in [33].
detections.
Theorem 1: Let Q be an arbitrary set of n points in Rd ,
In addition, it is hard to determine the retained number of
represented as an n × d matrix A. Given ε, β > 0, let
principle components l due to the difficult of measuring the
effect of l on distances preservation (further affecting kNN). 4 + 2β
This will be seen from the industry example in Section V, L0 = log n (12)
ε2 / 2 − ε3 / 3
the best result of PC-kNN is obtain when PCs is small (only
capture about 10% variances); worse results are obtained with For integer L ≥ L0 , let R be a d × L random matrix with
more PCs. R(i, j) = ri,j , where {ri,j } are independent random variables

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on November 14,2020 at 18:38:51 UTC from IEEE Xplore. Restrictions apply.
74 IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, VOL. 28, NO. 1, FEBRUARY 2015

from either one of the following two probability distributions: where D̄(i)2 , i = 1, . . . , m is the rearrangement
 of D̄i , i = 1, . . . , m in descending order, and
2
+1 with probability 1/2
rij = (13) n(1 − α) represents the integer of n(1 − α).
−1 ··· 1/2,
⎧ • Fault detection
⎨ +1 with probability 1/6
rij = 0 ··· 2/3 (14) For a new sample y,
⎩ – Project onto subspace: ty = RT y
−1 ··· 1/6.
– Calculate ty ’s kNN distance D̄t2y using (17).
Let
– If D̄t2y > D̄α2 , then y will be classified as a faulty
1
E = √ AR (15) sample. Otherwise, y is an normal sample.
k Note that:
Let f : Rd → RL map the ith row of A to row of E. With 1) For a given set of samples, the dimension of random
probability at least 1 − n−β , for all u, v ∈ Q subspace is dependent on distortion parameter ε. If
small value of ε is chosen, then the degree of distor-
(1 − ε) u − v 2 ≤ f (u) − f (v) 2 ≤ (1 + ε) u − v 2 (16)
tion is low and the performance of kNN can be well
where the parameter ε controls the accuracy in distance preserved. However, small ε (large L) results in less
preservation and β controls the probability of success. effective dimension reduction. In contrast, large value
From JL Lemma and theorem 1, it can be seen that the of ε results in higher degree of distortion which further
reduced dimension L is not dependent on the dimension of affect the performance of kNN, though the dimension
original space d, but relevant mainly to the number of points can be reduced extensively in this case. It also worth not-
n and distortion parameter ε. This means random projection ing that ε is related with the worst case, which means the
is extremely suitable for applications of high dimension and biggest distortion of any pairwise distances is no larger
limited samples. In addition, the projection matrix is non- than ε. That is to say, in order to preserve all pairwise
adaptive and independent of the underlying data used for distances, small value of ε may be determined due to
model building. This is different from PCA, where complex few outliers that deviated from other samples. At least
eigendecomposition of high dimensional XT X needs to be the distortion can be controlled by random projection
solved. It worth noting that the reduced dimension L may be which is superior to PCA.
larger than d if the number of samples, n, is large enough or 2) The advantage of nonadaptive designation of random
with very small ε. projection matrix R may useful for the problem of model
immigration, where there is not enough data for learning
IV. R ANDOM P ROJECTION BASED kNN RULE a reliable projection model.
In this section, we propose a new fault detection method
based on random projection and k nearest neighbor rule, which B. Choice of Reduced Dimension L
combines the advantages of random projection in distance In this subsection, we will discuss how to select an appro-
preservation and kNN rule in fault detection. priate L. For the purpose of fault detection, the optimal L is
the one that reduces the computational complexity as much as
A. Random Projection-Based kNN (RPkNN) possible and guaranteeing the performance of fault detection
The details of RPkNN algorithm are as follows: simultaneously.
• Dimension reduction by random projection It is indicated in (12) that the parameter ε determines the
– Construct projection matrix R according to retained degree of pairwise distances and affects further the
either (13) or (14) chosen neighbors of the online test sample. This means the
– Project X onto random subspace (RS): TRP = XR chosen neighbors of a test sample in original space will not
• Model building based on TRP
be identical to those in the random subspace if an inappro-
– Find k nearest neighbors using Euclidean distance priate ε is selected. Hence, a faulty sample detected in the
for each sample in TRP . original space by kNN may be recognized as normal sample
– Calculate the average square Euclidean distance in subspace, and vice versa. Therefore, we want to determine
between each sample and its k nearest neighbors. ε under the criterion of maintaining the same neighbors of the
test sample in both original and random subspace, and then L
1 2
k
can be determined according to (12).
D̄i2 = d̄i,j (17)
k Assume all the pairwise distances in the training set
j=1
denote as
2 denotes the squared distance between
where d̄i,j ⎡ ⎤
d1,(1) · · · d1,(k) d1,(k+1) · · · d1,(n−1)
sample i to its jth nearest neighbor. ⎢ .. .. .. .. ⎥
– Determine the threshold D̄α2 (control limit). ⎢ . . . . ⎥
⎢ ⎥
The threshold is chosen as the (1 − α)-empirical ⎢
D = ⎢ di,(1) · · · di,(k) di,(k+1) · · · di,(n−1) ⎥
⎥ (19)
quartile of Di2 as ⎢ . .. .. .. ⎥
⎣ . . . . . ⎦
D̄α2 = D̄(n·(1−α))
2
(18) dn,(1) · · · dn,(k) dn,(k+1) · · · dn,(n−1)

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on November 14,2020 at 18:38:51 UTC from IEEE Xplore. Restrictions apply.
ZHOU et al.: FAULT DETECTION USING RANDOM PROJECTIONS AND kNN RULE FOR SEMICONDUCTOR MANUFACTURING PROCESSES 75

TABLE I
where di,( j) , j = 1, . . . , n − 1 represents the distance between I NDUCED FAULTS
ith sample and the jth nearest neighbor in training set except
ith sample itself. We denote Nindex i as the k nearest neighbors’
index set of ith sample.
Furthermore, the distance matrix and index set in projection
subspace are denoted as D̄ = [d̄i,( j) ] and N̄index
i , respectively.
The following theorem gives the condition of ε, under which
the chosen neighbors of each sample can be identical both in
original and random subspace, namely Nindex i = N̄index
i .
Theorem 2: Let Q be an arbitrary set of n point in Rd , and
the distance matrix D contains the Euclidean distance between
any two samples in Q. Let Nindex i represents the index set TABLE II
of k nearest neighbors of ith sample. For a random matrix P ROCESS VARIABLES U SED FOR M ONITORING
R ∈ RL×d generated according to (13) or (14), projecting
these n points onto random subspace using R. Let D̄ = [d̄i,( j) ]
i
and N̄index denote the distance matrix and index set of each
sample in projection subspace, respectively.
If ε satisfies
d
min i,(k+1) −1
i∈{1,...,n} di,(k)
ε≤ d
(20)
min i,(k+1) +1
i∈{1,...,n} di,(k)

then
i
N̄index = Nindex
i
, i = 1, . . . , n (21) the value of the variables shown in Table I. These data were
obtained from three experiments (in Feb, Mar, and Apr 1996
The proof can be seen in the appendix. respectively). Due to large amount of missing data in two
Remark 1: Once ε is calculated according to the result of batches (each one in normal and fault set), only 107 normal
theorem 2, L can be further determined. It worth noting that we wafers and 20 fault wafers are used in this case study. The
should calculate L based on (20) with n + 1, in order to ensure standard etch process consists of six steps, similar to [1], only
the effect of random matrix on the incoming test sample. the samples from step 4 and 5 are used, and 17 nonsetpoint
Remark 2: The bound of ε in theorem 2 is slightly relaxed. process variables3 (see Table II) are used for fault detection
It only guarantees the sets of neighbors for each sample in both in experiments.
original and subspace are the same, but not the permutation
of k nearest neighbors.
B. Data Preprocessing
Remark 3: The JL lemma is based on the worst case which
guarantees distance distortion of any two samples less than or In order to maximize the level of automation in fault detec-
equal to ε. So the bound of ε in Theorem 2 is also derived tion of industrial applications, relevant fault detection methods
from the worst case. This means the value of ε may be very with minimum data preprocessing are compared in experi-
small due to the extreme case, which leads to trivial results ments [1], [3]. Firstly, equal length batch records are obtained
(i.e., dimension of random subspace may be larger than the through removing the initial five sample points such that the
original space, L > d). However, the performance of kNN rule effect of initial fluctuation can be eliminated and keeping 85
in random subspace may not degrade much, contradicting the sample points in order to accommodate shorter batches in
bound of ε. all the batches. Then, equal length batch array is unfolded
and the obtained two-dimensional data matrix will be scaled
V. I NDUSTRY E XAMPLE to zero mean and unit variance for each variable. The same
preprocessing is done for all training, validation and test data.
In this section, a benchmark industrial data is used to
demonstrate the performance of the proposed fault detec-
C. Results of Fault Detection
tion method. Three relevant methods are compared in this
experiment. In order to reduce the effect of randomness on the results,
the experiments are repeated 100 times. In each experiment,
A. Data Description the 107 normal wafers are further randomly separated into two
parts, 97 wafers for training and 10 for validation respectively.
The data is collected from an Al stack etch process per-
This means different training set are used to build the model
formed on a commercially available Lam 9600 plasma etch
in each experiment. The parameter settings are as follows: the
tool [6], [35]. The goal of this process is to etch the
number of neighbors k = 3 and the confidence level is set
TiN/A1-0.5% Cu/TiN/oxide stack with an inductively coupled
BCl3 /Cl2 plasma. The data consist of 108 normal wafers and 3 Two variables (bias RF reflected power and TCP reflected power) remain
21 fault wafer which were intentionally induced by changing almost zero during the batch.

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on November 14,2020 at 18:38:51 UTC from IEEE Xplore. Restrictions apply.
76 IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, VOL. 28, NO. 1, FEBRUARY 2015

TABLE III
FAULT D ETECTION BY T HREE M ETHODS ON VALIDATION S ET

TABLE IV
FAULT D ETECTION R ATES OF T HREE M ETHODS

Fig. 3. PC-kNN with PCs ∈ {1, 2, . . . , 96}.4

which was used in [3] as an evidence to explain why PC-kNN


has nearly the same performance as FD-kNN.
To further inspect the effects of PCs on PC-kNN and ε on
RPkNN, we compare both methods with various parameter
values. The results are shown in Figs. 3 and 4. The blue circle
line in Fig. 3 represents the contribution rate (the ratio between
the sum of the first few biggest eigenvalues and the total
sum of all eigenvalues) under different numbers of retained
to be 95% for all the three methods, the number of princi- principle components. The red triangular line represents the
pal components is l=3 for PC-kNN, these two parameters are total fault detection rate on 20 faults in experiments repeated
selected the same as in [3]. The distortion degree is ε=0.4 100 times. As the number of PCs (the retained information)
(L = 469) and β = 1 (e.g., 1 − n−β = 0.9993) for RPkNN. increases, the total fault detection rate of PC-kNN gradually
The results of the three methods on validation and test set decreases instead. It is hard to explain why the fault detection
are shown in Tables III and IV. The total false alarm rate of rate is higher with fewer PCs (less information is retained).
FD-kNN is higher than the other two methods in Table III. And we can hardly learn how to select an appropriate number
The entries in the last three columns of Table IV represent of PCs from this experiment. Note that the line of contribu-
the total fault detection rate of each fault in 100 experi- tion rate in Fig. 3 is almost linear, this means the correlation
ments. It can be seen that faulty batches detected by FD-kNN between variables is weak, so PCA may not be suitable for
and PC-kNN are different from each other, though the total dimension reduction in this case.
fault detection rate are similar. Some batches determined as In Fig. 4, the blue circle line represents the dimension
faulty (e.g., fault 11, fault 14) or normal batches (e.g., fault 5, ratio (e.g., Ld ) under different values of ε. It is reasonable
fault 6) by FD-kNN in original space are recognized as nor- that as the value of ε increases and the distance distortion
mal batches or faulty batches instead by PC-kNN in principal becomes bigger, the total fault detection rate of RPkNN grad-
component subspace, respectively. Because the distortions of ually decreases. Although the degree of distortion is high when
pairwise distances after projection by PCA are out of control, ε approaches 1, the total fault detection rate can also be main-
thus the performance of kNN rule in fault detection cannot tained around 70%. The fluctuation of total fault detection rate
be guaranteed in PCS. Therefore, we cannot confidently use is due to the effect of randomness. Note that the dimension
PC-kNN as a reliable fault detection method for semiconductor ratio is larger than 1 when ε = 0.2, in this case the total
processes. fault detection rate is the same as FD-kNN. It also worth not-
On the other hand, the results of the proposed RPkNN ing that if the ε is determined by (20), then the trivial result
are almost the same as that of given by FD-kNN, except (e.g., L > d, no dimension reduction) will be obtained. The
that fault 9 is detected by RPkNN in some cases. This reason is that the distances between each pairwise samples in
means the performance of RPkNN for fault detection is nearly the training set are too small, thus resulting in an extremely
unchanged after dimension reduction compared to FD-kNN, small value of ε.
meanwhile the computation complexity and storage space have
also been reduced by RPkNN, which is consistent with earlier D. Comparison of Distance Distortion
analysis. In this subsection, the distance distortion in the stage of
In addition, since the information (variances) retained in dimension reduction caused by PCA and random projection
PCS by three principal components is only 10.6%, it is on experiments in Section V-C are compared. The degree of
unlikely (or at least no information shows) that the other 89.4%
discarded information in the residual subspace are all noises 4 The rank of X is 97.

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on November 14,2020 at 18:38:51 UTC from IEEE Xplore. Restrictions apply.
ZHOU et al.: FAULT DETECTION USING RANDOM PROJECTIONS AND kNN RULE FOR SEMICONDUCTOR MANUFACTURING PROCESSES 77

Fig. 6. Boxplot of distance distortion by RPkNN on training data.


Fig. 4. RPkNN with ε ∈ [0.2 0.9].

Fig. 7. Boxplot of distance distortion by RPkNN on validation data.


Fig. 5. Boxplot of distance distortion by PC-kNN.

distortion (relative error) is used as the metric of distance


distortion
p
di,j − di,j
(22)
di,j
where di,j represents the distance between the ith and jth sam-
p
ple; di,j represents the distance between the corresponding two
samples in projection subspace.
The experiments are conducted on training,
 validation and
test data set and they amount to 97 2 ×100=465600 pair-
wise distances on the training set, 97×10×100=97000 on the
validation set and 97×20×100=194000 on the test set, respec-
tively in the experiments repeated 100 times. The boxplots of
these results are shown in Figs. 5–8. It can be seen in Fig. 5 Fig. 8. Boxplot of distance distortion by RPkNN on test data.
that the distance distortion by PCA on the training data is
mainly concentrated in [-0.3 -0.4], which means the pairwise
distances changed nearly 35% after projection. It also implies From Figs. 6–8, the distance distortion caused by ran-
that the threshold (control limit) will decrease about 35% com- dom projection under different ε are mainly concentrated
pared to the threshold constructed in the original space. The in [-0.02 0.02] on the training, validation and test data. This
distance distortion by PCA on the test data is mainly concen- means the changes caused by random projection are very lim-
trated in [-0.35 -0.6], which is lager than that on the training ited. It also explain why an acceptable total fault detection rate
data. Thus the kNN distance of some fault batches (should can be obtained with large ε. Hence the distortion parameter
deviate from normal batches) may be below the threshold. ε=0.4 is a conservative choice in experiments in Section V-C.
This may explain why some fault batches are undetected by In summary, the degree of distance distortion caused by
kNN in principal component subspace. PCA and random projection in this experiment further explains

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on November 14,2020 at 18:38:51 UTC from IEEE Xplore. Restrictions apply.
78 IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, VOL. 28, NO. 1, FEBRUARY 2015

TABLE V
C OMPARISON OF C OMPUTATION S PEED of the idea of contribution plots. And the estimation of
fault magnitude is useful for fault tolerant control. The
distances from normal samples are, to a certain extent
related with fault magnitude.
• To apply RPkNN to other batch or continuous processes.
The proposed method is not limited to semiconductor pro-
cesses, it can also be applied to other high-dimensional
batch processes and continuous processes.

why dimension reduction by random projection is superior A PPENDIX


to PCA when combined with the kNN-based fault detection
method. In this section, we presents the proof of Theorem 2.
Proof: We first prove that, for an ε in (0,1), if the following
condition satisfies:
E. Comparison of Computation Time
In this subsection, we compare the computation time of (1 + ε)di,(k) ≤ (1 − ε)di,(k+1) , i = 1, . . . , n (23)
the three methods. The experiments are carried out using a then we can derive the conclusion that i
N̄index = i
Nindex ,
desktop computer with hardware environment: duo Core pro- i = 1, . . . , n. From (23) and theorem 1, we have
cess (3.0GHz), 1.5GB RAM; And the software environment:
Windows XP and Matlab R2011b (7.13). The parameter set- d̄i,(p) ≤ (1 + ε)di,(k) ≤ (1 − ε)di,(k+1) ≤ d̄i,(q) (24)
ting is the same as in Section V-C (e.g., l=3, ε=0.4, and k=3). where d̄i,(p) , p = 1, . . . , k and d̄i,(q) , q = k + 1, . . . , n repre-
The results are given in Table V. sent the distances between the ith sample and other samples
In the stage of model building, we find that PC-kNN actually in a random subspace. This means the k nearest neighbors of
does not reduce the computation time (compare to FD-kNN) the ith sample in random subspace are the same as those in
due to the cost from SVD in calculating eigenvectors. In con- the original space. So we get
trast, the cost of generating simple random projection matrix
(e.g., (13) or (14)) while avoiding the multiplication operation i
N̄index = Nindex
i
, i = 1, . . . , n (25)
of floating-point is low in model building using RPkNN. In
Next, we derive ε from the condition given in (23)
the stage of online monitoring, the average computation time
of processing a test sample by RPkNN is less than FD-kNN. (1 + ε)di,(k) ≤ (1 − ε)di,(k+1) , i = 1, . . . , n
The cost of PC-kNN is the smallest among the three methods (1 + ε) di,(k+1)
because few principle components (l=3) are used in this exper- ⇒ ≤ ,
(1 − ε) di,(k)
iment. However, as analyzed in previous experiments, PC-kNN i = 1, . . . , n. (ε ∈ (0, 1), di,(k) = 0)
cannot guarantee the performance of fault detection. The pro-
(1 + ε) di,(k+1)
posed RPkNN can realize reducing computational complexity ⇒ ≤ min
(1 − ε) i∈{1,...,n} di,(k)
while guaranteeing the performance of fault detection. 
min di,(k+1) di,(k) − 1
i∈{1,...,n}
⇒ ε≤  (26)
VI. C ONCLUSION min di,(k+1) di,(k) + 1
i∈{1,...,n}
In this paper, a fault detection method has been proposed by
combining random projection and k nearest neighbor rule. The and this concludes the proof.
proposed method not only can reduce the computational com-
plexity and storage space, but also approximately guarantee the ACKNOWLEDGMENT
advantages of kNN rule in dealing with the problems of mul- The authors would like to thank Prof. Z. Song, Prof. Z. Ge,
timode batch trajectories and nonlinearity that often coexist in and Prof. C. Zhao at Zhejiang Unversity, and X. Zhang at
semiconductor processes. ShenYang University of Chemical Technology for valuable
It is worth noting that the idea of using random projection discussions.
for reducing computation of kNN classification is not new.
However, the idea of combining random projection with kNN
R EFERENCES
for fault detection (one-class classification) of semiconduc-
tor processes, to our knowledge, should have no publication [1] Q. P. He and J. Wang, “Fault detection using the k-nearest neighbor rule
for semiconductor manufacturing processes,” IEEE Trans. Semicond.
before. The proposed method should be seen as an alternative Manuf., vol. 20, no. 4, pp. 345–354, Nov. 2007.
fault detection method. It is not superior to PCA in all cases. [2] Q. P. He and J. Wang, “Principal component based k-nearest-neighbor
Some interesting future work: rule for semiconductor process fault detection,” in Proc. Amer. Control
Conf., Seattle, WA, USA, Jun. 2008, pp. 1606–1611.
• To exploit the ability of kNN for isolating faulty vari- [3] Q. P. He and J. Wang, “Large-scale semiconductor process fault
ables and estimating the fault magnitude. Once a fault detection using a fast pattern recognition-based method,” IEEE Trans.
is detected, the next step is to identify faulty variables Semicond. Manuf., vol. 23, no. 2, pp. 194–200, May 2010.
[4] Z. Zhou, C. Yang, and C. Wen, “Random projection based k nearest
which are attributed to this fault. One can also define the neighbor rule for semiconductor process fault detection,” in Proc. 33rd
variable contribution for the kNN distance in the light Chin. Control Conf. (CCC), Nanjing, China, Jul. 2014, pp. 3169–3174.

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on November 14,2020 at 18:38:51 UTC from IEEE Xplore. Restrictions apply.
ZHOU et al.: FAULT DETECTION USING RANDOM PROJECTIONS AND kNN RULE FOR SEMICONDUCTOR MANUFACTURING PROCESSES 79

[5] B. M. Wise, N. B. Gallagher, and E. B. Martin, “Application of [29] C. Schmidt et al., “Fault detection and classification (FDC) for a via
PARAFAC2 to fault detection and diagnosis in semiconductor etch,” etching process,” in Proc. 5th Eur. AEC/APC Conf., Dresden, Germany,
J. Chemometr., vol. 15, no. 4, pp. 285–298, 2001. Apr. 2004.
[6] B. M. Wise, N. B. Gallagher, S. W. Butler, D. D. White, and G. G. Barna, [30] W. B. Johnson and J. Lindenstrauss, “Extensions of Lipschitz map-
“A comparison of principal component analysis, multiway principal pings into a Hilbert space,” in Proc. Conf. Mod. Anal. Probab., vol. 26.
component analysis, trilinear decomposition and parallel factor analy- New Haven, CT, USA, 1984, pp. 189–206.
sis for fault detection in a semiconductor etch process,” J. Chemometr., [31] D. Sivakumar, “Algorithmic derandomization via complexity theory,”
vol. 13, nos. 3–4, pp. 379–396, 1999. in Proc. 34th Annu. ACM Symp. Theory Comput., Montreal, QC, Canada,
[7] G. A. Cherry and S. J. Qin, “Multiblock principal component analy- May 2002, pp. 619–626.
sis based on a combined index for semiconductor fault detection and [32] P. Frankl and H. Maehara, “The Johnson-Lindenstrauss Lemma and
diagnosis,” IEEE Trans. Semicond. Manuf., vol. 19, no. 2, pp. 159–172, the sphericity of some graphs,” J. Comb. Theory B, vol. 44, no. 3,
May 2006. pp. 355–362, 1988.
[8] Z. Q. Ge and Z. H. Song, “Semiconductor manufacturing process mon- [33] S. Dasgupta and A. Gupta, “An elementary proof of a theorem of
itoring based on adaptive substatistical PCA,” IEEE Trans. Semicond. Johnson and Lindenstrauss,” Random Struct. Algorithms, vol. 22, no. 1,
Manuf., vol. 23, no. 1, pp. 99–108, Feb. 2010. pp. 60–65, 2003.
[9] Q. P. He and J. Wang, “Statistics pattern analysis: A new process moni- [34] D. Achlioptas, “Database-friendly random projections,” in Proc. 20th
toring framework and its application to semiconductor batch processes,” ACM SIGMOD-SIGACT-SIGART Symp. Prin. Database Syst., Montreal,
AIChE J., vol. 57, no. 1, pp. 107–121, 2011. QC, Canada, May 2001, pp. 274–281.
[10] S. W. Choi, C. Lee, J.-M. Lee, J. H. Park, and I.-B. Lee, “Fault detec- [35] B. M. Wise. (1999). Metal Etch Data for Fault Detection Evaluation.
tion and identification of nonlinear processes based on kernel PCA,” [Online]. Available: http://software.eigenvector.com/Data/Etch/
Chemometr. Intell. Lab. Syst., vol. 75, no. 1, pp. 55–67, 2005. index.html
[11] Z. Q. Ge, C. J. Yang, and Z. H. Song, “Improved kernel PCA-based
monitoring approach for nonlinear processes,” Chem. Eng. Sci., vol. 64,
no. 9, pp. 2245–2255, 2009.
[12] Z. Q. Ge and Z. H. Song, “Mixture Bayesian regularization method of
PPCA for multimode process monitoring,” AIChE J., vol. 56, no. 11,
pp. 2838–2849, 2010.
[13] C. H. Zhao, Y. Yao, F. R. Gao, and F. L. Wang, “Statistical analy-
sis and online monitoring for multimode processes with between-mode Zhe Zhou received the B.E. and M.S. degrees
transitions,” Chem. Eng. Sci., vol. 65, no. 22, pp. 5961–5975, 2010. from the School of Automation, Hangzhou Dianzi
[14] X. Q. Liu, L. Xie, U. Kruger, T. Littler, and S. Q. Wang, “Statistical- University, Hangzhou, China, in 2009 and 2012,
based monitoring of multivariate non-Gaussian systems,” AIChE J., respectively. He is currently pursuing the Ph.D.
vol. 54, no. 9, pp. 2379–2391, 2008. degree from the Department of Control Science
and Engineering, Zhejiang University, Hangzhou.
[15] S. X. Ding, Model-Based Fault Diagnosis Techniques: Design Schemes,
His current research interests are data-driven fault
Algorithms, and Tools, 2nd ed. Berlin, Germany: Springer, 2013.
diagnosis and its applications in industry.
[16] D. Fradkin and D. Madigan, “Experiments with random projections for
machine learning,” in Proc. 9th ACM SIGKDD Int. Conf. Knowl. Disc.
Data Min., Washington, DC, USA, Aug. 2003, pp. 517–522.
[17] Q. F. Shi, C. H. Shen, R. Hill, and A. van den Hengel, “Is margin pre-
served after random projection?” in Proc. 29th Int. Conf. Mach. Learn.,
Edinburgh, U.K., 2012, pp. 591–598.
[18] C. Boutsidis, A. Zouzias, and P. Drineas, “Random projections for
κ-means clustering,” in Proc. Adv. Neural Inf. Process. Syst., Vancouver, Chenglin Wen (M’10) received the bache-
BC, Canada, 2010, pp. 298–306. lor’s and master’s degrees in mathematics from
[19] E. Bingham and H. Mannila, “Random projection in dimensionality Henan University, Kaifeng, China, and Zhengzhou
reduction: Applications to image and text data,” in Proc. 7th ACM University, Zhengzhou, China, and the Ph.D. degree
SIGKDD Int. Conf. Knowl. Disc. Data Min., San Francisco, CA, USA, from Northwestern Polytechnical University, Xi’an,
Aug. 2001, pp. 245–250. China, in 1986, 1996, and 1999, respectively. He is
[20] A. Eftekhari, M. Babaie-Zadeh, and H. A. Moghaddam, “Two- currently a Professor and the Chair with the Institute
dimensional random projection,” Signal Process., vol. 91, no. 7, of Systems Science and Control Engineering,
pp. 1589–1603, 2011. School of Automation, Hangzhou Dianzi University,
[21] E. Candes and J. Romberg, “Practical signal recovery from random Hangzhou, China. His current research interests
projections,” Proc. SPIE Int. Symp. Electron. Imag. Comput. Imag. III, include multisensor networked information fusion
pp. 76–86, 2005. theory, multitarget tracking, fault diagnosis of complex systems and devices,
[22] E. Candes and T. Tao, “Near-optimal signal recovery from random reliability assessment and health control, recognition, and tracking of hyper-
projections: Universal encoding strategies?” IEEE Trans. Inf. Theory, sonic vehicle. He is currently a Committee Member of Intelligent Automation
vol. 52, no. 12, pp. 5406–5425, Dec. 2006. Committee and Process Fault Diagnosis and Security Committee of Chinese
[23] J. E. Jackson and G. S. Mudholkar, “Control procedures for residuals Association of Automation.
associated with principal component analysis,” Technometrics, vol. 21,
no. 3, pp. 341–349, 1979.
[24] S. J. Qin, “Statistical process monitoring: Basics and beyond,” J.
Chemometr., vol. 17, nos. 8–9, pp. 480–502, 2003.
[25] T. Denoeux, “A k-nearest neighbor classification rule based on
Dempster–Shafer theory,” IEEE Trans. Syst., Man, Cybern., vol. 25,
no. 5, pp. 804–813, May 1995.
[26] J. M. Keller, M. R. Gray, and J. A. Givens, “A fuzzy k-nearest neigh- Chunjie Yang received the Ph.D. degree in con-
bor algorithm,” IEEE Trans. Syst., Man, Cybern., vol. SMC-15, no. 4, trol theory and engineering from the Zhejiang
pp. 580–585, Jul. 1985. University, Hangzhou, China, in 1998. He is cur-
[27] H. B. Shen and K. C. Chou, “Using optimized evidence-theoretic rently a Professor with the Department of Control
k-nearest neighbor classifier and pseudo-amino acid composition to Science and Engineering, Zhejiang University. His
predict membrane protein types,” Biochem. Biophys. Res. Commun., research interests include system modeling, con-
vol. 334, no. 1, pp. 288–292, 2005. trol, and fault diagnosis of industrial processes, soft
[28] Y. C. Lee, “Handwritten digit recognition using k nearest-neighbor, sensor technology, and implementation for complex
radial-basis function, and backpropagation neural networks,” Neural industrial system.
Comput., vol. 3, no. 3, pp. 440–449, 1991.

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on November 14,2020 at 18:38:51 UTC from IEEE Xplore. Restrictions apply.

You might also like