Paulo-Henrique - Chatter Detection in Turning Using Machine Learning and Similarity Measures of Time Series Via Dynamic Time Warping

Journal of Manufacturing Processes 77 (2022) 190–206
Contents lists available at ScienceDirect
Journal of Manufacturing Processes

journal homepage: www.elsevier.com/locate/manpro
Chatter detection in turning using machine learning and similarity

measures of time series via dynamic time warping
Melih C. Yesilli a, *, Firas A. Khasawneh a, Andreas Otto b
a
Department of Mechanical Engineering, Michigan State University, East Lansing, MI 48824, USA
b
Fraunhofer Institute for Machine Tools and Forming Technology IWU, Reichenhainer Straße 88, 09126 Chemnitz, Germany
A R T I C L E I N F O A B S T R A C T
Keywords: Chatter detection from sensor signals has been an active field of research. While some success has been reported
Chatter detection using several featurization tools and machine learning algorithms, existing methods have several drawbacks
Dynamic time warping including the need for data pre-processing by an expert. In this paper, we present an alternative approach for
Time series analysis
chatter detection based on K-Nearest Neighbor (KNN) algorithm for classification and the Dynamic Time
Transfer learning
Turning
Warping (DTW) as a time series similarity measure. The used time series are the acceleration signals acquired
Parallel computing from the tool holder in a series of turning experiments. Our results show that this approach achieves detection
Approximate and eliminate search algorithm accuracies that can outperform existing methods, and it does not require data pre-processing. We compare our
results to the traditional methods based on Wavelet Packet Transform (WPT) and the Ensemble Empirical Mode
Decomposition (EEMD), as well as to the more recent Topological Data Analysis (TDA) based approach. We show
that in two out of four cutting configurations our DTW-based approach is in the error range of the highest ac
curacy or attain the highest classification rate reaching in one case as high as 98% accuracy. Moreover, we
combine the Approximate and Eliminate Search Algorithm (AESA) and parallel computing with the DTW-based
approach to achieve chatter classification in less than 2 s thus making our approach applicable for online chatter
detection.
1. Introduction diagnosis include neural network classification [4], logistic regression

[5], Quadratic Discriminant Analysis [6] and Hidden Markov Model [7].
Machine tool chatter manifests as excessive vibrations of the cutting For example, WPT is combined with Recursive Feature Elimination
tool or the workpiece during machining. Cutting tool life shortens dur (RFE) for chatter detection in milling processes [8,9]. Tangjitsitcharoen
ing cutting experiments due to chatter, tool wear [1], and the adhesion et al. monitored cutting force signals and they used wavelet transform to
of the workpiece to the cutting tool [2]. Chatter phenomenon also often detect chatter in ball-end milling process [10]. Tran et al. utilized EEMD
leads to poor surface finish. Therefore, various methods for chatter and then used Hilbert-Huang Transform to analyze the data set. They
prediction and mitigation have been proposed in the past several de introduced a fuzzy entropy approach for feature selection to improve the
cades, including increasing the stiffness of the machine tool, passive and classification accuracy [11]. Ji et al. extracted informative Intrinsic
active damping techniques as well as changing the spindle speed [3]. Mode Functions(IMFs) from accelerometer signals in milling and used
While chatter prediction tools are important for process planning, the standard deviation, power spectral entropy, and the fractal dimension as
cutting parameters often drift during the process which necessitates features in an SVM chatter classifier [12]. Reference [13] proposed a
utilizing sensor signals to detect chatter in a practical setting. chatter detection algorithm based on the mean value and the standard
Traditionally, chatter detection tools combine methods from signal deviation of the spectrum of the Hilbert-Huang Transform of IMFs. Chen
processing, such as Wavelet Packet Transform (WPT) and Empirical et al. used Fisher Discriminant Ratio for ranking of features obtained
Mode Decomposition (EMD) or Ensemble Empirical Mode Decomposi with EEMD [14]. In addition to WPT and EEMD, frequency spectrum
tion (EEMD), with machine learning algorithms. While support vector analysis is also used for feature matrix generation. For instance, Thaler
machine (SVM) is the most widely adopted classification algorithm for et al. proposed chatter diagnosis methods based on the analysis of sound
chatter detection in the literature, other successful classifiers for chatter signals in band sawing processes using Short-Time Fourier Transform
* Corresponding author.
E-mail addresses: yesillim@egr.msu.edu (M.C. Yesilli), khasawn3@egr.msu.edu (F.A. Khasawneh), andreas.otto@iwu.fraunhofer.de (A. Otto).
https://doi.org/10.1016/j.jmapro.2022.03.009
Received 17 December 2021; Received in revised form 5 March 2022; Accepted 7 March 2022
Available online 18 March 2022
1526-6125/© 2022 The Society of Manufacturing Engineers. Published by Elsevier Ltd. All rights reserved.
M.C. Yesilli et al. Journal of Manufacturing Processes 77 (2022) 190–206
[6]. Lamraoui et al. applied multi-band resonance filtering and envelope speech recognition [32–36], time series classification [37–39], and
analysis to milling vibration signals [4]. Yesilli and Khasawneh com signature verification [40–43]. In this study, we combine DTW with
bined Fast Fourier Transform (FFT), Power Spectral Density (PSD), and AESA to detect chatter in signals obtained from turning experiments.
Auto-correlation Function (ACF) with supervised classification algo The paper is organized as follows. Section 2 describes the experi
rithms to detect chatter in turning signals. They used the coordinates of mental setup. Section 3 details our approach which combines similarity
the peaks of FFT, PSD, and ACF plots as features in classification algo measures of time series with machine learning for chatter detection.
rithms [15]. Fourier Transform is used in signal-based methods for Section 3.2 briefly explains the K-Nearest Neighbor algorithm. Section 4
chatter detection, and Liu et al. [16] combined signal-based and model- describes the Approximate and Eliminate Search Algorithm (AESA).
based methods to build and hybrid method for chatter detection. Wang Section 5 provides the results and comparisons with other methods,
et al. used the power spectrum and the Q-factor as a descriptor of while our concluding remarks can be found in Section 6.
chatter, and they combined these features with SVM [17]. Variational
Mode Decomposition (VMD) is another method for chatter detection. 1.1. Our contribution
For example, Liu et al. developed a method to automatically select the
VMD parameters and to extract the corresponding features using signal State-of-the-art methods for chatter detection that are based on
energy entropy [18]. choosing certain basis functions from signal decomposition require
There are also chatter classification methods that do not rely on intense manual preprocessing and thus have low automation potential
signal decomposition. For example, Tarng et al. utilized unsupervised [15,31]. For example, the frequency spectrum of the signals obtained
neural networks with adaptive resonance theory [19]. Tangjitsitcharoen from wavelet packets and IMFs must be checked to choose the infor
et al. proposed three different parameters which are based on the vari mative wavelet packets or IMFs. In contrast, our approach eliminates the
ance of the cutting force signals to diagnose different cutting states [20]. feature extraction step since it does not depend on signal decomposition,
Fu et al. used a deep belief network with an automatic feature con but rather relies on computing pairwise distances between time series.
struction model based on unsupervised greedy layer-wise pre-training All the steps in our approach can be performed automatically, and they
and supervised fine-tuning to monitor the state of milling processes only need two input parameters: the number of neighbors required for
[21]. Cherukuri et al. used Artificial Neural Network (ANN) on synthetic the KNN classifier, and the looseness constant (H) for the AESA
turning data for chatter classification [22]. However, using ANN (or algorithm.
other black-box machine learning methods) requires large training sets. Although this study focuses on turning as a use case, our approach is
That amount of data may not be always available, especially in small- applicable to other machining processes where the data stream is in the
batch production processes which constitute a large portion of discrete form of time series. A comparison of the resulting chatter classification
manufacturing. success rates between our DTW approach and other widely used
More recent methods to extract features from cutting vibration sig methods in the literature shows that our approach has the highest
nals are based on Topological Data Analysis (TDA) [23–26]. For average accuracy, or is within the error band of the highest accuracy in
example, Yesilli et al. studied chatter classification accuracy in turning two out of the four cutting configurations.
using various topological features extracted from persistence diagrams In terms of transfer learning performance, we show that our DTW
[27]. Specifically, it was found that Carlsson coordinates [28] and approach has similar transfer learning capabilities to its counterparts but
template functions [29] yield the highest overall accuracy rates. The without needing any manual preprocessing. We also show how to
generation of the persistence diagrams and the feature extraction can be drastically reduce the computation time by either combining our
done without requiring any user expertise. However, the transfer approach with Approximate and Eliminate Search Algorithm (AESA), or
learning capabilities of these tools have not been tested, and reducing with parallel computing. Although AESA has been widely used in word
their computational time is still a topic of current research [30]. recognition, pattern recognition, and handwritten character recognition
Although prior studies on chatter detection have shown some suc [44–48], to the best of our knowledge, our work is the first to combine
cess, these tools typically share two main limitations: (1) training a AESA and DTW for analyzing engineering systems. In addition, our re
classifier requires significant manual pre-processing of the data, and (2) sults obtained with AESA and parallel computing show that after
the trained classifier is sensitive to the differences between the training training a classifier offline we can label an incoming time series in less
set and the test set [31]. For example, in the WPT or EEMD method, the than 2 s. Therefore, our approach is very conducive to online chatter
signal is decomposed into wavelet packets or IMFs, respectively. The detection applications.
pre-processing requires the selection of the informative wavelet packets
or informative IMFs via choosing the packet or IMF that falls within the 2. Experimental setup and data preprocessing
range of the chatter frequency. These informative packets or informative
IMFs are used for extracting frequency and time features, which are This section describes the experimental setup, the preprocessing of
often ranked with the Recursive Feature Elimination (RFE) method. the measurements, and the manual tagging of the data.
Then, an incoming data stream can be classified based on these features
and a classification algorithm such as SVM. This means that there are 2.1. Experimental setup
fundamental limitations if the chatter frequencies change significantly,
for example, due to changing natural frequencies or changing process Fig. 1 shows the experimental setup for the turning cutting tests. A
parameters. Specifically, Yesilli et al. assessed the transfer learning Clausing-Gamet 33 cm (13 in.) engine lathe was utilized and it was
performance of each of WPT and EEMD [31]. Further, these methods equipped with three accelerometers to measure vibration signals: two
require a level of skill for feature extraction and classifier training that PCB 352B10 uniaxial accelerometers and a PCB 356B11 triaxial accel
precludes their wide adoption in chatter detection settings. Therefore, erometer. The uniaxial accelerometers were attached to an S10R-
there is a need for an accurate machine learning algorithm for chatter SCLCR3S boring bar where the latter is part of Grizzly T10439 carbide
diagnosis that can (1) be easily and automatically applied, and (2) can insert boring bar set. Titanium nitride coated insert with a radius of 0.04
be computed in a reasonable time. cm (0.015 in.) was used. The distance between the uniaxial acceler
In this paper, we describe a novel method for chatter identification ometers and the cutting insert was set to 3.81 cm (1.5 in.) to protect the
that satisfies these conditions based on combining the K-Nearest accelerometers from the cutting debris. The vibration signals were
Neighbor (KNN) classifier with time series similarity measure: Dynamic collected at 160 kHz using an NI USB-6366 data acquisition box and
Time Warping (DTW) and Approximate and Eliminate Search Algorithm Matlab's data acquisition toolbox.
(AESA). DTW has been used in many application domains including Experiments were performed for four different cutting configurations
191
Fig. 1. The experimental setup (left), and an illustration of the four overhang lengths (right).
based on the overhang distance which is the distance between the accelerometers were found to be redundant, and the x-axis vibration
backside of the cutting tool holder and the heel of the boring bar, see signals of the tri-axial accelerometer had the best Signal-to-Noise-Ratio
Fig. 1. Changing the overhang distance varies the stiffness of the boring (SNR). The data was labeled using four different tags: stable or chatter-
rod, and consequently the corresponding chatter frequencies. The four free (s), intermediate chatter (i), chatter (c), and unknown (u). Note that
overhang lengths are: 5.08 cm (2 in.), 6.35 cm (2.5 in.), 8.89 cm (3.5 in.) the same cutting signal can contain regions with different labels as
and 11.43 cm (4.5 in.). For each overhang length, we performed cutting shown in Fig. 2.
tests using several combinations of spindle speeds and depths of cut. When tagging the time series manually, both time and frequency
Moreover, a PCB 130E20 microphone and Terahertz LT880 laser domain characteristics were considered. In the time domain, the raw
tachometer were also utilized to collect audio and once-per-revolution data was divided according to its amplitude. In the frequency domain,
signals of the spindle, respectively. Note that audio signals were also we only considered frequencies below 5 kHz. Using this information,
collected using a microphone; however, the corresponding SNR was tagging was performed by looking at the peaks in the time and frequency
lower than its vibration counterpart, so we ended up using the latter in domain. If the signal has low peaks in both domains, then it is labeled as
the analysis. stable or chatter-free (s). These time series also have high peaks at the
spindle rotation frequency (see Fig. 2b) [51]. Time series with inter
2.2. Preprocessing of experimental data mediate chatter have low peaks in the time domain, but they have high
peaks in the frequency domain. On the other hand, a time series was
The sampling frequency for the experiment was 160 kHz. The reason labeled as chatter signal if it had high peaks in both domains. Any signal
for oversampling is that we did not use an in-line analog filter during the that did not fit any of the above criteria was labeled as unknown (u). The
experiments. Instead, to avoid the aliasing effects, Butterworth low pass resulting tags were verified by spot-checking the corresponding surface
filter of order 100 was used, and the data was subsequently down finish of the workpiece as shown using the example photos in Fig. 2. The
sampled to 10 kHz—which is the upper limit of the accelerometer's number of time series belonging to each label for all overhang cases is
measuring range. The filtered and downsampled signals were used to provided in Table 1.
label the data as explained in Section 2.3. Both the raw and filtered data
are available for download in a Mendeley repository [49]. 3. Similarity-based method for chatter detection using Dynamic
Before using the data for similarity analyses, there are two additional Time Warping (DTW) and K-Nearest Neighbor (KNN)
conditioning steps: 1) normalizing the time series to zero mean and
standard deviation of one. This normalization does not change the This section describes a novel method for chatter detection using the
characteristics of the time series, but it is necessary to eliminate the similarity between time series. We use Dynamic Time Warping (DTW)
effect of features with higher values on the cost functions in the classi for measuring similarity. We first define DTW in Section 3.1. We then
fication algorithms [50]. And 2) subdividing the time series while describe how DTW generates similarity matrices that can be used for
maintaining the corresponding tagging. For decreasing the computation machine learning in Section 3.3.
time, the time series are subdivided into shorter segments whose lengths
are nearly equal to 10,000. 3.1. Dynamic Time Warping (DTW)
Dynamic Time Warping is an algorithm that is capable of measuring

2.3. Data tagging distance or similarity between two time series even if they have dis
similar lengths. Let TS1 and TS2 be two time series with elements xi and
While several sensors were used to collect data during cutting, only yj whose lengths are m and n as follows:
the x-axis vibration signal of the tri-axial accelerometer was used. This is
because most of the acceleration signals obtained from the other
192
Fig. 2. (a) Sample tagged time series and some samples of the resulting surface finish. The right panel shows a comparison of the frequency spectra between a signal
labeled as chatter versus (b) no chatter, (c) intermediate chatter, and (d) unknown.
element pairs under certain constraints. While there are several options
Table 1
for computing the distance between a pair (xi, yj) of elements of the time
Number of time series in each label for all overhang cases.
series, in this implementation, the Manhattan distance d(xi, yj) = ‖xi −
Overhang length (cm (inch)) Stable Mild chatter Chatter Total yj‖1 is used. The minimization of the distance between TS1 and TS2 in the
5.08 (2) 443 31 118 592 DTW algorithm can then be written according to [52] such that
6.35 (2.5) 82 33 13 128 ( )
8.89 (3.5) 55 5 6 66 ∑
L
11.43 (4.5) 114 14 48 176 DTW (TS1 , TS2 ) = min d(wk ) . (1)
k=1
There are several restrictions to define the optimum warping path.

TS1 = x1 , x2 , …, xi , …, xm ,
These are monotonicity, continuity, adjustment window condition,
TS2 = y2 , y2 , …, yj , …, yn .
slope constraint, and boundary conditions. These restrictions are applied
Berndt and Clifford state that the warping path wk = (xi(k), yj(k)) on the alignment window to reduce the possible number of warping
between two time series can be represented by mapping the corre paths since there is an excessive number of possibilities for warping
sponding elements of the time series on m × n matrix (see Fig. 3 for paths without any constraint [53].
warping path example) [52]. The warping path is composed of the
points wk which indicate alignment between the elements xi(k) and yj(k) • Monotonicity: The indices i and j should always either increase or
of the time series. The length L of the warping path fulfills the con stay the same such that i(k) ≥ i(k − 1) and j(k) ≥ j(k − 1).
straints m ≤ L ≤ n, where we assume that n ≥ m. For instance, w3 in • Continuity: The indices i and j can only increase at most by one such
Fig. 3b corresponds to the alignment of x2 and y3. In general, warping that i(k) − i(k − 1) ≤ 1 and j(k) − j(k − 1) ≤ 1.
paths are not unique and several warping paths can be generated for the • Boundary condition: The warping paths should start where i and j are
same two time series. For two different time series, the DTW algorithm equal to 1 and should end where i = n and j = m.
chooses the warping path that gives the minimum distance between the
Fig. 3. DTW alignment (a) and warping path (b) for two different time series.
193
• Adjustment window condition: The warping path with minimum The similarity matrices can then be combined with a K-Nearest Neighbor
distance is searched on a restricted area on the alignment window to classifier to inform us whether the time series corresponds to chatter or
avoid significant timing difference between the two paths [53]. The chatter-free cutting. When chatter occurs in metal cutting, the dominant
restricted area is given by i − r ≤ j ≤ i + r. frequencies of the vibrations at the tool and workpiece change from
• Slope constraint: This condition avoids significant movement in one harmonics of the spindle rotation period to chatter frequencies, which
direction [52]. After a steps in horizontal or vertical direction, it are close to some eigenfrequencies of the mechanical structure. More
cannot move in the same direction without having b steps in the over, the amplitude of the vibrations increases significantly. This char
diagonal direction [53]. The effective intensity of the slope acteristic behavior can be used to distinguish between stable cutting and
constraint can be defined as P = b/a. We chose P=1, which was re chatter by comparing current time-domain signals (e.g. acceleration)
ported as an optimum value in an experiment on speech recognition with existing labeled signal segments from a training phase. During
[53]. classification, the data set is split into training and test sets. Indices of
the training set and test set samples are found and then the distance
In this paper, the distances between the time series are computed matrix for the training set and test set is generated by using the square
using the cDTW package. There is another widely adopted algorithm distance matrix that was computed in advance for all the cases. After
named FastDTW [54]. The time complexity of FastDTW algorithm is obtaining these distances, the nearest training set samples to the test
given as N(8r + 14), while cDTW package has time complexity of rN sample are found based on the selected number of nearest neighbors.
[54,55]. N is the number of points in the time series and r is a parameter Labels of each nearest neighbor are counted as shown in the illustration
for the adjustment window condition explained above. Both packages in Fig. 4. The label with the highest count is assigned to the test sample
have similar time complexity which is O (N). as the predicted label. In this application, we assign either chatter or
chatter-free labels to the test sample. After repeating these processes for
3.2. K-Nearest Neighbor (KNN) each of the test samples, we compare the predicted labels with the
ground truth and define the accuracy of our method. Splitting data into
In this study, we used a K-Nearest Neighbor (KNN) algorithm to train training and test sets is repeated 10 times. The distances between new
a classifier. KNN is a supervised machine learning algorithm based on test sets and training sets do not have to be computed since we already
classifying objects with respect to labels of nearest neighbors [56]. The compute the pairwise distances between all samples in the beginning.
‘K’ corresponds to the number of neighbors chosen to decide the label of Finding the indices of the samples in each iteration will be enough to
newly introduced samples. Fig. 4 shows an example that illustrates the generate new similarity matrices for training and test sets. The standard
classification process with KNN. deviation of the classification is also provided since classification is
Specifically, Fig. 4 assumes that we have two different classes for a repeated 10 times.
classification problem denoted by pentagons and stars. Pentagons and When a new test sample is introduced to a classifier, distance com
stars belong to the training set and the red square belongs to a new putations between all training samples and the test sample is required
sample from the test set. When a new sample is encountered (the square which can be computationally expensive. Therefore, we implement the
in the figure), we assign a tag based on the number of K nearest Approximate and Eliminate Search Algorithm (AESA), which is
neighbors to each class. For instance, for the 1-NN case, the closest explained in Section 4, to reduce the number of DTW computations per
neighbor is from the star class; therefore, the test sample is tagged as a new test sample.
star class. On the other hand, for the 5-NN case, the test sample has two
neighbors from the star class and three neighbors from the pentagon 4. Approximate and Eliminate Search Algorithm (AESA)
class. Consequently, the label for the test sample is set as pentagon since
there are more neighbors from this class than there are from the star Approximate and Eliminate Search Algorithm (AESA) is a method
class. If the number of nearest neighbors is the same for multiple classes, designed for reducing the number of distance computations during the
the class label is assigned randomly with equal probability [57]. test phase of classification. Derivation of the AESA starts with the
question of whether DTW is a metric or not. There are four requirements
3.3. Similarity matrices and classification on a function to be a metric [58]:
Using DTW, we get a measure of how different/similar any pair of 1. D(x,y) ≥ 0,

time series TS1 and TS2 is. By comparing N time series TS1, …, TSN with 2. D(x,y) = 0 ⇔ x = y,
each other, we can generate similarity matrices whose entries are the 3. D(x,y) = D(y,x),
distance between the two corresponding time series. Since DTW is 4. D(x,z) ≤ D(x,y) + D(y,z).
commutative, the resulting matrices are symmetric. Consequently, the
resulting similarity matrix for DTW requires N(N − 1)/2 computations. Although DTW always satisfies the first two properties, the
Fig. 4. K-Nearest Neighbor classification example for two-class classification.
194
commutativity condition is only satisfied when the DTW algorithm does x as accurately as possible with fewer DTW distance computations. The
not approximate the distance measure. Therefore, if the exact DTW is first step of the algorithm is to select a first candidate for the nearest
computed, then the first three conditions will be satisfied. However, the training sample s to x. Reference [44] points out that using pseudo
fourth condition may not be satisfied and a combination of three time center c of P as the first candidate improves the results; alternatively, s
series that violate the triangular inequality can be found in a data set. can be randomly selected. The definition of the pseudo center is.
Consequently, DTW is not accepted as a metric. Depending on the data { ( )}
∑ ∑
set, the fourth condition may or may not be satisfied; therefore, DTW can PC(P) = c| DTW (c, p) = min DTW (q, p) . (3)
still be used by relaxing the strict triangular inequality condition. Spe ∀p∈P
∀q∈P
∀p∈P
cifically, Ruiz et al. introduced the triangle inequality looseness in [44]
After this choice of s in Fig. 5c, the distance between s and x is
as follows
computed. That distance is shown in Fig. 5d with r0, and our count
H(x, y, z) = D(x, y) + D(y, z) − D(x, z), number increases by one. The nearest sample n to x is defined by
comparing the distances made inside of the algorithm. Since only one
where x, y and z represents the time series in a data set. Loose triangular distance computation is performed so far, s is assigned as the nearest
inequality is also defined as sample n. We complete the approximation step in the first iteration, now
D(x, y) + D(y, z) ≥ D(x, z) + HL . (2) the elimination should be performed. Reference [44] defined the elim
ination criteria such that the training set samples pi which do not satisfy
Ruiz et al. call DTW distance a loose metric space when a data set DTW(x, pi) < DTW(x, n) are eliminated. Applying Eq. (2) yields two
does not violate the triangular inequality [44]. All training samples are elimination criteria such that (Fig. 6)
considered as potential candidates for the nearest sample to a new test
set. The main purpose of the algorithm is to eliminate these training DTW (pi , s) < DTW (x, s) + DTW (x, n) − H
(4)
DTW (pi , s) > DTW (x, s) − DTW (x, n) + H.
samples and to approximate to the nearest one correctly. In addition, the
algorithm performs the classification part and assigns the label of the These two conditions correspond to the two circles shown in Fig. 5e
nearest training sample to the test sample. In other words, it applies 1- with r1 and r2. In Fig. 5e we define the region where we look for possible
NN classification where the classification step can be part of the AESA candidates for the nearest samples (n) based on the elimination criteria.
algorithm. We will present results based on 1-NN classification in this The training samples outside of the shaded, green region are eliminated
implementation. as shown in gray in Fig. 5f thus completing the first iteration. The set
The pseudo code for the algorithm is given in Algorithm 1 [44]. Il that includes the eliminated samples is called E. Iterations continue until
lustrations of some steps are provided in Fig. 5. Assume that we have P = E. We only eliminated two samples (see Fig. 5g) so far in our
training set P and its samples pi are shown in Fig. 5a. We introduce a new example. Therefore, we continue searching for new s in the second
test sample x to our classifier (see Fig. 5b). The aim of AESA is to classify iteration. The new s = q is selected for iterations except the first one such
Fig. 5. Illustration of elimination and approximation steps of AESA. Count refers to the number of distance computations made in the algorithm.
195
misclassification since the true n is eliminated. Therefore, we expect a

decrease in classification accuracy when H becomes too large. The
process for choosing H is described in more detail in Section 5.4.
5. Results
This section presents the results for the classification accuracy using
our approach as well as current state-of-the-art methods in the literature.
Specifically, Section 5.1 compares the classification accuracy using the
same data set for the similarity-based method to the WPT/EEMD
methods [26], and the TDA-based results [27]. Section 5.2 shows the
transfer learning performance of the similarity measure method in
comparison to the WPT/EEMD results [26]. Section 5.3 describes how
parallel computing is employed with DTW approach. Further, Section
5.4 provides the results obtained using Approximate and Eliminate
Search (AESA) explained in Section 4.
Fig. 6. Illustration for elimination criteria.

5.1. Classification results for Dynamic Time Warping (DTW)
that
Table 2 provides classification accuracies and the time needed to
( )
∑ compute the distance between two time series. Although the same time
min |DTW (q, u) − DTW (x, u) | , series is used to find the time needed to compute the distance between
∀u∈U
them, cDTW is faster compared to the FastDTW as seen from Table 2. As
the larger r parameter is selected for FastDTW, its computational time
where q ∈ {P − E} [44]. This choice makes p5 the new s as shown in
increases. However, lower r values do not approximate the distance
orange in Fig. 5h. Now, we compute the distance between s and x again
between two time series well. In addition, cDTW algorithm was able to
and increase the count by one (see Fig. 5i). Then, we draw circles that
match with the accuracy obtained from FastDTW. Therefore, we used
define the elimination criteria (see Fig. 5j), and it is seen that all
cDTW algorithm to obtain results in this study.
remaining samples are outside of the defined region. Therefore, we
Table 3 compares the best classification scores obtained from WPT,
eliminate all of them (see Fig. 5k). Since all training set samples are
EEMD, and the TDA-based methods to the results from DTW. We ob
eliminated, there will be no further iterations in the algorithm. In the
tained results for K=1,2,…, 5 for KNN classifier and reported the best
last step, we assign s as nearest sample n to the test sample x if DTW(x, s)
accuracies for DTW in Table 3. The cells highlighted in green are the
< DTW(x, n) (see Fig. 5l). This completes the algorithm and we can assign
ones with the highest overall classification score, while those high
the label of n to x.
lighted in blue represent results with error bands that overlap with the
In traditional classification, it would be required to compute the
best overall accuracy in the same row. A full list of the average classi
distance between P and x. This would be equal to six DTW computations
fication scores and the corresponding standard deviations can be found
for the example in Fig. 5. However, we made only two DTW computa
in Tables 8 and 9.
tions with the AESA algorithm. This demonstrates how AESA is capable
Table 3 shows that features based on WPT, Carlsson Coordinates,
of significantly reducing the number of DTW computations.
which is a TDA-based method, and DTW give the highest accuracy for
The choice of the parameter H influences computation reduction and
classification accuracy. To show that, we provide two illustrations that
correspond to two H values in Fig. 7. Fig. 7 shows how the defined area is Table 2
Comparison of classification accuracy and the time required to compute distance
reduced when H is increased. The decrease in the area increases the
between two time series for cDTW and FastDTW packages.
number of eliminated samples from the training set, and P and E will be
equal to each other in a small number of iterations. Thus, the algorithm Overhang length cDTW FastDTWr=21
cm (inch)
performs fewer DTW computations as we increase H. In addition, Accuracy Time Accuracy Time
increasing H can lead to misclassification. Larger shaded area with low H (seconds) (seconds)
can cover the nearest sample n in the region (see Fig. 7 (left)), while 5.08 (2) 98.34 % ± 1.5 99.24 % ± 51.1
smaller one can exclude n (see Fig. 7 (right)). This exclusion can lead to 1.08% 0.73%
Fig. 7. Effect of H on the elimination criteria and accuracy of the classification.
196
Table 3
Comparison of results for similarity-based methods with their counterparts available in literature.
*Intermediate chatter cases are excluded.
the different overhang cases. For the 5.08 and 8.89 cm (2 and 3.5 in.) time series with different labels. This can also be observed in Fig. 8. The
overhang cases, DTW and Carlsson Coordinates have the highest clas average distance between stable cases is the lowest, while the one be
sification accuracies of 98.3% and 95.7%, respectively. However, the tween stable and unstable (chatter) cases is the highest. Therefore, the
DTW is in the error band of the highest accuracy for the 8.89 cm (3.5 in.) DTW algorithm can distinguish the signals with different labels. How
case. On the other hand, feature extraction with WPT and RFE is the ever, this may not hold all the time. For instance, the heat map of 6.35
most accurate for the 6.35 and 11.43 cm (2.5 and 4.5 in.) overhang cases cm (2.5 in.) case in Fig. 9 indicates that the average distance between
scoring 100% and 87.5%. While the results from other methods are not intermediate chatter and no-chatter time series is almost identical to the
the highest for any of the considered cases, some of them still lie within average distance among intermediate chatter time series. This explains
the error bars for the 11.43 cm (4.5 in.) overhang cases. Specifically, why we have low accuracy when we include the intermediate chatter as
EEMD is within one standard deviation of the best results for the 11.43 a separate class in Table 3. In this case, the classification algorithm can
cm (4.5 in.) case. classify intermediate cases as stable ones, although these cases are taken
Note that the results provided in Table 3 are for two-class classifi into account as chatter cases, thus reducing the resulting accuracy.
cation: chatter, and chatter-free. In the column named DTW*, we pre When we exclude the intermediate cases completely, the classification
sent the results when we exclude the intermediate chatter cases. In score increases from 72.3% to 86.9% since there is a large difference
contrast, the DTW column shows the results when intermediate chatter between the average distances of stable and chatter cases as seen in
cases are assumed as chatter in classification. The reason we exclude the Fig. 9.
intermediate chatter cases in DTW is revealed in the heat maps of the In the 11.43 cm (4.5 in.) case, Fig. 14 indicates that there is no sig
similarity matrices. The heatmaps of the average DTW distances be nificant difference in the average DTW distances. As a consequence,
tween the three classes (chatter, intermediate chatter, and no-chatter/ there might be errors in the classification of the chatter cases and this
stable) are given in Figs. 8, 9, 13, 14. Each of the nine regions in these leads to low overall classification accuracy as shown in Table 3. When
figures shows the average DTW distance between all the cases marked we do not include intermediate chatter cases, the classification score is
according to the row and the column labels in that region, e.g., the top 70.9%. Since the difference between the average distance of
right region reports the average DTW distance between the time series intermediate-stable and intermediate-chatter is small, the classification
tagged as no-chatter versus those tagged as chatter. These heatmaps algorithm can still classify intermediate cases as chatter. This can in
show us how the time series belonging to different classes are similar to crease the classification accuracy when we include the intermediate
each other. Ideally, the average distance between time series with the cases as shown in Table 3. The score increases from 70.9% to 75.7%. For
same label is expected small compared to the average distance between 5.08 and 8.89 cm (2 and 3.5 in.) cases, the heatmaps (see Figs. 8 and 13)
Fig. 8. The heat map of averages DTW distances of time series belongs to three classes for 5.08 cm (2 in.) case.
197
Fig. 9. The heat map of averages DTW distances of time series belongs to three classes for 6.35 cm (2.5 in.) case.
show clear differences between the three cases. This explains the high trained classifier perform on a real manufacturing center? This question
classification accuracies for both cases. is strongly related to the concept of transfer learning, i.e., the idea of
However, there might be interest in identifying intermediate chatter training a classifier on some cutting configuration, and hoping that this
as part of a prediction algorithm that intervenes before the process de classifier will be robust enough to the inevitable changes in the systems'
velops into full chatter. Alternatively, inducing or sustaining interme dynamic parameters during manufacturing. In order to assess the
diate chatter might be desirable for surface texturing applications. capability of transfer learning using time series similarity measures,
Figs. 2b–d show that the power spectrum for chatter and intermediate Fig. 10 shows the average distance for each quadrant of the distance
chatter is very similar making the featurization in the frequency domain matrix between the 5.08 and 11.43 cm (2 and 4.5 in.) overhang lengths.
extremely challenging. However, Fig. 2a shows a clear difference be It can be easily seen that cases with different labels, albeit from different
tween the two chatter regimes in the time domain. Therefore, it is more cutting configurations, can still be differentiated since their pairwise
advantageous to extract features in the time domain for a three-class distances are distinct. In contrast, Fig. 2 shows that, for instance, chatter
classification (chatter, intermediate chatter, and no chatter). and intermediate chatter show similar frequency bands, which compli
For 5.04 and 8.89 cm (2 and 3.5 in.) overhang cases, Figs. 8 and 13 cates extracting distinguishing features between chatter and interme
show that DTW can differentiate between chatter and intermediate diate chatter using frequency-based features.
chatter as evidenced by the high average distance between time series Motivated by Fig. 10, we performed transfer learning analysis using
tagged as chatter and intermediate chatter. Looking at the regions that the DTW-based approach. This analysis involved training a classifier
list the distances between intermediate chatter and chatter cases, we can using the 5.08 cm (2 in.) data and testing it on the 11.43 cm (4.5 in.)
see that the distances between chatter-chatter and chatter-intermediate signals. The same analysis was repeated using the 11.43 cm (4.5 in.) data
chatter cases are quite different. This confirms the ability of our as the training test, and the 5.08 cm (2 in.) data as the test set. These two
approach to distinguish the differences between these two cases. How cases were chosen in order to test the effectiveness of the classifier in
ever, the KNN algorithm may not differentiate these cases due to the detecting chatter when the eigenfrequencies of the system change be
similarities between average distances in the case of 6.35 and 11.43 cm tween two extremes. 67% of the training set was used to train the
(2.5 and 4.5 in.) overhang distance. As a concrete example for the three- classifier, and testing was performed using 67% of the testing set. In each
class classification, we computed the distance matrices for all overhang case, a KNN classifier was trained for K ∈ {1,2,…,5}, and the highest
distances and applied the KNN classification algorithm to obtain the best resulting accuracy was listed in Table 5 (the full classification accuracies
3-class classification accuracy. Table 4 provides the best results of cor can be found in Table 11). This table also compares the transfer learning
responding cases (the full classification results can be found in results from DTW to its WPT and EEMD counterparts obtained from
Table 10). Table 4 shows that the DTW approach successfully distin Ref. [31].
guishes the three different classes and the success rates are only slightly Table 5 shows that WPT and EEMD outperform DTW. However, the
below the success rates of the two-class classification (cf. Table 3) for results of DTW are within the error band of the best results when the
5.04 and 8.89 cm (2 and 3.5 in.), while the low classification accuracy is training set is the 5.08 cm (2 inch) overhang case, and DTW provides this
observed for 6.35 and 11.43 cm (2.5 and 4.5 in.) cases as expected. accuracy with a smaller deviation compared to WPT Level 4. On the
other hand, EEMD provides the best accuracy with 94.9% when the
5.2. Transfer learning results training set is 11.43 cm (4.5 inch) case, while DTW places second with
81.4%. It is worth noting that one of the main limitations with WPT and
An important question with practical implications is how well will a EEMD methods is manual preprocessing required to choose informative
decompositions for feature extraction as stated in Ref. [31]. DTW does
Table 4 not require manual preprocessing and it can be fully automated.
The best accuracy results for three class classification with DTW approach, and Therefore, these results show that a classifier trained using the DTW
the corresponding number of nearest neighbors used in the KNN algorithm. approach retains good classification accuracies even if the dynamic
Overhang length cm (inch) DTW K-NN algorithm
parameters of the machining process deviate from their original values.
5.08 (2) 97.7 % ± 1.1% 1-NN

6.35 (2.5) 71.4 % ± 7.0% 4-NN 5.3. Parallel computing
8.89 (3.5) 95.5 % ± 5.4% 3-NN
11.43 (4.5) 73.9 % ± 4.6% 4-NN
In this study, we utilized parallel computing to expedite calculating
198
Fig. 10. The heat map that represents the average DTW pairwise distances between 5.08 and 11.43 cm (2 and 4.5 in.) cases. The number marked in each region is the
average DTW distance of all the time series pairs in that region.
Table 5
Transfer learning results comparison between DTW, WPT and EEMD.
the distance matrices. In a parallelized way, we are able to compute policy determines the queue time for a user depending on the requested
multiple distances at the same time and this reduces the total run time. resources from HPCC-MSU. Each time a user requests a large amount of
We used the High Performance Computing Center (HPCC) of Michigan resources, the queue time for the job submitted by that user gets higher.
State University (MSU) and obtained all the distance matrices required In addition, the queue time depends on the current usage of HPCC-MSU
to produce the results shown in this study. HPCC is composed of several since it is open to all university members. Although we submit all jobs at
supercomputers each with hundreds of nodes. Each of these nodes can the same time to HPCC-MSU, their computation is not started at the
be thought of as a computer with a certain number of CPUs and cores. same time due to queue time, and this causes deviations from the ideal
In parallel computing, we submitted 175, 82, 22 and 154 jobs at the time which is equal to the runtime of a single job.
same time for 5.08, 6.35,8.89, and 11.43 cm (2, 2.5, 3.5 and 4.5 in.) Table 6 provides the times required to obtain classification results
cases, respectively. For 5.08 cm (2 in.) case, the number of distance with DTW and its counterparts. For DTW, we report two different run
computations made in each job is 1000, while it is 100 for other cases. times: traditional and parallel. In traditional DTW, only one distance
We requested one to five nodes and 5 CPU per job and 2GB of RAM per computed is performed at a time. Therefore, a significant difference is
CPU. Therefore, each job is run with 10 GB of RAM in total. It is also observed between parallel and traditional computations in spite of the
worth mentioning that HPCC-MSU has a job submission policy and this fact that the times reported in Table 6 for DTW (Parallel) include the
Table 6
Time (seconds) comparison between similarity measure methods and its counterparts.
*These run times are rough estimations.
199
queue time. The times reported for traditional computation are esti applications. Therefore, we employ the AESA algorithm to reduce the
mated based on the time required to complete one distance computation number of DTW computations for a test sample during classification.
on Dell Optilex 7050 desktop with Intel Core i7–7700 CPU and 16.0 GB The first step is to check if there is any violation of the loose trian
RAM. In addition, the times reported for TDA-based feature extraction gular inequality given in Eq. (2). The looseness constant for all combi
methods are obtained by applying parallel computing only on persis nations of three different time series are computed for turning cutting
tence diagram computations, while parallel computing is not involved in data set and plots are provided in Fig. 11.
WPT and EEMD. Even though WPT and EEMD are the fastest methods, This figure shows that all combinations of three time series comply
parallel computing with DTW significantly reduces the total run time with the triangular inequality since the frequencies are accumulated on
making the latter the third-fastest method. positive looseness values. Then, we choose a range of looseness constant
We point out that the classification based on WPT is optimized (H) between 0 and 105 with an increment of 100. Therefore, we use 101
whereas the classification based on DTW is far from optimal. The aim of different H values as input to the AESA algorithm. We split the data set of
the paper is the introduction of DTW for chatter detection and the pre each overhang distance into training (67%) and test (33%) set. For every
sentation of its general performance. In fact, this includes the necessity H, the number of DTW distance computations made per test sample and
for an optimization of the algorithm as can be seen from Table 6 even if the predicted labels of test samples are obtained as output. Then, these
parallel computing is used. However, the values reported in Table 6 for predicted labels are compared with the true labels of the time series to
WPT and EEMD do not include the time required for choosing the determine the accuracy level and we take the average of the number of
informative wavelet packets or IMFs using manual preprocessing since distance computations made for the samples in the test set. We provide a
the speed of manual preprocessing depends on the skill of the user, and is plot to show how the average number of distance computations per test
more difficult to track. We expect that the overall time will be much sample and classification accuracy change with varying H in Fig. 12. All
higher for WPT and EEMD after including the manual processing time. of the results presented in Fig. 12 are obtained with HPCC-MSU and
First, the computing time can be decreased by decreasing the length using 1NN implementation in the AESA. However, AESA can be modi
and/or the number of time series. At the moment it is not clear how such fied to perform classification with a larger number of nearest neighbors.
a reduction affects the performance of the method. For example, the Fig. 12 indicates that there is a decrease in the average number of
success rates for the 8.89 cm (3.5 in.) case and the 5.08 cm (2 in.) case distance computations and accuracy as the looseness constant increases.
are comparable even though the available amount of training data is However, the average number of distance computations decreases
much smaller for the former. Second, we computed the upper diagonal dramatically. Therefore, AESA algorithms can reduce the number of
of the distance matrix including pairwise distances between the training DTW distance computations while keeping the accuracy as large as
data. Although Table 6 shows that DTW clocks the second-fastest run possible. One can find a value of H with low distance computation and
time, we note that this slowdown is mostly related to the training phase high accuracy value. Furthermore, an increase in accuracy with
because of the large number of necessary pairwise distance computa increasing H is observed for 6.35, 8.89 and 11.43 cm (2.5, 3.5, and 4.5
tions during the training/testing phase. However, once the classifier is in.) overhang distances. However, 6.35 and 8.89 cm (2.5 and 3.5 in.)
obtained, the necessary runtime for DTW will be significantly reduced have fewer samples in comparison to the other cases (see Table 1). The
because any new data is classified upon computing its pairwise distance fewer data samples in an overhang distance case cause bigger jumps in
with the training set, i.e., the only needed computation is equivalent to accuracy as shown in Fig. 12. In addition, the plots shown in Fig. 12 are
the evaluation of one row of the training/testing similarity matrix (see only for one train-test split. One may obtain a smoother decrease in
Table 7). Finally, it is possible that the code for calculating the DTW accuracy plots by applying train-test split several times and taking the
distance matrix can be further optimized. Many researchers have pub average. Then, the accuracy values for small H will converge to the
lished on DTW optimization, especially for data mining [59] including a values which are obtained with 10 times train-test split and provided in
speedup of runtime for distance matrix computations between time se Table 3.
ries and its query. The resulting algorithm allows performing fast queries We choose some H values and find their corresponding classification
on a single core machine in a very short time; thus, allowing using small score and the average number of distance computations for all overhang
consumer electronics to handle the data and possibly extract features distance cases from Fig. 12. The time required to classify one test sample
from it in real-time. However, we apply the AESA algorithm explained in is estimated for traditional way and parallelized way and we provided a
Section 4 and given Algorithm 1 to reduce the number of distance comparison between traditional computing, parallel computing, and
computations and the time required for testing. AESA algorithm in Table 7. Accuracy reported in Table 7 is the corre
sponding accuracies shown in Fig. 12 and Table 3.
The H values chosen in Table 7 is an example. One can choose
5.4. Approximate and eliminate search algorithm results
different values for H as well. There is a trade-off between accuracy and
the average number of distance computations. Higher H values can be
The run times for DTW methods with parallel computing given in
selected to have a small number of distance computations, but this
Table 6 are for training a classifier. When a new test sample is intro
comes with a lower classification score. For the traditional way, the
duced, we need to compute distances between the test sample and all the
times we reported are estimated based on the time required to complete
training set samples to identify the nearest neighbors. However,
one distance computation. On the other hand, parallel computing could
computing these distances the traditional way can be too lengthy,
be the fastest method among them. Ideally, if all distance computations
especially when considering DTW for online chatter detection
Table 7
Classification time of one test sample for traditional way, parallel computing and AESA algorithm and corresponding average accuracy obtained from Table 3 and
Fig. 12.
Overhang distances cm (inch) Traditional AESA Parallel
Acc. Time (s) H Acc. Avg. count Time (s) Acc. Time (s)
5.08 (2) 98.3% 595.5a 6900 90.81% 1.13 1.70 98.3% ≈1.5
6.35 (2.5) 72.3% 130.5a 3300 74.42% 20.04 30.06 72.3% ≈1.5
8.89 (3.5) 92.9% 66a 6300 86.96% 1.09 1.64 92.9% ≈1.5
11.43 (4.5) 75.7% 175.5a 5600 81.36% 1.19 1.79 75.7% ≈1.5
a
These run times are rough estimates.
200
Fig. 11. Histograms for looseness constant of all combinations of three different time series for all overhang distance.
Fig. 12. Classification accuracy (%) (red solid line) and average number of DTW computations (green dashed line) for varying looseness constant (H). (For
interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
between the test set sample and the training set are sent to HPCC-MSU in available. Alternatively, the user can speed up the computations by
separate jobs, each job will take nearly 1.5 s to complete. However, there implementing the AESA algorithm on a workstation. Moreover, AESA
might be some queue time which leads to a delay in time to obtain re algorithm provide promising classification times as seen from Table 7.
sults. One can also use the available workstations in the market to Classification can be completed in less than 2 s with the chosen H values
perform parallel computing without needing extravagant supercom for all overhang distances except the 6.35 cm (2.5 in.). One can choose
puters which HPCC-MSU has. Using parallel computing is optional, and another value of H for 6.35 cm (2.5 in.) case to obtain results faster, but
is a viable option if a cluster for high performance computing is this corresponds to a lower classification score. Table 7 shows us that
201
AESA and parallel computing can enable in-process chatter detection on content in these two cases is too similar while the time domain signals
the cutting centers with the similarity measure approach using a clas are different as evidenced by Fig. 2 and the heat maps shown in Figs. 8
sifier that is trained first offline and then loaded to a controller attached and 13.
to the manufacturing center. Moreover, implementation of AESA to an In addition, the combination of the DTW approach with the AESA
online manufacturing application is cheaper than buying supercom algorithm provides a significant decrease in the number of DTW com
puters or workstations to perform parallel computing. putations thus reducing the classification time of a single test sample to
The DTW-based approach operates directly on the time series thus less than 2 s for three out of four overhang distances. Depending on the
bypassing the preprocessing step involved in the WPT and EEMD user choice of H, one can obtain the results even faster with the cost of
methods. Further, in contrast to deep learning techniques, such as neural loss in classification accuracy as seen from Fig. 12. Therefore, there is a
networks, using DTW does not necessitate a large number of datasets for trade-off between reduction in the number of distance computation and
training. The feasibility of implementing DTW on a real cutting center is classification performance. In contrast to the AESA algorithm, parallel
examined by investigating its transfer learning capabilities. DTW pos computing has the capability of completing classification of single time
sesses results that are in the error band of the best result or have small series in about 1.5 s if ideal conditions such as zero queue time and
difference with the best result as seen from Table 5. Providing these enough resources–number of processors/cores, and RAM capacity–is
results with smaller deviation and eliminating the manual preprocessing obtained to be able to run all jobs simultaneously. Therefore, it is
are significant advantages for our approach that still enables it to ach observed that both AESA and parallel computing can be combined with
ieve high classification accuracies even if the system parameters (in our similarity measures for real-time online chatter detection applications.
case the eigenfrequencies) shift during the process. We also note that Table 6 does not include the time required for the
manual preprocessing in the WPT and EEMD methods for choosing
6. Conclusion informative packets or decompositions. The actual time for these two
methods is larger than the ones provided in the table depending on the
This paper presents a novel method for chatter detection that com number of the investigated time series and the skill of the person per
bines similarity measures of time series via Dynamic Time Warping forming the preprocessing. This is because WPT and EEMD require a
(DTW) with machine learning. In this approach, the similarity of process for checking the frequency spectrum of the times series and
different time series is measured using their DTW distance, and any examining the energy ratio of the wavelet packets of the time series.
incoming data stream is then classified using the KNN algorithm. We test Furthermore, whereas the WPT algorithms are highly optimized, the
the classification accuracy of our approach using a set of turning ex Python scripts that we used in this paper for computing the DTW have
periments with four different tool overhang lengths, and we compare the little to no optimization. We hypothesize that further optimization using
resulting accuracy to two widely used methods: the Wavelet Packet for example the ideas in [59] and combining the DTW approach with
Transform (WPT) and the Empirical Mode Decomposition, as well as to AESA and parallel computing will speed up the runtime for the similarity
newly developed tools based on Topological Data Analysis (TDA). Our measures making them a viable option for on-machine chatter detection.
results in Table 3 show that the DTW's classification accuracy matches or Future work includes optimizing the DTW algorithm and studying its
exceeds those of existing methods for two out of four different overhang performance when combined with other machining operations.
cases. This indicates that temporal features extracted using DTW are
effective markers for detecting chatter in cutting processes. Topological
Declaration of competing interest
Data Analysis (TDA) methods results are also close to the ones for sim
ilarity measures; however, one advantage for the DTW approach in
The authors declare that they have no known competing financial
comparison to TDA-based tools is that it does not require embedding the
interests or personal relationships that could have appeared to influence
data into a point cloud, hence avoiding the complications associated
the work reported in this paper.
with choosing appropriate embedding parameters.
The DTW approach can distinguish stable and unstable cases from
each as evidenced by the heat map of the average distances between Acknowledgement
time series with different labels (see Fig. 8), 9 and 13. The DTW
approach also successfully distinguishes between chatter and interme This material is based upon work supported by the National Science
diate chatter as shown in Table 4. These comparisons are difficult or Foundation under Grant Nos. CMMI-1759823 and DMS-1759824 with
impossible using only frequency domain features because the frequency PI FAK.
Appendix A
Algorithm 1. AESA.
202
Table 8
Two class classification (chatter and stable) results obtained with DTW similarity measure method for 2, 2.5, 3.5 and 4.5 in. overhang case.
Table 9
Two class classification (chatter including intermediate chatter cases as chatter and stable) results obtained with DTW similarity measure method for 2, 2.5, 3.5 and
4.5 in. overhang case.
203
Table 10
Three class classification applied with DTW approach.
204
Table 11
Transfer learning results for Dynamic Time Warping (DTW).
References [15] Yesilli MC, Khasawneh FA. On transfer learning of traditional frequency and time
domain features in turning. In: Manufacturing processes; manufacturing systems;
nano/micro/meso manufacturing; quality and reliability. Volume 2. American
[1] Wang R, Song Q, Liu Z, Ma H, Liu Z. Multi-condition identification in milling ti-6al-
Society of Mechanical Engineers; 2020. https://doi.org/10.1115/msec2020-8274.
4v thin-walled parts based on sensor fusion. Mech Syst Signal Process 2022;164:
[16] Liu M-K, Tran M-Q, Chung C, Qui Y-W. Hybrid model- and signal-based chatter
108264. https://doi.org/10.1016/j.ymssp.2021.108264.
detection in the milling process. J Mech Sci Technol 2020;34(1):1–10. https://doi.
[2] Dadgari A, Huo D, Swailes D. Investigation on tool wear and tool life prediction in
org/10.1007/s12206-019-1201-5.
micro-milling of ti-6al-4v. Nanotechnol. Precis. Eng. 2018;1(4):218–25. https://
[17] Wang Y, Bo Q, Liu H, Hu L, Zhang H. Mirror milling chatter identification using q-
doi.org/10.1016/j.npe.2018.12.005.
factor and SVM. Int J Adv Manuf Technol 2018;98(5–8):1163–77. https://doi.org/
[3] Munoa J, Beudaert X, Dombovari Z, Altintas Y, Budak E, Brecher C, Stepan G.
10.1007/s00170-018-2318-x.
Chatter suppression techniques in metal cutting. CIRP Ann 2016;65(2):785–808.
[18] Liu C, Zhu L, Ni C. Chatter detection in milling process based on VMD and energy
https://doi.org/10.1016/j.cirp.2016.06.004.
entropy. Mech Syst Signal Process 2018;105:169–82. https://doi.org/10.1016/j.
[4] Lamraoui M, Barakat M, Thomas M, Badaoui ME. Chatter detection in milling
ymssp.2017.11.046.
machines by neural network classification and feature selection. J Vib Control
[19] Tarng YS, Chen MC. An intelligent sensor for detection of milling chatter. J Intell
2013;21(7):1251–66. https://doi.org/10.1177/1077546313493919.
Manuf 1994;5(3):193–200. https://doi.org/10.1007/bf00123923.
[5] Ding L, Sun Y, Xiong Z. Early chatter detection based on logistic regression with
[20] Tangjitsitcharoen S. Advance in detection system to improve the stability and
time and frequency domain features. In: 2017 IEEE international conference on
capability of CNC turning process. J Intell Manuf 2009;22(6):843–52. https://doi.
advanced intelligent mechatronics (AIM). IEEE; 2017. p. 1052–7. https://doi.org/
org/10.1007/s10845-009-0355-x.
10.1109/aim.2017.8014158.
[21] Fu Y, Zhang Y, Gao H, Mao T, Zhou H, Sun R, Li D. Automatic feature constructing
[6] Thaler T, Potonik P, Bric I, Govekar E. Chatter detection in band sawing based on
from vibration signals for machining state monitoring. J Intell Manuf 2017;30(3):
discriminant analysis of sound features. Appl Acoust 2014;77:114–21. https://doi.
995–1008. https://doi.org/10.1007/s10845-017-1302-x.
org/10.1016/j.apacoust.2012.12.004.
[22] Cherukuri, Perez-Bernabeu, Selles, Schmitz. Machining chatter prediction using a
[7] Han Z, Jin H, Han D, Fu H. ESPRIT- and HMM-based real-time monitoring and
data learning model. J Manuf Mater Process 2019;3(2):45. https://doi.org/
suppression of machining chatter in smart CNC milling system. Int J Adv Manuf
10.3390/jmmp3020045.
Technol 2016;89(9–12):2731–46. https://doi.org/10.1007/s00170-016-9863-y.
[23] Khasawneh FA, Munch E. Stability determination in turning using persistent
[8] M.-Q. Tran M.-K. Liu M. Elsisi , Effective multi-sensor data fusion for chatter
homology and time series analysis. In: Dynamics, vibration, and control. Volume
detection in milling process, ISA Trans doi:10.1016/j.isatra.2021.07.005.
4B. ASME; 2014. https://doi.org/10.1115/imece2014-40221.
[9] Chen GS, Zheng QZ. Online chatter detection of the end milling based on wavelet
[24] Khasawneh FA, Munch E. Chatter detection in turning using persistent homology.
packet transform and support vector machine recursive feature elimination. Int J
Mech Syst Signal Process 2016;70–71:527–41. https://doi.org/10.1016/j.
Adv Manuf Technol 2017;95(1–4):775–84. https://doi.org/10.1007/s00170-017-
ymssp.2015.09.046.
1242-9.
[25] Khasawneh FA, Munch E, Perea JA. Chatter classification in turning using machine
[10] Tangjitsitcharoen S, Saksri T, Ratanakuakangwan S. Advance in chatter detection
learning and topological data analysis. IFAC-PapersOnLine 2018;51(14):195–200.
in ball end milling process by utilizing wavelet transform. J Intell Manuf 2013;26
https://doi.org/10.1016/j.ifacol.2018.07.222.
(3):485–99. https://doi.org/10.1007/s10845-013-0805-3.
[26] Yesilli MC, Tymochko S, Khasawneh FA, Munch E. Chatter diagnosis in milling
[11] Tran M-Q, Elsisi M, Liu M-K. Effective feature selection with fuzzy entropy and
using supervised learning and topological features vector. In: 2019 18th IEEE
similarity classifier for chatter vibration diagnosis. Measurement 2021;184:
International Conference On Machine Learning And Applications (ICMLA). IEEE;
109962. https://doi.org/10.1016/j.measurement.2021.109962.
2019. p. 1211–8. https://doi.org/10.1109/icmla.2019.00200.
[12] Ji Y, Wang X, Liu Z, Wang H, Jiao L, Wang D, Leng S. Early milling chatter
[27] M. C. Yesilli F. A. Khasawneh A. Otto , Topological feature vectors for chatter
identification by improved empirical mode decomposition and multi-indicator
detection in turning processes, Int J Adv Manuf Technol doi:10.1007/s00170-021-
synthetic evaluation. J Sound Vib 2018;433:138–59. https://doi.org/10.1016/j.
08242-5.
jsv.2018.07.019.
[28] Adcock A, Carlsson E, Carlsson G. The ring of algebraic functions on persistence bar
[13] Liu C, Zhu L, Ni C. The chatter identification in end milling based on combining
codes. Homology Homotopy Appl 2016;18(1):381–402. https://doi.org/10.4310/
EMD and WPD. Int J Adv Manuf Technol 2017;91(9–12):3339–48. https://doi.org/
hha.2016.v18.n1.a21.
10.1007/s00170-017-0024-8.
[29] J. A. Perea E. Munch F. A. Khasawneh , Approximating continuous functions on
[14] Chen Y, Li H, Hou L, Wang J, Bu X. An intelligent chatter detection method based
persistence diagrams using template functions, arXiv preprint: 1902.07190 arXiv:
on EEMD and feature selection with multi-channel vibration signals. Measurement
http://arxiv.org/abs/1902.07190v2.
2018;127:356–65. https://doi.org/10.1016/j.measurement.2018.06.006.
205
[30] U. Bauer H. Edelsbrunner G. Jablonski M. Mrozek , ÄŒech-Delaunay gradient flow Speech Comm 1985;4(4):333–44. https://doi.org/10.1016/0167-6393(85)90058-
and homology inference for self-maps, arXiv preprint:1709.04068v2 arXiv: 5.
1709.04068v2. [45] Vidal E, Lloret M-J. Fast speaker-independent DTW recognition of isolated words
[31] Yesilli MC, Khasawneh FA, Otto A. On transfer learning for chatter detection in using a metric-space search algorithm (AESA). Speech Comm 1988;7(4):417–22.
turning using wavelet packet transform and ensemble empirical mode https://doi.org/10.1016/0167-6393(88)90059-3.
decomposition. CIRP J Manuf Sci Technol 2020;28:118–35. https://doi.org/ [46] Ruiz EV. An algorithm for finding nearest neighbours in (approximately) constant
10.1016/j.cirpj.2019.11.003. average time. Pattern Recogn Lett 1986;4(3):145–57. https://doi.org/10.1016/
[32] Myers C, Rabiner L, Rosenberg A. Performance tradeoffs in dynamic time warping 0167-8655(86)90013-9.
algorithms for isolated word recognition. IEEE Trans Acoust Speech Signal Process [47] Micó ML, Oncina J, Vidal E. A new version of the nearest-neighbour approximating
1980;28(6):623–35. https://doi.org/10.1109/tassp.1980.1163491. and eliminating search algorithm (AESA) with linear preprocessing time and
[33] Myers CS, Rabiner LR. A comparative study of several dynamic time-warping memory requirements. Pattern Recogn Lett 1994;15(1):9–17. https://doi.org/
algorithms for connected-word recognition. Bell Syst Tech J 1981;60(7):1389–409. 10.1016/0167-8655(94)90095-7.
https://doi.org/10.1002/j.1538-7305.1981.tb00272.x. [48] Rico-Juan JR, Micó L. Comparison of AESA and LAESA search algorithms using
[34] Sakoe H, Chiba S, Waibel A, Lee K. Dynamic programming algorithm optimization string and tree-edit-distances. Pattern Recogn Lett 2003;24(9–10):1417–26.
for spoken word recognition. In: Readings in speech recognition. 159; 1990. p. 224. https://doi.org/10.1016/s0167-8655(02)00382-3.
[35] Juang B-H. On the hidden markov model and dynamic time warping for speech [49] Khasawneh F, Otto A, Yesilli M. Turning dataset for chatter diagnosis using
recognition-a unified view. AT&T Bell Lab Tech J 1984;63(7):1213–43. https:// machine learningv1. Mendeley Data; 2019. https://doi.org/10.17632/
doi.org/10.1002/j.1538-7305.1984.tb00034.x. hvm4wh3jzx.1.
[36] Itakura F. Minimum prediction residual principle applied to speech recognition. [50] Theodoridis S, Koutroumbas K. Feature selection. In: Pattern recognition. Elsevier;
IEEE Trans Acoust Speech Signal Process 1975;23(1):67–72. https://doi.org/ 2009. p. 261–322. https://doi.org/10.1016/b978-1-59749-272-0.50007-4.
10.1109/tassp.1975.1162641. [51] Insperger T, Mann BP, Surmann T, Stépán G. On the chatter frequencies of milling
[37] Niennattrakul V, Ratanamahatana CA. On clustering multimedia time series data processes with runout. Int J Mach Tool Manuf 2008;48(10):1081–9. https://doi.
using k-means and dynamic time warping. In: 2007 international conference on org/10.1016/j.ijmachtools.2008.02.002.
multimedia and ubiquitous engineering (MUE07). IEEE; 2007. https://doi.org/ [52] Berndt DJ, Clifford J. Using dynamic time warping to find patterns in time series.
10.1109/mue.2007.165. In: KDD workshop. Vol. 10; 1994. p. 359–70. Seattle, WA.
[38] Yu F, Dong K, Chen F, Jiang Y, Zeng W. Clustering time series with granular [53] Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken word
dynamic time warping method. In: 2007 IEEE international conference on granular recognition. IEEE Trans Acoust Speech Signal Process 1978;26(1):43–9. https://
computing (GRC 2007). IEEE; 2007. p. 393–8. https://doi.org/10.1109/ doi.org/10.1109/tassp.1978.1163055.
grc.2007.34. [54] Salvador S, Chan P. Toward accurate dynamic time warping in linear time and
[39] Petitjean F, Forestier G, Webb GI, Nicholson AE, Chen Y, Keogh E. Dynamic time space. Intell Data Anal 2007;11(5):561–80. https://doi.org/10.3233/IDA-2007-
warping averaging of time series allows faster and more accurate classification. In: 11508.
2014 IEEE international conference on data mining. IEEE; 2014. p. 470–9. https:// [55] Han R, Li Y, Gao X, Wang S. An accurate and rapid continuous wavelet dynamic
doi.org/10.1109/icdm.2014.27. time warping algorithm for end-to-end mapping in ultra-long nanopore
[40] Shanker AP, Rajagopalan A. Off-line signature verification using DTW. Pattern sequencing. Bioinformatics 2018;34(17):i722–31. https://doi.org/10.1093/
Recogn Lett 2007;28(12):1407–14. https://doi.org/10.1016/j.patrec.2007.02.016. bioinformatics/bty555.
[41] Munich M, Perona P. Continuous dynamic time warping for translation-invariant [56] Dudani SA. The distance-weighted k-nearest-neighbor rule. IEEE Trans Syst Man
curve alignment with applications to signature verification. In: Proceedings of the Cybern 1976;(4):325–7.
seventh IEEE international conference on computer vision. IEEE; 1999. https://doi. [57] Horváth T, Wrobel S, Bohnebeck U. Relational instance-based learning with lists
org/10.1109/iccv.1999.791205. and terms. Mach Learn 2001;43(1):53–80. https://doi.org/10.1023/A:
[42] Parizeau M, Plamondon R. A comparative analysis of regional correlation, dynamic 1007668716498. URL doi:10.1023/A:1007668716498.
time warping, and skeletal tree matching for signature verification. IEEE Trans [58] Choudhary B. The elements of complex analysis. New York: J. Wiley; 1993.
Pattern Anal Mach Intell 1990;12(7):710–7. https://doi.org/10.1109/34.56215. [59] Rakthanmanon T, Campana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J,
[43] Martens R, Claesen L. On-line signature verification by dynamic time-warping. In: Keogh E. Searching and mining trillions of time series subsequences under dynamic
Proceedings of 13th international conference on pattern recognition. IEEE; 1996. time warping. In: Proceedings of the 18th ACM SIGKDD international conference
p. 38–42. https://doi.org/10.1109/icpr.1996.546791. on Knowledge discovery and data mining. ACM; 2012. p. 262–70.
[44] Ruiz EV, Nolla FC, Segovia HR. Is the DTW “distance” really a metric? An
algorithm reducing the number of DTW comparisons in isolated word recognition.
206

Paulo-Henrique - Chatter Detection in Turning Using Machine Learning and Similarity Measures of Time Series Via Dynamic Time Warping

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paulo-Henrique - Chatter Detection in Turning Using Machine Learning and Similarity Measures of Time Series Via Dynamic Time Warping

Uploaded by

Copyright:

Available Formats

Journal of Manufacturing Processes 77 (2022) 190–206

Contents lists available at ScienceDirect

Journal of Manufacturing Processes

Chatter detection in turning using machine learning and similarity

1. Introduction diagnosis include neural network classification [4], logistic regression

Dynamic Time Warping is an algorithm that is capable of measuring

There are several restrictions to define the optimum warping path.

Using DTW, we get a measure of how different/similar any pair of 1. D(x,y) ≥ 0,

Fig. 4. K-Nearest Neighbor classification example for two-class classification.

misclassification since the true n is eliminated. Therefore, we expect a

Fig. 6. Illustration for elimination criteria.

Fig. 7. Effect of H on the elimination criteria and accuracy of the classification.

*Intermediate chatter cases are excluded.

5.08 (2) 97.7 % ± 1.1% 1-NN

*These run times are rough estimations.

You might also like