Alignment of Curves by Dynamic Time Warping

The Annals of Statistics
1997, Vol. 25, No. 3, 1251 ] 1276
ALIGNMENT OF CURVES BY DYNAMIC TIME WARPING1
BY KONGMING WANG and THEO GASSER

¨
University of Zurich
When studying some process or development in different subjects or
units}be it biological, chemical or physical}we usually see a typical
pattern, common to all curves. Yet there is variation both in amplitude
and dynamics between curves. Following some ideas of structural analysis
introduced by Kneip and Gasser, we study a method}dynamic time
warping with a proper cost function}for estimating the shift or warping
function from one curve to another to align the two functions. For some
models this method can identify the true shift functions if the data are
noise free. Noisy data are smoothed by a nonparametric function estimate
such as a kernel estimate. It is shown that the proposed estimator is
asymptotically normal and converges to the true shift function as the
sample size per subject goes to infinity. Some simulation results are
presented to illustrate the performance of this method.
1. Introduction. When studying some process or development Že.g., a

biological, chemical or physical process., we usually see a typical pattern
which is common to different subjects or units, and yet there are variations in
both amplitude and phase Žor timing. between curves. One example is growth
of humans or animals, where growth evolves at different intensities and at
different paces in different individuals. Another example is speech signals,
where the same words are spoken with varying loudness and varying speed.
Classical statistical approaches such as repeated measure analysis of vari-
ance or principal component analysis deal exclusively with amplitude varia-
tion, and methods to deal with both are scarce. In Kneip and Gasser Ž1992.,
‘‘structural analysis’’ was proposed to align or shift curves to a common
average time scale before applying further statistics such as averaging curves.
Estimating ‘‘structural average curves’’ proved to be successful to study
human growth for variables which have been inaccessible because of their
small size and relatively large residual variation w Gasser, Kneip, Binding,
Prader and Molinari Ž1991. and Gasser, Kneip, Zieger, Molinari, Prader and
Largo Ž1994.x . However, the step leading to individual shift functions for the
alignment of curves is somewhat delicate and time-consuming.
In the engineering literature, a different approach, called dynamic time
warping, was developed to align two signals with different dynamics w Parsons
Ž1986., Rabiner and Schmidt Ž1980., Qi Ž1992.x . This method has been mainly
Received May 1995; revised August 1996.

1
Research supported by Swiss NSF Ž21-36042.92..
AMS 1991 subject classifications. Primary 62G07; secondary 62H05.
Key words and phrases. Curves, shift functions, structural analysis, dynamic time warping,
kernel estimation.
1251
1252 K. WANG AND T. GASSER
applied to speech analysis and speech recognition. We will give details in the
next section and sketch here how time warping works. Suppose that two
sequences f Ž i ., i s 1, . . . , M 4 and g Ž j ., j s 1, . . . , N 4 characterize two sig-
nals f and g, respectively. We want to find the best match between f and g
by some alignment w, based on minimizing a cost function. The classical cost
function is given by
Ý Ž f Ž i. y g Ž j. . .
2
inf
w Ž
i , j .gw
Here w s Ž i, j .4 is a warping path connecting Ž1, 1. and Ž M, N . in a two-

dimensional square lattice and satisfying monotonicity and connectedness.
This means that both coordinates of the parametrized path w s Ž iŽ k ., jŽ k ..:
k s 1, . . . , K ; iŽ1. s jŽ1. s 1, iŽ K . s M, jŽ K . s N 4 have to be nondecreasing,
and that they can only increase by 0 or 1 when going from k to k q 1.
Obviously this is a minimization problem which can be solved efficiently by
using dynamic programming. What controls the warping result is the cost
function. How a sample of functions can be aligned by dynamic time warping
to some average time scale, however, needs some further thoughts.
We now describe briefly the methods proposed in the statistical literature
to analyze samples of functions. Suppose the observed data yi j fit the model
yi j s f i Ž t i j . q « i j ,
Ž 1.
j s 1, . . . , n i Ž timing for subject i . ; i s 1, . . . , m Ž subjects. .
Here f i : i s 1, . . . , m4 are unknown smooth functions, t i j g J ' w 0, 1x ; R Ž J

could be any closed interval., and « i j : j s 1, . . . , n i ; i s 1, . . . , m4 are inde-
pendent random variables with mean EŽ « i j . s 0 and variance V Ž « i j . s si2 )
0. Note that the t i j 4 do not need to be equally spaced.
When all subjects have the same number of measurements Ž n s n i ., the
resulting multivariate data matrix can be analyzed by principal component
analysis, as suggested by Rao Ž1958.. In this way, m functions are reduced to
a small number of ‘‘elementary’’ functions. This approach does not take into
account the inherent smoothness of the functions f i ; and, more crucially, it
does not account for ‘‘dynamic variability’’ but only for amplitude variability.
The first drawback was eliminated in the proposal by Rice and Silverman
Ž1991., incorporating a penalized smoothing approach into PCA w see also the
related paper by Ramsay and Dalzell Ž1991.x . To account for the second
drawback, Silverman Ž1995. also incorporated an individually constant shift
parameter into PCA.
The cross-sectional average Ý m is1 f irm is not a good estimate of the average
dynamic and amplitude in general. Let us assume that f i Ž t . s sŽ u i Ž t .. in Ž1.,
where s is some shape function and u i 4 is an i.i.d. random sample of shift
functions satisfying EŽ u i Ž t .. s t and strict monotonicity. If s is nonlinear,
then EŽ f 1 . / s. By the law of large numbers, my1 Ý m Ž .
is1 f i ª E f 1 / s in
probability.
ALIGNMENT OF CURVES 1253
The structural analysis suggested by Kneip and Gasser Ž1992., leading in

particular to a structural average curve, proceeds as follows.
1. Kernel estimates fî , fîŽ1., fîŽ2. of f i , f 1X , f i 0 are obtained.
2. Individual structural points are identified from fî , fîŽ1. andror fîŽ2.. Roughly
speaking, structural points Žcalled ‘‘landmarks’’ in shape analysis. are
features that are common to all or most curves.
3. Shift functions uˆ i are constructed from the locations of the structural
points such that individual structural points are shifted to the respective
average location. In between, smooth monotonic interpolation is used.
4. A structural average fˆ0 Žin contrast to a cross-sectional average. is ob-
tained by averaging aligned Žsmoothed. curves:
1 m
Ž 2. fˆ0 Ž ? . s Ý fî Ž uˆ i Ž ?. . .
m is1
Figure 1 shows the alignment of the velocities of two growth curves of

shoulder width Žboys. by steps Ž2. ] Ž3. and by dynamic time warping. The
structural points used in steps Ž2. ] Ž3. are simply extremes and inflection
points. Dynamic time warping is introduced in the next section. Both methods
produce rather good results in this example.
A delicate step is Ž2., and to a lesser extent Ž3.; it is not easy to define
features which are common to most curves and to determine them unequivo-
cally from noisy data, such that they have an equivalent meaning. Dynamic
time warping, which addresses a similar problem, does not need such prereq-
uisites. The method is fully nonparametric. It is thus of interest to establish
whether and when dynamic time warping leads to meaningful shift functions.
This problem is dealt with here mainly in the context of aligning one function
with respect to another. The possible extension to m functions is discussed
but not treated exhaustively.
Section 2 is devoted to an introduction into dynamic time warping and to
some improvements. First, a variational problem in continuous time is formu-
lated in order to obtain smooth shift functions instead of a warping path.
Secondly, new cost functions are introduced to offer an improvement. At the
end of Section 2, we present approaches for aligning m curves to their
common time scale so that a structural average curve can be computed. A
theoretical analysis of this method is beyond the scope of this paper. In
Section 3, various classes of models are introduced and it is shown that
dynamic time warping identifies the correct shift functions in these classes.
Asymptotic properties for the estimators of shift functions are derived in
Section 4 Žwith proofs in the Appendix.. Simulations with a small number of
replications are presented in Section 5.
2. Dynamic time warping. Dynamic time warping has been designed

for aligning one curve with respect to another, and it is well documented in
the engineering literature. The article by Sakoe and Chiba Ž1978. is a good
FIG. 1. Example of aligning two curves: top left gives two smoothed velocity curves; top right is
the alignment by dynamic time warping; bottom left is the alignment by steps Ž2. ] Ž3.; bottom
right gives differences of shift functions and the identity function. The shift functions are produced
by dynamic time warping Ž solid line . and by steps Ž2. ] Ž3. Ž dotted ..
reference for basic ideas, and we present this approach in Section 2.1. A new
cost function is introduced there. In Section 2.2, we deal with the problem of
aligning m regression functions.
2.1. Aligning two regression functions. In speech recognition, a word or a

sentence can be expressed as a sequence of features by feature extraction
methods. This feature vector, rather than the original audio signal, is then
further analyzed. The recognition process consists of comparing the recorded
word with words in a template set. If its feature sequence matches closely to
the feature sequence of a word in the template set, then the word Žor speech.
is recognized. For this comparison, the time-axis fluctuations between the
given word and a template have to be eliminated. The template for a word
might be what we call a structural average here and it is obtained by
averaging samples spoken by many people.
We describe briefly the version of dynamic time warping given in Sakoe

and Chiba Ž1978.. Let F s Ž f Ž1., . . . , f Ž M .. and G s Ž g Ž1., . . . , g Ž N .. be two
feature vectors. Whether the two speech patterns are of the same category is
similar to asking whether there is a mapping w of the form
Ž 3. w s Ž Ž i Ž 1. , j Ž 1. . , . . . , Ž i Ž K . , j Ž K . . .
such that the discrepancy
K
Ž 4. C0 Ž F , G , w . ' Ý d Ž f Ž i Ž k . . , g Ž jŽ k . . . r Ž k .
ks1
is small enough. The warping path w has to satisfy several side conditions.
CONDITIONS. Ži. Monotoncity: iŽ k . F iŽ k q 1. and jŽ k . F jŽ k q 1..

Žii. Continuity: iŽ k q 1. y iŽ k . F 1 and jŽ k q 1. y jŽ k . F 1.
Žiii. Boundary: iŽ1. s jŽ1. s 1, iŽ K . s M and jŽ K . s N.
Živ. Window: < iŽ k . y jŽ k .< F a given positive integer.
Žv. Slope contraint: neither too steep nor too gentle a gradient should be
allowed.
In Ž4., the function dŽ?, ?. is a distance measure and r is a nonnegative

weighting function w usually one takes r Ž k . ' 1x . The length K of the warping
path w is determined by the warping process. Note that this cost function is
symmetric in F and G .
The time-normalized distance between the speech patterns F and G is
defined as the solution of the following minimization problem:
Ž 5. D 0 Ž F , G . s inf C0 Ž F , G , w . .
w
Conditions Ži. and Žii. are natural. Condition Žiii. is posted since the start
and end of words are detected before time normalization. Condition Živ. is
posted according to the concept that time-axis fluctuations should not lead to
too excessive differences in timing. Finally Žv. is used to prevent unrealistic
warping if < M y N < is relatively large. Figure 2 shows what time warping
does.
In principle a cost function could be any functional of the two input
sequences and a warping path, depending on the purpose of the application.
The cost functions used in speech recognition and pattern classification are
varieties of the classical quadratic cost function Žsee Table 1 for some
examples..
Some explanations might be helpful. In the second line of the table, P Ž w .
is a penalty function which penalizes jumps and flat spots on the warping
path w, and b is a coefficient which controls how severe the penalty is
w b s 0.0075 in Roberts, Lawrence, Eisen and Hoirch Ž1987., chosen by trial
FIG. 2. Tutorial illustration of dynamic time warping. Top left: two given curves; top right:
warping function; bottom figures: the warping of one curve to the other.
and errorx . From the equality ab s yŽ a y b . 2r2 q Ž a 2 q b 2 .r2, we see that
Ý f Ž i. g Ž j. s Ý Ž f Ž i . y g Ž j . . y 12 Ý Ž f Ž i . 2 q g Ž j . 2 . .
2
y 1
2
Ž i , j .g v Ž i , j .gw Ž i , j .gw
The first term is just the classical cost function. Since the second term tends
to make the warping path longer, a penalty is needed. A version of P Ž w .
given in Roberts, Lawrence, Eisen and Hoirch Ž1987. is the following. Write
w s Ž iŽ k ., jŽ k ..: k s 1, . . . , K 4 where K is the length of w. Then
K
P Ž w. s Ý p Ž w, k . ,
ks2
TABLE 1
Cost functions
Ý Ž f Ž i . y g Ž j .. 2 Classical
Ž i, j .gw
y Ý f Ži. g Ž j. y bP Žw. Roberts, Lawrence, Eisen and Hoirch

Ž i, j . gw
ž g Ž uŽ t . .
/
2
f Žt.
H01
a2
5f5
y
5 g5
ž g 9Ž uŽ t . .
/
2
f 9Ž t .
q Ž1 y a . q f Ž u9 Ž t . . dt
2
y Proposed here
5 f 95 5 g 95
where
¡Ž k y r k y 1. ,
2
if i Ž k . s i Ž k y 1 . , r k s min r F k y 1: i Ž r . s i Ž k . 4 ,
p Ž w, k . s~ Ž k y r y 1. ,
2
k
if j Ž k . s j Ž k y 1 . , r k s min r F k y 1: j Ž r . s j Ž k . 4 ,
¢0, otherwise.
With the penalty term added, the cost function is no longer a convex function.
The third line of Table 1 is the cost function proposed in this paper. Details
are given below. Our cost function is inspired by Sobolev norms and by the
least squares principle. The normalization with the sup-norm 5 ? 5 in Ž6. is
intended to reduce the differences in amplitudes of the curves when estimat-
ing the shift functions. This should prevent us from explaining amplitude
variability between curves in terms of dynamic variability. There are other
cost functions used in speech analysis. For example, a cost function defined in
¨
terms of the linear predictive coding features sets of signals is used in Hohne,
Coker, Levinson and Rabiner Ž1983..
For our purpose, where aligning maxima, minima, and inflection points of
curves is important, we incorporate derivatives of functions into the cost
function. Apart from heuristics, theoretical and simulation analysis lends
support to this idea. Incorporating higher order derivatives Žthe second
derivative in particular. is possible in principle, but problems of estimating
higher order derivatives from noisy data might arise.
Now, we give details of the new cost function. Define a functional F of the
functions f, f 9, g, g 9, u and a real variable a by
ž f Ž t. g Ž uŽ t . .
/
2
F Ž f , f 9, g , g 9, u, a . Ž t . ' a 2 y
5f5 5 g5
Ž 6.
ž f 9Ž t . g 9Ž uŽ t . .
/
2
qŽ 1 y a .
2
y .
5 f 95 5 g 95
Here f 9 is the derivative of f and 5 f 9 5 is the supremum norm of f 9. Then the

cost function is defined by
Ž 7 . C Ž f , f 9, g , g 9, u, a . ' H F Ž f , f 9, g , g 9, u, a . Ž t . q f Ž u9 Ž t . . dt.
1
The function f serves as a penalty function which plays a role similar to the
side conditions Ži. ] Žv.. It is specified as follows. Let M ) d ) 0 be constants
and define f to be a convex function satisfying the following conditions:
f Ž x . s 0 for x g w d q r, M y r x with a small positive number r; f Ž dq. s
f Ž My. s `; f Ž x . s ` for x g Ž d , M . c ; and f g C 4 Ž d , M .. An example of
such a f is given by
f Ž t . s c Ž d q r y t . IŽ d , dqr x Ž t . r Ž t y d .
5
q Ž t y M q r . Iw Myr , M . Ž t . r Ž M y t . , t g Žd, M.
5
for a constant c. Here IAŽ t . is the indicator function of a set A.

The best warping or shift function between two functions f and g is given
by the solution of the following variational problem
Ž 8. inf C Ž f , f 9, g , g 9, u, a . : u g C 1 , a g R 4 .
The cost function Ž7. is motivated not only by the least squares principle.
For two important classes of models, true shift functions can be recovered by
using the cost function; compare Section 3. Simulations in Section 5 also show
its usefulness.
It is easy to verify via Euler’s equation that if the optimal solution u of Ž8.
satisfies d q r - u9Ž t . - M y r, t g Ž0, 1., then any extrema of g is aligned to
a stationary point of f where f 9 s 0.
Note that we do not require uŽ0. s 0. This flexibility allows us to study
certain models in more detail ŽSection 3.. In applications one usually has
uŽ0. s 0. This is the case in speech recognition, where the start and end
points of words or sentences are detected before time warping. It is also the
case when growth curves are analyzed. The case uŽ0. / 0 arises, for example,
when one signal has not been observed from the beginning.
The warping path in a discrete setting is in parametrized curve form
Ž iŽ k ., jŽ k ..: k s 1, . . . , K 4 , while the warping function u in our continuous
setting is not. The reason is that the warping path in a discrete setting is not
a one-to-one mapping. Note that the parametric form makes the variational
problem symmetric in the two functions being matched. Since we require that
the shift function is in C 1 and that the optimal shift function is strictly
increasing, no parametrization in curve form is needed.
The variational problem Ž8. can be solved as follows. For a given a , use
dynamic programming to find an optimal u corresponding to this a . Then the
minimization in a can be restricted to w 0, 1x Žsee the proof of Lemma 4.1. and
can be done, say, by grid search.
2.2. Aligning m regression functions. We now go back to model Ž1. of m

regression functions and present a method for aligning all curves to their
average time scale based on aligning one function to another. Once this is
done, further statistical analysis such as structural averaging is then
straightforward.
Dynamic time warping produces a relative shift function between two
curves. To align a sample of curves to a common time scale, we need a
reference curve. Then all curves can be aligned to this reference curve, and
hence the average timing can be computed. It is assumed in this subsection
that all curves are observed continuously and are noise free. In applications a
further preliminary step of smoothing the data is required.
The principle is simple. Let f e be the chosen reference curve. Warp each
curve f i , i s 1, . . . , m, to f e and denote the warping function by h i Ž t .,
t g w 0, 1x . Then
1 m
hŽ t . ' Ý h Ž t.
m is1 i
is the average timing with respect to f e . Since each h i is strictly increasing,
the function h is strictly increasing and it has an inverse hy1 . Now it is clear
that
u i Ž t . ' h i Ž hy1 Ž t . .
is the correct shift function to transform f i to the average time scale. A
structural average of f i : i s 1, . . . , m4 is then computed as
1 m
f0 Ž t . s
m
Ý fi Ž u i Ž t . . .
is1
The reference curve should be close to the typical pattern of the sample
curves and should have more or less the same features as most sample
curves. A consideration in choosing a reference curve is the trade-off between
accuracy and computational effort. Several possibilities are given here.
1. In principle one could choose a curve randomly from the sample as
reference curve, following the arguments given above. This is computation-
ally attractive even when m is large, but the statistical quality may suffer
if an atypical curve were selected. This would inevitably make it more
difficult to estimate the warping function well.
2. Take each f i , i s 1, . . . , m, as reference curve. Warp every other curve to f i
and compute the total cost Ži.e., the sum of the cost for warping f j to f i ,
j / i .. Now choose f e to be the curve corresponding to the maximum total
cost. The main problem with this procedure is that it can require pro-
hibitive computing time if m is large. Note that one would need to solve
the variational problem Ž8. mŽ m y 1.r2 times.
3. An iterative method can be used. First take f e Ž t . s Ž1rm.Ý m Ž .
is1 f i t , the
cross-sectional average. Then compute a structural average based on f e . In
the following steps, take the structural average computed in the previous
step as the reference curve of the next step and iterate. Computation is not
a problem since a few iterations are enough. This proposal shows good
statistical properties if the relative shifts among curves are small. If the
shifts are large, then the cross-sectional average might be too atypical to
start with, since structure gets lost.
4. For large m, one could select a random sample of size k from f i 4 . Assume
k s 2 j. Partition this selected sample into 2 jy1 pairs and compute
a structural average for each pair by a single warping. Now we have
2 jy1 structural averages. Partition this group into 2 jy2 pairs and compute
a structural average for each pair. Repeat this procedure till only one
structural average is left. Take this one as the reference curve f e .
Now the question arises as to which of proposals Ž1. ] Ž4. should be used in
practice. Our suggestion is the following. If m is small, Ž2. should be the
choice. If m is large but one has confidence that the relative shifts among
sample curves are small and that no outlier should be in the sample, Ž3.
would be a good choice. Finally, if m is large and no prior information about
the quality of the sample is available, Ž4. is recommended. Another possibility
would be to combine Ž2. and Ž4.: select a random sample of size k from f i 4
and perform Ž2. on this subsample to compute a reference curve. As men-
tioned before, analysis of the methods proposed in this subsection is not
considered in this paper.
3. Alignment for some semiparametric models. It has become clear

that dynamic time warping effects a nonparametric technique for the align-
ment of regression functions. The questions is now whether dynamic time
warping achieves its goal when some parametric or semiparametric model is
assumed to be known. In the following we study some semiparametric
models. Most of these models have been studied in recent years in the
statistical literature. We want to investigate whether dynamic time warping
identifies the right alignment in the absence of noise. Let us postulate a
functional model of the following form for the regression model Ž1.:
Ž 9. fi Ž t . s s Ž t , ui . , t g w 0, 1 x , i s 1, . . . , m.
Here s is some prespecified function with individual parameters u i g R d.
When data for many functions are available, and when some general struc-
ture for s can be postulated, the semiparametric problem of estimating u i in
the presence of the infinite-dimensional nuisance parameter s can be success-
fully treated w Kneip and Gasser Ž1988.x , and no specific form for s needs to be
specified. In the context of this paper, it is interesting that most of the
semiparametric classes of models considered so far have amplitude and shift
variation as the basic structure. In the simplest case this variation is modeled
linearly, leading to the so-called shape-invariant model ŽSIM.:
Ž 10. fi Ž t . s ai s
ž t y bi
ci / q di .
Estimating parameters a i , bi , c i , d i and the shape function s for such a

model has been studied in Lawton, Sylvestre and Maggo Ž1972., Kneip and
Gasser Ž1988., and Kneip and Engel Ž1995.. This class contains logistic,
Gompertz and other important nonlinear regression models. It is successful
for modeling human growth w Stutzle,¨ Gasser, Molinari, Largo, Prader and
Huber Ž1980.x by postulating two components for prepubertal and pubertal
growth, respectively.
It is easy to see that the optimal shift function between f i and f j is given
by the solution Ž u, a . of Ž8., where
Ž u Ž t . , a . s Ž c j Ž t y bi . rc i q bj , 0 . .
Thus, the newly introduced cost function is able to identify linear shifts
within SIM correctly. To appreciate this, note that using other cost functions
in Table 1 will not lead to correct shift functions within this simple model.
¨
For comparing two functions f 1 and f 2 , Hardle and Marron Ž1990. consid-
ered a somewhat more general model with amplitude-phase modulation:
Ž 11. f 2 Ž t . s Suy1
0
f 1 Ž Tuy1
0
t.,
where Su and Tu are invertible parametric transformations. For linear trans-
formations Su and Tu this reduces to Ž10.. They proposed to estimate u 0 by
minimizing the loss function
LŽ u . s H f 1 Ž t . y Su f 2 Ž Tu t .
2
r Ž t . dt ,
with some nonnegative weight function r. None of the cost functions in

Table 1 is successful for identifying Tu 0 in general, but a back-fitting method
w such as the one proposed in Kneip and Gasser Ž1988.x would work.
A more general nonlinear shift model ŽNLSM. allows a shift function u i to
be specified nonparametrically:
Ž 12. fi Ž t . s ai s Ž u i Ž t . . ,
where u i g C is strictly increasing and a i is a real and positive parameter.
1
Despite the great generality allowed for shifts, this model is still identifiable.
The optimal shift function from f j to f i can be obtained by dynamic time
warping with cost function Ž7. as
Ž u Ž t . , a . s Ž uy1
j Ž ui Ž t . . , 1. .
Thus, shift functions can only be extracted in relative terms with respect to
some function chosen as reference Ž f i in this example.. Again, other cost
functions in Table 1 are not successful for extracting the correct shift func-
tion.
An interesting generalization emerges when replacing the individual factor
a i by some parametric function a i Ž t, bi .:
Ž 13. f i Ž t . s a i Ž t , bi . s Ž u i Ž t . . q d i .
Here a i is an a priori known function with individual parameter bi g R d ,
and d i is an unknown constant. Possibly, this quite general semiparametric
model is also identifiable when requiring proper conditions for bi , g i , and d i .
In any case, recovering the shift function between two such functions would
require a more complicated cost function than Ž7.. It is plausible that a
back-fitting procedure}such as the one used in Kneip and Gasser Ž1988. }
could solve this problem. However, the general amplitude-phase modulated
model
fi Ž t . s ai Ž t . s Ž u i Ž t . .
is clearly not identifiable. This is true even when requiring obvious conditions
such as Ea i Ž t . ' 1 and Eg i Ž t . ' t. Nonetheless, simulations show that cost
function Ž7. yields reasonable results even for this model.
With the new cost function Ž7., shift functions can be fully recovered by
dynamic time warping in models ŽSIM. and ŽNLSM. if data are noise free.
This seems to us an important achievement, since dynamic time warping is a
relatively easy, automatic method. It can be attributed to the inclusion of
derivatives in the cost function. For noisy data, some nonparametric function
fitting method like kernel estimators or local polynomial fitting allows the
estimation of the function itself and of its derivatives as a preliminary step.
The problem of estimating derivatives from noisy data might have prevented
their earlier use in a cost function.
4. Estimation and asymptotics. In practice, a regression function f is

unobservable and has to be estimated from noisy data y 1 , . . . , yn 4 , where
yj s f Ž t j . q « j . We use convolution-type kernel smoothing to estimate deriva-
tives of order n G 0 of f. Specifically, let Kn be a kernal of order Ž n , n q 2.
w Gasser, Muller¨ and Mammitzsch Ž1985.x for n s 0, 1. That is,
¡0, j F n and 0 F j F n q 1,
Hy1 K dt s~ Ž y1 .
1 n
Ž t. t j
, jsn,
n
¢b / 0j j s n q 2,
and the support of Kn is w y1, 1x . Note that optimal kernels are explicitly
known as polynomials of order Ž n q 2.. Then we define
ž /
vyt
1 n
Hs
sj
fˆŽ t . s Ý yj K0 dv,
b0 js1 jy1
b0
ž /
vyt
1 n
Hs
sj
fˆ9 Ž t . s Ý yj K1 dv.
b 12 js1 jy1
b1
Here s j s Ž t jy1 q t j .r2 for 1 F j F n, s0 s 0, and sn s 1. Following asymp-

totic theory we take b 0 s O Ž ny1 r5 . and b 1 s O Ž ny1 r7 .. Other smoothing
methods like local polynomial fitting or smoothing splines can be employed
here instead of kernel smoothing. It would not change the convergence rates
given in Theorem 4.1 below, since we assume a fixed design.
For any two functions f and g, the optimal shift function between f and g
is then estimated by the solution of
½ ž
inf C fˆ, fˆ9, gˆ , gˆ9, u, a : u g C 1 , a g R . / 5
Obviously some theoretical and practical questions need to be addressed.
Does the variational problem Ž8. have a solution? Is a solution unique? How
does dynamic time warping perform with noisy data? We address these
questions in this section. As stated before, the analysis focuses on the
alignment of two functions. First we prove the existence of a solution to Ž8..
LEMMA 4.1. If f, g g C 1 , then there exists a solution of the variational

problem Ž8..
A proof is given in the Appendix. Since f, ˆ g,

ˆ fˆ9, gˆ9 are continuous, this
lemma shows that Ž8. has a solution if one replaces f, g, f 9, g 9 by f, ˆ g,
ˆ fˆ9, gˆ9.
It is easy to see that Ž8. has many solutions in some cases. An example is
g Ž uŽ t .. s cf Ž t . for some strictly increasing u and some constant c, where
f Ž t . s constant on some open interval Ž a, b . ; w 0, 1x . Obviously Ž u, 1. is a
solution of Ž8.. Now take Ž a1 , b 1 . ; Ž a, b . and define v by v Ž t . s uŽ t . on
w 0, 1x _ Ž a1 , b 1 . and anything on Ž a1 , b 1 . such that v is strictly increasing and
v g C 1 w 0, 1x . Then Ž v, 1. is also a solution of Ž8.. To make the solution unique,
one can replace f in Ž8. by a strictly convex function with minimum at 1, but
this may drive the solution of Ž8. away from the optimal shift function, if the
identity uŽ t . s t is not the optimal shift function. Similarly, if g 9Ž uŽ t .. s
cf 9Ž t . and f 9Ž t . s constant on Ž a, b . g w 0, 1x , then Ž8. has many solutions.
Note that even though Ž8. has many solutions in each of these cases, the
alignment is unique. That is,
5 g Ž u . y g Ž v . 5 2 5 g 9 Ž u . y g 9 Ž v . 5 2 s 0.
Here 5 ? 5 2 stands for the L2 norm of square integrable functions on w 0, 1x . We
suspect that this equation is true in general if Ž8. has more than one solution,
but we could not prove it.
In data analysis, parametric linear function fitting is the approach most
often used. Here, as in other problems in nonparametric function fitting, a
linear regression function becomes a degenerate case. However, it is more a
theoretical than a practical problem, since linear or almost linear functions
can easily be spotted in an exploratory analysis. If both f and g are linear
and if both their derivatives are positive or negative, then f 9r5 f 9 5 y g 9r5 g 9 5
' 0. Thus for any u defined on w 0, 1x , the pair Ž u, 0. is a solution of Ž8.. This
results in incorrect alignment in this case. There are two ways to get around
this problem. First, choose the penalty function f as a strictly convex
function with minimum f Ž1. s 0. Then the solution of Ž8. is Ž u, 0. with
uŽ t . s t. As mentioned before, this may drive the solution of Ž8. away from
the optimal alignment between two functions in general. The second way is
the following. Observe that dg Ž uŽ t ..rdt s g 9Ž uŽ t .. u9Ž t .. It is therefore natu-
ral to replace g 9Ž u. in Ž6. by g 9Ž u. u9. When both f and g are linear, the
second term of Ž6. becomes Ž1 y u9Ž t .. 2 . The theoretical analysis will not
change much if this replacement is made, but computation becomes difficult
mainly because of the need to find a way of computing u9 in order that it is
still possible to solve the problem Ž8. by using dynamic programming.
Some notation is needed for statistical analysis. Let
A Ž t . ' A Ž f , f 9, g , g 9, u, a . Ž t .
s
ž Fu u Ž f , f 9, g , g 9, u, a .
Fu a Ž f , f 9, g , g 9, u, a .
Fu a Ž f , f 9, g , g 9, u, a .
Fa a Ž f , f 9, g , g 9, u, a . / .
We have used the notation Fu u Ž f, f 9, g, g 9, u, a . for Fu u Ž f Ž t ., f 9Ž t ., g Ž t .,

g 9Ž t ., uŽ t ., a . since t will be fixed. Here Ž u, a . is an optimal solution of Ž8..
Let Ž u, ˆ aˆ . denote a solution of Ž8. with f, g, f 9, g 9 replaced by f, ˆ g,
ˆ fˆ9, gˆ9.
Recall that a s 0, 1 correspond to models where true shift functions can be

recovered ŽSection 3.. Below and in the remainder of this paper, let fˆŽ k . Ž t . s
ˆ k and similarly for the derivatives of kernels and other estimators. We
d k frdt
have the following theorem.
THEOREM 4.1. Assume conditions Ži. ] Žiii..

Ži. f, g g C 4 ŽR. and the optimal shift function u between f and g satisfies
d q r - u9Ž t . - M y r, t g w 0, 1x .
Žii. AŽ t . is invertible at some t g Ž0, 1..
Žiii. b 0 s O Ž ny1 r5 . and b 1 s O Ž ny1 r7 ..
Then the following conclusions hold:
Ža. EŽ u,
ˆ aˆ .Ž t . y Ž u, a .Ž t . s ½ O Ž b 12 . qo Ž ny1r2 by3r2
OŽ b 02 . qo Ž n
1
y1r2
log 2 Ž n . . ,
by1r2
0 log Ž n . . ,
2
a /1,
a s1.
Žb. If a / 1 then
'nb 5
1 Ž u, ˆ aˆ . Ž t . « N Ž 0, A Ž t . V Ž a . A Ž t . .
ˆ aˆ . Ž t . y E Ž u,
y1 y1
with N a multinormal distribution and
ž f 9Ž t . g 9Ž uŽ t . .
/H
2
K 1Ž1. Ž x . dx diag Ž 5 g 9 5y2 , 0 . .

1
V Ž a . s 4Ž 1 y a . s
2 2
2
y
5 f 95 5 g 95 y1
'
If V Ž a . s 0 then nb15 wŽ u,
ˆ aˆ .Ž t . y EŽ u,
ˆ aˆ .Ž t .x ª 0 in probability.
Žc. If a s 1 then
'nb 3
0 Ž u, ˆ aˆ . Ž t . ª N Ž 0, A Ž t . V1 A Ž t . .
ˆ aˆ . Ž t . y E Ž u,
y1 y1
with N a multinormal distribution and
ž f Ž t. g Ž uŽ t . .
/H
2
K 0Ž1. Ž x . dx diag Ž 5 g 5y2 , 0 . .

1 2
V1 s 4s 2
y
5f5 5 g5 y1
If V1 s 0 then 'nb 3
0 ˆ aˆ .Ž t . y EŽ u,
wŽ u, ˆ aˆ .Ž t .x ª 0 in probability.
The proof is given in the Appendix. It shows that dynamic time warping
performs reasonably well with noisy data, though the convergence rate is not
very fast Ž ny1 r7 for a / 1 and ny1r5 for a s 1.. We point out that these
results on bias and variance hold for any cost function with three continuous
´
Frechet derivatives. This will be clear from the proof. We make some remarks
about this theorem.
REMARKS. Ža. Condition Ži. is not a restriction since d , r and My1 can be
as small as one wants.
Žb. It would be nice to have a convergence rate for 5 uˆ y u 5. As a result of
the nonuniqueness of solutions Ž8., it could, however, be complicated to show
ˆ such that 5 uˆ y u 5 ª 0.
just that there exists an optimal solution u
Žc. Since Ž u, a . is an optimal solution, the second variation of the cost

function at Ž u, a ., given by
D 2 C Ž f , f 9, g , g 9, u, a . w u1 , a 1 ; u1 , a 1 x
H0
1
Ž u1 , a 1 . A Ž t . Ž u1 , a 1 . q f 0 Ž u9 . Ž uX1 . dt ,
T 2
s
is nonnegative for any admissible Ž u1 , a 1 .. See the proof in the Appendix for
the definition of the second variation, or second Frechet ´ derivative. Since u
satisfies Ži., f 0 Ž u9. ' 0. Hence AŽ t . is semipositive definite for all t g w 0, 1x .
Since AŽ t . is continuous in t, condition Žii. implies that AŽ t . is positive
definite in a neighborhood of t. This makes the solution of Ž8. unique in a
neighborhood of t, and therefore makes it possible to prove the pointwise
result given in this theorem.
Žd. One can check condition Žii. for some interesting models. SIM is an
easy example while the proof for NLSM is not so easy.
Že. In the structural analysis proposed by Kneip and Gasser wŽ 1992.,
Theorem 3, page 1289x , the estimated shift function from noisy data con-
verges to the true shift function at a rate of O Ž ny1 r5 .. Here the rate
' nb15 s O Ž ny1r7 . when a F 1 is slower because the second derivative of gˆ is
involved when solving Ž8. Žsee the proof given in the Appendix..
5. Simulations. We undertook a small scale simulation Ž100 runs. to

evaluate the practical performance of dynamic time warping. Both the classi-
cal and our new cost function are evaluated. The evaluation is performed for
the basic problem of warping one function to a second one. The assumption
uŽ0. s 0, which is natural and useful in many applications, is not made in
this section.
The base function Žor shape function. is shown in Figure 3 and is defined
by
¡s Ž t . q 4 t sinŽ 31.4Ž 0.45 y t . . ,
1 t F 0.45,
Ž 14. s Ž t . s~ s Ž t . q 10 Ž s Ž t . y s Ž t . . Ž 0.45 y t . , 0.45 F t F 0.55,
¢s Ž t . q 4Ž 1 y t . sinŽ 31.4Ž t y 0.55. . ,
1
2
2 1
t G 0.55.
with
s1 Ž t . s0.25 Ž ty5 . y8 Ž 0.45yt . , s2 Ž t . sy0.25 Ž tq4 . q8 Ž ty0.55. .

2 2
We consider the model

fi Ž t . s ai s Ž h i Ž t . . q d i , i s 1, 2,
where a i and d i are constants while h i is a strictly increasing shift function.
The base function s has 8 extremes, denoted by t j , j s 1, . . . , 8. Let t i j s
hy1
i
Žt j ., j g K i , be the corresponding extremes of f i which are present on the
graph Ž t, f i Ž t ..: t g w 0, 1x4 . K i is a set of indices. Note that t i j is present on
the graph if and only if t i j g w 0, 1x .
FIG. 3. Shape function.
Under optimal shifting of the second function to the first one, the extreme
t 2 j should be shifted to t 1 j for j g K ' K 1 l K 2 . Let ˆ
t 1 j be the image of t 2 j
under dynamic time warping for j g K. We use
1
Ý Ž ˆt 1 j y t 1 j .
2
Ž 15. r'
<K< jgK
as an error measurement for aligning extremes. This choice is made for the
following reasons. First, the norm 5 u ˆ y u 5 2 , with uˆ and u estimated and true
shift functions, respectively, is not very sensitive and thus not appropriate as
criterion Ža different standardization might, however, help.. One could also
compute the difference 5 f 1Ž?. y f 2 Ž uŽ?..5 2 , but this is often not informative
enough about the appropriate alignment.
5.1. Shape-invariant model. In this simulation we consider the shape-

invariant model; that is, the shift functions are
t y bi
hi Ž t . s .
ci
For 100 runs, first 200 sample curves are generated and grouped into 100
pairs. Then we apply dynamic time warping with a prespecified cost function
to warp one curve in a pair to the other and compute the warping error by
Ž15.. The bandwidth for kernel smoothing was chosen data adaptively via a
plug-in rule w Gasser, Kneip and Kohler
¨ Ž1991.x . The parameters in the sample
curves are generated as follows:
a i s max Ž 1 q 5s N Ž 0, 1 . , 0.5. , bi s 0.1 s N Ž 0, 1 . ,
c i s 1 q s Ž U Ž 0, 1 . y 0.5. , d i s 20 s N Ž 0, 1 . .
Here N Ž0, 1. stands for a standard normal variable and UŽ0, 1. for a uniform
variable on Ž0, 1.. The noise is generated as « i j s N Ž0, s« 2 .. The data are then
formed as
yi j s f i Ž t i j . q « i j .
TABLE 2
Simulation with SIM model
Cost function s« Mean Variance
Classical 0.1 1.81e-4 1.53e-7

New 0.1 8.77e-5 4.59e-9
Classical 0.2 9.71e-4 1.92 e-6
New 0.2 2.74 e-4 1.75e-7
Classical 0.4 3.28e-3 2.52 e-5
New 0.4 1.59e-3 8.89e-6
Classical 0.6 3.95e-3 2.80 e-5
New 0.6 2.80 e-3 3.13e-5
Classical 1.0 2.87e-2 8.43e-3
New 1.0 2.80 e-2 6.94 e-3
The sample size for each curve is 100. We happen to take s s s« in this and
subsequent subsections, but there is no special reason for doing so.
Table 2 gives the results of 100 runs. Let r i , i s 1, . . . , 100, be the error
Ž15. of the ith run. The column ‘‘Mean’’ is the sample means of r i 4 , while the
column ‘‘ variance’’ is the sample variance of r i 4 , leading easily to standard
errors of simulation. The rows starting with ‘‘Classical’’ give the results for
the classical quadratic cost function, and the rows starting with ‘‘New’’ give
the results for cost function Ž7..
A typical run with s s s« s 0.2 is shown in Figure 4. The error of aligning
extremes using the classical cost function is 3.33e y 3, and the error using
the new cost function is 1.66 e y 4. The true shift function is linear for the
SIM model.
The results are visually appealing, and even more so for the new cost
function.
5.2. Nonlinear shift model. We consider a nonlinear shift model for the
simulation. The shift function is modeled by
ai
hi Ž t . s t q sin Ž 2p Ž bi t q g i . .
2p
and the model is again
fi Ž t . s ai s Ž h i Ž t . . q d i .
The parameters a i , d i are generated as in the last simulation, while a i , bi , g i
are generated as follows:
a i s max y0.7, min 0.35N Ž 0, 1 . , 0.74 4 ,
bi s min 1 q 0.2 N Ž 0, 1 . , 1.44 , g i s 0.5 Ž U Ž 0, 1 . y 0.5. .
Note that typically < bi < F 1.4 and therefore
hUi Ž t . s 1 q a i bi cos Ž 2p Ž bi t q g i . . G 1 y < a i bi < ) 0.
Thus we have strictly increasing shift functions.
FIG. 4. Alignment for SIM model. Top left: two smoothed sample curves; top right: warping
result using the classical cost function; bottom left: warping result using the new cost function;
bottom right: two warping functions Ž dotted line for classical cost function, solid line for new cost
function..
The simulation is done as follows. First 100 shift functions are generated
and the data for 100 curves are formed as
yi j s f i Ž t i j . q N Ž 0, s« . .
Each curve is sampled at 100 points. After kernel smoothing, each curve is
aligned to the shape function sŽ t . and the error r i is computed. Finally, we
compute the sample mean and variance of r i 4 . The results from 100 runs are
given in the Table 3.
A typical run with s s s« s 0.2 is shown in Figure 5. The error of aligning
extremes using the classical cost function is 1.2 e y 4, and the error using the
new cost function is 2.55e y 5. Evidently, dynamic time warping with the
classical cost function has difficulty in aligning the signals properly. This is
not due to noise but to the more complicated shift function. The new cost
function Ž7. performs much better.
TABLE 3
Simulation with NLSM model
Cost function s« Mean Variance
Classical 0.1 1.44 e-3 7.56 e-6

New 0.1 8.08e-4 2.24 e-6
Classical 0.2 2.16 e-3 7.04 e-6
New 0.2 1.10 e-3 3.61e-6
Classical 0.4 2.32 e-3 1.71e-5
New 0.4 8.75e-4 2.82 e-6
Classical 0.6 2.01e-3 7.49e-6
New 0.6 1.01e-3 3.09e-6
Classical 1.0 2.75e-3 1.36 e-5
New 1.0 1.79e-3 8.94 e-6
5.3. Conclusions from simulations. Dynamic time warping is an appropri-

ate method for aligning two functions. The new cost function outperforms the
classical cost function, because of the use of derivatives. Furthermore, dy-
namic time warping with the new cost function produces smoother shift
functions and prevents the breakdown shown in Figure 5 Žtop left graphics..
As expected, the performance of warping depends mainly on the amount of
true shifts}the difference between true shift functions and the identity
function Žsee the simulation in Section 5.1.. The noise level does not affect the
warping results very much Žsee the simulation in Section 5.2. since data are
smoothed before warping.
Compared to the structural analysis as described in steps Ž1. ] Ž4. in the
introduction, the advantage of dynamic time warping is that it automatically
aligns structural points of two functions. It needs thus less a priori knowledge
and less manpower, but structural analysis could still be preferable in
difficult situations.
APPENDIX A
Proofs.
A.1. Proof of Lemma 4.1. First note that we can restrict a g w 0, 1x ,

because for any a g w 0, 1x c, C Ž f, f 9, g, g 9, u, a . G min C Ž f, f 9, g, g 9, u, 0.,
C Ž f, f 9, g, g 9, u, 1.4 . Since C Ž f, f 9, g, g 9, u, a . is bounded from below, there
exists a feasible sequence Ž u n , a n . such that
lim C Ž f , f 9, g , g 9, u n , a n . s
nª`
Ý C Ž f , f 9, g , g 9, u, a . .
u, a
Let I Ž d , M . s u g C1 w 0, 1x : < uŽ0.< F M, a F u9 F M 4 . Then I Ž d , M . is com-

pact under the norm < u < s sup< uŽ t .< q < u9Ž t .<: t g w 0, 1x4 . Therefore I Ž d , M . =
w 0, 1x is compact and u n , a n 4 has a subsequence which converges to a limit
uniformly. This limit is a solution to Ž8. because C is continuous in Ž u, a .. I
FIG. 5. Alignment for NLSM model. Top left: two smoothed sample curves; top right: warping
result using the classical cost function; bottom left: warping result using the new cost function;
bottom right: warping functions Ž dotted line for classical cost function, solid line for new cost
function, dashed line for true shift function..
A.2. Proof of Theorem 4.1. The proof proceeds as follows. First, the
difference Ž u, ˆ aˆ . y Ž u, ˆ a . can be represented as a linear functional of the
basic statistics plus higher order error terms. These basic statistics are
fˆŽ t . y f Ž t ., fˆŽ1. Ž t . y f 9Ž t ., fˆ9Ž t . y f 9Ž t ., fˆ9Ž1. Ž t . y f 0 Ž t ., 5 fˆ5 q 5 f 5, 5 fˆ9 5 y 5 f 9 5,
and similar statistics with f Ž t . replaced by g Ž uŽ t .. Note that fˆŽ1. Ž t . is
different from fˆ9Ž t .. To treat bias and variance asymptotically, we need only
to take care of these basic statistics. The bias of the linear functional is
simply the linear combination of the bias of those basic statistics, available in
the literature. To deal with the variance we note that the dominating terms
are those involving second derivatives: fˆXŽ1. Ž t . y f 0 Ž t . and gˆ9Ž1. Ž uŽ t .. y
g 0 Ž uŽ t ... All other terms can be neglected as far as asymptotic variance is
concerned. Therefore the main issue is to develop such a representation,
which will be the first thing to do.
Some more notation is needed. Since, f, g g C 4 , C Ž f, f 9, g, g 9, u, a . has

´
three continuous Frechet derivatives with respect to Ž u, a .. Denote these
derivatives by DC, D 2 C and D 3 C, respectively. That is,
DC Ž f , f 9, g , g 9, u, a . w u1 , a 1 x
C Ž f , f 9, g , g 9, u q d u1 , a q da 1 . y C Ž f , f 9, g , g 9, u, a .
s lim ,
dª0 d
D 2 C Ž f , f 9, g , g 9, u, a . w u1 , a 1 ; u 2 , a 2 x
s lim Ž DC Ž f , f 9, g , g 9, u q d u 2 , aqda 2 . w u1 , a 1 x
dª0
yDC Ž f , f 9, g , g 9, u, a . w u1 , a 1 x . dy1 ,
and D 3 C is defined in the same way.
Since Ž u, ˆ fˆ9 g,
ˆ aˆ . is the minimizer of C Ž f, ˆ gˆ9, u, a ., for any Ž u1 , a 1 . with
u1 g C 0, 1x ,
1w
ž
0 s DC fˆ, fˆ9, gˆ , gˆ9, u, /
ˆ aˆ w u1 , a 1 x
s DC Ž f , f 9, g , g 9, u,
ˆ aˆ . w u1 , a 1 x
Ž 16.
ž
qR f , f 9, g , g 9, fˆ, fˆ9, gˆ , gˆ9, u, /
ˆ aˆ w u1 , a 1 x
s D 2 C Ž f , f 9, g , g 9, u, a . u1 , a 1 ; u
ˆ y u, aˆ y a
q O Ž <Ž u
ˆ y u, aˆ y a . < 2 .
ž
qR f , f 9, g , g 9, fˆ, fˆ9, gˆ , gˆ9, u, /
ˆ aˆ w u1 , a 1 x .
We have used the fact that DC Ž f, f 9, g, g 9, u, a .w u1 , a 1 x s 0. The O term
equals
D 3 C Ž f , f 9, g , g 9, cu , ca . u1 , a 1 ; u
ˆ y u, aˆ y a ; uˆ y u, aˆ y a
for some Ž cu , ca . between Ž u, a . and Ž u,
ˆ aˆ .. The term R is simply
ž
DC fˆ, fˆ9, gˆ , gˆ9, u, /
ˆ aˆ w u1 , a 1 x y DC Ž f , f 9, g , g 9, u,
ˆ aˆ . w u1 , a 1 x .
ˆ fˆ9, g,
Since f does not depend on f, f 9, g, g 9, f, ˆ gˆ9 and their derivatives, R
does not depend on f Žcanceled out.. Consequently
ž
R f , f 9, g , g 9, fˆ, fˆ9, gˆ , gˆ9, u, /
ˆ aˆ w u1 , a 1 x
Ž 17. s H0
1
ž F ž fˆ, fˆ9, gˆ, gˆ9, u,ˆ aˆ / y F Ž f , f 9, g , g 9, u,ˆ aˆ . / u
u u 1
q ž F ž fˆ, fˆ9, gˆ , gˆ9, u,

a ˆ aˆ / y F Ž f , f 9, g , g 9, u,
a ˆ aˆ . / a 1 dt.
Easy computations show that
D 2 C Ž f , f 9, g , g 9, u, a . u1 , a 1 ; u
ˆ y u, aˆ y a
H0
1 T
s Ž u1 , a 1 . A Ž t . Ž uˆ y u, aˆ y a . q f 0 Ž u9 . uX1 Ž u9
ˆ y u9 . dt.
Since Ž u1 , a 1 . can be chosen arbitrarily, it follows from Ž16., Ž17. and f 0 Ž u9.
' 0 that
ˆ y u, aˆ y a . s BˆŽ t . q O Ž < Ž uˆ y u, aˆ y a . < 2 . ,

T
Ž 18. AŽ t . Ž u
with
B̂ Ž t . s
ž
Fu fˆ, fˆ9, gˆ , gˆ9, u,
F ž fˆ, fˆ9, gˆ , gˆ9, u,

a
/
ˆ aˆ y Fu Ž f , f 9, g , g 9, u,
ˆ aˆ .
ˆ aˆ / y F Ž f , f 9, g , g 9, u,
ˆ aˆ . a
0 .
One could also derive Ž18. from Euler equations. Since the inverse Ay1 exists
and since Fu , Fa are continuous, one immediately gets Ž u ˆ y u, aˆ y a . ª 0 in
probability from the fact that Ž f, ˆ fˆ9, g,
ˆ gˆ9, fˆŽ1., gˆŽ1., fˆ9Ž1., gˆXŽ1. . y Ž f, f 9, g, g 9,
f 9, g 9, f 0, g 0 . ª 0 in probability.
We now deal with B. ˆ First note the following two algebraic equalities
fˆ
5 fˆ5
s
f
5f5
q
fˆy f
5f5
y
f 5 fˆ5 y 5 f 5
5f5 5f5
y
5 fˆ5 y 5 f 5
5f5 ž fˆ
5 fˆ5
y
f
5f5 / ,
fˆŽ1.
5 fˆ5
s
f9
5f5
q
fˆŽ1. y f 9
5f5
y
f 9 5 fˆ5 y 5 f 5
5f5 5f5
y
5 fˆ5 y 5 f 5
5f5 ž fˆŽ1.
5 fˆ5
y
f9
5f5 / .
The same relations hold if we replace, f, f 9, f, ˆ fˆŽ1. by f 9, f 0, fˆ9, fˆ9Ž1.. For any
Ž u, a ., we simply write f for f Ž t ., g for g Ž uŽ t .., and so on. Then it follows
from the two algebraic equalities that
ž /
Fu fˆ, fˆ9, gˆ , gˆ9, u, a y Fu Ž f , f 9, g , g 9, u, a .
s 2a 2
ž gˆ
5 gˆ 5
y
5 fˆ5
fˆ
/ gˆŽ1.
5 gˆ 5
y
ž 5 g5
g
y
5f5
f
/ g9
5 g5
y 2Ž 1 y a .
2
ž gˆ9
5 gˆ9 5
y
fˆ9
5 fˆ9 5 / gˆXŽ1.
5 gˆ9 5
y
ž g9
5 g 95
y
f9
5 f 95 / g0
5 g 95
s 2 a 2 G 1 Ž g , gˆ , f , fˆ, u, a . q o Ž < G 1 < . q 2 Ž 1 y a . G 2 Ž g 9, gˆ9, f 9, fˆ9, u, a .

2
qo Ž < G 2 < . .
The G functionals are defined by
G 1 Ž g , gˆ , f , fˆ, u, a .
g 9 gˆ y g g9 g 5 gˆ 5 y 5 g 5 g 9 fˆy f
s y2 y
5 g5 5 g5 5 g5 5 g5 5 g5 5 g5 5 f 5
q2
g 5 fˆ5 y 5 f 5
5 g5 5f5
q
ž g
5 g5
y
5f5
f
/ gˆŽ1. y g 9
5 g5
,
G 2 Ž g 9, gˆ9, f 9, fˆ9, u, a .
g0 gˆ9 y g 9 g0 g 9 5 gˆ9 5 y 5 g 9 5 g0 fˆ9 y f 9
s y2 y
5 g 95 5 g 95 5 g 95 5 g 95 5 q9 5 5 g 95 5 f 95
q2
g 9 5 fˆ9 5 y 5 f 9 5
5 g 95 5 f 95
q
ž g9
5 g 95
y
f9
5 f 95 / gˆ9Ž1. y g 0
5 g 95
.
Fa is treated in the same way and one gets
ž /
Fa fˆ, fˆ9, gˆ , gˆ9, u, a y Fa Ž f , f 9, g , g 9, u, a .
ž
s 4a H Ž f , fˆ, g , gˆ , u, a . y 4 Ž 1 y a . H f 9, fˆ9, g˜ , gˆ9, u, a q o Ž < H < . . /
The H functional is defined by
H Ž f , fˆ, g , gˆ , u, a .
s
ž f
5f5
y
g
5 g5 /ž fˆy f
5f5
y
f 5 fˆ5 y 5 f 5
5f5 5f5
y
gˆ y g
5 g5
q
5 g5
g 5 gˆ 5 y 5 g 5
5 g5 / ,
and the o term is
ž / ž ž
o Ž < H < . s o < H Ž f , fˆ, g , gˆ , u, a . < q o < H f 9, fˆ9, g˜ , gˆ9, u, a < . //
Combining all the above considerations, we have shown that
ˆŽ t . s Bˆ0 Ž t . q higher order error terms
B
with
ž
ˆ 2 G1 g , gˆ, f , fˆ, u, /
ˆ aˆ q 2 Ž 1 y aˆ . G 2 g 9, gˆ9, f 9, fˆ9, u, ž /
0
2
2a ˆ aˆ
B̂0 Ž t . s .
ž
ˆ H f , fˆ, g , gˆ, u,
4a /
ˆ aˆ y 4 Ž 1 y aˆ . H f 9, fˆ9, g 9, gˆ9, u,
ˆ aˆ ž /
Let us make one more observation here. If we replace Ž u, ˆ aˆ . by Ž u, a . in
ˆ0 Ž t ., the error terms are those like Ž gˆŽ j . y g Ž j ..Ž uˆ y u., with some j
B
between u ˆ and u, by a first order Taylor expansion. Therefore,
ˆŽ t . s Bˆ1 Ž t . q higher order error terms,
B
ˆ1Ž t . is obtained from Bˆ0 Ž t . by replacing Ž u,

where B ˆ aˆ . with Ž u, a .. Now Ž18.
becomes
T
Ž 19. Ž uˆ y u, aˆ y a . s A Ž t . y1 Bˆ1 Ž t . q higher order error terms.
We are ready to complete the proof of Theorem 4.1.
PROOF OF Ža.. This part follows from Ž19. and the lemma of Kneip and
Gasser wŽ 1992., pages 1291 and 1292x , which shows that
E Ž fˆy f . s O Ž b 02 . , E Ž fˆŽ1. y f 9 . s O Ž b 02 . ,
E Ž fˆ9 y f 9 . s O Ž b 12 . , E Ž fˆ9Ž1. y f 0 . s O Ž b 12 . .
Also,
E 5 fˆ5 y 5 f 5 F E 5 fˆy Efˆ5 q 5 Efˆy f 5 s o

ž' /log 2 Ž n .
nb 0
q O Ž b 02 . ,
E 5 fˆ9 5 y 5 f 9 5 F E 5 fˆ9 y Efˆ9 5 q 5 Efˆ9 y f 9 5 s o

ž' / log 2 Ž n .
nb13
q O Ž b 12 . .
The same is true for g. With b 0 s O Ž ny1 r5 . and b1 s O Ž ny1 r7 ., we have
O Ž b 02 . s o Ž b 12 . , Ž nb 0 .
y1
ž
s o Ž nb13 .
y1
/.
Part Ža. follows.
PROOF OF Žb.. If a / 1 then B ˆ1Ž t . depends on gˆ9Ž1.. It follows from Žb. of

the Lemma of Kneip and Gasser Ž1992. that the variances of fˆ9, gˆ9, fˆŽ g 0 . and
gˆŽ g 0 . are dominated by the variance of gˆ9Ž1. if g 0 s 0, 1. We need also that the
variances of 5 fˆ9 5, 5 gˆ9 5, 5 fˆ5 and 5 gˆ 5 are dominated by that of gˆ9Ž1.. This can be
seen, for example, for 5 fˆ9 5, as follows:
ž /
2
E Ž 5 fˆ9 5 y E 5 fˆ9 5 . s E 5 fˆ9 5 y 5 Efˆ9 5 q 5 Efˆ9 5 y E 5 fˆ9
2
F 2 E Ž 5 fˆ9 5 y 5 Efˆ9 5 . q 2 Ž 5 Efˆ9 5 y E 5 fˆ9 5 .

2 2
F 4 E 5 fˆ9 y Efˆ9 5 2 s o
ž log 4 Ž n .
nb13 /
s o Ž ny1 by5
1 . s o EŽ g ž
ˆ9Ž1. y Egˆ9Ž1. . . /
2
Now it is clear that the covariance of any two different terms in B ˆ1Ž t . is an
oŽ?. term of VarŽ gˆ9Ž1. . because <CovŽ X, Y .< F ŽVarŽ X .VarŽ Y ..y1 r2 for any X
and Y.
Part Žb. now follows from Ž19. and the fact that w cf. the lemma of Kneip and
Gasser Ž1992.x
Var Ž gˆ9Ž1. . s
s2
nb15
Hy1 K
1 Ž1.
1 Ž x . dx q o
2
ž /
1
nb15
and that gˆ9Ž1. Ž t . y Eg9
ˆ Ž1. Ž t . converges to a normal variable if properly scaled;
compare Gasser and Muller ¨ Ž1984..
PROOF OF Žc.. If a s 1, then B ˆ1Ž t . does not depend on fˆ9, gˆ9 and gˆ9Ž1.. The
dominating term is then the term involving gˆŽ1., as far as asymptotic
variance-covariance is concerned. Again the covariance of any two terms in
ˆ1Ž t . is an oŽ?. term of VarŽ gˆŽ1. .. Note that
B
Var Ž gˆŽ1. . s
s2
nb 03
Hy1 K
1 Ž1.
0 Ž x . dx q o
2
ž /
1
nb 03
.
The rest of the proof for Žc. is the same as in the proof of Žb.. I
Acknowledgments. We thank two referees and an Associate Editor for

helpful comments.
REFERENCES
GASSER, TH., KNEIP, A., BINDING, A., PRADER, A. and MOLINARI, L. Ž1991.. The dynamics of linear
growth in distance, velocity and acceleration. Ann. of Human Biology 18 187]205.
GASSER, TH., KNEIP, A. and KOHLER ¨ , W. Ž1991.. A flexible and fast method for automatic
smoothing. J. Amer. Statist. Assoc. 86 643]652.
GASSER, TH., KNEIP, A., ZIEGLER, P., MOLINARI, L., PRADER, A. and LARGO, R. Ž1994.. Development
and outcome of indices of obesity in normal children. Ann. of Human Biology 21
275]286.
GASSER, TH. and MULLER¨ , H. Ž1984.. Estimating regression functions and their derivatives by
the Kernel method. Scand. J. Statist. 11 171]185.
¨
GASSER, TH., MULLER , H. and MAMMITZSCH, V. Ž1985.. Kernels for nonparametric curve estima-
tion. J. Roy. Statist. Soc. Ser. B 47 238]252.
ARDLE, W. and MARRON, J. S. Ž1990.. Semiparametric comparison of regression curves. Ann.
H¨
Statist. 18 63]89.
¨
HOHNE , H., COKER, C., LEVINSON, S. and RABINER, L. Ž1983.. On temporal alignment of sentences
of natural and synthetic speech. IEEE Trans. Acoust. Speech Signal Process. 31
807]813.
KNEIP, A. and ENGEL, J. Ž1995.. Model estimation in nonlinear regression under shape invari-
ance. Ann. Statist. 23 551]570.
KNEIP, A. and GASSER, TH. Ž1988.. Convergence and consistency results for self-modeling nonlin-
ear regression. Ann. Statist. 16 82]112.
KNEIP, A. and GASSER, TH. Ž1992.. Statistical tools to analyze data representing a sample of
curves. Ann. Statist. 20 1266]1305.
LAWTON, W. H., SYLVESTRE, E. A. and MAGGO, M. S. Ž1972.. Self-modeling regression. Technomet-
rics 14 513]532.
PARSONS, T. Ž1986.. Voice and Speech Processing. McGraw-Hill, New York.
QI, Y. Ž1992.. Time normalization in voice analysis. J. Acoust. Soc. Am. 92 2569]2576.
RABINER, L. and SCHMIDT, C. Ž1980.. Application of dynamic time warping to connected digit
recognition. IEEE Trans. Acoust. Speech Signal Process. 28 377]388.
RAMSAY, J. and DALZELL, C. Ž1991.. Some tools for functional data analysis Žwith discussion..
J. Roy. Statist. Soc. Ser. B 53 539]572.
RAO, C. Ž1958.. Some statistical methods for the comparison of growth curves. Biometrics 14
1]17.
RICE, J. and SILVERMAN, B. Ž1991.. Estimating the mean and covariance structure nonparametri-
cally when data are curves. J. Roy. Statist. Soc. Ser. B 53 233]243.
ROBERTS, K., LAWRENCE, P., EISEN, A. and HOIRCH, M. Ž1987.. Enhancement and dynamic time
warping of somatosensory evoked potential components applied to patients with
multiple sclerosis. IEEE Trans. Biomed. Eng. BME-34 397]405.
SAKOE, H. and CHIBA, S. Ž1978.. Dynamic programming algorithm optimization for spoken word
recognition. IEEE Trans. Acoust. Speech Signal Process. 26 43]49.
SILVERMAN, B. Ž1995.. Incorporating parametric effects into functional principal components
analysis. J. Roy. Statist. Soc. Ser. B 57 673]689.
¨
STUTZLE , W., GASSER, TH., MOLINARI, L., LARGO, R., PRADER, A. and HUBER, P. Ž1980.. Shape-
invariant modeling of human growth. Ann. of Human Biology 7 507]528.
DEPARTMENT OF PSYCHIATRY DEPARTMENT OF BIOSTATISTICS

NEURODYNAMICS LAB ¨
UNIVERSITY OF ZURICH
STATE UNIVERSITY OF NEW YORK ¨
CH-8006 ZURICH
450 CLARKSON AVE ., BOX 1203 SWITZERLAND
BROOKLYN , NEW YORK 11203-2098 E-MAIL : tgasser@ifspm.unizh.ch
E-MAIL : kmw@mendel.neurodyn.hscbklyn.edu

Alignment of Curves by Dynamic Time Warping

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Alignment of Curves by Dynamic Time Warping

Uploaded by

Copyright:

Available Formats

The Annals of Statistics

1997, Vol. 25, No. 3, 1251 ] 1276

ALIGNMENT OF CURVES BY DYNAMIC TIME WARPING1

BY KONGMING WANG and THEO GASSER

1. Introduction. When studying some process or development Že.g., a

Received May 1995; revised August 1996.

Here w s Ž i, j .4 is a warping path connecting Ž1, 1. and Ž M, N . in a two-

Here  f i : i s 1, . . . , m4 are unknown smooth functions, t i j g J ' w 0, 1x ; R Ž J

The structural analysis suggested by Kneip and Gasser Ž1992., leading in

Figure 1 shows the alignment of the velocities of two growth curves of

2. Dynamic time warping. Dynamic time warping has been designed

2.1. Aligning two regression functions. In speech recognition, a word or a

We describe briefly the version of dynamic time warping given in Sakoe

such that the discrepancy

CONDITIONS. Ži. Monotoncity: iŽ k . F iŽ k q 1. and jŽ k . F jŽ k q 1..

In Ž4., the function dŽ?, ?. is a distance measure and r is a nonnegative

and errorx . From the equality ab s yŽ a y b . 2r2 q Ž a 2 q b 2 .r2, we see that

y Ý f Ži. g Ž j. y bP Žw. Roberts, Lawrence, Eisen and Hoirch

Here f 9 is the derivative of f and 5 f 9 5 is the supremum norm of f 9. Then the

for a constant c. Here IAŽ t . is the indicator function of a set A.

2.2. Aligning m regression functions. We now go back to model Ž1. of m

3. Alignment for some semiparametric models. It has become clear

Estimating parameters a i , bi , c i , d i and the shape function s for such a

with some nonnegative weight function r. None of the cost functions in

4. Estimation and asymptotics. In practice, a regression function f is

Here s j s Ž t jy1 q t j .r2 for 1 F j F n, s0 s 0, and sn s 1. Following asymp-

LEMMA 4.1. If f, g g C 1 , then there exists a solution of the variational

A proof is given in the Appendix. Since f, ˆ g,

We have used the notation Fu u Ž f, f 9, g, g 9, u, a . for Fu u Ž f Ž t ., f 9Ž t ., g Ž t .,

Recall that a s 0, 1 correspond to models where true shift functions can be

THEOREM 4.1. Assume conditions Ži. ] Žiii..

with N a multinormal distribution and

K 1Ž1. Ž x . dx diag Ž 5 g 9 5y2 , 0 . .

with N a multinormal distribution and

K 0Ž1. Ž x . dx diag Ž 5 g 5y2 , 0 . .

Žc. Since Ž u, a . is an optimal solution, the second variation of the cost

5. Simulations. We undertook a small scale simulation Ž100 runs. to

s1 Ž t . s0.25 Ž ty5 . y8 Ž 0.45yt . , s2 Ž t . sy0.25 Ž tq4 . q8 Ž ty0.55. .

We consider the model

FIG. 3. Shape function.

5.1. Shape-invariant model. In this simulation we consider the shape-

Cost function s« Mean Variance

Classical 0.1 1.81e-4 1.53e-7

Cost function s« Mean Variance

Classical 0.1 1.44 e-3 7.56 e-6

5.3. Conclusions from simulations. Dynamic time warping is an appropri-

A.1. Proof of Lemma 4.1. First note that we can restrict a g w 0, 1x ,

Let I Ž d , M . s  u g C1 w 0, 1x : < uŽ0.< F M, a F u9 F M 4 . Then I Ž d , M . is com-

Some more notation is needed. Since, f, g g C 4 , C Ž f, f 9, g, g 9, u, a . has

q ž F ž fˆ, fˆ9, gˆ , gˆ9, u,

Easy computations show that

ˆ y u, aˆ y a . s BˆŽ t . q O Ž < Ž uˆ y u, aˆ y a . < 2 . ,

F ž fˆ, fˆ9, gˆ , gˆ9, u,

s 2 a 2 G 1 Ž g , gˆ , f , fˆ, u, a . q o Ž < G 1 < . q 2 Ž 1 y a . G 2 Ž g 9, gˆ9, f 9, fˆ9, u, a .

The G functionals are defined by

Fa is treated in the same way and one gets

and the o term is

ˆ1Ž t . is obtained from Bˆ0 Ž t . by replacing Ž u,

E 5 fˆ5 y 5 f 5 F E 5 fˆy Efˆ5 q 5 Efˆy f 5 s o

E 5 fˆ9 5 y 5 f 9 5 F E 5 fˆ9 y Efˆ9 5 q 5 Efˆ9 y f 9 5 s o

The same is true for g. With b 0 s O Ž ny1 r5 . and b1 s O Ž ny1 r7 ., we have

PROOF OF Žb.. If a / 1 then B ˆ1Ž t . depends on gˆ9Ž1.. It follows from Žb. of

F 2 E Ž 5 fˆ9 5 y 5 Efˆ9 5 . q 2 Ž 5 Efˆ9 5 y E 5 fˆ9 5 .

Here w s Ž i, j .4 is a warping path connecting Ž1, 1. and Ž M, N . in a two-

Here f i : i s 1, . . . , m4 are unknown smooth functions, t i j g J ' w 0, 1x ; R Ž J

Let I Ž d , M . s u g C1 w 0, 1x : < uŽ0.< F M, a F u9 F M 4 . Then I Ž d , M . is com-