Professional Documents
Culture Documents
Time Ser PDF
Time Ser PDF
Series Prediction
Machine Learning Summer School
(Hammamet, 2013)
Gianluca Bontempi
According to MathSciNet
2. Adaptive real-time machine learning for credit card fraud detection (2012-2013).
st = g(t) + t t = 1, . . . , T
360
observed
340
320
360
trend
340
1 2 3 320
seasonal
1
3
0.5
random
0.0
0.5
p(1 , 2 )
p(2 |1 ) =
p(1 )
p(1 , . . . , T )
E[t ] = t = , Var [t ] = t2 = 2
(k)
(k) =
(0)
p(t+k |t ) = p(t+k )
It follows that this process has constant mean and variance. Also
for k = 1, 2, . . . .
A purely random process is strictly stationary.
A purely random process is sometimes called white noise by engineers.
An example of purely random process is the series of numbers drawn by
a roulette wheel in a casino.
3
2
1
0
y
1
2
3
1.0
0.8
0.6
ACF
0.4
0.2
0.0
0 5 10 15 20 25 30
Lag
t = t1 + wt
t = t t1
50
40
30
20
10
10
20
30
40
0 50 100 150 200 250 300 350 400 450 500
t = 1 t1 + + n tn + wt
This means that the next value is a linear weighted sum of the past n
values plus a random shock.
Finite memory filter.
If w is a normal variable, t will be normal too.
Note that this is like a linear regression model where is regressed not
on independent variables but on its past values (hence the prefix auto).
The properties of stationarity depends on the values i , i = 1, . . . , n.
t = t1 + wt
Then
2
E[t ] = 0 Var [t ] = w (1 + 2 + 4 + . . . )
Then if || < 1 the variance if finite and equals
2 2
Var [t ] = = w /(1 2 )
(k) = k k = 0, . . . , 1, 2
Machine Learning Strategies for Prediction p. 31/128
General order AR(n) process
It has been shown that condition necessary and sufficient for the
stationarity is that the complex roots of the equation
(z) = 1 1 z n z n = 0
15
10
y
5
0
1.0
0.5
ACF
0.0
0.5
0 5 10 15 20 25 30
Lag
0.5
Partial ACF
0.0
0.5
0 5 10 15 20 25 30
Lag
t = 1 t1 + + p tn + wt
T
2
X
= arg min [t 1 t1 n tn ]
t=n+1
N
X
T 2 T
= arg min (yi xi a) = arg min (Y Xa) (Y Xa)
a a
i=1
linear methods interpret all the structure in a time series through linear
correlation
deterministic linear dynamics can only lead to simple exponential or
periodically oscillating behavior, so all irregular behavior is attributed to
external noise while deterministic nonlinear equations could produce
very irregular data,
in real problems it is extremely unlikely that the variables are linked by a
linear relation.
In practice, the form of the relation is often unknown and only a limited
amount of samples is available.
TRAINING
DATASET
MODEL
PREDICTION
15
10
Y
5
0
5
2 1 0 1 2
x
15
10
Y
5
0
5
2 1 0 1 2
x
f(x) = 0 + 1 x
15
10
Y
5
0
5
2 1 0 1 2
x
f(x) = 0 + 1 x + + 3 x3
15
10
Y
5
0
5
2 1 0 1 2
x
f(x) = 0 + 1 x + + 18 x1 8
where the intrinsic noise term reflects the target alone, the bias reflects
the targets relation with the learning algorithm and the variance term
reflects the learning algorithm alone.
This result is purely theoretical since these quantities cannot be
measured on the basis of a finite amount of data.
However, this result provides insight about what makes accurate a
learning process.
Bias
Variance
complexity
2. within the family f(x, ), to estimate on the basis of the training set DN
the parameter N which best approximates f (parametric identification).
In order to accomplish that, a learning procedure is made of two nested
loops:
1. an external structural identification loop which goes through different
model structures
2. an inner parametric identification loop which searches for the best
parameter vector within the family structure.
\ emp ()
N = (DN ) = arg min MISE
PN 2
i=1 yi f(xi , )
\ emp () =
MISE
N
\ emp (N ) 1 + p/N
FPE = MISE
1 p/N
with p = n + 1,
the Generalized Cross-Validation (GCV)
\ emp (N ) 1
GCV = MISE
(1 Np )2
p 1
AIC = L(N )
N N
\ emp (N )
MISE
Cp = 2
+ 2p N
w
2
where
w is an estimate of the variance of noise,
the Predicted Squared Error (PSE)
\ emp (N ) + 2 2 p
PSE = MISE w
N
2
where
w is an estimate of the variance of noise.
N N 2
\ 1 X k(i) 2 1 X
MISE CV = (yi yi ) = yi f(xi , k(i) )
N i=1 N i=1
k(i)
where yi denotes the fitted value for the ith observation returned by the
model estimated with the k(i)th part of the data removed.
10%
90%
N N
\ 1 X i 2 1 X
i 2
MISE LOO = (yi yi ) = (yi f xi , )
N i=1 N i=1
PARAMETRIC IDENTIFICATION
REALIZATION
TRAINING STOCHASTIC
N N , GN N , GN
1 1 2 2 s S
, GN SET PROCESS
VALIDATION
STRUCTURAL
N N , GSN
1 s
N , GN
1 2 2
, GN
IDENTIFICATION
?
MODEL SELECTION
N
LEARNED MODEL
A model with complexity s is trained on the whole dataset DN and used for
future predictions.
(b) ej = yj f(xj , N
s
1 )
\
MISE (s) = 1
PN 2
LOO N j=1 ej
\
2. Model selection: s = arg mins=1,...,S MISE LOO (s)
3. Final parametric identification:
N = arg mins i=1 (yi f(xi , ))2
s
PN
0.9
0.8
0.7
0.6
fraction of local
n=1
points
0.5
n=2
0.4
n=3
0.3
0.2
0.1 n=100
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
r
The size of the neighborhood on which we can estimate local features of the
output (e.g. E[y|x]) increases with dimension n, making the estimation
coarser and coarser.
Underfitting Overfitting
Bias
Variance
1/Bandwith
PARAMETRIC IDENTIFICATION
ON N SAMPLES PUT THE j-th SAMPLE ASIDE
N TIMES PARAMETRIC IDENTIFICATION ON N-1 SAMPLES
TEST ON THE j-th SAMPLE
PRESS STATISTIC
LEAVE-ONE-OUT
The leave-one-out error can be computed in two equivalent ways: the slowest
way (on the right) which repeats N times the training and the test procedure;
the fastest way (on the left) which performs only once the parametric
identification and the computation of the PRESS statistic.
H = X(X T X)1 X T
ej
eloo
j =
1 Hjj
Note that PRESS is not an approximation of the loo error but simply a faster
way of computing it.
yq (k) = xTq
(k)
yq = xTq
(k), \
with k = arg min MISE LOO (k)
k
Pb
i=1 i y
q (ki )
yq = Pb ,
i=1 i
where the weights are the inverse of the mean square errors:
\
i = 1/MISE (ki ).
LOO
z -1
t
t-3 f
z -1
t-2
z -1
t-1
The approximator f returns the prediction of the value of the time series at
time t + 1 as a function of the n previous values (the rectangular box
containing z 1 represents a unit delay operator, i.e., t1 = z 1 t ).
t16 t11 t
t6 t1
We want to predict at time t 1 the next value of the series y of order n = 6.
The pattern yt16 , yt15 , . . . , yt11 is the most similar to the pattern
{yt6 , yt5 , . . . , yt1 }. Then, the prediction yt = yt10 is returned.
z -1
t
t-3 f
z -1
t-2
z -1
t-1
z -1
The approximator f returns the prediction of the value of the time series at
time t + 1 by iterating the predictions obtained in the previous steps (the
rectangular box containing z 1 represents a unit delay operator, i.e.,
t1 = z 1 t ).
1 1
2
3 2 3
4 4
5 5
a) b)
250
200
150
y
100
50
0
The A chaotic time series has a training set of 1000 values: the task is to
predict the continuation for 100 steps, starting from different points.
Machine Learning Strategies for Prediction p. 98/128
One-step assessment criterion
300
250
200
150
100
50
0
0 10 20 30 40 50 60 70 80 90 100
250
200
150
100
50
0
0 10 20 30 40 50 60 70 80 90 100
[t+H1 , . . . , t ] = F (t1 , . . . , tn ) + w
t1 t
t+1 t+2
t+3
t1 t
t+1 t+2
t+3
t1 t
t+1 t+2
t+3
This quantity is smaller than one if the predictor performs better than the
naivest predictor, i.e. the average
.
Other measures rely on relative or percentage error
t+h t+h
pet+h = 100
t+h
like
PH
h=1 |pet+h |
MAPE =
H
Machine Learning Strategies for Prediction p. 111/128
Applications in my lab
Side-channel attack
45
40
35
Temperature (C)
30
25
20
! !! !
!! !! ! ! !! ! ! ! ! ! ! ! ! !
0 5 10 15 20
Time (Hour)
Communication costs
Metric
Model error
Model complexity
!
p
AR(p) : si [t] = j si [t j]
j=1
N 2
1 X
fQj T(t1) , T(t2) , ..., T(tn1) T(t)
D (Qj , T ) =
N n + 1 t=n
= arg
Q min D (Qj , T )
j[0,2(k1) ]
Machine Learning Strategies for Prediction p. 122/128
Conclusions
Highly recommended !
128-1
ing. International Journal of Forecasting, 27(3):689699,
2011.
[9] M. Guo, Z. Bai, and H.Z. An. Multi-step prediction for non-
linear autoregressive models based on empirical distribu-
tions. Statistica Sinica, pages 559570, 1999.
128-2
[11] Yann-Al Le Borgne, Silvia Santini, and Gianluca Bon-
tempi. Adaptive model selection for time series pre-
diction in wireless sensor networks. Signal Processing,
87(12):30103020, 2007.
128-3
[18] A. Sorjamaa and A. Lendasse. Time series prediction us-
ing dirrec strategy. In M. Verleysen, editor, ESANN06, Eu-
ropean Symposium on Artificial Neural Networks, pages
143148, Bruges, Belgium, April 26-28 2006. European
Symposium on Artificial Neural Networks.
[21] Van Tung Tran, Bo-Suk Yang, and Andy Chit Chiow Tan.
Multi-step ahead direct prediction for the machine con-
dition prognosis using regression trees and neuro-fuzzy
systems. Expert Syst. Appl., 36(5):93789387, 2009.
128-4