You are on page 1of 40

Time Series Forecasting With

Feed-Forward Neural Networks:


Guidelines And Limitations

Eric Plummer
Computer Science Department
University of Wyoming
February 15, 2010
Topics
• Thesis Goals
• Time Series Forecasting
• Neural Networks
• K-Nearest-Neighbor
• Test-Bed Application
• Empirical Evaluation
• Data Preprocessing
• Contributions
• Future Work
• Conclusion
• Demonstration

February 15, 2010 Eric Plummer 2


Thesis Goals

• Compare neural networks and k-nearest-neighbor for


time series forecasting
• Analyze the response of various configurations to
data series with specific characteristics
• Identify when neural networks and k-nearest-
neighbor are inadequate
• Evaluate the effectiveness of data preprocessing

February 15, 2010 Eric Plummer 3


Time Series Forecasting –
Description
• What is it?
– Given an existing data series, observe or model the
data series to make accurate forecasts
• Example data series
– Financial (e.g., stocks, rates)
– Physically observed (e.g., weather, sunspots)
– Mathematical (e.g., Fibonacci sequence)

February 15, 2010 Eric Plummer 4


Time Series Forecasting –
Difficulties
• Why is it difficult?
– Limited quantity of data
• Observed data series sometimes too short to partition
– Noise
• Erroneous data points
• Obscuring component
– Moving Average
– Nonstationarity
• Fundamentals change over time
• Nonstationary mean: “Ascending” data series
– First-difference preprocessing

– Forecasting method selection


• Statistics
• Artificial intelligence

February 15, 2010 Eric Plummer 5


Time Series Forecasting –
Importance
• Why is it important?
– Preventing undesirable events by forecasting the
event, identifying the circumstances preceding the
event, and taking corrective action so the event can be
avoided (e.g., inflationary economic period)
– Forecasting undesirable, yet unavoidable, events to
preemptively lessen their impact (e.g., solar maximum
w/ sunspots)
– Profiting from forecasting (e.g., financial markets)

February 15, 2010 Eric Plummer 6


Neural Networks –
Background
• Loosely based on the human brain’s neuron structure
• Timeline
– 1940’s – McCulloch and Pitts – proposed neuron models in
the form of binary threshold devices and stochastic
algorithms
– 1950’s & 1960’s – Rosenblatt – class of learning machines
called perceptrons
– Late 1960’s – Minsky and Papert – discouraging analysis of
perceptrons (linearly separable classes)
– 1980’s – Rumelhart, Hinton, and Williams – generalized
delta rule for learning by back-propagation for training
multilayer perceptrons
– Present – many new training algorithms and architectures,
but nothing “revolutionary”

February 15, 2010 Eric Plummer 7


Neural Networks –
Architecture
• A feed-forward neural
network can have any
number of:
– Layers
– Units per layer
– Network inputs
– Network outputs
• Hidden layers (A, B)
• Output layer (C)

February 15, 2010 Eric Plummer 8


Neural Networks –
Units
• A unit has:
– Connections
– Weights
– Bias
– Activation function
• Weights and bias are randomly initialized
before training
• Unit’s input consists of:
– Sum of the products of each connection
value and associated weight
– Add the bias
• Input is then fed into unit’s activation
function
• Unit’s output is the output of activation
function
– Hidden layers: Sigmoid
– Output layer: Linear

February 15, 2010 Eric Plummer 9


Neural Networks –
Training
• Partition data series into:
– Training set
– Validation set (optional)
– Test set (optional)
• Typically, the training procedure is:
– Perform backpropagation training with training set
– After n epochs, compute total squared error on training set
and validation set
– If consistently validation error  and training error , stop
training.
• Overfitting: Training set learned too well
• Generalization: Given inputs not in training and validation sets,
able to accurately forecast
February 15, 2010 Eric Plummer 10
Neural Networks –
Training
• Backpropagation training:
– First, examples in the form of <input, output> pairs are
extracted from the data series
– Then, the network is trained with backpropagation on the
examples:
1. Present an example’s input vector to the network inputs and
run the network sequentially forward
2. Propagate the error sequentially backward from the output layer
3. For every connection, change the weight modifying that
connection in proportion to the error
– When all three steps have been performed for all examples,
one epoch has occurred
– Goal is to converge to a near-optimal solution based on the
total squared error

February 15, 2010 Eric Plummer 11


Neural Networks –
Training
Backpropagation training cycle

February 15, 2010 Eric Plummer 12


Neural Networks –
Forecasting
• Forecasting method
depends on examples
• Examples depend on
step-ahead size

If step-ahead size is one: If step-ahead size is greater


Iterative forecasting than one: Direct forecasting

February 15, 2010 Eric Plummer 13


Neural Networks –
Forecasting
Iterative forecasting

Can continue this indefinitely

February 15, 2010 Eric Plummer 14


Neural Networks –
Forecasting
Directly forecasting n steps

This is the only forecast

February 15, 2010 Eric Plummer 15


K-Nearest-Neighbor –
Forecasting
• No model to train
• Simple linear
search
• Compare
reference to
candidates
• Select k
candidates with
lowest error
• Forecast is
average of k next
values

February 15, 2010 Eric Plummer 16


Test-Bed Application –
FORECASTER
• Written in Visual C++ with MFC
• Object-oriented
• Multithreaded
• Wizard-based
• Easily modified
• Implements feed-forward neural networks & k-
nearest-neighbor
• Used for time series forecasting
• Eventually will be upgraded for classification
problems

February 15, 2010 Eric Plummer 17


Value Value

0
5
10
15
20
25
30
35

-10
-5
0
5
10
15
20
25
30
35
40
0 0
7 7

14 14
21 21
28 28
35 35
42 42
49 49
56 56
63 63
70 70
77 77
84 84
91 91

Original
98 98
105 105
Original

112 Data Point 112

Data Point
119 119
Original

126 126

Original with More Noisy

More Noisy
Count 133 133
140 140
More Noisy

0
20
40
60
80
100
120
140
160
180
200
147 147
1784
154 154
1791
161 161
1798
168 168
1805
175 175
1812
182 182
1819
189 189
1826 196
196
1833 203 203
1840 210 210
1847
1854
1861
1868
1875
1882

Year
1889
Value Value
1896
-5
0
5
10
15
20
25
30
35

0
10
20
30
40
50
60

Sunspots 1784-1983
1903 0 0

Sunspots
1910 7 7
1917 14 14
1924 21 21

1931 28 28

1938 35 35
42 42
1945
49 49
1952
56 56
1959
63 63
1966
70 70
1973
77 77
1980 84
84
91 91
Original

Original

98 98
105 105
112 112
Data Point

Data Point

119 119
126 126
Original with Ascending
Original with Less Noisy
Less Noisy

Ascending

133 133
Less Noisy

Ascending

140 140
147 147
154 154
161 161
168 168
175 175
182 182
189 189
196 196
Empirical Evaluation – Data Series

203 203
210 210
Empirical Evaluation –
Neural Network Architectures
• Number of network inputs
based on data series
• Need to make unambiguous
examples
• For “sawtooths”:
– 24 inputs are necessary • For sunspots:
– Test networks with 25 & – 30 inputs
35 inputs – 1 hidden layer with 30
– Test networks with 1 units
hidden layer with 2, 10, & • For real-world data series,
20 hidden layer units selection may be trial-and-
– One output layer unit error!

February 15, 2010 Eric Plummer 19


Empirical Evaluation –
Neural Network Training
• Heuristic method: • Simple method:
– Start with aggressive – Use conservative
learning rate learning rate
– Gradually lower learning – Training stops when:
rate as validation error • Number of training
increases epochs equals the
– Stop training when epochs limit -or-
learning rate cannot be • Training error is less
lowered anymore than or equal to error
limit

February 15, 2010 Eric Plummer 20


Empirical Evaluation –
Neural Network Forecasting
• Metric to compare forecasts: • For networks trained on
Coefficient of Determination original, less noisy, and
– Value may be (-∞, 1] more noisy data series,
– Want value between 0 forecast will be compared to
and 1, where 0 is original series
forecasting the mean of • For networks trained on
the data series and 1 is ascending data series,
forecasting the actual forecast will be compared to
value continuation of ascending
– Must have actual values series
to compare with • For networks trained on
forecasted values sunspots data series,
forecast will be compared to
test set

February 15, 2010 Eric Plummer 21


Empirical Evaluation –
K-Nearest-Neighbor
• Choosing window size analogous to choosing
number of neural network inputs
• For sawtooth data series:
– k=2
– Test window sizes of 20, 24, and 30
• For sunspots data series:
– k=3
– Window size of 10
• Compare forecasts via coefficient of determination

February 15, 2010 Eric Plummer 22


Empirical Evaluation –
Candidate Selection
• Neural networks
– For each training method, data series, and
architecture, 3 candidates were trained
– Also, average of 3 candidates’ forecasts was taken:
forecasting by committee
– Best forecast was selected based on coefficient of
determination
• K-nearest-neighbor
– For each data series, k, and window size, only one
search was performed (only one needed)

February 15, 2010 Eric Plummer 23


Empirical Evaluation – Original Data Series

Heuristic NN Simple NN
Nets Trained on Original Nets Trained on Original

Original 35,2 35,10 35,20 Original 35,2 35,10 35,20


35 35

30 30

25
25
20
20
15
Value

Value
15
10
10
5
5
0
222

240

261

282
216

219

225

228

231

234

237

243

246

249

252

255

258

264

267

270

273

276

279

285
-5 0

216

219

237

240

255

258

276

279
222

225

228

231

234

243

246

249

252

261

264

267

270

273

282

285
-10 -5
Data Point Data Point

Smaller NN K-N-N
Nets Trained on Original K-Nearest-Neighbor on Original
Original 25,10 25,20 Original 2,20 2,24 2,30
150 35

30
100

25
50
20
Value
Value

0
15
240

261

282
216

219

222

225

228

231

234

237

243

246

249

252

255

258

264

267

270

273

276

279

285

-50 10

5
-100

0
222

261

282
216

219

225

228

231

234

237

240

243

246

249

252

255

258

264

267

270

273

276

279

285
-150
Data Point Data Point
Empirical Evaluation – Less Noisy Data Series

Heuristic NN Simple NN
Nets Trained on Less Noisy Nets Trained on Less Noisy

Original 35,2 35,10 35,20 Original 35,2 35,10 35,20


40 40

35
30
30

25
20
20
Value

Value
15 10

10
0
5

222

240

261

282
216

219

225

228

231

234

237

243

246

249

252

255

258

264

267

270

273

276

279

285
0
-10
222

240

261

282
216

219

225

228

231

234

237

243

246

249

252

255

258

264

267

270

273

276

279

285
-5

-10 -20
Data Point Data Point

K-N-N
K-Nearest-Neighbor on Less Noisy

Original 2,20 2,24 2,30


35

30

25

20
Value

15

10

0
222

261

282
216

219

225

228

231

234

237

240

243

246

249

252

255

258

264

267

270

273

276

279

285

Data Point
Empirical Evaluation – More Noisy Data Series

Heuristic NN Simple NN
Nets Trained on More Noisy Nets Trained on More Noisy

Original 35,10 35,20 Original 35,10 35,20


60 60

50 50

40
40
30
30
20
Value

Value
20
10
10
0

222

240

261

282
216

219

225

228

231

234

237

243

246

249

252

255

258

264

267

270

273

276

279

285
0
-10
222

240

261

282
216

219

225

228

231

234

237

243

246

249

252

255

258

264

267

270

273

276

279

285
-10 -20

-20 -30
Data Point Data Point

K-N-N
K-Nearest-Neighbor on More Noisy

Original 2,20 2,24 2,30


35

30

25

20

15
Value

10

0
222

240

261

282
216

219

225

228

231

234

237

243

246

249

252

255

258

264

267

270

273

276

279

285
-5

-10
Data Point
Empirical Evaluation – Ascending Data Series

Heuristic NN Simple NN
Nets Trained on Ascending Nets Trained on Ascending

Ascending 35,10 35,20 Ascending 35,2 35,10 35,20


70 70

60 60

50 50

40 40
Value

Value
30 30

20 20

10 10

0 0

222

261

282
222

261

282

216

219

225

228

231

234

237

240

243

246

249

252

255

258

264

267

270

273

276

279

285
216

219

225

228

231

234

237

240

243

246

249

252

255

258

264

267

270

273

276

279

285
Data Point Data Point
Empirical Evaluation – Longer Forecast
Heuristic NN
Nets Trained on Less Noisy (Longer Forecast)

Original 35,2 35,10 35,20


60

40

20

0
Value

226

266

306

346
216
221

231
236
241
246
251
256
261

271
276
281
286
291
296
301

311
316
321
326
331
336
341

351
356
-20

-40

-60

-80
Data Point

Nets Trained on More Noisy (Longer Forecast)

Original 35,10 35,20


150

100

50
Value

0
226

266

306

346
216
221

231
236
241
246
251
256
261

271
276
281
286
291
296
301

311
316
321
326
331
336
341

351
356
-50

-100
Data Point
Empirical Evaluation – Sunspots Data Series

Simple NN & K-N-N


Sunspots 1950-1983

Test Set 30,30 Neural Net 3,10 K-Nearest-Neighbor


250

200

150
Count

100

50

0
1950

1952

1954

1956

1958

1960

1962

1964

1966

1968

1970

1972

1974

1976

1978

1980

1982
-50
Year
Empirical Evaluation –
Discussion
• Heuristic training method observations:
– Networks train longer (more epochs) on smoother data series like
the original and ascending data series
– The total squared error and unscaled error are higher for noisy data
series
– Neither the number of epochs nor the errors appear to correlate
well with the coefficient of determination
– In most cases, the committee forecast is worse than the best
candidate's forecast
• When actual values are unavailable, choosing the best candidate is
difficult!

February 15, 2010 Eric Plummer 30


Empirical Evaluation –
Discussion
• Simple training method observations:
– The total squared error and unscaled error are higher for noisy data
series with the exception of the 35:10:1 network trained on the
more noisy data series
– The errors do not appear to correlate well with the coefficient of
determination
– In most cases, the committee forecast is worse than the best
candidate's forecast
– There are four networks whose coefficient of determination is
negative, compared with two for the heuristic training method
Coefficient of Determination Comparison Coefficient of Determination Comparison

35,2 35,10 35,20 35,2 35,10 35,20


1 1
0.9 0.9
0.8 0.8
Coefficient of Determination

Coefficient of Determination
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
-0.1 Original Less Noisy More Noisy Ascending -0.1 Original Less Noisy More Noisy Ascending
-0.2 -0.2
-0.3 -0.3
-0.4 -0.4
-0.5 -0.5
Data Series Data Series

February 15, 2010 Eric Plummer 31


Empirical Evaluation –
Discussion
• General observations:
– One training method did not appear to be clearly better
– Increasingly noisy data series increasingly degraded the forecasting
performance
– Nonstationarity in the mean degraded the performance
– Too few hidden units (e.g., 35:2:1) forecasted well on simpler data
series, but failed for more complex ones
– Excessive numbers of hidden units (e.g, 35:20:1) did not hurt
performance
– Twenty-five network inputs was not sufficient
– K-nearest-neighbor was consistently better than the neural networks
– Feed-forward neural networks are extremely sensitive to architecture
and parameter choices, and making such choices is currently more
art than science, more trial-and-error than absolute, more practice
than theory!

February 15, 2010 Eric Plummer 32


Data Preprocessing

• First-difference
– For ascending data series, a neural network trained on first-
difference can forecast near perfectly
– In that case, it is better to train and forecast on first-
difference
– FORECASTER reconstitutes forecast from its first-difference
• Moving average
– For noisy data series, moving average would eliminate much
of the noise
– But would also smooth out peaks and valleys
– Series may then be easier to learn and forecast
– But in some series, the “noise” may be important data (e.g.,
utility load forecasting)

February 15, 2010 Eric Plummer 33


Contributions

• Filled a void within feed-forward neural network time series


forecasting literature: know how networks respond to various
data series characteristics in a controlled environment
• Showed that k-nearest-neighbor is a better forecasting method
for the data series used in this research
• Reaffirmed that neural networks are very sensitive to
architecture, parameter, and learning method changes
• Presented some insight into neural network architecture
selection: selecting number of network inputs based on data
series
• Presented a neural network training heuristic that produced
good results

February 15, 2010 Eric Plummer 34


Future Work

• Upgrade FORECASTER to work with classification


problems
• Add more complex network types, including wavelet
networks for time series forecasting
• Investigate k-nearest-neighbor further
• Add other forecasting methods, (e.g., decision trees
for classification)

February 15, 2010 Eric Plummer 35


Conclusion

• Presented:
– Time series forecasting
– Neural networks
– K-nearest-neighbor
– Empirical evaluation
• Learned a lot about the implementation details of the
forecasting techniques
• Learned a lot about MFC programming

February 15, 2010 Eric Plummer 36


Demonstration

Various files can be found at:


http://w3.uwyo.edu/~eplummer

February 15, 2010 Eric Plummer 37


Unit Output, Error, and Weight
Change Formulas
 P  1
Oc = hHidden  ∑ ic , p wc , p + bc  where hHidden ( x ) = −x
 p =1  1 + e

 P 
Oc = hOutput  ∑ ic , p wc , p + bc  where hOutput ( x ) = x
 p =1 
′ ( x )( Dc − Oc )
δ c = hOutput
N

δ c = h Hidden ( x )∑ δ n wn ,c
n =1

∆wc , p = α δcO p
Forecast Error Formulas

1 C
E C = ∑ ( Dc − O c )
2

2 c =1
C
UEC = ∑ UDc − UOc
c =1

n
1 ∀i xˆi = xi
∑(x
if
i − xˆ i ) 2
0 > k > 1
 if xˆ i is a better forecast than x
r2 = 1 − i =1
r2 = 
n
0 if generally xˆi = x
∑ i
( x
i =1
− x ) 2
k < 0 if xˆ i is a worse forecast than x
Related Work

• Drossu and Obradovic (1996): hybrid stochastic and


neural network approach to time series forecasting
• Zhang and Thearling (1994): parallel implementations
of neural networks and memory-based reasoning
• Geva (1998): multiscale fast wavelet transform and an
array of feed-forward neural networks
• Lawrence, Tsoi, and Giles (1996): encodes the series
with a self-organizing map and uses recurrent neural
networks
• Kingdon (1997): automated intelligent system for
financial forecasting and uses neural networks and
genetic algorithms
February 15, 2010 Eric Plummer 40

You might also like