Time Series Forecasting With Feed-Forward Neural Networks

Time Series Forecasting With
Feed-Forward Neural Networks:

Guidelines And Limitations
Eric Plummer
Computer Science Department
University of Wyoming
February 15, 2010
Topics
• Thesis Goals
• Time Series Forecasting
• Neural Networks
• K-Nearest-Neighbor
• Test-Bed Application
• Empirical Evaluation
• Data Preprocessing
• Contributions
• Future Work
• Conclusion
• Demonstration
February 15, 2010 Eric Plummer 2

Thesis Goals
• Compare neural networks and k-nearest-neighbor for

time series forecasting
• Analyze the response of various configurations to
data series with specific characteristics
• Identify when neural networks and k-nearest-
neighbor are inadequate
• Evaluate the effectiveness of data preprocessing

Time Series Forecasting –
Description
• What is it?
– Given an existing data series, observe or model the
data series to make accurate forecasts
• Example data series
– Financial (e.g., stocks, rates)
– Physically observed (e.g., weather, sunspots)
– Mathematical (e.g., Fibonacci sequence)

Difficulties
• Why is it difficult?
– Limited quantity of data
• Observed data series sometimes too short to partition
– Noise
• Erroneous data points
• Obscuring component
– Moving Average
– Nonstationarity
• Fundamentals change over time
• Nonstationary mean: “Ascending” data series
– First-difference preprocessing
– Forecasting method selection

• Statistics
• Artificial intelligence

Importance
• Why is it important?
– Preventing undesirable events by forecasting the
event, identifying the circumstances preceding the
event, and taking corrective action so the event can be
avoided (e.g., inflationary economic period)
– Forecasting undesirable, yet unavoidable, events to
preemptively lessen their impact (e.g., solar maximum
w/ sunspots)
– Profiting from forecasting (e.g., financial markets)

Neural Networks –
Background
• Loosely based on the human brain’s neuron structure
• Timeline
– 1940’s – McCulloch and Pitts – proposed neuron models in
the form of binary threshold devices and stochastic
algorithms
– 1950’s & 1960’s – Rosenblatt – class of learning machines
called perceptrons
– Late 1960’s – Minsky and Papert – discouraging analysis of
perceptrons (linearly separable classes)
– 1980’s – Rumelhart, Hinton, and Williams – generalized
delta rule for learning by back-propagation for training
multilayer perceptrons
– Present – many new training algorithms and architectures,
but nothing “revolutionary”

Neural Networks –
Architecture
• A feed-forward neural
network can have any
number of:
– Layers
– Units per layer
– Network inputs
– Network outputs
• Hidden layers (A, B)
• Output layer (C)

Neural Networks –
Units
• A unit has:
– Connections
– Weights
– Bias
– Activation function
• Weights and bias are randomly initialized
before training
• Unit’s input consists of:
– Sum of the products of each connection
value and associated weight
– Add the bias
• Input is then fed into unit’s activation
function
• Unit’s output is the output of activation
function
– Hidden layers: Sigmoid
– Output layer: Linear

Neural Networks –
Training
• Partition data series into:
– Training set
– Validation set (optional)
– Test set (optional)
• Typically, the training procedure is:
– Perform backpropagation training with training set
– After n epochs, compute total squared error on training set
and validation set
– If consistently validation error  and training error , stop
training.
• Overfitting: Training set learned too well
• Generalization: Given inputs not in training and validation sets,
able to accurately forecast
Neural Networks –
Training
• Backpropagation training:
– First, examples in the form of <input, output> pairs are
extracted from the data series
– Then, the network is trained with backpropagation on the
examples:
1. Present an example’s input vector to the network inputs and
run the network sequentially forward
2. Propagate the error sequentially backward from the output layer
3. For every connection, change the weight modifying that
connection in proportion to the error
– When all three steps have been performed for all examples,
one epoch has occurred
– Goal is to converge to a near-optimal solution based on the
total squared error

Neural Networks –
Training
Backpropagation training cycle

Neural Networks –
Forecasting
• Forecasting method
depends on examples
• Examples depend on
step-ahead size
If step-ahead size is one: If step-ahead size is greater

Iterative forecasting than one: Direct forecasting

Neural Networks –
Forecasting
Iterative forecasting
Can continue this indefinitely

Neural Networks –
Forecasting
Directly forecasting n steps
This is the only forecast

K-Nearest-Neighbor –
Forecasting
• No model to train
• Simple linear
search
• Compare
reference to
candidates
• Select k
candidates with
lowest error
• Forecast is
average of k next
values

Test-Bed Application –
FORECASTER
• Written in Visual C++ with MFC
• Object-oriented
• Multithreaded
• Wizard-based
• Easily modified
• Implements feed-forward neural networks & k-
nearest-neighbor
• Used for time series forecasting
• Eventually will be upgraded for classification
problems

Value Value
0
5
10
15
20
25
30
35
-10
-5
0
5
10
15
20
25
30
35
40
0 0
7 7
14 14
21 21
28 28
35 35
42 42
49 49
56 56
63 63
70 70
77 77
84 84
91 91
Original
98 98
105 105
Original
112 Data Point 112
Data Point
119 119
Original
126 126
Original with More Noisy
More Noisy
Count 133 133
140 140
More Noisy
0
20
40
60
80
100
120
140
160
180
200
147 147
1784
154 154
1791
161 161
1798
168 168
1805
175 175
1812
182 182
1819
189 189
1826 196
196
1833 203 203
1840 210 210
1847
1854
1861
1868
1875
1882
Year
1889
Value Value
1896
-5
0
5
10
15
20
25
30
35
0
10
20
30
40
50
60
Sunspots 1784-1983
1903 0 0
Sunspots
1910 7 7
1917 14 14
1924 21 21
1931 28 28
1938 35 35
42 42
1945
49 49
1952
56 56
1959
63 63
1966
70 70
1973
77 77
1980 84
84
91 91
Original
Original
98 98
105 105
112 112
Data Point
Data Point
119 119
126 126
Original with Ascending
Original with Less Noisy
Less Noisy
Ascending
133 133
Less Noisy
Ascending
140 140
147 147
154 154
161 161
168 168
175 175
182 182
189 189
196 196
Empirical Evaluation – Data Series
203 203
210 210
Empirical Evaluation –
Neural Network Architectures
• Number of network inputs
based on data series
• Need to make unambiguous
examples
• For “sawtooths”:
– 24 inputs are necessary • For sunspots:
– Test networks with 25 & – 30 inputs
35 inputs – 1 hidden layer with 30
– Test networks with 1 units
hidden layer with 2, 10, & • For real-world data series,
20 hidden layer units selection may be trial-and-
– One output layer unit error!

Neural Network Training
• Heuristic method: • Simple method:
– Start with aggressive – Use conservative
learning rate learning rate
– Gradually lower learning – Training stops when:
rate as validation error • Number of training
increases epochs equals the
– Stop training when epochs limit -or-
learning rate cannot be • Training error is less
lowered anymore than or equal to error
limit

Neural Network Forecasting
• Metric to compare forecasts: • For networks trained on
Coefficient of Determination original, less noisy, and
– Value may be (-∞, 1] more noisy data series,
– Want value between 0 forecast will be compared to
and 1, where 0 is original series
forecasting the mean of • For networks trained on
the data series and 1 is ascending data series,
forecasting the actual forecast will be compared to
value continuation of ascending
– Must have actual values series
to compare with • For networks trained on
forecasted values sunspots data series,
forecast will be compared to
test set

K-Nearest-Neighbor
• Choosing window size analogous to choosing
number of neural network inputs
• For sawtooth data series:
– k=2
– Test window sizes of 20, 24, and 30
• For sunspots data series:
– k=3
– Window size of 10
• Compare forecasts via coefficient of determination

Candidate Selection
• Neural networks
– For each training method, data series, and
architecture, 3 candidates were trained
– Also, average of 3 candidates’ forecasts was taken:
forecasting by committee
– Best forecast was selected based on coefficient of
determination
• K-nearest-neighbor
– For each data series, k, and window size, only one
search was performed (only one needed)

Empirical Evaluation – Original Data Series
Heuristic NN Simple NN
Nets Trained on Original Nets Trained on Original
Original 35,2 35,10 35,20 Original 35,2 35,10 35,20

35 35
30 30
25
25
20
20
15
Value
Value
15
10
10
5
5
0
222
240
261
282
216
219
225
228
231
234
237
243
246
249
252
255
258
264
267
270
273
276
279
285
-5 0
216
219
237
240
255
258
276
279
222
225
228
231
234
243
246
249
252
261
264
267
270
273
282
285
-10 -5
Data Point Data Point
Smaller NN K-N-N
Nets Trained on Original K-Nearest-Neighbor on Original
Original 25,10 25,20 Original 2,20 2,24 2,30
150 35
30
100
25
50
20
Value
Value
0
15
240
261
282
216
219
222
225
228
231
234
237
243
246
249
252
255
258
264
267
270
273
276
279
285
-50 10
5
-100
0
222
261
282
216
219
225
228
231
234
237
240
243
246
249
252
255
258
264
267
270
273
276
279
285
-150
Empirical Evaluation – Less Noisy Data Series
Nets Trained on Less Noisy Nets Trained on Less Noisy
Original 35,2 35,10 35,20 Original 35,2 35,10 35,20

40 40
35
30
30
25
20
20
Value
Value
15 10
10
0
5
222
240
261
282
216
219
225
228
231
234
237
243
246
249
252
255
258
264
267
270
273
276
279
285
0
-10
222
240
261
282
216
219
225
228
231
234
237
243
246
249
252
255
258
264
267
270
273
276
279
285
-5
-10 -20
K-N-N
K-Nearest-Neighbor on Less Noisy
Original 2,20 2,24 2,30

35
30
25
20
Value
15
10
0
222
261
282
216
219
225
228
231
234
237
240
243
246
249
252
255
258
264
267
270
273
276
279
285
Data Point
Empirical Evaluation – More Noisy Data Series
Nets Trained on More Noisy Nets Trained on More Noisy
Original 35,10 35,20 Original 35,10 35,20

60 60
50 50
40
40
30
30
20
Value
Value
20
10
10
0
222
240
261
282
216
219
225
228
231
234
237
243
246
249
252
255
258
264
267
270
273
276
279
285
0
-10
222
240
261
282
216
219
225
228
231
234
237
243
246
249
252
255
258
264
267
270
273
276
279
285
-10 -20
-20 -30
K-N-N
K-Nearest-Neighbor on More Noisy
Original 2,20 2,24 2,30

35
30
25
20
15
Value
10
0
222
240
261
282
216
219
225
228
231
234
237
243
246
249
252
255
258
264
267
270
273
276
279
285
-5
-10
Data Point
Empirical Evaluation – Ascending Data Series
Nets Trained on Ascending Nets Trained on Ascending
Ascending 35,10 35,20 Ascending 35,2 35,10 35,20

70 70
60 60
50 50
40 40
Value
Value
30 30
20 20
10 10
0 0
222
261
282
222
261
282
216
219
225
228
231
234
237
240
243
246
249
252
255
258
264
267
270
273
276
279
285
216
219
225
228
231
234
237
240
243
246
249
252
255
258
264
267
270
273
276
279
285
Empirical Evaluation – Longer Forecast
Heuristic NN
Nets Trained on Less Noisy (Longer Forecast)
Original 35,2 35,10 35,20

60
40
20
0
Value
226
266
306
346
216
221
231
236
241
246
251
256
261
271
276
281
286
291
296
301
311
316
321
326
331
336
341
351
356
-20
-40
-60
-80
Data Point
Nets Trained on More Noisy (Longer Forecast)
Original 35,10 35,20

150
100
50
Value
0
226
266
306
346
216
221
231
236
241
246
251
256
261
271
276
281
286
291
296
301
311
316
321
326
331
336
341
351
356
-50
-100
Data Point
Empirical Evaluation – Sunspots Data Series
Simple NN & K-N-N

Sunspots 1950-1983
Test Set 30,30 Neural Net 3,10 K-Nearest-Neighbor

250
200
150
Count
100
50
0
1950
1952
1954
1956
1958
1960
1962
1964
1966
1968
1970
1972
1974
1976
1978
1980
1982
-50
Year
Discussion
• Heuristic training method observations:
– Networks train longer (more epochs) on smoother data series like
the original and ascending data series
– The total squared error and unscaled error are higher for noisy data
series
– Neither the number of epochs nor the errors appear to correlate
well with the coefficient of determination
– In most cases, the committee forecast is worse than the best
candidate's forecast
• When actual values are unavailable, choosing the best candidate is
difficult!

Discussion
• Simple training method observations:
– The total squared error and unscaled error are higher for noisy data
series with the exception of the 35:10:1 network trained on the
more noisy data series
– The errors do not appear to correlate well with the coefficient of
determination
– In most cases, the committee forecast is worse than the best
candidate's forecast
– There are four networks whose coefficient of determination is
negative, compared with two for the heuristic training method
Coefficient of Determination Comparison Coefficient of Determination Comparison
35,2 35,10 35,20 35,2 35,10 35,20

1 1
0.9 0.9
0.8 0.8
Coefficient of Determination
Coefficient of Determination
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
-0.1 Original Less Noisy More Noisy Ascending -0.1 Original Less Noisy More Noisy Ascending
-0.2 -0.2
-0.3 -0.3
-0.4 -0.4
-0.5 -0.5
Data Series Data Series

Discussion
• General observations:
– One training method did not appear to be clearly better
– Increasingly noisy data series increasingly degraded the forecasting
performance
– Nonstationarity in the mean degraded the performance
– Too few hidden units (e.g., 35:2:1) forecasted well on simpler data
series, but failed for more complex ones
– Excessive numbers of hidden units (e.g, 35:20:1) did not hurt
performance
– Twenty-five network inputs was not sufficient
– K-nearest-neighbor was consistently better than the neural networks
– Feed-forward neural networks are extremely sensitive to architecture
and parameter choices, and making such choices is currently more
art than science, more trial-and-error than absolute, more practice
than theory!

Data Preprocessing
• First-difference
– For ascending data series, a neural network trained on first-
difference can forecast near perfectly
– In that case, it is better to train and forecast on first-
difference
– FORECASTER reconstitutes forecast from its first-difference
• Moving average
– For noisy data series, moving average would eliminate much
of the noise
– But would also smooth out peaks and valleys
– Series may then be easier to learn and forecast
– But in some series, the “noise” may be important data (e.g.,
utility load forecasting)

Contributions
• Filled a void within feed-forward neural network time series

forecasting literature: know how networks respond to various
data series characteristics in a controlled environment
• Showed that k-nearest-neighbor is a better forecasting method
for the data series used in this research
• Reaffirmed that neural networks are very sensitive to
architecture, parameter, and learning method changes
• Presented some insight into neural network architecture
selection: selecting number of network inputs based on data
series
• Presented a neural network training heuristic that produced
good results

Future Work
• Upgrade FORECASTER to work with classification

problems
• Add more complex network types, including wavelet
networks for time series forecasting
• Investigate k-nearest-neighbor further
• Add other forecasting methods, (e.g., decision trees
for classification)

Conclusion
• Presented:
– Time series forecasting
– Neural networks
– K-nearest-neighbor
– Empirical evaluation
• Learned a lot about the implementation details of the
forecasting techniques
• Learned a lot about MFC programming

Demonstration
Various files can be found at:

http://w3.uwyo.edu/~eplummer

Unit Output, Error, and Weight
Change Formulas
 P  1
Oc = hHidden  ∑ ic , p wc , p + bc  where hHidden ( x ) = −x
 p =1  1 + e
 P 
Oc = hOutput  ∑ ic , p wc , p + bc  where hOutput ( x ) = x
 p =1 
′ ( x )( Dc − Oc )
δ c = hOutput
N
′
δ c = h Hidden ( x )∑ δ n wn ,c
n =1
∆wc , p = α δcO p
Forecast Error Formulas
1 C
E C = ∑ ( Dc − O c )
2
2 c =1
C
UEC = ∑ UDc − UOc
c =1
n
1 ∀i xˆi = xi
∑(x
if
i − xˆ i ) 2
0 > k > 1
 if xˆ i is a better forecast than x
r2 = 1 − i =1
r2 = 
n
0 if generally xˆi = x
∑ i
( x
i =1
− x ) 2
k < 0 if xˆ i is a worse forecast than x
Related Work
• Drossu and Obradovic (1996): hybrid stochastic and

neural network approach to time series forecasting
• Zhang and Thearling (1994): parallel implementations
of neural networks and memory-based reasoning
• Geva (1998): multiscale fast wavelet transform and an
array of feed-forward neural networks
• Lawrence, Tsoi, and Giles (1996): encodes the series
with a self-organizing map and uses recurrent neural
networks
• Kingdon (1997): automated intelligent system for
financial forecasting and uses neural networks and
genetic algorithms

Time Series Forecasting With Feed-Forward Neural Networks

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Time Series Forecasting With Feed-Forward Neural Networks

Uploaded by

Copyright:

Available Formats

Time Series Forecasting With

Feed-Forward Neural Networks:

February 15, 2010 Eric Plummer 2

• Compare neural networks and k-nearest-neighbor for

February 15, 2010 Eric Plummer 3

February 15, 2010 Eric Plummer 4

– Forecasting method selection

February 15, 2010 Eric Plummer 5

February 15, 2010 Eric Plummer 6

February 15, 2010 Eric Plummer 7

February 15, 2010 Eric Plummer 8

February 15, 2010 Eric Plummer 9

February 15, 2010 Eric Plummer 11

February 15, 2010 Eric Plummer 12

If step-ahead size is one: If step-ahead size is greater

February 15, 2010 Eric Plummer 13

Can continue this indefinitely

February 15, 2010 Eric Plummer 14

This is the only forecast

February 15, 2010 Eric Plummer 15

February 15, 2010 Eric Plummer 16

February 15, 2010 Eric Plummer 17

112 Data Point 112

Original with More Noisy

February 15, 2010 Eric Plummer 19

February 15, 2010 Eric Plummer 20

February 15, 2010 Eric Plummer 21

February 15, 2010 Eric Plummer 22

February 15, 2010 Eric Plummer 23

Original 35,2 35,10 35,20 Original 35,2 35,10 35,20

Original 35,2 35,10 35,20 Original 35,2 35,10 35,20

Original 2,20 2,24 2,30

Original 35,10 35,20 Original 35,10 35,20

Original 2,20 2,24 2,30

Ascending 35,10 35,20 Ascending 35,2 35,10 35,20

Original 35,2 35,10 35,20

Nets Trained on More Noisy (Longer Forecast)

Original 35,10 35,20

Simple NN & K-N-N

Test Set 30,30 Neural Net 3,10 K-Nearest-Neighbor

February 15, 2010 Eric Plummer 30

35,2 35,10 35,20 35,2 35,10 35,20

February 15, 2010 Eric Plummer 31

February 15, 2010 Eric Plummer 32

February 15, 2010 Eric Plummer 33

• Filled a void within feed-forward neural network time series

February 15, 2010 Eric Plummer 34

• Upgrade FORECASTER to work with classification

February 15, 2010 Eric Plummer 35

February 15, 2010 Eric Plummer 36

Various files can be found at:

February 15, 2010 Eric Plummer 37

• Drossu and Obradovic (1996): hybrid stochastic and

You might also like